[ovs-dev,v4,1/2] dpif-netdev: Add SMC cache after EMC cache

Message ID 1530294772-44098-2-git-send-email-yipeng1.wang@intel.com
State New
Headers show
Series
  • dpif-netdev: Combine CD/DFC patch for datapath refactor
Related show

Commit Message

Yipeng Wang June 29, 2018, 5:52 p.m.
This patch adds a signature match cache (SMC) after exact match
cache (EMC). The difference between SMC and EMC is SMC only stores
a signature of a flow thus it is much more memory efficient. With
same memory space, EMC can store 8k flows while SMC can store 1M
flows. It is generally beneficial to turn on SMC but turn off EMC
when traffic flow count is much larger than EMC size.

SMC cache will map a signature to an netdev_flow index in
flow_table. Thus, we add two new APIs in cmap for lookup key by
index and lookup index by key.

For now, SMC is an experimental feature that it is turned off by
default. One can turn it on using ovsdb options.

Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
---
 Documentation/topics/dpdk/bridge.rst |  15 ++
 NEWS                                 |   2 +
 lib/cmap.c                           |  73 +++++++++
 lib/cmap.h                           |   5 +
 lib/dpif-netdev-perf.h               |   1 +
 lib/dpif-netdev.c                    | 310 ++++++++++++++++++++++++++++++-----
 tests/pmd.at                         |   1 +
 vswitchd/vswitch.xml                 |  13 ++
 8 files changed, 383 insertions(+), 37 deletions(-)

Comments

O Mahony, Billy July 4, 2018, 5:42 p.m. | #1
Hi,

I've checked the latest patch and the performance results I get are similar to the ones give in the previous patches. Also enabling/disabling the DFC on the fly works as expected.

The main query I have regards the slow sweep for SMC

[[BO'M]] The slow sweep removes EMC entries that are no longer valid because the associated dpcls rule has been changed or has expired. Is there a mechanism to remove SMC entries associated with a changed/expired dpcls rules?

Further comments, queries inline. Review not complete yet, I'll try to finish off tomorrow.

Regards,
Billy.


> -----Original Message-----
> From: Wang, Yipeng1
> Sent: Friday, June 29, 2018 6:53 PM
> To: dev@openvswitch.org
> Cc: Wang, Yipeng1 <yipeng1.wang@intel.com>; jan.scheurich@ericsson.com;
> Stokes, Ian <ian.stokes@intel.com>; O Mahony, Billy
> <billy.o.mahony@intel.com>; Loftus, Ciara <ciara.loftus@intel.com>
> Subject: [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an netdev_flow index in flow_table. Thus, we
> add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS                                 |   2 +
>  lib/cmap.c                           |  73 +++++++++
>  lib/cmap.h                           |   5 +
>  lib/dpif-netdev-perf.h               |   1 +
>  lib/dpif-netdev.c                    | 310 ++++++++++++++++++++++++++++++-----
>  tests/pmd.at                         |   1 +
>  vswitchd/vswitch.xml                 |  13 ++
>  8 files changed, 383 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-------------------------
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +    $ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index cd15a33..26d6ef1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -40,6 +40,8 @@ Post-v2.9.0
>           ovs-appctl dpif-netdev/pmd-perf-show
>       * Supervision of PMD performance metrics and logging of suspicious
>         iterations
> +     * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +       it improves throughput when traffic has many more flows than EMC size.
>     - ERSPAN:
>       * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
>         both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..db1c806 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
>                         hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. For example, index 7
> +means
> + * the second bucket and the third item.
[[BO'M]] Is this is assuming 4 for the bucket size. Maybe explicitly add where the bucket size is coming from - CMAP_K ? If so is that value not 5 for a 64 bit system?

> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * of performance reasons. Currently it is only used by the datapath DFC cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint16_t index) {
> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +    uint32_t b = index / CMAP_K;
> +    uint32_t e = index % CMAP_K;
> +
> +    if (b > impl->mask) {
> +        return NULL;
> +    }
> +
> +    const struct cmap_bucket *bucket = &impl->buckets[b];
> +
> +    return cmap_node_next(&bucket->nodes[e]);
> +}
> +
> +/* Find the index of certain hash value. Currently only used by the
> +datapath
> + * DFC cache.
> + *
> + * Return the index of the entry if found, or UINT32_MAX if not found
> +*/
> +
> +uint32_t
> +cmap_find_index(const struct cmap *cmap, uint32_t hash) {
> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
> +    uint32_t h1 = rehash(impl, hash);
> +    uint32_t h2 = other_hash(h1);
> +
> +    uint32_t b_index1 = h1 & impl->mask;
> +    uint32_t b_index2 = h2 & impl->mask;
> +
> +    uint32_t c1, c2;
> +    uint32_t index = UINT32_MAX;
> +
> +    const struct cmap_bucket *b1 = &impl->buckets[b_index1];
> +    const struct cmap_bucket *b2 = &impl->buckets[b_index2];
> +
> +    do {
> +        do {
> +            c1 = read_even_counter(b1);
> +            for (int i = 0; i < CMAP_K; i++) {
> +                if (b1->hashes[i] == hash) {
> +                    index = b_index1 * CMAP_K + i;
> +                 }
> +            }
> +        } while (OVS_UNLIKELY(counter_changed(b1, c1)));
> +        if (index != UINT32_MAX) {
> +            break;
> +        }
> +        do {
> +            c2 = read_even_counter(b2);
> +            for (int i = 0; i < CMAP_K; i++) {
> +                if (b2->hashes[i] == hash) {
> +                    index = b_index2 * CMAP_K + i;
> +                }
> +            }
> +        } while (OVS_UNLIKELY(counter_changed(b2, c2)));
> +
> +        if (index != UINT32_MAX) {
> +            break;
> +        }
> +    } while (OVS_UNLIKELY(counter_changed(b1, c1)));
> +
> +    return index;
> +}
> +
>  /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
>   * and sets the corresponding pointer in 'nodes', if the hash value was
>   * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
> diff --git a/lib/cmap.h b/lib/cmap.h index 8bfb6c0..076afd6 100644
> --- a/lib/cmap.h
> +++ b/lib/cmap.h
> @@ -145,6 +145,11 @@ size_t cmap_replace(struct cmap *, struct cmap_node
> *old_node,  const struct cmap_node *cmap_find(const struct cmap *, uint32_t
> hash);  struct cmap_node *cmap_find_protected(const struct cmap *, uint32_t
> hash);
> 
> +/* Find node by index or find index by hash */ const struct cmap_node *
> +cmap_find_by_index(const struct cmap *,
> +                                            uint16_t index); uint32_t
> +cmap_find_index(const struct cmap *, uint32_t hash);
> +
>  /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
>   * and sets the corresponding pointer in 'nodes', if the hash value was
>   * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
> b8aa4e3..299d52a 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -56,6 +56,7 @@ extern "C" {
> 
>  enum pmd_stat_type {
>      PMD_STAT_EXACT_HIT,     /* Packets that had an exact match (emc). */
> +    PMD_STAT_SMC_HIT,        /* Packets that had a sig match hit (SMC). */
>      PMD_STAT_MASKED_HIT,    /* Packets that matched in the flow table. */
>      PMD_STAT_MISS,          /* Packets that did not match and upcall was ok. */
>      PMD_STAT_LOST,          /* Packets that did not match and upcall failed. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff..de7d2ca 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -129,7 +129,9 @@ struct netdev_flow_key {
>      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> 
> -/* Exact match cache for frequently used flows
> +/* EMC cache and SMC cache compose of the datapath flow cache (DFC)
[[BO'M]] 'compose of the' -> 'compose the' or else 'together make the' 
> + *
> + * Exact match cache for frequently used flows
>   *
>   * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
>   * search its entries for a miniflow that matches exactly the miniflow of the @@
> -143,6 +145,15 @@ struct netdev_flow_key {
>   * value is the index of a cache entry where the miniflow could be.
>   *
>   *
> + * Signature match cache (SMC)
> + *
> + * This cache stores a 16-bit signature for each flow without storing
> + keys, and
> + * stores the corresponding 16-bit flow_table index to the 'dp_netdev_flow'.
> + * Each flow thus occupies 32bit which is much more memory efficient than
> EMC.
> + * SMC uses a set-associative design that each bucket contains
> + * SMC_ENTRY_PER_BUCKET number of entries.
> + *
> + *
>   * Thread-safety
>   * =============
>   *
> @@ -155,6 +166,11 @@ struct netdev_flow_key {  #define
> EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)  #define
> EM_FLOW_HASH_SEGS 2
> 
> +#define SMC_ENTRY_PER_BUCKET 4
> +#define SMC_ENTRIES (1u << 20)
> +#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET)
> #define
> +SMC_MASK (SMC_BUCKET_CNT - 1)
> +
>  /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB
> */  #define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
> @@ -170,6 +186,21 @@ struct emc_cache {
>      int sweep_idx;                /* For emc_cache_slow_sweep(). */
>  };
> 
> +struct smc_bucket {
> +    uint16_t sig[SMC_ENTRY_PER_BUCKET];
> +    uint16_t flow_idx[SMC_ENTRY_PER_BUCKET]; };
> +
> +/* Signature match cache, differentiate from EMC cache */ struct
> +smc_cache {
> +    struct smc_bucket buckets[SMC_BUCKET_CNT]; };
> +
> +struct dfc_cache {
> +    struct emc_cache emc_cache;
> +    struct smc_cache smc_cache;
> +};
> +
>  /* Iterate in the exact match cache through every entry that might contain a
>   * miniflow with hash 'HASH'. */
>  #define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
> \
> @@ -214,10 +245,11 @@ static void dpcls_insert(struct dpcls *, struct
> dpcls_rule *,
>                           const struct netdev_flow_key *mask);  static void
> dpcls_remove(struct dpcls *, struct dpcls_rule *);  static bool dpcls_lookup(struct
> dpcls *cls,
> -                         const struct netdev_flow_key keys[],
> +                         const struct netdev_flow_key *keys[],
>                           struct dpcls_rule **rules, size_t cnt,
>                           int *num_lookups_p);
> -

> +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
> +                            const struct netdev_flow_key *target);
>  /* Set of supported meter flags */
>  #define DP_SUPPORTED_METER_FLAGS_MASK \
>      (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS | OFPMF13_BURST)
> @@ -284,6 +316,8 @@ struct dp_netdev {
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
>      /* Enable collection of PMD performance metrics. */
>      atomic_bool pmd_perf_metrics;
> +    /* Enable the SMC cache from ovsdb config */
> +    atomic_bool smc_enable_db;
> 
>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>       * thread synchronization. */
> @@ -552,7 +586,7 @@ struct dp_netdev_pmd_thread {
>       * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>       * need to be protected by 'non_pmd_mutex'.  Every other instance
>       * will only be accessed by its own pmd thread. */
> -    struct emc_cache flow_cache;
> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
> 
>      /* Flow-Table and classifiers
>       *
> @@ -720,6 +754,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct
> dp_netdev_pmd_thread *pmd,
> 
>  static inline bool emc_entry_alive(struct emc_entry *ce);  static void
> emc_clear_entry(struct emc_entry *ce);
> +static void smc_clear_entry(struct smc_bucket *b, int idx);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);  static inline
> bool @@ -740,6 +775,24 @@ emc_cache_init(struct emc_cache *flow_cache)
> }
> 
>  static void
> +smc_cache_init(struct smc_cache *smc_cache) {
> +    int i, j;
> +    for (i = 0; i < SMC_BUCKET_CNT; i++) {
> +        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
> +            smc_cache->buckets[i].flow_idx[j] = UINT16_MAX;
> +        }
> +    }
> +}
> +
> +static void
> +dfc_cache_init(struct dfc_cache *flow_cache) {
> +    emc_cache_init(&flow_cache->emc_cache);
> +    smc_cache_init(&flow_cache->smc_cache);
> +}
> +
> +static void
>  emc_cache_uninit(struct emc_cache *flow_cache)  {
>      int i;
> @@ -749,6 +802,25 @@ emc_cache_uninit(struct emc_cache *flow_cache)
>      }
>  }
> 
> +static void
> +smc_cache_uninit(struct smc_cache *smc) {
> +    int i, j;
> +
> +    for (i = 0; i < SMC_BUCKET_CNT; i++) {
> +        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
> +            smc_clear_entry(&(smc->buckets[i]), j);
> +        }
> +    }
> +}
> +
> +static void
> +dfc_cache_uninit(struct dfc_cache *flow_cache) {
> +    smc_cache_uninit(&flow_cache->smc_cache);
> +    emc_cache_uninit(&flow_cache->emc_cache);
> +}
> +
>  /* Check and clear dead flow references slowly (one entry at each
>   * invocation).  */
>  static void
> @@ -860,6 +932,7 @@ pmd_info_show_stats(struct ds *reply,
>                    "  packet recirculations: %"PRIu64"\n"
>                    "  avg. datapath passes per packet: %.02f\n"
>                    "  emc hits: %"PRIu64"\n"
> +                  "  smc hits: %"PRIu64"\n"
>                    "  megaflow hits: %"PRIu64"\n"
>                    "  avg. subtable lookups per megaflow hit: %.02f\n"
>                    "  miss with success upcall: %"PRIu64"\n"
> @@ -867,6 +940,7 @@ pmd_info_show_stats(struct ds *reply,
>                    "  avg. packets per output batch: %.02f\n",
>                    total_packets, stats[PMD_STAT_RECIRC],
>                    passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
> +                  stats[PMD_STAT_SMC_HIT],
>                    stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
>                    stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
>                    packets_per_batch);
> @@ -1580,6 +1654,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct
> dpif_dp_stats *stats)
>          stats->n_flows += cmap_count(&pmd->flow_table);
>          pmd_perf_read_counters(&pmd->perf_stats, pmd_stats);
>          stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT];
> +        stats->n_hit += pmd_stats[PMD_STAT_SMC_HIT];
>          stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
>          stats->n_missed += pmd_stats[PMD_STAT_MISS];
>          stats->n_lost += pmd_stats[PMD_STAT_LOST]; @@ -2272,10 +2347,11
> @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
>       * probability of 1/100 ie. 1% */
> 
>      uint32_t min;
> +
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
> 
>      if (min && random_uint32() <= min) {
> -        emc_insert(&pmd->flow_cache, key, flow);
> +        emc_insert(&(pmd->flow_cache).emc_cache, key, flow);
>      }
>  }
> 
> @@ -2297,6 +2373,76 @@ emc_lookup(struct emc_cache *cache, const struct
> netdev_flow_key *key)
>      return NULL;
>  }
> 
> +static inline const struct cmap_node *
> +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
> +                const uint32_t hash)
[[BO'M]] Minor quibble. We can get cache arg from pmd arg - pmd->flow_cache->smc_cache.
> +{
> +    struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK];
> +    uint16_t sig = hash >> 16;
> +    uint16_t index = UINT16_MAX;
> +
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if (bucket->sig[i] == sig) {
> +            index = bucket->flow_idx[i];
> +            break;
> +        }
> +    }
> +    if (index != UINT16_MAX) {
> +        return cmap_find_by_index(&pmd->flow_table, index);
> +    }
> +    return NULL;
> +}
> +
> +static void
> +smc_clear_entry(struct smc_bucket *b, int idx) {
> +    b->flow_idx[idx] = UINT16_MAX;
> +}
> +
> +static inline int
> +smc_insert(struct dp_netdev_pmd_thread *pmd,
> +           const struct netdev_flow_key *key,
> +           uint32_t hash)
> +{
> +    struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache;
> +    struct smc_bucket *bucket = &smc_cache->buckets[key->hash &
> SMC_MASK];
> +    uint16_t index;
> +    uint32_t cmap_index;
> +    bool smc_enable_db;
> +
> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
> +    if (!smc_enable_db) {
> +        return 0;
> +    }
> +
> +    cmap_index = cmap_find_index(&pmd->flow_table, hash);
> +    index = (cmap_index >= UINT16_MAX) ? UINT16_MAX :
> + (uint16_t)cmap_index;
> +
> +    /* If the index is larger than SMC can handel */
> +    if (index == UINT16_MAX) {
> +        return 0;
> +    }
> +
> +    uint16_t sig = key->hash >> 16;
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if(bucket->sig[i] == sig) {
> +            bucket->flow_idx[i] = index;
> +            return 1;
> +        }
> +    }
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if(bucket->flow_idx[i] == UINT16_MAX) {
> +            bucket->sig[i] = sig;
> +            bucket->flow_idx[i] = index;
> +            return 1;
> +        }
> +    }
> +    int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1);
> +    bucket->sig[idx] = sig;
> +    bucket->flow_idx[idx] = index;
> +    return 1;
> +}
> +
>  static struct dp_netdev_flow *
>  dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
>                            const struct netdev_flow_key *key, @@ -2310,7 +2456,7 @@
> dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
> 
>      cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
>      if (OVS_LIKELY(cls)) {
> -        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
> +        dpcls_lookup(cls, &key, &rule, 1, lookup_num_p);
>          netdev_flow = dp_netdev_flow_cast(rule);
>      }
>      return netdev_flow;
> @@ -3137,6 +3283,17 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
>          }
>      }
> 
> +    bool smc_enable = smap_get_bool(other_config, "smc-enable", false);
> +    bool cur_smc;
> +    atomic_read_relaxed(&dp->smc_enable_db, &cur_smc);
> +    if (smc_enable != cur_smc) {
> +        atomic_store_relaxed(&dp->smc_enable_db, smc_enable);
> +        if (smc_enable) {
> +            VLOG_INFO("SMC cache is enabled");
> +        } else {
> +            VLOG_INFO("SMC cache is disabled");
> +        }
> +    }
>      return 0;
>  }
> 
> @@ -4270,7 +4427,7 @@ pmd_thread_main(void *f_)
>      ovs_numa_thread_setaffinity_core(pmd->core_id);
>      dpdk_set_lcore_id(pmd->core_id);
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> -    emc_cache_init(&pmd->flow_cache);
> +    dfc_cache_init(&pmd->flow_cache);
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
> 
> @@ -4324,7 +4481,7 @@ reload:
>              coverage_try_clear();
>              dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
>              if (!ovsrcu_try_quiesce()) {
> -                emc_cache_slow_sweep(&pmd->flow_cache);
> +                emc_cache_slow_sweep(&((pmd->flow_cache).emc_cache));
[[BO'M]] The slow sweep removes EMC entries that are no longer valid because the associated dpcls rule has been changed or has expired. Is there a mechanism to remove SMC entries associated with a changed/expired dpcls rules?
>              }
> 
>              atomic_read_relaxed(&pmd->reload, &reload); @@ -4349,7 +4506,7
> @@ reload:
>          goto reload;
>      }
> 
> -    emc_cache_uninit(&pmd->flow_cache);
> +    dfc_cache_uninit(&pmd->flow_cache);
>      free(poll_list);
>      pmd_free_cached_ports(pmd);
>      return NULL;
> @@ -4785,7 +4942,7 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> -        emc_cache_init(&pmd->flow_cache);
> +        dfc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
>      pmd_perf_stats_init(&pmd->perf_stats);
> @@ -4828,7 +4985,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct
> dp_netdev_pmd_thread *pmd)
>       * but extra cleanup is necessary */
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&dp->non_pmd_mutex);
> -        emc_cache_uninit(&pmd->flow_cache);
> +        dfc_cache_uninit(&pmd->flow_cache);
>          pmd_free_cached_ports(pmd);
>          pmd_free_static_tx_qid(pmd);
>          ovs_mutex_unlock(&dp->non_pmd_mutex);
> @@ -5132,10 +5289,67 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);  }
> 
> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
> +/* SMC lookup function for a batch of pckets.
> + * By doing batching SMC lookup, we can use prefetch
> + * to hide memory access latency.
> + */
> +static inline void
> +smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> +            struct netdev_flow_key *keys,
> +            struct netdev_flow_key **missed_keys,
> +            struct dp_packet_batch *packets_,
> +            struct packet_batch_per_flow batches[],
> +            size_t *n_batches, const int cnt) {
> +    int i;
> +    struct dp_packet *packet;
> +    size_t n_smc_hit = 0, n_missed = 0;
> +    struct dfc_cache *cache = &pmd->flow_cache;
> +    struct smc_cache *smc_cache = &cache->smc_cache;
> +    const struct cmap_node *flow_node;
> +
> +    /* Prefetch buckets for all packets */
> +    for (i = 0; i < cnt; i++) {
> +        OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]);
> +    }
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
> +        struct dp_netdev_flow *flow = NULL;
> +        flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash);
> +
> +        if (OVS_LIKELY(flow_node != NULL)) {
> +            /* Optimistically we only check the head of the node linked list */
> +            flow = CONTAINER_OF(flow_node, struct dp_netdev_flow, node);
> +            /* Since we dont have per-port megaflow to check the port number,
> +             * we need to  verify that the input ports match. */
> +            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
> +                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
> +                /* SMC hit and emc miss, we insert into EMC */
> +                emc_probabilistic_insert(pmd, &keys[i], flow);
> +                keys[i].len =
> +                        netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> +                dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> +                        n_batches);
> +                n_smc_hit++;
> +                continue;
> +            }
> +        }
> +
> +        /* SMC missed. Group missed packets together at
> +         * the beginning of the 'packets' array. */
> +        dp_packet_batch_refill(packets_, packet, i);
> +        /* Put missed keys to the pointer arrays return to the caller */
> +        missed_keys[n_missed++] = &keys[i];
> +    }
> +
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> +n_smc_hit); }
> +
> +/* Try to process all ('cnt') the 'packets' using only the datapath
> +flow cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> - * beginning of the 'packets' array.
> + * beginning of the 'packets' array. The pointers of missed keys are
> + put in the
> + * missed_keys pointer array for future processing.
>   *
>   * The function returns the number of packets that needs to be processed in the
>   * 'packets' array (they have been moved to the beginning of the vector).
> @@ -5147,20 +5361,24 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>   * will be ignored.
>   */
>  static inline size_t
> -emc_processing(struct dp_netdev_pmd_thread *pmd,
> +dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
> +               struct netdev_flow_key **missed_keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
>                 bool md_is_valid, odp_port_t port_no)  {
> -    struct emc_cache *flow_cache = &pmd->flow_cache;
>      struct netdev_flow_key *key = &keys[0];
> -    size_t n_missed = 0, n_dropped = 0;
> +    size_t n_missed = 0, n_emc_hit = 0;
> +    struct dfc_cache *cache = &pmd->flow_cache;
>      struct dp_packet *packet;
>      const size_t cnt = dp_packet_batch_size(packets_);
>      uint32_t cur_min;
>      int i;
> +    bool smc_enable_db;
> +
> 
> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>      pmd_perf_update_counter(&pmd->perf_stats,
>                              md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV, @@ -
> 5171,7 +5389,6 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> -            n_dropped++;
>              continue;
>          }
> 
> @@ -5187,34 +5404,47 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>          miniflow_extract(packet, &key->mf);
>          key->len = 0; /* Not computed yet. */
> -        /* If EMC is disabled skip hash computation and emc_lookup */
> -        if (cur_min) {
> +        /* If EMC and SMC disabled skip hash computation */
> +        if (smc_enable_db == true || cur_min != 0) {
>              if (!md_is_valid) {
>                  key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
>                          &key->mf);
>              } else {
>                  key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
>              }
> -            flow = emc_lookup(flow_cache, key);
> +        }
> +        if (cur_min) {
> +            flow = emc_lookup(&cache->emc_cache, key);
>          } else {
>              flow = NULL;
>          }
>          if (OVS_LIKELY(flow)) {
>              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                      n_batches);
> +            n_emc_hit++;
>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */
>              dp_packet_batch_refill(packets_, packet, i);
>              /* 'key[n_missed]' contains the key of the current packet and it
> -             * must be returned to the caller. The next key should be extracted
> -             * to 'keys[n_missed + 1]'. */
> +             * will be passed to SMC lookup. The next key should be extracted
> +             * to 'keys[n_missed + 1]'.
> +             * We also maintain a pointer array to keys missed both SMC and EMC
> +             * which will be returned to the caller for future processing. */
> +            missed_keys[n_missed] = key;
>              key = &keys[++n_missed];
>          }
>      }
> 
> -    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> -                            cnt - n_dropped - n_missed);
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> + n_emc_hit);
> +
> +    if (!smc_enable_db) {
> +        return dp_packet_batch_size(packets_);
> +    }
> +
> +    /* Packets miss EMC will do a batch lookup in SMC if enabled */
> +    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
> +                            n_batches, n_missed);
> 
>      return dp_packet_batch_size(packets_);  } @@ -5282,6 +5512,8 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> +        uint32_t hash = dp_netdev_flow_hash(&netdev_flow->ufid);
> +        smc_insert(pmd, key, hash);
>          emc_probabilistic_insert(pmd, key, netdev_flow);
>      }
>      if (pmd_perf_metrics_enabled(pmd)) { @@ -5298,7 +5530,7 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,  static inline void
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
> -                     struct netdev_flow_key *keys,
> +                     struct netdev_flow_key **keys,
>                       struct packet_batch_per_flow batches[],
>                       size_t *n_batches,
>                       odp_port_t in_port) @@ -5320,12 +5552,13 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> 
>      for (size_t i = 0; i < cnt; i++) {
>          /* Key length is needed in all the cases, hash computed on demand. */
> -        keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> +        keys[i]->len =
> + netdev_flow_key_size(miniflow_n_values(&keys[i]->mf));
>      }
>      /* Get the classifier for the in_port */
>      cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
>      if (OVS_LIKELY(cls)) {
> -        any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
> +        any_miss = !dpcls_lookup(cls, (const struct netdev_flow_key **)keys,
> +                                rules, cnt, &lookup_cnt);
>      } else {
>          any_miss = true;
>          memset(rules, 0, sizeof(rules)); @@ -5347,7 +5580,7 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>              /* It's possible that an earlier slow path execution installed
>               * a rule covering this flow.  In this case, it's a lot cheaper
>               * to catch it here than execute a miss. */
> -            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i],
> +            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, keys[i],
>                                                      &add_lookup_cnt);
>              if (netdev_flow) {
>                  lookup_cnt += add_lookup_cnt; @@ -5355,7 +5588,7 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                  continue;
>              }
> 
> -            int error = handle_packet_upcall(pmd, packet, &keys[i],
> +            int error = handle_packet_upcall(pmd, packet, keys[i],
>                                               &actions, &put_actions);
> 
>              if (OVS_UNLIKELY(error)) {
> @@ -5385,9 +5618,11 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> +        uint32_t hash =  dp_netdev_flow_hash(&flow->ufid);
> +        smc_insert(pmd, keys[i], hash);
> +        emc_probabilistic_insert(pmd, keys[i], flow);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
> +        dp_netdev_queue_batches(packet, flow, &keys[i]->mf, batches,
> + n_batches);
>      }
> 
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -5417,17 +5652,18 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,  #endif
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
> +    struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>      size_t n_batches;
>      odp_port_t in_port;
> 
>      n_batches = 0;
> -    emc_processing(pmd, packets, keys, batches, &n_batches,
> +    dfc_processing(pmd, packets, keys, missed_keys, batches,
> + &n_batches,
>                              md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
>          in_port = packets->packets[0]->md.in_port.odp_port;
> -        fast_path_processing(pmd, packets, keys,
> +        fast_path_processing(pmd, packets, missed_keys,
>                               batches, &n_batches, in_port);
>      }
> 
> @@ -6377,7 +6613,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
> *rule)
> 
>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>   * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool
> +static bool
>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                         const struct netdev_flow_key *target)  { @@ -6404,7 +6640,7 @@
> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>   *
>   * Returns true if all miniflows found a corresponding rule. */  static bool -
> dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
>               struct dpcls_rule **rules, const size_t cnt,
>               int *num_lookups_p)
>  {
> @@ -6443,7 +6679,7 @@ dpcls_lookup(struct dpcls *cls, const struct
> netdev_flow_key keys[],
>           * masked with the subtable's mask to avoid hashing the wildcarded
>           * bits. */
>          ULLONG_FOR_EACH_1(i, keys_map) {
> -            hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
> +            hashes[i] = netdev_flow_key_hash_in_mask(keys[i],
>                                                       &subtable->mask);
>          }
>          /* Lookup. */
> @@ -6457,7 +6693,7 @@ dpcls_lookup(struct dpcls *cls, const struct
> netdev_flow_key keys[],
>              struct dpcls_rule *rule;
> 
>              CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> -                if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
> +                if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i])))
> + {
>                      rules[i] = rule;
>                      /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
>                       * within one second optimization interval. */ diff --git
> a/tests/pmd.at b/tests/pmd.at index f3fac63..60452f5 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -185,6 +185,7 @@ CHECK_PMD_THREADS_CREATED()  AT_CHECK([ovs-
> appctl vlog/set dpif_netdev:dbg])  AT_CHECK([ovs-ofctl add-flow br0
> action=normal])  AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:emc-
> insert-inv-prob=1])
> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:smc-enable=true])
> 
>  sleep 1
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 7609485..b4622f5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -388,6 +388,19 @@
>          </p>
>        </column>
> 
> +      <column name="other_config" key="smc-enable"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Signature match cache or SMC is a cache between EMC and megaflow
> +          cache. It does not store the full key of the flow, so it is more
> +          memory efficient comparing to EMC cache. SMC is especially useful
> +          when flow count is larger than EMC capacity.
> +        </p>
> +        <p>
> +          Defaults to false but can be changed at any time.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
> --
> 2.7.4
O Mahony, Billy July 5, 2018, 4:24 p.m. | #2
Hi Yipeng,

Some further comments below. Mainly to do with readability and understanding of the changes.

Regards,
Billy.

> -----Original Message-----
> From: Wang, Yipeng1
> Sent: Friday, June 29, 2018 6:53 PM
> To: dev@openvswitch.org
> Cc: Wang, Yipeng1 <yipeng1.wang@intel.com>; jan.scheurich@ericsson.com;
> Stokes, Ian <ian.stokes@intel.com>; O Mahony, Billy
> <billy.o.mahony@intel.com>; Loftus, Ciara <ciara.loftus@intel.com>
> Subject: [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an netdev_flow index in flow_table. Thus, we
> add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS                                 |   2 +
>  lib/cmap.c                           |  73 +++++++++
>  lib/cmap.h                           |   5 +
>  lib/dpif-netdev-perf.h               |   1 +
>  lib/dpif-netdev.c                    | 310 ++++++++++++++++++++++++++++++-----
>  tests/pmd.at                         |   1 +
>  vswitchd/vswitch.xml                 |  13 ++
>  8 files changed, 383 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-------------------------
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +    $ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index cd15a33..26d6ef1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -40,6 +40,8 @@ Post-v2.9.0
>           ovs-appctl dpif-netdev/pmd-perf-show
>       * Supervision of PMD performance metrics and logging of suspicious
>         iterations
> +     * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +       it improves throughput when traffic has many more flows than EMC size.
>     - ERSPAN:
>       * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
>         both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..db1c806 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
>                         hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. For example, index 7
> +means
> + * the second bucket and the third item.
> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * of performance reasons. Currently it is only used by the datapath DFC cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint16_t index) {
> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +    uint32_t b = index / CMAP_K;
> +    uint32_t e = index % CMAP_K;
> +
> +    if (b > impl->mask) {
> +        return NULL;
> +    }
> +
> +    const struct cmap_bucket *bucket = &impl->buckets[b];
> +
> +    return cmap_node_next(&bucket->nodes[e]);
> +}
> +
> +/* Find the index of certain hash value. Currently only used by the
> +datapath
> + * DFC cache.
> + *
> + * Return the index of the entry if found, or UINT32_MAX if not found
[[BO'M]]  An intro the concept of index would be useful here especially as it does not currently exist in cmap. Something like: "The 'index' of a cmap entry is a way to combine the specific bucket and item occupied by an entry into a convenient single integer value. It is not used internally by cmap." Unless of course that is actually wrong :) 
    
If a cmap's capacity exceeds the range of UINT32_MAX what happens? Does something different happen if the entry is in a bucket that can be expressed in a uint32_t versus a bucket that is outside of that range?
    
> +*/
> +
> +uint32_t
> +cmap_find_index(const struct cmap *cmap, uint32_t hash) {
> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
> +    uint32_t h1 = rehash(impl, hash);
> +    uint32_t h2 = other_hash(h1);
> +
> +    uint32_t b_index1 = h1 & impl->mask;
> +    uint32_t b_index2 = h2 & impl->mask;
> +
> +    uint32_t c1, c2;
> +    uint32_t index = UINT32_MAX;
> +
> +    const struct cmap_bucket *b1 = &impl->buckets[b_index1];
> +    const struct cmap_bucket *b2 = &impl->buckets[b_index2];
> +
> +    do {
> +        do {
> +            c1 = read_even_counter(b1);
> +            for (int i = 0; i < CMAP_K; i++) {
> +                if (b1->hashes[i] == hash) {
> +                    index = b_index1 * CMAP_K + i;
> +                 }
> +            }
> +        } while (OVS_UNLIKELY(counter_changed(b1, c1)));
> +        if (index != UINT32_MAX) {
> +            break;
> +        }
> +        do {
> +            c2 = read_even_counter(b2);
> +            for (int i = 0; i < CMAP_K; i++) {
> +                if (b2->hashes[i] == hash) {
> +                    index = b_index2 * CMAP_K + i;
> +                }
> +            }
> +        } while (OVS_UNLIKELY(counter_changed(b2, c2)));
> +
> +        if (index != UINT32_MAX) {
> +            break;
> +        }
> +    } while (OVS_UNLIKELY(counter_changed(b1, c1)));
> +
> +    return index;
> +}
> +
>  /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
>   * and sets the corresponding pointer in 'nodes', if the hash value was
>   * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
> diff --git a/lib/cmap.h b/lib/cmap.h index 8bfb6c0..076afd6 100644
> --- a/lib/cmap.h
> +++ b/lib/cmap.h
> @@ -145,6 +145,11 @@ size_t cmap_replace(struct cmap *, struct cmap_node
> *old_node,  const struct cmap_node *cmap_find(const struct cmap *, uint32_t
> hash);  struct cmap_node *cmap_find_protected(const struct cmap *, uint32_t
> hash);
> 
> +/* Find node by index or find index by hash */ const struct cmap_node *
> +cmap_find_by_index(const struct cmap *,
> +                                            uint16_t index); uint32_t
> +cmap_find_index(const struct cmap *, uint32_t hash);
> +
>  /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
>   * and sets the corresponding pointer in 'nodes', if the hash value was
>   * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
> b8aa4e3..299d52a 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -56,6 +56,7 @@ extern "C" {
> 
>  enum pmd_stat_type {
>      PMD_STAT_EXACT_HIT,     /* Packets that had an exact match (emc). */
> +    PMD_STAT_SMC_HIT,        /* Packets that had a sig match hit (SMC). */
>      PMD_STAT_MASKED_HIT,    /* Packets that matched in the flow table. */
>      PMD_STAT_MISS,          /* Packets that did not match and upcall was ok. */
>      PMD_STAT_LOST,          /* Packets that did not match and upcall failed. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff..de7d2ca 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -129,7 +129,9 @@ struct netdev_flow_key {
>      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> 
> -/* Exact match cache for frequently used flows
> +/* EMC cache and SMC cache compose of the datapath flow cache (DFC)
> + *
> + * Exact match cache for frequently used flows
>   *
>   * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
>   * search its entries for a miniflow that matches exactly the miniflow of the @@
> -143,6 +145,15 @@ struct netdev_flow_key {
>   * value is the index of a cache entry where the miniflow could be.
>   *
>   *
> + * Signature match cache (SMC)
> + *
> + * This cache stores a 16-bit signature for each flow without storing
> + keys, and
> + * stores the corresponding 16-bit flow_table index to the 'dp_netdev_flow'.
> + * Each flow thus occupies 32bit which is much more memory efficient than
> EMC.
> + * SMC uses a set-associative design that each bucket contains
> + * SMC_ENTRY_PER_BUCKET number of entries.
> + *
> + *
>   * Thread-safety
>   * =============
>   *
> @@ -155,6 +166,11 @@ struct netdev_flow_key {  #define
> EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)  #define
> EM_FLOW_HASH_SEGS 2
> 
> +#define SMC_ENTRY_PER_BUCKET 4
[[BO'M]] SMC_ENTRY_PER_BUCKET -> SMC_ENTRIES_PER_BUCKET.
Also a comment something like "A bucket forms the set of possible entries that a flow hash can occupy. Therefore SMC_ENTRIES_PER_BUCKET for SMC in analagous to EM_FLOW_HASH_SEGS for EMC." Might help people familiar with the current EMC to grok the SMC a little faster.

> +#define SMC_ENTRIES (1u << 20)
> +#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET)
> #define
> +SMC_MASK (SMC_BUCKET_CNT - 1)
> +
>  /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB
> */  #define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
> @@ -170,6 +186,21 @@ struct emc_cache {
>      int sweep_idx;                /* For emc_cache_slow_sweep(). */
>  };
> 
> +struct smc_bucket {
> +    uint16_t sig[SMC_ENTRY_PER_BUCKET];
> +    uint16_t flow_idx[SMC_ENTRY_PER_BUCKET]; };
> +
> +/* Signature match cache, differentiate from EMC cache */ struct
> +smc_cache {
> +    struct smc_bucket buckets[SMC_BUCKET_CNT]; };
> +
> +struct dfc_cache {
> +    struct emc_cache emc_cache;
> +    struct smc_cache smc_cache;
> +};
> +
>  /* Iterate in the exact match cache through every entry that might contain a
>   * miniflow with hash 'HASH'. */
>  #define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
> \
> @@ -214,10 +245,11 @@ static void dpcls_insert(struct dpcls *, struct
> dpcls_rule *,
>                           const struct netdev_flow_key *mask);  static void
> dpcls_remove(struct dpcls *, struct dpcls_rule *);  static bool dpcls_lookup(struct
> dpcls *cls,
> -                         const struct netdev_flow_key keys[],
> +                         const struct netdev_flow_key *keys[],
>                           struct dpcls_rule **rules, size_t cnt,
>                           int *num_lookups_p);
> -

> +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
> +                            const struct netdev_flow_key *target);
>  /* Set of supported meter flags */
>  #define DP_SUPPORTED_METER_FLAGS_MASK \
>      (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS | OFPMF13_BURST)
> @@ -284,6 +316,8 @@ struct dp_netdev {
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
>      /* Enable collection of PMD performance metrics. */
>      atomic_bool pmd_perf_metrics;
> +    /* Enable the SMC cache from ovsdb config */
> +    atomic_bool smc_enable_db;
> 
>      /* Protects access to ofproto-dpif-upcall interface during revalidator
>       * thread synchronization. */
> @@ -552,7 +586,7 @@ struct dp_netdev_pmd_thread {
>       * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>       * need to be protected by 'non_pmd_mutex'.  Every other instance
>       * will only be accessed by its own pmd thread. */
> -    struct emc_cache flow_cache;
> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
> 
>      /* Flow-Table and classifiers
>       *
> @@ -720,6 +754,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct
> dp_netdev_pmd_thread *pmd,
> 
>  static inline bool emc_entry_alive(struct emc_entry *ce);  static void
> emc_clear_entry(struct emc_entry *ce);
> +static void smc_clear_entry(struct smc_bucket *b, int idx);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);  static inline
> bool @@ -740,6 +775,24 @@ emc_cache_init(struct emc_cache *flow_cache)
> }
> 
>  static void
> +smc_cache_init(struct smc_cache *smc_cache) {
> +    int i, j;
> +    for (i = 0; i < SMC_BUCKET_CNT; i++) {
> +        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
> +            smc_cache->buckets[i].flow_idx[j] = UINT16_MAX;
> +        }
> +    }
> +}
> +
> +static void
> +dfc_cache_init(struct dfc_cache *flow_cache) {
> +    emc_cache_init(&flow_cache->emc_cache);
> +    smc_cache_init(&flow_cache->smc_cache);
> +}
> +
> +static void
>  emc_cache_uninit(struct emc_cache *flow_cache)  {
>      int i;
> @@ -749,6 +802,25 @@ emc_cache_uninit(struct emc_cache *flow_cache)
>      }
>  }
> 
> +static void
> +smc_cache_uninit(struct smc_cache *smc) {
> +    int i, j;
> +
> +    for (i = 0; i < SMC_BUCKET_CNT; i++) {
> +        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
> +            smc_clear_entry(&(smc->buckets[i]), j);
> +        }
> +    }
> +}
> +
> +static void
> +dfc_cache_uninit(struct dfc_cache *flow_cache) {
> +    smc_cache_uninit(&flow_cache->smc_cache);
> +    emc_cache_uninit(&flow_cache->emc_cache);
> +}
> +
>  /* Check and clear dead flow references slowly (one entry at each
>   * invocation).  */
>  static void
> @@ -860,6 +932,7 @@ pmd_info_show_stats(struct ds *reply,
>                    "  packet recirculations: %"PRIu64"\n"
>                    "  avg. datapath passes per packet: %.02f\n"
>                    "  emc hits: %"PRIu64"\n"
> +                  "  smc hits: %"PRIu64"\n"
>                    "  megaflow hits: %"PRIu64"\n"
>                    "  avg. subtable lookups per megaflow hit: %.02f\n"
>                    "  miss with success upcall: %"PRIu64"\n"
> @@ -867,6 +940,7 @@ pmd_info_show_stats(struct ds *reply,
>                    "  avg. packets per output batch: %.02f\n",
>                    total_packets, stats[PMD_STAT_RECIRC],
>                    passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
> +                  stats[PMD_STAT_SMC_HIT],
>                    stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
>                    stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
>                    packets_per_batch);
> @@ -1580,6 +1654,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct
> dpif_dp_stats *stats)
>          stats->n_flows += cmap_count(&pmd->flow_table);
>          pmd_perf_read_counters(&pmd->perf_stats, pmd_stats);
>          stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT];
> +        stats->n_hit += pmd_stats[PMD_STAT_SMC_HIT];
>          stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
>          stats->n_missed += pmd_stats[PMD_STAT_MISS];
>          stats->n_lost += pmd_stats[PMD_STAT_LOST]; @@ -2272,10 +2347,11
> @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
>       * probability of 1/100 ie. 1% */
> 
>      uint32_t min;
> +
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
> 
>      if (min && random_uint32() <= min) {
> -        emc_insert(&pmd->flow_cache, key, flow);
> +        emc_insert(&(pmd->flow_cache).emc_cache, key, flow);
>      }
>  }
> 
> @@ -2297,6 +2373,76 @@ emc_lookup(struct emc_cache *cache, const struct
> netdev_flow_key *key)
>      return NULL;
>  }
> 
> +static inline const struct cmap_node *
> +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
> +                const uint32_t hash)
> +{
> +    struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK];
> +    uint16_t sig = hash >> 16;
> +    uint16_t index = UINT16_MAX;
> +
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if (bucket->sig[i] == sig) {
> +            index = bucket->flow_idx[i];
> +            break;
> +        }
> +    }
> +    if (index != UINT16_MAX) {
> +        return cmap_find_by_index(&pmd->flow_table, index);
> +    }
> +    return NULL;
> +}
> +
> +static void
> +smc_clear_entry(struct smc_bucket *b, int idx) {
> +    b->flow_idx[idx] = UINT16_MAX;
> +}
> +
> +static inline int
[[BO'M]] As return value seems to be 1 for ok and 0 for failure suggest using a bool for the return value. Also a comment describing when an insert may fail. Describe insertion strategy which seems to be 'if entry already exists update it, otherwise insert in a free space, if no free space available randomly pick an entry form the bucket' 
    
> +smc_insert(struct dp_netdev_pmd_thread *pmd,
> +           const struct netdev_flow_key *key,
> +           uint32_t hash)
> +{
> +    struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache;
> +    struct smc_bucket *bucket = &smc_cache->buckets[key->hash &
> SMC_MASK];
> +    uint16_t index;
> +    uint32_t cmap_index;
> +    bool smc_enable_db;
> +
> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
> +    if (!smc_enable_db) {
> +        return 0;
> +    }
> +
> +    cmap_index = cmap_find_index(&pmd->flow_table, hash);
> +    index = (cmap_index >= UINT16_MAX) ? UINT16_MAX :
> + (uint16_t)cmap_index;
> +
> +    /* If the index is larger than SMC can handel */
[[BO'M]]     handel -> handle. 
Also should state explicitly that while cmap_find_index can generate an index of to UINT32_MAX we only handle values of entry up to UINT16_MAX in order to reduce SMC memory footprint.
> +    if (index == UINT16_MAX) {
> +        return 0;
> +    }
> +
> +    uint16_t sig = key->hash >> 16;
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if(bucket->sig[i] == sig) {
> +            bucket->flow_idx[i] = index;
> +            return 1;
> +        }
> +    }
> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
> +        if(bucket->flow_idx[i] == UINT16_MAX) {
> +            bucket->sig[i] = sig;
> +            bucket->flow_idx[i] = index;
> +            return 1;
> +        }
> +    }
> +    int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1);
[[BO'M]] This is assuming SMC_ENTRY_PER_BUCKET is a power of 2? But there is no BUILD_ASSERT_DECL for that. Would using mod be avoid the ^2 constraint be too slow?
Also 'idx' makes me think of the cmap 'index' concept which is not relevant here. Re-using 'i' above as the array index would be more consistent.

> +    bucket->sig[idx] = sig;
> +    bucket->flow_idx[idx] = index;
> +    return 1;
> +}
> +
>  static struct dp_netdev_flow *
>  dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
>                            const struct netdev_flow_key *key, @@ -2310,7 +2456,7 @@
> dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
> 
>      cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
>      if (OVS_LIKELY(cls)) {
> -        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
> +        dpcls_lookup(cls, &key, &rule, 1, lookup_num_p);
>          netdev_flow = dp_netdev_flow_cast(rule);
>      }
>      return netdev_flow;
> @@ -3137,6 +3283,17 @@ dpif_netdev_set_config(struct dpif *dpif, const
> struct smap *other_config)
>          }
>      }
> 
> +    bool smc_enable = smap_get_bool(other_config, "smc-enable", false);
> +    bool cur_smc;
> +    atomic_read_relaxed(&dp->smc_enable_db, &cur_smc);
> +    if (smc_enable != cur_smc) {
> +        atomic_store_relaxed(&dp->smc_enable_db, smc_enable);
> +        if (smc_enable) {
> +            VLOG_INFO("SMC cache is enabled");
> +        } else {
> +            VLOG_INFO("SMC cache is disabled");
> +        }
> +    }
>      return 0;
>  }
> 
> @@ -4270,7 +4427,7 @@ pmd_thread_main(void *f_)
>      ovs_numa_thread_setaffinity_core(pmd->core_id);
>      dpdk_set_lcore_id(pmd->core_id);
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> -    emc_cache_init(&pmd->flow_cache);
> +    dfc_cache_init(&pmd->flow_cache);
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
> 
> @@ -4324,7 +4481,7 @@ reload:
>              coverage_try_clear();
>              dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
>              if (!ovsrcu_try_quiesce()) {
> -                emc_cache_slow_sweep(&pmd->flow_cache);
> +                emc_cache_slow_sweep(&((pmd->flow_cache).emc_cache));
>              }
> 
>              atomic_read_relaxed(&pmd->reload, &reload); @@ -4349,7 +4506,7
> @@ reload:
>          goto reload;
>      }
> 
> -    emc_cache_uninit(&pmd->flow_cache);
> +    dfc_cache_uninit(&pmd->flow_cache);
>      free(poll_list);
>      pmd_free_cached_ports(pmd);
>      return NULL;
> @@ -4785,7 +4942,7 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> -        emc_cache_init(&pmd->flow_cache);
> +        dfc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
>      pmd_perf_stats_init(&pmd->perf_stats);
> @@ -4828,7 +4985,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct
> dp_netdev_pmd_thread *pmd)
>       * but extra cleanup is necessary */
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&dp->non_pmd_mutex);
> -        emc_cache_uninit(&pmd->flow_cache);
> +        dfc_cache_uninit(&pmd->flow_cache);
>          pmd_free_cached_ports(pmd);
>          pmd_free_static_tx_qid(pmd);
>          ovs_mutex_unlock(&dp->non_pmd_mutex);
> @@ -5132,10 +5289,67 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);  }
> 
> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
> +/* SMC lookup function for a batch of pckets.
> + * By doing batching SMC lookup, we can use prefetch
> + * to hide memory access latency.
> + */
> +static inline void
> +smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> +            struct netdev_flow_key *keys,
> +            struct netdev_flow_key **missed_keys,
> +            struct dp_packet_batch *packets_,
> +            struct packet_batch_per_flow batches[],
> +            size_t *n_batches, const int cnt) {
> +    int i;
> +    struct dp_packet *packet;
> +    size_t n_smc_hit = 0, n_missed = 0;
> +    struct dfc_cache *cache = &pmd->flow_cache;
> +    struct smc_cache *smc_cache = &cache->smc_cache;
> +    const struct cmap_node *flow_node;
> +
> +    /* Prefetch buckets for all packets */
> +    for (i = 0; i < cnt; i++) {
> +        OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]);
> +    }
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
> +        struct dp_netdev_flow *flow = NULL;
> +        flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash);
> +
> +        if (OVS_LIKELY(flow_node != NULL)) {
> +            /* Optimistically we only check the head of the node linked list */
[[BO'M]] So this is only an issue for dpcls flows that share the same hash for pmd->flow_table (once time in 2^32 flows)? In that case the dp_netdev_flows hidden behind first one will not be hit in DFC ever but it would be found in dpcls_lookup? i.e. functionally still correct just a very rare performance hit. Would a loop to follow the possible node chain hurt the usual case performance much? 
> +            flow = CONTAINER_OF(flow_node, struct dp_netdev_flow, node);
> +            /* Since we dont have per-port megaflow to check the port number,
> +             * we need to  verify that the input ports match. */
> +            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
> +                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
> +                /* SMC hit and emc miss, we insert into EMC */
> +                emc_probabilistic_insert(pmd, &keys[i], flow);
> +                keys[i].len =
> +                        netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> +                dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> +                        n_batches);
> +                n_smc_hit++;
> +                continue;
> +            }
> +        }
> +
> +        /* SMC missed. Group missed packets together at
> +         * the beginning of the 'packets' array. */
> +        dp_packet_batch_refill(packets_, packet, i);
> +        /* Put missed keys to the pointer arrays return to the caller */
> +        missed_keys[n_missed++] = &keys[i];
> +    }
> +
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> +n_smc_hit); }
> +
> +/* Try to process all ('cnt') the 'packets' using only the datapath
> +flow cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> - * beginning of the 'packets' array.
> + * beginning of the 'packets' array. The pointers of missed keys are
> + put in the
> + * missed_keys pointer array for future processing.
>   *
>   * The function returns the number of packets that needs to be processed in the
>   * 'packets' array (they have been moved to the beginning of the vector).
> @@ -5147,20 +5361,24 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>   * will be ignored.
>   */
>  static inline size_t
> -emc_processing(struct dp_netdev_pmd_thread *pmd,
> +dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
> +               struct netdev_flow_key **missed_keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
>                 bool md_is_valid, odp_port_t port_no)  {
> -    struct emc_cache *flow_cache = &pmd->flow_cache;
>      struct netdev_flow_key *key = &keys[0];
> -    size_t n_missed = 0, n_dropped = 0;
> +    size_t n_missed = 0, n_emc_hit = 0;
> +    struct dfc_cache *cache = &pmd->flow_cache;
>      struct dp_packet *packet;
>      const size_t cnt = dp_packet_batch_size(packets_);
>      uint32_t cur_min;
>      int i;
> +    bool smc_enable_db;
> +
> 
> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>      pmd_perf_update_counter(&pmd->perf_stats,
>                              md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV, @@ -
> 5171,7 +5389,6 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> -            n_dropped++;
>              continue;
>          }
> 
> @@ -5187,34 +5404,47 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>          miniflow_extract(packet, &key->mf);
>          key->len = 0; /* Not computed yet. */
> -        /* If EMC is disabled skip hash computation and emc_lookup */
> -        if (cur_min) {
> +        /* If EMC and SMC disabled skip hash computation */
> +        if (smc_enable_db == true || cur_min != 0) {
>              if (!md_is_valid) {
>                  key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
>                          &key->mf);
>              } else {
>                  key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
>              }
> -            flow = emc_lookup(flow_cache, key);
> +        }
> +        if (cur_min) {
> +            flow = emc_lookup(&cache->emc_cache, key);
>          } else {
>              flow = NULL;
>          }
>          if (OVS_LIKELY(flow)) {
>              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                      n_batches);
> +            n_emc_hit++;
>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */
>              dp_packet_batch_refill(packets_, packet, i);
>              /* 'key[n_missed]' contains the key of the current packet and it
> -             * must be returned to the caller. The next key should be extracted
> -             * to 'keys[n_missed + 1]'. */
> +             * will be passed to SMC lookup. The next key should be extracted
> +             * to 'keys[n_missed + 1]'.
> +             * We also maintain a pointer array to keys missed both SMC and EMC
> +             * which will be returned to the caller for future processing. */
> +            missed_keys[n_missed] = key;
>              key = &keys[++n_missed];
>          }
>      }
> 
> -    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> -                            cnt - n_dropped - n_missed);
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> + n_emc_hit);
> +
> +    if (!smc_enable_db) {
> +        return dp_packet_batch_size(packets_);
> +    }
> +
> +    /* Packets miss EMC will do a batch lookup in SMC if enabled */
> +    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
> +                            n_batches, n_missed);
> 
>      return dp_packet_batch_size(packets_);  } @@ -5282,6 +5512,8 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> +        uint32_t hash = dp_netdev_flow_hash(&netdev_flow->ufid);
> +        smc_insert(pmd, key, hash);
>          emc_probabilistic_insert(pmd, key, netdev_flow);
>      }
>      if (pmd_perf_metrics_enabled(pmd)) { @@ -5298,7 +5530,7 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,  static inline void
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
> -                     struct netdev_flow_key *keys,
> +                     struct netdev_flow_key **keys,
>                       struct packet_batch_per_flow batches[],
>                       size_t *n_batches,
>                       odp_port_t in_port) @@ -5320,12 +5552,13 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> 
>      for (size_t i = 0; i < cnt; i++) {
>          /* Key length is needed in all the cases, hash computed on demand. */
> -        keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> +        keys[i]->len =
> + netdev_flow_key_size(miniflow_n_values(&keys[i]->mf));
>      }
>      /* Get the classifier for the in_port */
>      cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
>      if (OVS_LIKELY(cls)) {
> -        any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
> +        any_miss = !dpcls_lookup(cls, (const struct netdev_flow_key **)keys,
> +                                rules, cnt, &lookup_cnt);
>      } else {
>          any_miss = true;
>          memset(rules, 0, sizeof(rules)); @@ -5347,7 +5580,7 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>              /* It's possible that an earlier slow path execution installed
>               * a rule covering this flow.  In this case, it's a lot cheaper
>               * to catch it here than execute a miss. */
> -            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i],
> +            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, keys[i],
>                                                      &add_lookup_cnt);
>              if (netdev_flow) {
>                  lookup_cnt += add_lookup_cnt; @@ -5355,7 +5588,7 @@
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                  continue;
>              }
> 
> -            int error = handle_packet_upcall(pmd, packet, &keys[i],
> +            int error = handle_packet_upcall(pmd, packet, keys[i],
>                                               &actions, &put_actions);
> 
>              if (OVS_UNLIKELY(error)) {
> @@ -5385,9 +5618,11 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> +        uint32_t hash =  dp_netdev_flow_hash(&flow->ufid);
> +        smc_insert(pmd, keys[i], hash);
> +        emc_probabilistic_insert(pmd, keys[i], flow);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
> +        dp_netdev_queue_batches(packet, flow, &keys[i]->mf, batches,
> + n_batches);
>      }
> 
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -5417,17 +5652,18 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,  #endif
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
> +    struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>      size_t n_batches;
>      odp_port_t in_port;
> 
>      n_batches = 0;
> -    emc_processing(pmd, packets, keys, batches, &n_batches,
> +    dfc_processing(pmd, packets, keys, missed_keys, batches,
> + &n_batches,
>                              md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
>          in_port = packets->packets[0]->md.in_port.odp_port;
> -        fast_path_processing(pmd, packets, keys,
> +        fast_path_processing(pmd, packets, missed_keys,
>                               batches, &n_batches, in_port);
[[BO'M]] Does the introduction of 'missed_keys' (arry of pointers to keys) add anything to the existing 'keys' array populated by dfc_processing? It's still the same information passed to fast_path_processing just now as an array of pointers as opposed to a pointer to an array.
>      }
> 
> @@ -6377,7 +6613,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
> *rule)
> 
>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>   * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool
> +static bool
>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                         const struct netdev_flow_key *target)  { @@ -6404,7 +6640,7 @@
> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>   *
>   * Returns true if all miniflows found a corresponding rule. */  static bool -
> dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
[[BO'M]] Is there a benefit to changing from keys[] to *keys[] ? struct netdev_flow_key **keys as used elsewhere is easier to understand.
>               struct dpcls_rule **rules, const size_t cnt,
>               int *num_lookups_p)
>  {
> @@ -6443,7 +6679,7 @@ dpcls_lookup(struct dpcls *cls, const struct
> netdev_flow_key keys[],
>           * masked with the subtable's mask to avoid hashing the wildcarded
>           * bits. */
>          ULLONG_FOR_EACH_1(i, keys_map) {
> -            hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
> +            hashes[i] = netdev_flow_key_hash_in_mask(keys[i],
>                                                       &subtable->mask);
>          }
>          /* Lookup. */
> @@ -6457,7 +6693,7 @@ dpcls_lookup(struct dpcls *cls, const struct
> netdev_flow_key keys[],
>              struct dpcls_rule *rule;
> 
>              CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> -                if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
> +                if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i])))
> + {
>                      rules[i] = rule;
>                      /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
>                       * within one second optimization interval. */ diff --git
> a/tests/pmd.at b/tests/pmd.at index f3fac63..60452f5 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -185,6 +185,7 @@ CHECK_PMD_THREADS_CREATED()  AT_CHECK([ovs-
> appctl vlog/set dpif_netdev:dbg])  AT_CHECK([ovs-ofctl add-flow br0
> action=normal])  AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:emc-
> insert-inv-prob=1])
> +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:smc-enable=true])
> 
>  sleep 1
> 
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 7609485..b4622f5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -388,6 +388,19 @@
>          </p>
>        </column>
> 
> +      <column name="other_config" key="smc-enable"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Signature match cache or SMC is a cache between EMC and megaflow
> +          cache. It does not store the full key of the flow, so it is more
> +          memory efficient comparing to EMC cache. SMC is especially useful
> +          when flow count is larger than EMC capacity.
> +        </p>
> +        <p>
> +          Defaults to false but can be changed at any time.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
> --
> 2.7.4
Yipeng Wang July 6, 2018, 1:40 a.m. | #3
Thanks for the comments, please see my reply inlined.

>I've checked the latest patch and the performance results I get are similar to the ones give in the previous patches. Also
>enabling/disabling the DFC on the fly works as expected.
>
>The main query I have regards the slow sweep for SMC
>
>[[BO'M]] The slow sweep removes EMC entries that are no longer valid because the associated dpcls rule has been changed or has
>expired. Is there a mechanism to remove SMC entries associated with a changed/expired dpcls rules?

[Wang, Yipeng] It is a good question. EMC needs to sweep is mostly because EMC holds pointers to the dp_netdev_flow. If EMC does not sweep and release the pointers, a dead dpcls rule cannot be de-allocated from memory as long as the EMC entry referencing it stays.  SMC does not hold pointers, so the dead dpcls can be de-allocated at any time. SMC holds an index to flow_table. If the entry in flow_table is removed, SMC will just find a miss (and it is OK since SMC is just a cache).  So functionally, slow sweeping is not needed.  However, sweeping SMC may periodically give new flows more empty entries in SMC. But this adds code complexity so we eventually decide to remove this part of code.

>>  }
>>
>> +/* Find a node by the index of the entry of cmap. For example, index 7
>> +means
>> + * the second bucket and the third item.
>[[BO'M]] Is this is assuming 4 for the bucket size. Maybe explicitly add where the bucket size is coming from - CMAP_K ? If so is that
>value not 5 for a 64 bit system?
>
[Wang, Yipeng] I will change the comment to use CMAP_K instead of the numbers as example.

>> +++ b/lib/dpif-netdev.c
>> @@ -129,7 +129,9 @@ struct netdev_flow_key {
>>      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
>>
>> -/* Exact match cache for frequently used flows
>> +/* EMC cache and SMC cache compose of the datapath flow cache (DFC)
>[[BO'M]] 'compose of the' -> 'compose the' or else 'together make the'
>> @@ -2297,6 +2373,76 @@ emc_lookup(struct emc_cache *cache, const struct
>> netdev_flow_key *key)
>>      return NULL;
>>  }
>>
>> +static inline const struct cmap_node *
>> +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
>> +                const uint32_t hash)
>[[BO'M]] Minor quibble. We can get cache arg from pmd arg - pmd->flow_cache->smc_cache.

[Wang, Yipeng] I will fix all the issues you mentioned above.

Thanks!
Yipeng
Yipeng Wang July 9, 2018, 11:55 p.m. | #4
Thanks for the comments, please see my reply inlined. I made all the changes you suggested and included in v5 which I will send out soon.

>> diff --git a/lib/cmap.c b/lib/cmap.c
>> index 07719a8..db1c806 100644
>> --- a/lib/cmap.c
>> +++ b/lib/cmap.c
>> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
>>                         hash);
>>  }
>>
>> +/* Find a node by the index of the entry of cmap. For example, index 7
>> +means
>> + * the second bucket and the third item.
>> + * Notice that it is not protected by the optimistic lock (versioning)
>> +because
>> + * of performance reasons. Currently it is only used by the datapath DFC cache.
>> + *
>> + * Return node for the entry of index or NULL if the index beyond
>> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
>> +cmap *cmap, uint16_t index) {
>> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
>> +
>> +    uint32_t b = index / CMAP_K;
>> +    uint32_t e = index % CMAP_K;
>> +
>> +    if (b > impl->mask) {
>> +        return NULL;
>> +    }
>> +
>> +    const struct cmap_bucket *bucket = &impl->buckets[b];
>> +
>> +    return cmap_node_next(&bucket->nodes[e]);
>> +}
>> +
>> +/* Find the index of certain hash value. Currently only used by the
>> +datapath
>> + * DFC cache.
>> + *
>> + * Return the index of the entry if found, or UINT32_MAX if not found
>[[BO'M]]  An intro the concept of index would be useful here especially as it does not currently exist in cmap. Something like: "The
>'index' of a cmap entry is a way to combine the specific bucket and item occupied by an entry into a convenient single integer value. It
>is not used internally by cmap." Unless of course that is actually wrong :)
[Wang, Yipeng]  I will add the comments you suggested. It indeed makes it more understandable.

>If a cmap's capacity exceeds the range of UINT32_MAX what happens? Does something different happen if the entry is in a bucket
>that can be expressed in a uint32_t versus a bucket that is outside of that range?
[Wang, Yipeng] I currently assume the cmap cannot be larger than uint32_max. It is a very large number and I guess OvS should not deal with
this big table.  But you are right that I should make it clear,
 I add comments in cmap header file and other places to explain this restriction for the newly added API.
>
>> +*/
>> @@ -155,6 +166,11 @@ struct netdev_flow_key {  #define
>> EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)  #define
>> EM_FLOW_HASH_SEGS 2
>>
>> +#define SMC_ENTRY_PER_BUCKET 4
>[[BO'M]] SMC_ENTRY_PER_BUCKET -> SMC_ENTRIES_PER_BUCKET.
>Also a comment something like "A bucket forms the set of possible entries that a flow hash can occupy. Therefore
>SMC_ENTRIES_PER_BUCKET for SMC in analagous to EM_FLOW_HASH_SEGS for EMC." Might help people familiar with the current
>EMC to grok the SMC a little faster.
>
[Wang, Yipeng] I added more explanations as you suggested. For the SMC_ENTRIES_PER_BUCKET, it is actually slightly different from
EM_FLOW_HASH_SEGS.  For EMC, two hash functions are used to index two locations, for SMC currently, I just use one hash
function to index a bucket, and iterate the entries in that bucket. It is more like a middle ground between EMC and CMAP.

>> @@ -2297,6 +2373,76 @@ emc_lookup(struct emc_cache *cache, const struct
>> netdev_flow_key *key)
>>      return NULL;
>>  }
>>
>> +static inline const struct cmap_node *
>> +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
>> +                const uint32_t hash)
>> +{
>> +    struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK];
>> +    uint16_t sig = hash >> 16;
>> +    uint16_t index = UINT16_MAX;
>> +
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if (bucket->sig[i] == sig) {
>> +            index = bucket->flow_idx[i];
>> +            break;
>> +        }
>> +    }
>> +    if (index != UINT16_MAX) {
>> +        return cmap_find_by_index(&pmd->flow_table, index);
>> +    }
>> +    return NULL;
>> +}
>> +
>> +static void
>> +smc_clear_entry(struct smc_bucket *b, int idx) {
>> +    b->flow_idx[idx] = UINT16_MAX;
>> +}
>> +
>> +static inline int
>[[BO'M]] As return value seems to be 1 for ok and 0 for failure suggest using a bool for the return value. Also a comment describing
>when an insert may fail. Describe insertion strategy which seems to be 'if entry already exists update it, otherwise insert in a free
>space, if no free space available randomly pick an entry form the bucket'
>
[Wang, Yipeng] Thanks for pointing it out. I eventually changed it to void function since I realize that I do not need a return value indeed.
I will include this change and more comments for the logic in V5. 

>> +smc_insert(struct dp_netdev_pmd_thread *pmd,
>> +           const struct netdev_flow_key *key,
>> +           uint32_t hash)
>> +{
>> +    struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache;
>> +    struct smc_bucket *bucket = &smc_cache->buckets[key->hash &
>> SMC_MASK];
>> +    uint16_t index;
>> +    uint32_t cmap_index;
>> +    bool smc_enable_db;
>> +
>> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>> +    if (!smc_enable_db) {
>> +        return 0;
>> +    }
>> +
>> +    cmap_index = cmap_find_index(&pmd->flow_table, hash);
>> +    index = (cmap_index >= UINT16_MAX) ? UINT16_MAX :
>> + (uint16_t)cmap_index;
>> +
>> +    /* If the index is larger than SMC can handel */
>[[BO'M]]     handel -> handle.
>Also should state explicitly that while cmap_find_index can generate an index of to UINT32_MAX we only handle values of entry up to
>UINT16_MAX in order to reduce SMC memory footprint.
[Wang, Yipeng] I will add the suggested comments.

>> +    if (index == UINT16_MAX) {
>> +        return 0;
>> +    }
>> +
>> +    uint16_t sig = key->hash >> 16;
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if(bucket->sig[i] == sig) {
>> +            bucket->flow_idx[i] = index;
>> +            return 1;
>> +        }
>> +    }
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if(bucket->flow_idx[i] == UINT16_MAX) {
>> +            bucket->sig[i] = sig;
>> +            bucket->flow_idx[i] = index;
>> +            return 1;
>> +        }
>> +    }
>> +    int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1);
>[[BO'M]] This is assuming SMC_ENTRY_PER_BUCKET is a power of 2? But there is no BUILD_ASSERT_DECL for that. Would using mod
>be avoid the ^2 constraint be too slow?
>Also 'idx' makes me think of the cmap 'index' concept which is not relevant here. Re-using 'i' above as the array index would be more
>consistent.
[Wang, Yipeng]  You are right. I should use mod. If it is power of 2, I believe the compiler will optimize it to use &.
>
>> @@ -5132,10 +5289,67 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>>      packet_batch_per_flow_update(batch, pkt, mf);  }
>>
>> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
>> +/* SMC lookup function for a batch of pckets.
>> + * By doing batching SMC lookup, we can use prefetch
>> + * to hide memory access latency.
>> + */
>> +static inline void
>> +smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>> +            struct netdev_flow_key *keys,
>> +            struct netdev_flow_key **missed_keys,
>> +            struct dp_packet_batch *packets_,
>> +            struct packet_batch_per_flow batches[],
>> +            size_t *n_batches, const int cnt) {
>> +    int i;
>> +    struct dp_packet *packet;
>> +    size_t n_smc_hit = 0, n_missed = 0;
>> +    struct dfc_cache *cache = &pmd->flow_cache;
>> +    struct smc_cache *smc_cache = &cache->smc_cache;
>> +    const struct cmap_node *flow_node;
>> +
>> +    /* Prefetch buckets for all packets */
>> +    for (i = 0; i < cnt; i++) {
>> +        OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]);
>> +    }
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>> +        struct dp_netdev_flow *flow = NULL;
>> +        flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash);
>> +
>> +        if (OVS_LIKELY(flow_node != NULL)) {
>> +            /* Optimistically we only check the head of the node linked list */
>[[BO'M]] So this is only an issue for dpcls flows that share the same hash for pmd->flow_table (once time in 2^32 flows)? In that case
>the dp_netdev_flows hidden behind first one will not be hit in DFC ever but it would be found in dpcls_lookup? i.e. functionally still
>correct just a very rare performance hit. Would a loop to follow the possible node chain hurt the usual case performance much?
[Wang, Yipeng] You understand it correctly. I guess I was worried about the overhead of the loop (and branches) and thought it was corner case anyway.
But I tested again and it seems did not harm the performance in my new test run by adding the loop back. So I will add it back.
>> @@ -5417,17 +5652,18 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
>> *pmd,  #endif
>>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>> +    struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>>      size_t n_batches;
>>      odp_port_t in_port;
>>
>>      n_batches = 0;
>> -    emc_processing(pmd, packets, keys, batches, &n_batches,
>> +    dfc_processing(pmd, packets, keys, missed_keys, batches,
>> + &n_batches,
>>                              md_is_valid, port_no);
>>      if (!dp_packet_batch_is_empty(packets)) {
>>          /* Get ingress port from first packet's metadata. */
>>          in_port = packets->packets[0]->md.in_port.odp_port;
>> -        fast_path_processing(pmd, packets, keys,
>> +        fast_path_processing(pmd, packets, missed_keys,
>>                               batches, &n_batches, in_port);
>[[BO'M]] Does the introduction of 'missed_keys' (arry of pointers to keys) add anything to the existing 'keys' array populated by
>dfc_processing? It's still the same information passed to fast_path_processing just now as an array of pointers as opposed to a pointer
>to an array.
 [Wang, Yipeng] It is a good question and there is no major difference functionally; it is just a pointer array now instead of the key array.
Here is the reason I made this change. At beginning, SMC lookup is not batched. Every time a EMC miss will go lookup SMC.
If a key misses both EMC and SMC, it will be left in the key array for future fast_path_processing.
I found it is not optimal. If I can batch the SMC lookup it will enable me to do prefetch to improve the performance (or even software pipelining).
Thus, with batching SMC lookup, EMC lookup will still be the same as original that keys missing EMC will be left into the key array. Then I pass the key array into SMC to do batching lookup.
Along with the SMC lookup, I form the key pointer array to point to those keys that miss SMC as well, and eventually pass the pointer array to fast_path_processing.
I use the new pointer array because I do not want to move the keys around to form a new key array during SMC lookup, that would be costly. Another way to do it
Is to have a "hit mask" showing which keys in the key array already hit SMC. But then I still need to modify many functions to pass this mask around.
Functionally it should not change anything to the masterhead.
>>
>> @@ -6377,7 +6613,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
>> *rule)
>>
>>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>>   * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool
>> +static bool
>>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>>                         const struct netdev_flow_key *target)  { @@ -6404,7 +6640,7 @@
>> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>>   *
>>   * Returns true if all miniflows found a corresponding rule. */  static bool -
>> dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
>> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
>[[BO'M]] Is there a benefit to changing from keys[] to *keys[] ? struct netdev_flow_key **keys as used elsewhere is easier to
>understand.
[Wang, Yipeng] Similar to what I explained above, it is just because the batching SMC lookup now forms a pointer array instead
of the original key array. And I pass the pointer array around after SMC lookup instead of the key array as before. 
Functionally it is same as master head and no other specific reasons.

Patch

diff --git a/Documentation/topics/dpdk/bridge.rst b/Documentation/topics/dpdk/bridge.rst
index 63f8a62..df74c02 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -102,3 +102,18 @@  For certain traffic profiles with many parallel flows, it's recommended to set
 ``N`` to '0' to achieve higher forwarding performance.
 
 For more information on the EMC refer to :doc:`/intro/install/dpdk` .
+
+
+SMC cache (experimental)
+-------------------------
+
+SMC cache or signature match cache is a new cache level after EMC cache.
+The difference between SMC and EMC is SMC only stores a signature of a flow
+thus it is much more memory efficient. With same memory space, EMC can store 8k
+flows while SMC can store 1M flows. When traffic flow count is much larger than
+EMC size, it is generally beneficial to turn off EMC and turn on SMC. It is
+currently turned off by default and an experimental feature.
+
+To turn on SMC::
+
+    $ ovs-vsctl --no-wait set Open_vSwitch . other_config:smc-enable=true
diff --git a/NEWS b/NEWS
index cd15a33..26d6ef1 100644
--- a/NEWS
+++ b/NEWS
@@ -40,6 +40,8 @@  Post-v2.9.0
          ovs-appctl dpif-netdev/pmd-perf-show
      * Supervision of PMD performance metrics and logging of suspicious
        iterations
+     * Add signature match cache (SMC) as experimental feature. When turned on,
+       it improves throughput when traffic has many more flows than EMC size.
    - ERSPAN:
      * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
        both kernel datapath and userspace datapath.
diff --git a/lib/cmap.c b/lib/cmap.c
index 07719a8..db1c806 100644
--- a/lib/cmap.c
+++ b/lib/cmap.c
@@ -373,6 +373,79 @@  cmap_find(const struct cmap *cmap, uint32_t hash)
                        hash);
 }
 
+/* Find a node by the index of the entry of cmap. For example, index 7 means
+ * the second bucket and the third item.
+ * Notice that it is not protected by the optimistic lock (versioning) because
+ * of performance reasons. Currently it is only used by the datapath DFC cache.
+ *
+ * Return node for the entry of index or NULL if the index beyond boundary */
+const struct cmap_node *
+cmap_find_by_index(const struct cmap *cmap, uint16_t index)
+{
+    const struct cmap_impl *impl = cmap_get_impl(cmap);
+
+    uint32_t b = index / CMAP_K;
+    uint32_t e = index % CMAP_K;
+
+    if (b > impl->mask) {
+        return NULL;
+    }
+
+    const struct cmap_bucket *bucket = &impl->buckets[b];
+
+    return cmap_node_next(&bucket->nodes[e]);
+}
+
+/* Find the index of certain hash value. Currently only used by the datapath
+ * DFC cache.
+ *
+ * Return the index of the entry if found, or UINT32_MAX if not found */
+
+uint32_t
+cmap_find_index(const struct cmap *cmap, uint32_t hash)
+{
+    const struct cmap_impl *impl = cmap_get_impl(cmap);
+    uint32_t h1 = rehash(impl, hash);
+    uint32_t h2 = other_hash(h1);
+
+    uint32_t b_index1 = h1 & impl->mask;
+    uint32_t b_index2 = h2 & impl->mask;
+
+    uint32_t c1, c2;
+    uint32_t index = UINT32_MAX;
+
+    const struct cmap_bucket *b1 = &impl->buckets[b_index1];
+    const struct cmap_bucket *b2 = &impl->buckets[b_index2];
+
+    do {
+        do {
+            c1 = read_even_counter(b1);
+            for (int i = 0; i < CMAP_K; i++) {
+                if (b1->hashes[i] == hash) {
+                    index = b_index1 * CMAP_K + i;
+                 }
+            }
+        } while (OVS_UNLIKELY(counter_changed(b1, c1)));
+        if (index != UINT32_MAX) {
+            break;
+        }
+        do {
+            c2 = read_even_counter(b2);
+            for (int i = 0; i < CMAP_K; i++) {
+                if (b2->hashes[i] == hash) {
+                    index = b_index2 * CMAP_K + i;
+                }
+            }
+        } while (OVS_UNLIKELY(counter_changed(b2, c2)));
+
+        if (index != UINT32_MAX) {
+            break;
+        }
+    } while (OVS_UNLIKELY(counter_changed(b1, c1)));
+
+    return index;
+}
+
 /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
  * and sets the corresponding pointer in 'nodes', if the hash value was
  * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
diff --git a/lib/cmap.h b/lib/cmap.h
index 8bfb6c0..076afd6 100644
--- a/lib/cmap.h
+++ b/lib/cmap.h
@@ -145,6 +145,11 @@  size_t cmap_replace(struct cmap *, struct cmap_node *old_node,
 const struct cmap_node *cmap_find(const struct cmap *, uint32_t hash);
 struct cmap_node *cmap_find_protected(const struct cmap *, uint32_t hash);
 
+/* Find node by index or find index by hash */
+const struct cmap_node * cmap_find_by_index(const struct cmap *,
+                                            uint16_t index);
+uint32_t cmap_find_index(const struct cmap *, uint32_t hash);
+
 /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1,
  * and sets the corresponding pointer in 'nodes', if the hash value was
  * found from the 'cmap'.  In other cases the 'nodes' values are not changed,
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index b8aa4e3..299d52a 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -56,6 +56,7 @@  extern "C" {
 
 enum pmd_stat_type {
     PMD_STAT_EXACT_HIT,     /* Packets that had an exact match (emc). */
+    PMD_STAT_SMC_HIT,        /* Packets that had a sig match hit (SMC). */
     PMD_STAT_MASKED_HIT,    /* Packets that matched in the flow table. */
     PMD_STAT_MISS,          /* Packets that did not match and upcall was ok. */
     PMD_STAT_LOST,          /* Packets that did not match and upcall failed. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 9390fff..de7d2ca 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -129,7 +129,9 @@  struct netdev_flow_key {
     uint64_t buf[FLOW_MAX_PACKET_U64S];
 };
 
-/* Exact match cache for frequently used flows
+/* EMC cache and SMC cache compose of the datapath flow cache (DFC)
+ *
+ * Exact match cache for frequently used flows
  *
  * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
  * search its entries for a miniflow that matches exactly the miniflow of the
@@ -143,6 +145,15 @@  struct netdev_flow_key {
  * value is the index of a cache entry where the miniflow could be.
  *
  *
+ * Signature match cache (SMC)
+ *
+ * This cache stores a 16-bit signature for each flow without storing keys, and
+ * stores the corresponding 16-bit flow_table index to the 'dp_netdev_flow'.
+ * Each flow thus occupies 32bit which is much more memory efficient than EMC.
+ * SMC uses a set-associative design that each bucket contains
+ * SMC_ENTRY_PER_BUCKET number of entries.
+ *
+ *
  * Thread-safety
  * =============
  *
@@ -155,6 +166,11 @@  struct netdev_flow_key {
 #define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
 #define EM_FLOW_HASH_SEGS 2
 
+#define SMC_ENTRY_PER_BUCKET 4
+#define SMC_ENTRIES (1u << 20)
+#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET)
+#define SMC_MASK (SMC_BUCKET_CNT - 1)
+
 /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
 #define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
@@ -170,6 +186,21 @@  struct emc_cache {
     int sweep_idx;                /* For emc_cache_slow_sweep(). */
 };
 
+struct smc_bucket {
+    uint16_t sig[SMC_ENTRY_PER_BUCKET];
+    uint16_t flow_idx[SMC_ENTRY_PER_BUCKET];
+};
+
+/* Signature match cache, differentiate from EMC cache */
+struct smc_cache {
+    struct smc_bucket buckets[SMC_BUCKET_CNT];
+};
+
+struct dfc_cache {
+    struct emc_cache emc_cache;
+    struct smc_cache smc_cache;
+};
+
 /* Iterate in the exact match cache through every entry that might contain a
  * miniflow with hash 'HASH'. */
 #define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
@@ -214,10 +245,11 @@  static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
                          const struct netdev_flow_key *mask);
 static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
 static bool dpcls_lookup(struct dpcls *cls,
-                         const struct netdev_flow_key keys[],
+                         const struct netdev_flow_key *keys[],
                          struct dpcls_rule **rules, size_t cnt,
                          int *num_lookups_p);
-
+static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
+                            const struct netdev_flow_key *target);
 /* Set of supported meter flags */
 #define DP_SUPPORTED_METER_FLAGS_MASK \
     (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS | OFPMF13_BURST)
@@ -284,6 +316,8 @@  struct dp_netdev {
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
     /* Enable collection of PMD performance metrics. */
     atomic_bool pmd_perf_metrics;
+    /* Enable the SMC cache from ovsdb config */
+    atomic_bool smc_enable_db;
 
     /* Protects access to ofproto-dpif-upcall interface during revalidator
      * thread synchronization. */
@@ -552,7 +586,7 @@  struct dp_netdev_pmd_thread {
      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
      * need to be protected by 'non_pmd_mutex'.  Every other instance
      * will only be accessed by its own pmd thread. */
-    struct emc_cache flow_cache;
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
 
     /* Flow-Table and classifiers
      *
@@ -720,6 +754,7 @@  static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
 
 static inline bool emc_entry_alive(struct emc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
+static void smc_clear_entry(struct smc_bucket *b, int idx);
 
 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
 static inline bool
@@ -740,6 +775,24 @@  emc_cache_init(struct emc_cache *flow_cache)
 }
 
 static void
+smc_cache_init(struct smc_cache *smc_cache)
+{
+    int i, j;
+    for (i = 0; i < SMC_BUCKET_CNT; i++) {
+        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
+            smc_cache->buckets[i].flow_idx[j] = UINT16_MAX;
+        }
+    }
+}
+
+static void
+dfc_cache_init(struct dfc_cache *flow_cache)
+{
+    emc_cache_init(&flow_cache->emc_cache);
+    smc_cache_init(&flow_cache->smc_cache);
+}
+
+static void
 emc_cache_uninit(struct emc_cache *flow_cache)
 {
     int i;
@@ -749,6 +802,25 @@  emc_cache_uninit(struct emc_cache *flow_cache)
     }
 }
 
+static void
+smc_cache_uninit(struct smc_cache *smc)
+{
+    int i, j;
+
+    for (i = 0; i < SMC_BUCKET_CNT; i++) {
+        for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) {
+            smc_clear_entry(&(smc->buckets[i]), j);
+        }
+    }
+}
+
+static void
+dfc_cache_uninit(struct dfc_cache *flow_cache)
+{
+    smc_cache_uninit(&flow_cache->smc_cache);
+    emc_cache_uninit(&flow_cache->emc_cache);
+}
+
 /* Check and clear dead flow references slowly (one entry at each
  * invocation).  */
 static void
@@ -860,6 +932,7 @@  pmd_info_show_stats(struct ds *reply,
                   "  packet recirculations: %"PRIu64"\n"
                   "  avg. datapath passes per packet: %.02f\n"
                   "  emc hits: %"PRIu64"\n"
+                  "  smc hits: %"PRIu64"\n"
                   "  megaflow hits: %"PRIu64"\n"
                   "  avg. subtable lookups per megaflow hit: %.02f\n"
                   "  miss with success upcall: %"PRIu64"\n"
@@ -867,6 +940,7 @@  pmd_info_show_stats(struct ds *reply,
                   "  avg. packets per output batch: %.02f\n",
                   total_packets, stats[PMD_STAT_RECIRC],
                   passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
+                  stats[PMD_STAT_SMC_HIT],
                   stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
                   stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
                   packets_per_batch);
@@ -1580,6 +1654,7 @@  dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
         stats->n_flows += cmap_count(&pmd->flow_table);
         pmd_perf_read_counters(&pmd->perf_stats, pmd_stats);
         stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT];
+        stats->n_hit += pmd_stats[PMD_STAT_SMC_HIT];
         stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
         stats->n_missed += pmd_stats[PMD_STAT_MISS];
         stats->n_lost += pmd_stats[PMD_STAT_LOST];
@@ -2272,10 +2347,11 @@  emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
      * probability of 1/100 ie. 1% */
 
     uint32_t min;
+
     atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
 
     if (min && random_uint32() <= min) {
-        emc_insert(&pmd->flow_cache, key, flow);
+        emc_insert(&(pmd->flow_cache).emc_cache, key, flow);
     }
 }
 
@@ -2297,6 +2373,76 @@  emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
     return NULL;
 }
 
+static inline const struct cmap_node *
+smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
+                const uint32_t hash)
+{
+    struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK];
+    uint16_t sig = hash >> 16;
+    uint16_t index = UINT16_MAX;
+
+    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
+        if (bucket->sig[i] == sig) {
+            index = bucket->flow_idx[i];
+            break;
+        }
+    }
+    if (index != UINT16_MAX) {
+        return cmap_find_by_index(&pmd->flow_table, index);
+    }
+    return NULL;
+}
+
+static void
+smc_clear_entry(struct smc_bucket *b, int idx)
+{
+    b->flow_idx[idx] = UINT16_MAX;
+}
+
+static inline int
+smc_insert(struct dp_netdev_pmd_thread *pmd,
+           const struct netdev_flow_key *key,
+           uint32_t hash)
+{
+    struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache;
+    struct smc_bucket *bucket = &smc_cache->buckets[key->hash & SMC_MASK];
+    uint16_t index;
+    uint32_t cmap_index;
+    bool smc_enable_db;
+
+    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
+    if (!smc_enable_db) {
+        return 0;
+    }
+
+    cmap_index = cmap_find_index(&pmd->flow_table, hash);
+    index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : (uint16_t)cmap_index;
+
+    /* If the index is larger than SMC can handel */
+    if (index == UINT16_MAX) {
+        return 0;
+    }
+
+    uint16_t sig = key->hash >> 16;
+    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
+        if(bucket->sig[i] == sig) {
+            bucket->flow_idx[i] = index;
+            return 1;
+        }
+    }
+    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
+        if(bucket->flow_idx[i] == UINT16_MAX) {
+            bucket->sig[i] = sig;
+            bucket->flow_idx[i] = index;
+            return 1;
+        }
+    }
+    int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1);
+    bucket->sig[idx] = sig;
+    bucket->flow_idx[idx] = index;
+    return 1;
+}
+
 static struct dp_netdev_flow *
 dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
                           const struct netdev_flow_key *key,
@@ -2310,7 +2456,7 @@  dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
 
     cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
     if (OVS_LIKELY(cls)) {
-        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
+        dpcls_lookup(cls, &key, &rule, 1, lookup_num_p);
         netdev_flow = dp_netdev_flow_cast(rule);
     }
     return netdev_flow;
@@ -3137,6 +3283,17 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         }
     }
 
+    bool smc_enable = smap_get_bool(other_config, "smc-enable", false);
+    bool cur_smc;
+    atomic_read_relaxed(&dp->smc_enable_db, &cur_smc);
+    if (smc_enable != cur_smc) {
+        atomic_store_relaxed(&dp->smc_enable_db, smc_enable);
+        if (smc_enable) {
+            VLOG_INFO("SMC cache is enabled");
+        } else {
+            VLOG_INFO("SMC cache is disabled");
+        }
+    }
     return 0;
 }
 
@@ -4270,7 +4427,7 @@  pmd_thread_main(void *f_)
     ovs_numa_thread_setaffinity_core(pmd->core_id);
     dpdk_set_lcore_id(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
-    emc_cache_init(&pmd->flow_cache);
+    dfc_cache_init(&pmd->flow_cache);
 reload:
     pmd_alloc_static_tx_qid(pmd);
 
@@ -4324,7 +4481,7 @@  reload:
             coverage_try_clear();
             dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
             if (!ovsrcu_try_quiesce()) {
-                emc_cache_slow_sweep(&pmd->flow_cache);
+                emc_cache_slow_sweep(&((pmd->flow_cache).emc_cache));
             }
 
             atomic_read_relaxed(&pmd->reload, &reload);
@@ -4349,7 +4506,7 @@  reload:
         goto reload;
     }
 
-    emc_cache_uninit(&pmd->flow_cache);
+    dfc_cache_uninit(&pmd->flow_cache);
     free(poll_list);
     pmd_free_cached_ports(pmd);
     return NULL;
@@ -4785,7 +4942,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
-        emc_cache_init(&pmd->flow_cache);
+        dfc_cache_init(&pmd->flow_cache);
         pmd_alloc_static_tx_qid(pmd);
     }
     pmd_perf_stats_init(&pmd->perf_stats);
@@ -4828,7 +4985,7 @@  dp_netdev_del_pmd(struct dp_netdev *dp, struct dp_netdev_pmd_thread *pmd)
      * but extra cleanup is necessary */
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_lock(&dp->non_pmd_mutex);
-        emc_cache_uninit(&pmd->flow_cache);
+        dfc_cache_uninit(&pmd->flow_cache);
         pmd_free_cached_ports(pmd);
         pmd_free_static_tx_qid(pmd);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -5132,10 +5289,67 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
 }
 
-/* Try to process all ('cnt') the 'packets' using only the exact match cache
+/* SMC lookup function for a batch of pckets.
+ * By doing batching SMC lookup, we can use prefetch
+ * to hide memory access latency.
+ */
+static inline void
+smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
+            struct netdev_flow_key *keys,
+            struct netdev_flow_key **missed_keys,
+            struct dp_packet_batch *packets_,
+            struct packet_batch_per_flow batches[],
+            size_t *n_batches, const int cnt)
+{
+    int i;
+    struct dp_packet *packet;
+    size_t n_smc_hit = 0, n_missed = 0;
+    struct dfc_cache *cache = &pmd->flow_cache;
+    struct smc_cache *smc_cache = &cache->smc_cache;
+    const struct cmap_node *flow_node;
+
+    /* Prefetch buckets for all packets */
+    for (i = 0; i < cnt; i++) {
+        OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]);
+    }
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
+        struct dp_netdev_flow *flow = NULL;
+        flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash);
+
+        if (OVS_LIKELY(flow_node != NULL)) {
+            /* Optimistically we only check the head of the node linked list */
+            flow = CONTAINER_OF(flow_node, struct dp_netdev_flow, node);
+            /* Since we dont have per-port megaflow to check the port number,
+             * we need to  verify that the input ports match. */
+            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
+                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
+                /* SMC hit and emc miss, we insert into EMC */
+                emc_probabilistic_insert(pmd, &keys[i], flow);
+                keys[i].len =
+                        netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
+                dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
+                        n_batches);
+                n_smc_hit++;
+                continue;
+            }
+        }
+
+        /* SMC missed. Group missed packets together at
+         * the beginning of the 'packets' array. */
+        dp_packet_batch_refill(packets_, packet, i);
+        /* Put missed keys to the pointer arrays return to the caller */
+        missed_keys[n_missed++] = &keys[i];
+    }
+
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
+}
+
+/* Try to process all ('cnt') the 'packets' using only the datapath flow cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
- * beginning of the 'packets' array.
+ * beginning of the 'packets' array. The pointers of missed keys are put in the
+ * missed_keys pointer array for future processing.
  *
  * The function returns the number of packets that needs to be processed in the
  * 'packets' array (they have been moved to the beginning of the vector).
@@ -5147,20 +5361,24 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
  * will be ignored.
  */
 static inline size_t
-emc_processing(struct dp_netdev_pmd_thread *pmd,
+dfc_processing(struct dp_netdev_pmd_thread *pmd,
                struct dp_packet_batch *packets_,
                struct netdev_flow_key *keys,
+               struct netdev_flow_key **missed_keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
                bool md_is_valid, odp_port_t port_no)
 {
-    struct emc_cache *flow_cache = &pmd->flow_cache;
     struct netdev_flow_key *key = &keys[0];
-    size_t n_missed = 0, n_dropped = 0;
+    size_t n_missed = 0, n_emc_hit = 0;
+    struct dfc_cache *cache = &pmd->flow_cache;
     struct dp_packet *packet;
     const size_t cnt = dp_packet_batch_size(packets_);
     uint32_t cur_min;
     int i;
+    bool smc_enable_db;
+
 
+    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
     pmd_perf_update_counter(&pmd->perf_stats,
                             md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV,
@@ -5171,7 +5389,6 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
 
         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
             dp_packet_delete(packet);
-            n_dropped++;
             continue;
         }
 
@@ -5187,34 +5404,47 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         miniflow_extract(packet, &key->mf);
         key->len = 0; /* Not computed yet. */
-        /* If EMC is disabled skip hash computation and emc_lookup */
-        if (cur_min) {
+        /* If EMC and SMC disabled skip hash computation */
+        if (smc_enable_db == true || cur_min != 0) {
             if (!md_is_valid) {
                 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
                         &key->mf);
             } else {
                 key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
             }
-            flow = emc_lookup(flow_cache, key);
+        }
+        if (cur_min) {
+            flow = emc_lookup(&cache->emc_cache, key);
         } else {
             flow = NULL;
         }
         if (OVS_LIKELY(flow)) {
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
+            n_emc_hit++;
         } else {
             /* Exact match cache missed. Group missed packets together at
              * the beginning of the 'packets' array. */
             dp_packet_batch_refill(packets_, packet, i);
             /* 'key[n_missed]' contains the key of the current packet and it
-             * must be returned to the caller. The next key should be extracted
-             * to 'keys[n_missed + 1]'. */
+             * will be passed to SMC lookup. The next key should be extracted
+             * to 'keys[n_missed + 1]'.
+             * We also maintain a pointer array to keys missed both SMC and EMC
+             * which will be returned to the caller for future processing. */
+            missed_keys[n_missed] = key;
             key = &keys[++n_missed];
         }
     }
 
-    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
-                            cnt - n_dropped - n_missed);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
+
+    if (!smc_enable_db) {
+        return dp_packet_batch_size(packets_);
+    }
+
+    /* Packets miss EMC will do a batch lookup in SMC if enabled */
+    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
+                            n_batches, n_missed);
 
     return dp_packet_batch_size(packets_);
 }
@@ -5282,6 +5512,8 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
+        uint32_t hash = dp_netdev_flow_hash(&netdev_flow->ufid);
+        smc_insert(pmd, key, hash);
         emc_probabilistic_insert(pmd, key, netdev_flow);
     }
     if (pmd_perf_metrics_enabled(pmd)) {
@@ -5298,7 +5530,7 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
 static inline void
 fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                      struct dp_packet_batch *packets_,
-                     struct netdev_flow_key *keys,
+                     struct netdev_flow_key **keys,
                      struct packet_batch_per_flow batches[],
                      size_t *n_batches,
                      odp_port_t in_port)
@@ -5320,12 +5552,13 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
     for (size_t i = 0; i < cnt; i++) {
         /* Key length is needed in all the cases, hash computed on demand. */
-        keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
+        keys[i]->len = netdev_flow_key_size(miniflow_n_values(&keys[i]->mf));
     }
     /* Get the classifier for the in_port */
     cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
     if (OVS_LIKELY(cls)) {
-        any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
+        any_miss = !dpcls_lookup(cls, (const struct netdev_flow_key **)keys,
+                                rules, cnt, &lookup_cnt);
     } else {
         any_miss = true;
         memset(rules, 0, sizeof(rules));
@@ -5347,7 +5580,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
             /* It's possible that an earlier slow path execution installed
              * a rule covering this flow.  In this case, it's a lot cheaper
              * to catch it here than execute a miss. */
-            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i],
+            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, keys[i],
                                                     &add_lookup_cnt);
             if (netdev_flow) {
                 lookup_cnt += add_lookup_cnt;
@@ -5355,7 +5588,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                 continue;
             }
 
-            int error = handle_packet_upcall(pmd, packet, &keys[i],
+            int error = handle_packet_upcall(pmd, packet, keys[i],
                                              &actions, &put_actions);
 
             if (OVS_UNLIKELY(error)) {
@@ -5385,9 +5618,11 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
         }
 
         flow = dp_netdev_flow_cast(rules[i]);
+        uint32_t hash =  dp_netdev_flow_hash(&flow->ufid);
+        smc_insert(pmd, keys[i], hash);
+        emc_probabilistic_insert(pmd, keys[i], flow);
 
-        emc_probabilistic_insert(pmd, &keys[i], flow);
-        dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
+        dp_netdev_queue_batches(packet, flow, &keys[i]->mf, batches, n_batches);
     }
 
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
@@ -5417,17 +5652,18 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
 #endif
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
         struct netdev_flow_key keys[PKT_ARRAY_SIZE];
+    struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
     struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
     size_t n_batches;
     odp_port_t in_port;
 
     n_batches = 0;
-    emc_processing(pmd, packets, keys, batches, &n_batches,
+    dfc_processing(pmd, packets, keys, missed_keys, batches, &n_batches,
                             md_is_valid, port_no);
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
         in_port = packets->packets[0]->md.in_port.odp_port;
-        fast_path_processing(pmd, packets, keys,
+        fast_path_processing(pmd, packets, missed_keys,
                              batches, &n_batches, in_port);
     }
 
@@ -6377,7 +6613,7 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
 
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
-static inline bool
+static bool
 dpcls_rule_matches_key(const struct dpcls_rule *rule,
                        const struct netdev_flow_key *target)
 {
@@ -6404,7 +6640,7 @@  dpcls_rule_matches_key(const struct dpcls_rule *rule,
  *
  * Returns true if all miniflows found a corresponding rule. */
 static bool
-dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
+dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
              struct dpcls_rule **rules, const size_t cnt,
              int *num_lookups_p)
 {
@@ -6443,7 +6679,7 @@  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
          * masked with the subtable's mask to avoid hashing the wildcarded
          * bits. */
         ULLONG_FOR_EACH_1(i, keys_map) {
-            hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
+            hashes[i] = netdev_flow_key_hash_in_mask(keys[i],
                                                      &subtable->mask);
         }
         /* Lookup. */
@@ -6457,7 +6693,7 @@  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
             struct dpcls_rule *rule;
 
             CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
-                if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
+                if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i]))) {
                     rules[i] = rule;
                     /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
                      * within one second optimization interval. */
diff --git a/tests/pmd.at b/tests/pmd.at
index f3fac63..60452f5 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -185,6 +185,7 @@  CHECK_PMD_THREADS_CREATED()
 AT_CHECK([ovs-appctl vlog/set dpif_netdev:dbg])
 AT_CHECK([ovs-ofctl add-flow br0 action=normal])
 AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:emc-insert-inv-prob=1])
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:smc-enable=true])
 
 sleep 1
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 7609485..b4622f5 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -388,6 +388,19 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="smc-enable"
+              type='{"type": "boolean"}'>
+        <p>
+          Signature match cache or SMC is a cache between EMC and megaflow
+          cache. It does not store the full key of the flow, so it is more
+          memory efficient comparing to EMC cache. SMC is especially useful
+          when flow count is larger than EMC capacity.
+        </p>
+        <p>
+          Defaults to false but can be changed at any time.
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>