diff mbox series

[ovs-dev] dpif-netdev: Refactor datapath flow cache

Message ID CFF8EF42F1132E4CBE2BF0AB6C21C58D8CEC59FC@ESESSMB107.ericsson.se
State Superseded
Delegated to: Ian Stokes
Headers show
Series [ovs-dev] dpif-netdev: Refactor datapath flow cache | expand

Commit Message

Jan Scheurich Nov. 20, 2017, 5:33 p.m. UTC
So far the netdev datapath uses an 8K EMC to speed up the lookup of
frequently used flows by comparing the parsed packet headers against
the miniflow of a cached flow, using 13 bits of the packet RSS hash
as index. The EMC is too small for many applications with 100K or more
parallel packet flows so that EMC threshing actually degrades performance.
Furthermore, the size of struct miniflow and the flow copying cost
prevents us from making it much larger.

At the same time the lookup cost of the megaflow classifier (DPCLS) is
increasing as the number of frequently hit subtables grows with the
complexity of pipeline and the number of recirculations.

To close the performance gap for many parallel flows, this patch
introduces the datapath flow cache (DFC) with 1M entries as lookup
stage between EMC and DPCLS. It directly maps 20 bits of the RSS hash
to a pointer to the last hit megaflow entry and performs a masked
comparison of the packet flow with the megaflow key to confirm the
hit. This avoids the costly DPCLS lookup even for very large number of
parallel flows with a small memory overhead.

Due the large size of the DFC and the low risk of DFC thrashing, any
DPCLS hit immediately inserts an entry in the DFC so that subsequent
packets get speeded up. The DFC, thus, accelerate also short-lived
flows.

To further accelerate the lookup of few elephant flows, every DFC hit
triggers a probabilistic EMC insertion of the flow. As the DFC entry is
already in place the default EMC insertion probability can be reduced to
1/1000 to minimize EMC thrashing should there still be many fat flows.
The inverse EMC insertion probability remains configurable.

The EMC implementation is simplified by removing the possibility to
store a flow in two slots, as there is no particular reason why two
flows should systematically collide (the RSS hash is not symmetric).
The maximum size of the EMC flow key is limited to 256 bytes to reduce
the memory footprint. This should be sufficient to hold most real life
packet flow keys. Larger flows are not installed in the EMC.

The pmd-stats-show command is enhanced to show both EMC and DFC hits
separately.

The sweep speed for cleaning up obsolete EMC and DFC flow entries and
freeing dead megaflow entries is increased. With a typical PMD cycle
duration of 100us under load and checking one DFC entry per cycle, the
DFC sweep should normally complete within in 100s.

In PVP performance tests with an L3 pipeline over VXLAN we determined the
optimal EMC size to be 16K entries to obtain a uniform speedup compared
to the master branch over the full range of parallel flows. The measurement
below is for 64 byte packets and the average number of subtable lookups
per DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for
a single busy mask. Tests with many visited subtables should show a strong
increase of the gain through DFC.

Flows   master  DFC+EMC  Gain
        [Mpps]  [Mpps]
------------------------------
8       4.45    4.62     3.8%
100     4.17    4.47     7.2%
1000    3.88    4.34    12.0%
2000    3.54    4.17    17.8%
5000    3.01    3.82    27.0%
10000   2.75    3.63    31.9%
20000   2.64    3.50    32.8%
50000   2.60    3.33    28.1%
100000  2.59    3.23    24.7%
500000  2.59    3.16    21.9%


Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
---
 lib/dpif-netdev.c | 349 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 235 insertions(+), 114 deletions(-)

--
1.9.1

Comments

Billy O'Mahony Nov. 21, 2017, 11:55 a.m. UTC | #1
Hi Jan,

Thanks, that's a really interesting patch.

Currently does not apply to head of master - what rev can I apply it to?

Some more below - including one way down in the code.

Thanks,
/Billy.

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Monday, November 20, 2017 5:33 PM
> To: dev@openvswitch.org
> Subject: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> So far the netdev datapath uses an 8K EMC to speed up the lookup of frequently
> used flows by comparing the parsed packet headers against the miniflow of a
> cached flow, using 13 bits of the packet RSS hash as index. The EMC is too small
> for many applications with 100K or more parallel packet flows so that EMC
> threshing actually degrades performance.
> Furthermore, the size of struct miniflow and the flow copying cost prevents us
> from making it much larger.
> 
> At the same time the lookup cost of the megaflow classifier (DPCLS) is increasing
> as the number of frequently hit subtables grows with the complexity of pipeline
> and the number of recirculations.
> 
> To close the performance gap for many parallel flows, this patch introduces the
> datapath flow cache (DFC) with 1M entries as lookup stage between EMC and
> DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit
> megaflow entry and performs a masked comparison of the packet flow with the
> megaflow key to confirm the hit. This avoids the costly DPCLS lookup even for
> very large number of parallel flows with a small memory overhead.
> 
> Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
> immediately inserts an entry in the DFC so that subsequent packets get speeded
> up. The DFC, thus, accelerate also short-lived flows.
> 
> To further accelerate the lookup of few elephant flows, every DFC hit triggers a
> probabilistic EMC insertion of the flow. As the DFC entry is already in place the
> default EMC insertion probability can be reduced to
> 1/1000 to minimize EMC thrashing should there still be many fat flows.
> The inverse EMC insertion probability remains configurable.
> 
> The EMC implementation is simplified by removing the possibility to store a flow
> in two slots, as there is no particular reason why two flows should systematically
> collide (the RSS hash is not symmetric).
> The maximum size of the EMC flow key is limited to 256 bytes to reduce the
> memory footprint. This should be sufficient to hold most real life packet flow
> keys. Larger flows are not installed in the EMC.
[[BO'M]] Does miniflow_extract work ok with the reduced miniflow size?
Miniflow comment says: 
  Caller is responsible for initializing 'dst' with enough storage for FLOW_U64S * 8 bytes.

> 
> The pmd-stats-show command is enhanced to show both EMC and DFC hits
> separately.
> 
> The sweep speed for cleaning up obsolete EMC and DFC flow entries and freeing
> dead megaflow entries is increased. With a typical PMD cycle duration of 100us
> under load and checking one DFC entry per cycle, the DFC sweep should
> normally complete within in 100s.
> 
> In PVP performance tests with an L3 pipeline over VXLAN we determined the
> optimal EMC size to be 16K entries to obtain a uniform speedup compared to
> the master branch over the full range of parallel flows. The measurement below
> is for 64 byte packets and the average number of subtable lookups per DPCLS hit
> in this pipeline is 1.0, i.e. the acceleration already starts for a single busy mask.
> Tests with many visited subtables should show a strong increase of the gain
> through DFC.
> 
> Flows   master  DFC+EMC  Gain
>         [Mpps]  [Mpps]
> ------------------------------
> 8       4.45    4.62     3.8%
> 100     4.17    4.47     7.2%
> 1000    3.88    4.34    12.0%
> 2000    3.54    4.17    17.8%
> 5000    3.01    3.82    27.0%
> 10000   2.75    3.63    31.9%
> 20000   2.64    3.50    32.8%
> 50000   2.60    3.33    28.1%
> 100000  2.59    3.23    24.7%
> 500000  2.59    3.16    21.9%
[[BO'M]]
What is the flow distribution here? 

Are there other flow distributions that we want to ensure do not suffer a possible regression. I'm not sure what they are exactly - I have some ideas but admittedly I've only every tested with either a uniform flow distribution or else a round-robin distribution. 

> 
> 
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> ---
>  lib/dpif-netdev.c | 349 ++++++++++++++++++++++++++++++++++++---------------
> ---
>  1 file changed, 235 insertions(+), 114 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index db78318..efcf2e9 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -127,19 +127,19 @@ struct netdev_flow_key {
>      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> 
> -/* Exact match cache for frequently used flows
> +/* Datapath flow cache (DFC) for frequently used flows
>   *
> - * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
> - * search its entries for a miniflow that matches exactly the miniflow of the
> - * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
> + * The cache uses the 32-bit hash of the packet (which can be the RSS
> + hash) to
> + * directly look up a pointer to the matching megaflow. To check for a
> + match
> + * the packet's flow key is compared against the key and mask of the
> megaflow.
>   *
> - * A cache entry holds a reference to its 'dp_netdev_flow'.
> - *
> - * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS
> different
> - * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
> - * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away).
> Each
> - * value is the index of a cache entry where the miniflow could be.
> + * For even faster lookup, the most frequently used packet flows are
> + also
> + * inserted into a small exact match cache (EMC). The EMC uses a part
> + of the
> + * packet hash to look up a miniflow that matches exactly the miniflow
> + of the
> + * packet. The matching EMC also returns a reference to the megaflow.
>   *
> + * Flows are promoted from the DFC to the EMC through probabilistic
> + insertion
> + * after successful DFC lookup with minor probability to favor elephant flows.
>   *
>   * Thread-safety
>   * =============
> @@ -148,33 +148,38 @@ struct netdev_flow_key {
>   * If dp_netdev_input is not called from a pmd thread, a mutex is used.
>   */
> 
> -#define EM_FLOW_HASH_SHIFT 13
> -#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT) -#define
> EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1) -#define
> EM_FLOW_HASH_SEGS 2
> +#define DFC_MASK_LEN 20
> +#define DFC_ENTRIES (1u << DFC_MASK_LEN) #define DFC_MASK
> (DFC_ENTRIES
> +- 1) #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u <<
> EMC_MASK_LEN)
> +#define EMC_MASK (EMC_ENTRIES - 1)
> 
>  /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB
> */ -#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
> +#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
>                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
> 
>  struct emc_entry {
>      struct dp_netdev_flow *flow;
> -    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> +    char key[248];   /* Holds struct netdev_flow_key of limited size. */
>  };
> 
>  struct emc_cache {
> -    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
> -    int sweep_idx;                /* For emc_cache_slow_sweep(). */
> +    struct emc_entry entries[EMC_ENTRIES];
> +    int sweep_idx;
> +};
> +
> +struct dfc_entry {
> +    struct dp_netdev_flow *flow;
> +};
> +
> +struct dfc_cache {
> +    struct emc_cache emc_cache;
> +    struct dfc_entry entries[DFC_ENTRIES];
> +    int sweep_idx;
>  };
> 
> -/* Iterate in the exact match cache through every entry that might contain a
> - * miniflow with hash 'HASH'. */
> -#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
> \
> -    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
> -         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ &
> EM_FLOW_HASH_MASK], \
> -         i__ < EM_FLOW_HASH_SEGS;                                            \
> -         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
> 
>  /* Simple non-wildcarding single-priority classifier. */
> 
> @@ -214,6 +219,8 @@ static bool dpcls_lookup(struct dpcls *cls,
>                           const struct netdev_flow_key keys[],
>                           struct dpcls_rule **rules, size_t cnt,
>                           int *num_lookups_p);
> +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
> +                                   const struct netdev_flow_key
> +*target);
> 
>  /* Set of supported meter flags */
>  #define DP_SUPPORTED_METER_FLAGS_MASK \ @@ -332,6 +339,7 @@ static
> struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
> 
>  enum dp_stat_type {
>      DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
> +    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
>      DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
>      DP_STAT_MISS,               /* Packets that did not match. */
>      DP_STAT_LOST,               /* Packets not passed up to the client. */
> @@ -565,7 +573,7 @@ struct dp_netdev_pmd_thread {
>       * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>       * need to be protected by 'non_pmd_mutex'.  Every other instance
>       * will only be accessed by its own pmd thread. */
> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
>      struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
> 
>      /* Queue id used by this pmd thread to send packets on all netdevs if @@ -
> 744,46 +752,59 @@ dpif_netdev_xps_revalidate_pmd(const struct
> dp_netdev_pmd_thread *pmd,  static int dpif_netdev_xps_get_tx_qid(const
> struct dp_netdev_pmd_thread *pmd,
>                                        struct tx_port *tx, long long now);
> 
> -static inline bool emc_entry_alive(struct emc_entry *ce);
> +static inline bool dfc_entry_alive(struct dfc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
> +static void dfc_clear_entry(struct dfc_entry *ce);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> 
>  static void
> -emc_cache_init(struct emc_cache *flow_cache)
> +emc_cache_init(struct emc_cache *emc)
>  {
>      int i;
> 
> -    flow_cache->sweep_idx = 0;
> +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> +        emc->entries[i].flow = NULL;
> +        struct netdev_flow_key *key =
> +                (struct netdev_flow_key *) emc->entries[i].key;
> +        key->hash = 0;
> +        key->len = sizeof(struct miniflow);
> +        flowmap_init(&key->mf.map);
> +    }
> +    emc->sweep_idx = 0;
> +}
> +
> +static void
> +dfc_cache_init(struct dfc_cache *flow_cache) {
> +    int i;
> +
> +    emc_cache_init(&flow_cache->emc_cache);
>      for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>          flow_cache->entries[i].flow = NULL;
> -        flow_cache->entries[i].key.hash = 0;
> -        flow_cache->entries[i].key.len = sizeof(struct miniflow);
> -        flowmap_init(&flow_cache->entries[i].key.mf.map);
>      }
> +    flow_cache->sweep_idx = 0;
>  }
> 
>  static void
> -emc_cache_uninit(struct emc_cache *flow_cache)
> +emc_cache_uninit(struct emc_cache *emc)
>  {
>      int i;
> 
> -    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> -        emc_clear_entry(&flow_cache->entries[i]);
> +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> +        emc_clear_entry(&emc->entries[i]);
>      }
>  }
> 
> -/* Check and clear dead flow references slowly (one entry at each
> - * invocation).  */
>  static void
> -emc_cache_slow_sweep(struct emc_cache *flow_cache)
> +dfc_cache_uninit(struct dfc_cache *flow_cache)
>  {
> -    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
> +    int i;
> 
> -    if (!emc_entry_alive(entry)) {
> -        emc_clear_entry(entry);
> +    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> +        dfc_clear_entry(&flow_cache->entries[i]);
>      }
> -    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) &
> EM_FLOW_HASH_MASK;
> +    emc_cache_uninit(&flow_cache->emc_cache);
>  }
> 
>  /* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */ @@ -
> 837,8 +858,8 @@ pmd_info_show_stats(struct ds *reply,
>      }
> 
>      /* Sum of all the matched and not matched packets gives the total.  */
> -    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_MASKED_HIT]
> -                    + stats[DP_STAT_MISS];
> +    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
> +                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];
> 
>      for (i = 0; i < PMD_N_CYCLES; i++) {
>          if (cycles[i] > pmd->cycles_zero[i]) { @@ -862,10 +883,13 @@
> pmd_info_show_stats(struct ds *reply,
>      ds_put_cstr(reply, ":\n");
> 
>      ds_put_format(reply,
> -                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> +                  "\temc hits:%llu\n\tdfc hits:%llu\n"
> +                  "\tmegaflow hits:%llu\n"
>                    "\tavg. subtable lookups per hit:%.2f\n"
>                    "\tmiss:%llu\n\tlost:%llu\n",
> -                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
> +                  stats[DP_STAT_EXACT_HIT],
> +                  stats[DP_STAT_DFC_HIT],
> +                  stats[DP_STAT_MASKED_HIT],
>                    stats[DP_STAT_MASKED_HIT] > 0
>                    ? (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT]
>                    : 0,
> @@ -1492,6 +1516,8 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct
> dpif_dp_stats *stats)
>          stats->n_hit += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
>          stats->n_hit += n;
> +        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
> +        stats->n_hit += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
>          stats->n_missed += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n); @@ -2111,6
> +2137,16 @@ netdev_flow_key_hash_in_mask(const struct netdev_flow_key
> *key,
>      return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);  }
> 
> +/*
> + * Datapath Flow Cache and EMC implementation  */
> +
> +static inline struct emc_entry *
> +emc_entry_get(struct emc_cache *emc, const uint32_t hash) {
> +    return &emc->entries[hash & EMC_MASK]; }
> +
>  static inline bool
>  emc_entry_alive(struct emc_entry *ce)
>  {
> @@ -2127,9 +2163,15 @@ emc_clear_entry(struct emc_entry *ce)  }
> 
>  static inline void
> -emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> -                 const struct netdev_flow_key *key)
> +emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key *key,
> +                 struct dp_netdev_flow *flow)
>  {
> +    /* We only store small enough flows in the EMC. */
> +    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
> +    if (key_size > sizeof(ce->key)) {
> +        return;
> +    }
> +
>      if (ce->flow != flow) {
>          if (ce->flow) {
>              dp_netdev_flow_unref(ce->flow); @@ -2141,73 +2183,148 @@
> emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
>              ce->flow = NULL;
>          }
>      }
> -    if (key) {
> -        netdev_flow_key_clone(&ce->key, key);
> -    }
> +    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
>  }
> 
>  static inline void
> -emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
> -           struct dp_netdev_flow *flow)
> +emc_probabilistic_insert(struct emc_cache *emc,
> +                         const struct netdev_flow_key *key,
> +                         struct dp_netdev_flow *flow)
>  {
> -    struct emc_entry *to_be_replaced = NULL;
> -    struct emc_entry *current_entry;
> +    const uint32_t threshold = UINT32_MAX/1000;
> 
> -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> -        if (netdev_flow_key_equal(&current_entry->key, key)) {
> -            /* We found the entry with the 'mf' miniflow */
> -            emc_change_entry(current_entry, flow, NULL);
> -            return;
> +    if (random_uint32() <= threshold) {
> +
> +        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> +        emc_change_entry(current_entry, key, flow);
> +    }
> +}
> +
> +static inline struct dp_netdev_flow *
> +emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) {
> +    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> +    struct netdev_flow_key *current_key =
> +            (struct netdev_flow_key *) current_entry->key;
> +
> +    if (current_key->hash == key->hash
> +        && emc_entry_alive(current_entry)
> +        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
> +
> +        /* We found the entry with the 'key->mf' miniflow */
> +        return current_entry->flow;
> +    }
> +    return NULL;
> +}
> +
> +static inline struct dfc_entry *
> +dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) {
> +    return &cache->entries[hash & DFC_MASK]; }
> +
> +static inline bool
> +dfc_entry_alive(struct dfc_entry *ce)
> +{
> +    return ce->flow && !ce->flow->dead; }
> +
> +static void
> +dfc_clear_entry(struct dfc_entry *ce)
> +{
> +    if (ce->flow) {
> +        dp_netdev_flow_unref(ce->flow);
> +        ce->flow = NULL;
> +    }
> +}
> +
> +static inline void
> +dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) {
> +    if (ce->flow != flow) {
> +        if (ce->flow) {
> +            dp_netdev_flow_unref(ce->flow);
>          }
> 
> -        /* Replacement policy: put the flow in an empty (not alive) entry, or
> -         * in the first entry where it can be */
> -        if (!to_be_replaced
> -            || (emc_entry_alive(to_be_replaced)
> -                && !emc_entry_alive(current_entry))
> -            || current_entry->key.hash < to_be_replaced->key.hash) {
> -            to_be_replaced = current_entry;
> +        if (dp_netdev_flow_ref(flow)) {
> +            ce->flow = flow;
> +        } else {
> +            ce->flow = NULL;
>          }
>      }
> -    /* We didn't find the miniflow in the cache.
> -     * The 'to_be_replaced' entry is where the new flow will be stored */
> -
> -    emc_change_entry(to_be_replaced, flow, key);
>  }
> 
>  static inline void
> -emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
> -                         const struct netdev_flow_key *key,
> -                         struct dp_netdev_flow *flow)
> +dfc_insert(struct dp_netdev_pmd_thread *pmd,
> +           const struct netdev_flow_key *key,
> +           struct dp_netdev_flow *flow)
>  {
> -    /* Insert an entry into the EMC based on probability value 'min'. By
> -     * default the value is UINT32_MAX / 100 which yields an insertion
> -     * probability of 1/100 ie. 1% */
> +    struct dfc_cache *cache = &pmd->flow_cache;
> +    struct dfc_entry *current_entry;
> +
> +    current_entry = dfc_entry_get(cache, key->hash);
> +    dfc_change_entry(current_entry, flow);
> +}
> +
> +static inline struct dp_netdev_flow *
> +dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
> +           bool *exact_match)
> +{
> +    struct dp_netdev_flow *flow;
> 
> -    uint32_t min;
> -    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
> +    /* Try an EMC lookup first. */
> +    flow = emc_lookup(&cache->emc_cache, key);
> +    if (flow) {
> +        *exact_match = true;
> +        return flow;
> +    }
> +
> +    /* EMC lookup not successful: try DFC lookup. */
> +    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
> +    flow = current_entry->flow;

 [[BO'M]] There is no equivalent of netdev_flow_key_equal_mf() used in emc_lookup here - which does a memcmp to verify the minflows are exactly equal. Without something similar for the DFC Isn't there a possibility that the 20bit hashes of two packets will accidently point to the same flow struct? 

> 
> -    if (min && random_uint32() <= min) {
> -        emc_insert(&pmd->flow_cache, key, flow);
> +    if (dfc_entry_alive(current_entry) &&
> +        dpcls_rule_matches_key(&flow->cr, key)) {
> +
> +        /* Found a match in DFC. Insert into EMC for subsequent lookups.
> +         * We use probabilistic insertion here so that mainly elephant
> +         * flows enter EMC. */
> +        emc_probabilistic_insert(&cache->emc_cache, key, flow);
> +        *exact_match = false;
> +        return flow;
> +    } else {
> +
> +        /* No match. Need to go to DPCLS lookup. */
> +        return NULL;
>      }
>  }
> 
> -static inline struct dp_netdev_flow *
> -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
> +/* Check and clear dead flow references slowly (one entry at each
> + * invocation).  */
> +static void
> +emc_slow_sweep(struct emc_cache *emc)
>  {
> -    struct emc_entry *current_entry;
> +    struct emc_entry *entry = &emc->entries[emc->sweep_idx];
> 
> -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> -        if (current_entry->key.hash == key->hash
> -            && emc_entry_alive(current_entry)
> -            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
> +    if (!emc_entry_alive(entry)) {
> +        emc_clear_entry(entry);
> +    }
> +    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
> +}
> 
> -            /* We found the entry with the 'key->mf' miniflow */
> -            return current_entry->flow;
> -        }
> +static void
> +dfc_slow_sweep(struct dfc_cache *cache)
> +{
> +    /* Sweep the EMC so that both finish in the same time. */
> +    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
> +        emc_slow_sweep(&cache->emc_cache);
>      }
> 
> -    return NULL;
> +    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
> +    if (!dfc_entry_alive(entry)) {
> +        dfc_clear_entry(entry);
> +    }
> +    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
>  }
> 
>  static struct dp_netdev_flow *
> @@ -4048,7 +4165,7 @@ pmd_thread_main(void *f_)
>      ovs_numa_thread_setaffinity_core(pmd->core_id);
>      dpdk_set_lcore_id(pmd->core_id);
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> -    emc_cache_init(&pmd->flow_cache);
> +    dfc_cache_init(&pmd->flow_cache);
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
> 
> @@ -4078,17 +4195,16 @@ reload:
>                                                        : PMD_CYCLES_IDLE);
>          }
> 
> -        if (lc++ > 1024) {
> -            bool reload;
> +        dfc_slow_sweep(&pmd->flow_cache);
> 
> +        if (lc++ > 1024) {
>              lc = 0;
> 
>              coverage_try_clear();
>              dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
> -            if (!ovsrcu_try_quiesce()) {
> -                emc_cache_slow_sweep(&pmd->flow_cache);
> -            }
> +            ovsrcu_try_quiesce();
> 
> +            bool reload;
>              atomic_read_relaxed(&pmd->reload, &reload);
>              if (reload) {
>                  break;
> @@ -4110,7 +4226,7 @@ reload:
>          goto reload;
>      }
> 
> -    emc_cache_uninit(&pmd->flow_cache);
> +    dfc_cache_uninit(&pmd->flow_cache);
>      free(poll_list);
>      pmd_free_cached_ports(pmd);
>      return NULL;
> @@ -4544,7 +4660,7 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> -        emc_cache_init(&pmd->flow_cache);
> +        dfc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
>      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd-
> >node),
> @@ -4586,7 +4702,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct
> dp_netdev_pmd_thread *pmd)
>       * but extra cleanup is necessary */
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&dp->non_pmd_mutex);
> -        emc_cache_uninit(&pmd->flow_cache);
> +        dfc_cache_uninit(&pmd->flow_cache);
>          pmd_free_cached_ports(pmd);
>          pmd_free_static_tx_qid(pmd);
>          ovs_mutex_unlock(&dp->non_pmd_mutex);
> @@ -4896,7 +5012,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);
>  }
> 
> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
> +/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
>   * beginning of the 'packets' array.
> @@ -4911,19 +5027,20 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>   * will be ignored.
>   */
>  static inline size_t
> -emc_processing(struct dp_netdev_pmd_thread *pmd,
> +dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
>                 bool md_is_valid, odp_port_t port_no)
>  {
> -    struct emc_cache *flow_cache = &pmd->flow_cache;
> +    struct dfc_cache *flow_cache = &pmd->flow_cache;
>      struct netdev_flow_key *key = &keys[0];
> -    size_t n_missed = 0, n_dropped = 0;
> +    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
>      struct dp_packet *packet;
>      const size_t cnt = dp_packet_batch_size(packets_);
>      uint32_t cur_min;
>      int i;
> +    bool exact_match;
> 
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
> 
> @@ -4932,7 +5049,6 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> -            n_dropped++;
>              continue;
>          }
> 
> @@ -4948,7 +5064,7 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>          miniflow_extract(packet, &key->mf);
>          key->len = 0; /* Not computed yet. */
> -        /* If EMC is disabled skip hash computation and emc_lookup */
> +        /* If DFC is disabled skip hash computation and DFC lookup */
>          if (cur_min) {
>              if (!md_is_valid) {
>                  key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> @@ -4956,11 +5072,16 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>              } else {
>                  key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
>              }
> -            flow = emc_lookup(flow_cache, key);
> +            flow = dfc_lookup(flow_cache, key, &exact_match);
>          } else {
>              flow = NULL;
>          }
>          if (OVS_LIKELY(flow)) {
> +            if (exact_match) {
> +                n_emc_hit++;
> +            } else {
> +                n_dfc_hit++;
> +            }
>              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                      n_batches);
>          } else {
> @@ -4974,8 +5095,8 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>      }
> 
> -    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
> -                           cnt - n_dropped - n_missed);
> +    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
> +    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);
> 
>      return dp_packet_batch_size(packets_);
>  }
> @@ -5044,7 +5165,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread
> *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> -        emc_probabilistic_insert(pmd, key, netdev_flow);
> +        dfc_insert(pmd, key, netdev_flow);
>      }
>  }
> 
> @@ -5136,7 +5257,7 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> +        dfc_insert(pmd, &keys[i], flow);
>          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
>      }
> 
> @@ -5169,7 +5290,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,
>      odp_port_t in_port;
> 
>      n_batches = 0;
> -    emc_processing(pmd, packets, keys, batches, &n_batches,
> +    dfc_processing(pmd, packets, keys, batches, &n_batches,
>                              md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
> @@ -6063,7 +6184,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
> *rule)
> 
>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>   * in 'mask' the values in 'key' and 'target' are the same. */
> -static inline bool
> +static bool
>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                         const struct netdev_flow_key *target)
>  {
> --
> 1.9.1
> 
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Jan Scheurich Nov. 21, 2017, 3:40 p.m. UTC | #2
Hi Billy,

Thanks for your feedback.

I continue to have problems with sending out patches via mail from our company network. This time it seems it is the "Page Feed" (^L) character in the dpif-netdev.c file that got corrupted in the emailed version of the patch. I wish there was a way to upload patches other than through email.

Please try the attached patch file instead. This one applies cleanly on the tip of master for me.

Some more answers below.

BR, Jan

> > The EMC implementation is simplified by removing the possibility to store a flow
> > in two slots, as there is no particular reason why two flows should systematically
> > collide (the RSS hash is not symmetric).
> > The maximum size of the EMC flow key is limited to 256 bytes to reduce the
> > memory footprint. This should be sufficient to hold most real life packet flow
> > keys. Larger flows are not installed in the EMC.
> [[BO'M]] Does miniflow_extract work ok with the reduced miniflow size?
> Miniflow comment says:
>   Caller is responsible for initializing 'dst' with enough storage for FLOW_U64S * 8 bytes.
> 

[Jan] miniflow_extract() is not used here. At EMC insertion I just copy the extracted miniflow into the EMC entry, provided that its length does not exceed the 248 available octets. If it is too large I simply don't insert the flow.

> > The pmd-stats-show command is enhanced to show both EMC and DFC hits
> > separately.
> >
> > The sweep speed for cleaning up obsolete EMC and DFC flow entries and freeing
> > dead megaflow entries is increased. With a typical PMD cycle duration of 100us
> > under load and checking one DFC entry per cycle, the DFC sweep should
> > normally complete within in 100s.
> >
> > In PVP performance tests with an L3 pipeline over VXLAN we determined the
> > optimal EMC size to be 16K entries to obtain a uniform speedup compared to
> > the master branch over the full range of parallel flows. The measurement below
> > is for 64 byte packets and the average number of subtable lookups per DPCLS hit
> > in this pipeline is 1.0, i.e. the acceleration already starts for a single busy mask.
> > Tests with many visited subtables should show a strong increase of the gain
> > through DFC.
> >
> > Flows   master  DFC+EMC  Gain
> >         [Mpps]  [Mpps]
> > ------------------------------
> > 8       4.45    4.62     3.8%
> > 100     4.17    4.47     7.2%
> > 1000    3.88    4.34    12.0%
> > 2000    3.54    4.17    17.8%
> > 5000    3.01    3.82    27.0%
> > 10000   2.75    3.63    31.9%
> > 20000   2.64    3.50    32.8%
> > 50000   2.60    3.33    28.1%
> > 100000  2.59    3.23    24.7%
> > 500000  2.59    3.16    21.9%
> [[BO'M]]
> What is the flow distribution here?
> 
> Are there other flow distributions that we want to ensure do not suffer a possible regression. I'm not sure what they are exactly - I have
> some ideas but admittedly I've only every tested with either a uniform flow distribution or else a round-robin distribution.

[Jan] Since we are using DPDK Pktgen as traffic generator in our standard setup, this is a round-robin flow distribution, which from all I understand, is the worst-case for flow caching as the flow packets  are equidistant and there is no clustering.

I would be happy if people with other traffic generators can provide complementary measurements. 

> > +
> > +    /* EMC lookup not successful: try DFC lookup. */
> > +    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
> > +    flow = current_entry->flow;
> 
>  [[BO'M]] There is no equivalent of netdev_flow_key_equal_mf() used in emc_lookup here - which does a memcmp to verify the minflows
> are exactly equal. Without something similar for the DFC Isn't there a possibility that the 20bit hashes of two packets will accidently point
> to the same flow struct?
> 

Yes, there is a risk for collision. The check whether the DFC entry points to the correct megaflow entry is done below. 

> >
> > -    if (min && random_uint32() <= min) {
> > -        emc_insert(&pmd->flow_cache, key, flow);
> > +    if (dfc_entry_alive(current_entry) &&
> > +        dpcls_rule_matches_key(&flow->cr, key)) {

[Jan] If the pointed to megaflow is alive, we use the existing function dpcls_rule_matches_key() to do a masked comparison of the packet's flow key with the megaflow's rule. This function is used by the DPCLS classifier to check the packet against rules in the subtable matching the subtable hash and we can reuse it as is.

In case the match fails, we have a DFC miss. The subsequent DPCLS lookup will then overwrite the current DFC entry. I decided not to implement probabilistic DFC insertion for two reasons: a) the DFC is much larger so will happen at much higher levels, and b) updating the megaflow pointer in the DFC is far cheaper than EMC insertion.
Ferriter, Cian Dec. 11, 2017, 4:28 p.m. UTC | #3
Hi Jan,

This is a very interesting patch with compelling results.

I had a few questions after reading the commit message for this patch: 
How did you decide on 1M as the proposed size for the DFC?
Will the size of this DFC be configurable (i.e. map a different number of hash bits)?

I was mostly interested in how the performance of the very basic phy2phy testcase would be affected by this change.
Below are my results from 1-10000 flows, 64B packets, unidirectional with 1 PMD. One OpenFlow rule is installed for each stream of traffic being sent:
Flows   master  DFC+EMC  Gain
        [Mpps]  [Mpps]
------------------------------
1       12.34   13.12    6.3%
10      7.83    8.19     4.5%
100     4.71    4.78     1.3%
1000    3.77    3.83     1.5%
10000   3.27    3.51     7.1%

This shows a performance improvement for this testcase also.

Tested-by: Cian Ferriter <cian.ferriter@intel.com>

Let me know your thoughts on the above questions.
Thanks,
Cian

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: 20 November 2017 17:33
> To: dev@openvswitch.org
> Subject: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> So far the netdev datapath uses an 8K EMC to speed up the lookup of
> frequently used flows by comparing the parsed packet headers against the
> miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The
> EMC is too small for many applications with 100K or more parallel packet
> flows so that EMC threshing actually degrades performance.
> Furthermore, the size of struct miniflow and the flow copying cost prevents
> us from making it much larger.
> 
> At the same time the lookup cost of the megaflow classifier (DPCLS) is
> increasing as the number of frequently hit subtables grows with the
> complexity of pipeline and the number of recirculations.
> 
> To close the performance gap for many parallel flows, this patch introduces
> the datapath flow cache (DFC) with 1M entries as lookup stage between EMC
> and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit
> megaflow entry and performs a masked comparison of the packet flow with
> the megaflow key to confirm the hit. This avoids the costly DPCLS lookup
> even for very large number of parallel flows with a small memory overhead.
> 
> Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
> immediately inserts an entry in the DFC so that subsequent packets get
> speeded up. The DFC, thus, accelerate also short-lived flows.
> 
> To further accelerate the lookup of few elephant flows, every DFC hit
> triggers a probabilistic EMC insertion of the flow. As the DFC entry is already
> in place the default EMC insertion probability can be reduced to
> 1/1000 to minimize EMC thrashing should there still be many fat flows.
> The inverse EMC insertion probability remains configurable.
> 
> The EMC implementation is simplified by removing the possibility to store a
> flow in two slots, as there is no particular reason why two flows should
> systematically collide (the RSS hash is not symmetric).
> The maximum size of the EMC flow key is limited to 256 bytes to reduce the
> memory footprint. This should be sufficient to hold most real life packet flow
> keys. Larger flows are not installed in the EMC.
> 
> The pmd-stats-show command is enhanced to show both EMC and DFC hits
> separately.
> 
> The sweep speed for cleaning up obsolete EMC and DFC flow entries and
> freeing dead megaflow entries is increased. With a typical PMD cycle
> duration of 100us under load and checking one DFC entry per cycle, the DFC
> sweep should normally complete within in 100s.
> 
> In PVP performance tests with an L3 pipeline over VXLAN we determined the
> optimal EMC size to be 16K entries to obtain a uniform speedup compared to
> the master branch over the full range of parallel flows. The measurement
> below is for 64 byte packets and the average number of subtable lookups per
> DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a single
> busy mask. Tests with many visited subtables should show a strong increase
> of the gain through DFC.
> 
> Flows   master  DFC+EMC  Gain
>         [Mpps]  [Mpps]
> ------------------------------
> 8       4.45    4.62     3.8%
> 100     4.17    4.47     7.2%
> 1000    3.88    4.34    12.0%
> 2000    3.54    4.17    17.8%
> 5000    3.01    3.82    27.0%
> 10000   2.75    3.63    31.9%
> 20000   2.64    3.50    32.8%
> 50000   2.60    3.33    28.1%
> 100000  2.59    3.23    24.7%
> 500000  2.59    3.16    21.9%
> 
> 
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> ---
>  lib/dpif-netdev.c | 349 ++++++++++++++++++++++++++++++++++++-------
> -----------
>  1 file changed, 235 insertions(+), 114 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index db78318..efcf2e9
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -127,19 +127,19 @@ struct netdev_flow_key {
>      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> 
> -/* Exact match cache for frequently used flows
> +/* Datapath flow cache (DFC) for frequently used flows
>   *
> - * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
> - * search its entries for a miniflow that matches exactly the miniflow of the
> - * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
> + * The cache uses the 32-bit hash of the packet (which can be the RSS
> + hash) to
> + * directly look up a pointer to the matching megaflow. To check for a
> + match
> + * the packet's flow key is compared against the key and mask of the
> megaflow.
>   *
> - * A cache entry holds a reference to its 'dp_netdev_flow'.
> - *
> - * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS
> different
> - * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each
> of
> - * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown
> away). Each
> - * value is the index of a cache entry where the miniflow could be.
> + * For even faster lookup, the most frequently used packet flows are
> + also
> + * inserted into a small exact match cache (EMC). The EMC uses a part
> + of the
> + * packet hash to look up a miniflow that matches exactly the miniflow
> + of the
> + * packet. The matching EMC also returns a reference to the megaflow.
>   *
> + * Flows are promoted from the DFC to the EMC through probabilistic
> + insertion
> + * after successful DFC lookup with minor probability to favor elephant
> flows.
>   *
>   * Thread-safety
>   * =============
> @@ -148,33 +148,38 @@ struct netdev_flow_key {
>   * If dp_netdev_input is not called from a pmd thread, a mutex is used.
>   */
> 
> -#define EM_FLOW_HASH_SHIFT 13
> -#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT) -
> #define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1) -#define
> EM_FLOW_HASH_SEGS 2
> +#define DFC_MASK_LEN 20
> +#define DFC_ENTRIES (1u << DFC_MASK_LEN) #define DFC_MASK
> (DFC_ENTRIES
> +- 1) #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u <<
> EMC_MASK_LEN)
> +#define EMC_MASK (EMC_ENTRIES - 1)
> 
>  /* Default EMC insert probability is 1 /
> DEFAULT_EM_FLOW_INSERT_INV_PROB */ -#define
> DEFAULT_EM_FLOW_INSERT_INV_PROB 100
> +#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
>                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
> 
>  struct emc_entry {
>      struct dp_netdev_flow *flow;
> -    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> +    char key[248];   /* Holds struct netdev_flow_key of limited size. */
>  };
> 
>  struct emc_cache {
> -    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
> -    int sweep_idx;                /* For emc_cache_slow_sweep(). */
> +    struct emc_entry entries[EMC_ENTRIES];
> +    int sweep_idx;
> +};
> +
> +struct dfc_entry {
> +    struct dp_netdev_flow *flow;
> +};
> +
> +struct dfc_cache {
> +    struct emc_cache emc_cache;
> +    struct dfc_entry entries[DFC_ENTRIES];
> +    int sweep_idx;
>  };
> 
> -/* Iterate in the exact match cache through every entry that might contain a
> - * miniflow with hash 'HASH'. */
> -#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
> \
> -    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
> -         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ &
> EM_FLOW_HASH_MASK], \
> -         i__ < EM_FLOW_HASH_SEGS;                                            \
> -         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
> 
>  /* Simple non-wildcarding single-priority classifier. */
> 
> @@ -214,6 +219,8 @@ static bool dpcls_lookup(struct dpcls *cls,
>                           const struct netdev_flow_key keys[],
>                           struct dpcls_rule **rules, size_t cnt,
>                           int *num_lookups_p);
> +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
> +                                   const struct netdev_flow_key
> +*target);
> 
>  /* Set of supported meter flags */
>  #define DP_SUPPORTED_METER_FLAGS_MASK \ @@ -332,6 +339,7 @@
> static struct dp_netdev_port *dp_netdev_lookup_port(const struct
> dp_netdev *dp,
> 
>  enum dp_stat_type {
>      DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
> +    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
>      DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
>      DP_STAT_MISS,               /* Packets that did not match. */
>      DP_STAT_LOST,               /* Packets not passed up to the client. */
> @@ -565,7 +573,7 @@ struct dp_netdev_pmd_thread {
>       * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>       * need to be protected by 'non_pmd_mutex'.  Every other instance
>       * will only be accessed by its own pmd thread. */
> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
> +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
>      struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
> 
>      /* Queue id used by this pmd thread to send packets on all netdevs if @@
> -744,46 +752,59 @@ dpif_netdev_xps_revalidate_pmd(const struct
> dp_netdev_pmd_thread *pmd,  static int
> dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
>                                        struct tx_port *tx, long long now);
> 
> -static inline bool emc_entry_alive(struct emc_entry *ce);
> +static inline bool dfc_entry_alive(struct dfc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
> +static void dfc_clear_entry(struct dfc_entry *ce);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> 
>  static void
> -emc_cache_init(struct emc_cache *flow_cache)
> +emc_cache_init(struct emc_cache *emc)
>  {
>      int i;
> 
> -    flow_cache->sweep_idx = 0;
> +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> +        emc->entries[i].flow = NULL;
> +        struct netdev_flow_key *key =
> +                (struct netdev_flow_key *) emc->entries[i].key;
> +        key->hash = 0;
> +        key->len = sizeof(struct miniflow);
> +        flowmap_init(&key->mf.map);
> +    }
> +    emc->sweep_idx = 0;
> +}
> +
> +static void
> +dfc_cache_init(struct dfc_cache *flow_cache) {
> +    int i;
> +
> +    emc_cache_init(&flow_cache->emc_cache);
>      for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>          flow_cache->entries[i].flow = NULL;
> -        flow_cache->entries[i].key.hash = 0;
> -        flow_cache->entries[i].key.len = sizeof(struct miniflow);
> -        flowmap_init(&flow_cache->entries[i].key.mf.map);
>      }
> +    flow_cache->sweep_idx = 0;
>  }
> 
>  static void
> -emc_cache_uninit(struct emc_cache *flow_cache)
> +emc_cache_uninit(struct emc_cache *emc)
>  {
>      int i;
> 
> -    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> -        emc_clear_entry(&flow_cache->entries[i]);
> +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> +        emc_clear_entry(&emc->entries[i]);
>      }
>  }
> 
> -/* Check and clear dead flow references slowly (one entry at each
> - * invocation).  */
>  static void
> -emc_cache_slow_sweep(struct emc_cache *flow_cache)
> +dfc_cache_uninit(struct dfc_cache *flow_cache)
>  {
> -    struct emc_entry *entry = &flow_cache->entries[flow_cache-
> >sweep_idx];
> +    int i;
> 
> -    if (!emc_entry_alive(entry)) {
> -        emc_clear_entry(entry);
> +    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> +        dfc_clear_entry(&flow_cache->entries[i]);
>      }
> -    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) &
> EM_FLOW_HASH_MASK;
> +    emc_cache_uninit(&flow_cache->emc_cache);
>  }
> 
>  /* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */ @@ -
> 837,8 +858,8 @@ pmd_info_show_stats(struct ds *reply,
>      }
> 
>      /* Sum of all the matched and not matched packets gives the total.  */
> -    total_packets = stats[DP_STAT_EXACT_HIT] +
> stats[DP_STAT_MASKED_HIT]
> -                    + stats[DP_STAT_MISS];
> +    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
> +                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];
> 
>      for (i = 0; i < PMD_N_CYCLES; i++) {
>          if (cycles[i] > pmd->cycles_zero[i]) { @@ -862,10 +883,13 @@
> pmd_info_show_stats(struct ds *reply,
>      ds_put_cstr(reply, ":\n");
> 
>      ds_put_format(reply,
> -                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> +                  "\temc hits:%llu\n\tdfc hits:%llu\n"
> +                  "\tmegaflow hits:%llu\n"
>                    "\tavg. subtable lookups per hit:%.2f\n"
>                    "\tmiss:%llu\n\tlost:%llu\n",
> -                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
> +                  stats[DP_STAT_EXACT_HIT],
> +                  stats[DP_STAT_DFC_HIT],
> +                  stats[DP_STAT_MASKED_HIT],
>                    stats[DP_STAT_MASKED_HIT] > 0
>                    ?
> (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT]
>                    : 0,
> @@ -1492,6 +1516,8 @@ dpif_netdev_get_stats(const struct dpif *dpif,
> struct dpif_dp_stats *stats)
>          stats->n_hit += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
>          stats->n_hit += n;
> +        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
> +        stats->n_hit += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
>          stats->n_missed += n;
>          atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n); @@ -2111,6
> +2137,16 @@ netdev_flow_key_hash_in_mask(const struct
> netdev_flow_key *key,
>      return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);  }
> 
> +/*
> + * Datapath Flow Cache and EMC implementation  */
> +
> +static inline struct emc_entry *
> +emc_entry_get(struct emc_cache *emc, const uint32_t hash) {
> +    return &emc->entries[hash & EMC_MASK]; }
> +
>  static inline bool
>  emc_entry_alive(struct emc_entry *ce)
>  {
> @@ -2127,9 +2163,15 @@ emc_clear_entry(struct emc_entry *ce)  }
> 
>  static inline void
> -emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> -                 const struct netdev_flow_key *key)
> +emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key
> *key,
> +                 struct dp_netdev_flow *flow)
>  {
> +    /* We only store small enough flows in the EMC. */
> +    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
> +    if (key_size > sizeof(ce->key)) {
> +        return;
> +    }
> +
>      if (ce->flow != flow) {
>          if (ce->flow) {
>              dp_netdev_flow_unref(ce->flow); @@ -2141,73 +2183,148 @@
> emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
>              ce->flow = NULL;
>          }
>      }
> -    if (key) {
> -        netdev_flow_key_clone(&ce->key, key);
> -    }
> +    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
>  }
> 
>  static inline void
> -emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
> -           struct dp_netdev_flow *flow)
> +emc_probabilistic_insert(struct emc_cache *emc,
> +                         const struct netdev_flow_key *key,
> +                         struct dp_netdev_flow *flow)
>  {
> -    struct emc_entry *to_be_replaced = NULL;
> -    struct emc_entry *current_entry;
> +    const uint32_t threshold = UINT32_MAX/1000;
> 
> -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> -        if (netdev_flow_key_equal(&current_entry->key, key)) {
> -            /* We found the entry with the 'mf' miniflow */
> -            emc_change_entry(current_entry, flow, NULL);
> -            return;
> +    if (random_uint32() <= threshold) {
> +
> +        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> +        emc_change_entry(current_entry, key, flow);
> +    }
> +}
> +
> +static inline struct dp_netdev_flow *
> +emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) {
> +    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> +    struct netdev_flow_key *current_key =
> +            (struct netdev_flow_key *) current_entry->key;
> +
> +    if (current_key->hash == key->hash
> +        && emc_entry_alive(current_entry)
> +        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
> +
> +        /* We found the entry with the 'key->mf' miniflow */
> +        return current_entry->flow;
> +    }
> +    return NULL;
> +}
> +
> +static inline struct dfc_entry *
> +dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) {
> +    return &cache->entries[hash & DFC_MASK]; }
> +
> +static inline bool
> +dfc_entry_alive(struct dfc_entry *ce)
> +{
> +    return ce->flow && !ce->flow->dead; }
> +
> +static void
> +dfc_clear_entry(struct dfc_entry *ce)
> +{
> +    if (ce->flow) {
> +        dp_netdev_flow_unref(ce->flow);
> +        ce->flow = NULL;
> +    }
> +}
> +
> +static inline void
> +dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) {
> +    if (ce->flow != flow) {
> +        if (ce->flow) {
> +            dp_netdev_flow_unref(ce->flow);
>          }
> 
> -        /* Replacement policy: put the flow in an empty (not alive) entry, or
> -         * in the first entry where it can be */
> -        if (!to_be_replaced
> -            || (emc_entry_alive(to_be_replaced)
> -                && !emc_entry_alive(current_entry))
> -            || current_entry->key.hash < to_be_replaced->key.hash) {
> -            to_be_replaced = current_entry;
> +        if (dp_netdev_flow_ref(flow)) {
> +            ce->flow = flow;
> +        } else {
> +            ce->flow = NULL;
>          }
>      }
> -    /* We didn't find the miniflow in the cache.
> -     * The 'to_be_replaced' entry is where the new flow will be stored */
> -
> -    emc_change_entry(to_be_replaced, flow, key);
>  }
> 
>  static inline void
> -emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
> -                         const struct netdev_flow_key *key,
> -                         struct dp_netdev_flow *flow)
> +dfc_insert(struct dp_netdev_pmd_thread *pmd,
> +           const struct netdev_flow_key *key,
> +           struct dp_netdev_flow *flow)
>  {
> -    /* Insert an entry into the EMC based on probability value 'min'. By
> -     * default the value is UINT32_MAX / 100 which yields an insertion
> -     * probability of 1/100 ie. 1% */
> +    struct dfc_cache *cache = &pmd->flow_cache;
> +    struct dfc_entry *current_entry;
> +
> +    current_entry = dfc_entry_get(cache, key->hash);
> +    dfc_change_entry(current_entry, flow);
> +}
> +
> +static inline struct dp_netdev_flow *
> +dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
> +           bool *exact_match)
> +{
> +    struct dp_netdev_flow *flow;
> 
> -    uint32_t min;
> -    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
> +    /* Try an EMC lookup first. */
> +    flow = emc_lookup(&cache->emc_cache, key);
> +    if (flow) {
> +        *exact_match = true;
> +        return flow;
> +    }
> +
> +    /* EMC lookup not successful: try DFC lookup. */
> +    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
> +    flow = current_entry->flow;
> 
> -    if (min && random_uint32() <= min) {
> -        emc_insert(&pmd->flow_cache, key, flow);
> +    if (dfc_entry_alive(current_entry) &&
> +        dpcls_rule_matches_key(&flow->cr, key)) {
> +
> +        /* Found a match in DFC. Insert into EMC for subsequent lookups.
> +         * We use probabilistic insertion here so that mainly elephant
> +         * flows enter EMC. */
> +        emc_probabilistic_insert(&cache->emc_cache, key, flow);
> +        *exact_match = false;
> +        return flow;
> +    } else {
> +
> +        /* No match. Need to go to DPCLS lookup. */
> +        return NULL;
>      }
>  }
> 
> -static inline struct dp_netdev_flow *
> -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
> +/* Check and clear dead flow references slowly (one entry at each
> + * invocation).  */
> +static void
> +emc_slow_sweep(struct emc_cache *emc)
>  {
> -    struct emc_entry *current_entry;
> +    struct emc_entry *entry = &emc->entries[emc->sweep_idx];
> 
> -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> -        if (current_entry->key.hash == key->hash
> -            && emc_entry_alive(current_entry)
> -            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
> +    if (!emc_entry_alive(entry)) {
> +        emc_clear_entry(entry);
> +    }
> +    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
> +}
> 
> -            /* We found the entry with the 'key->mf' miniflow */
> -            return current_entry->flow;
> -        }
> +static void
> +dfc_slow_sweep(struct dfc_cache *cache)
> +{
> +    /* Sweep the EMC so that both finish in the same time. */
> +    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
> +        emc_slow_sweep(&cache->emc_cache);
>      }
> 
> -    return NULL;
> +    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
> +    if (!dfc_entry_alive(entry)) {
> +        dfc_clear_entry(entry);
> +    }
> +    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
>  }
> 
>  static struct dp_netdev_flow *
> @@ -4048,7 +4165,7 @@ pmd_thread_main(void *f_)
>      ovs_numa_thread_setaffinity_core(pmd->core_id);
>      dpdk_set_lcore_id(pmd->core_id);
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> -    emc_cache_init(&pmd->flow_cache);
> +    dfc_cache_init(&pmd->flow_cache);
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
> 
> @@ -4078,17 +4195,16 @@ reload:
>                                                        : PMD_CYCLES_IDLE);
>          }
> 
> -        if (lc++ > 1024) {
> -            bool reload;
> +        dfc_slow_sweep(&pmd->flow_cache);
> 
> +        if (lc++ > 1024) {
>              lc = 0;
> 
>              coverage_try_clear();
>              dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
> -            if (!ovsrcu_try_quiesce()) {
> -                emc_cache_slow_sweep(&pmd->flow_cache);
> -            }
> +            ovsrcu_try_quiesce();
> 
> +            bool reload;
>              atomic_read_relaxed(&pmd->reload, &reload);
>              if (reload) {
>                  break;
> @@ -4110,7 +4226,7 @@ reload:
>          goto reload;
>      }
> 
> -    emc_cache_uninit(&pmd->flow_cache);
> +    dfc_cache_uninit(&pmd->flow_cache);
>      free(poll_list);
>      pmd_free_cached_ports(pmd);
>      return NULL;
> @@ -4544,7 +4660,7 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> -        emc_cache_init(&pmd->flow_cache);
> +        dfc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
>      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *,
> &pmd->node),
> @@ -4586,7 +4702,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct
> dp_netdev_pmd_thread *pmd)
>       * but extra cleanup is necessary */
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&dp->non_pmd_mutex);
> -        emc_cache_uninit(&pmd->flow_cache);
> +        dfc_cache_uninit(&pmd->flow_cache);
>          pmd_free_cached_ports(pmd);
>          pmd_free_static_tx_qid(pmd);
>          ovs_mutex_unlock(&dp->non_pmd_mutex);
> @@ -4896,7 +5012,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);
>  }
> 
> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
> +/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
>   * beginning of the 'packets' array.
> @@ -4911,19 +5027,20 @@ dp_netdev_queue_batches(struct dp_packet
> *pkt,
>   * will be ignored.
>   */
>  static inline size_t
> -emc_processing(struct dp_netdev_pmd_thread *pmd,
> +dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
>                 bool md_is_valid, odp_port_t port_no)
>  {
> -    struct emc_cache *flow_cache = &pmd->flow_cache;
> +    struct dfc_cache *flow_cache = &pmd->flow_cache;
>      struct netdev_flow_key *key = &keys[0];
> -    size_t n_missed = 0, n_dropped = 0;
> +    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
>      struct dp_packet *packet;
>      const size_t cnt = dp_packet_batch_size(packets_);
>      uint32_t cur_min;
>      int i;
> +    bool exact_match;
> 
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
> 
> @@ -4932,7 +5049,6 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> -            n_dropped++;
>              continue;
>          }
> 
> @@ -4948,7 +5064,7 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>          miniflow_extract(packet, &key->mf);
>          key->len = 0; /* Not computed yet. */
> -        /* If EMC is disabled skip hash computation and emc_lookup */
> +        /* If DFC is disabled skip hash computation and DFC lookup */
>          if (cur_min) {
>              if (!md_is_valid) {
>                  key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> @@ -4956,11 +5072,16 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>              } else {
>                  key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
>              }
> -            flow = emc_lookup(flow_cache, key);
> +            flow = dfc_lookup(flow_cache, key, &exact_match);
>          } else {
>              flow = NULL;
>          }
>          if (OVS_LIKELY(flow)) {
> +            if (exact_match) {
> +                n_emc_hit++;
> +            } else {
> +                n_dfc_hit++;
> +            }
>              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                      n_batches);
>          } else {
> @@ -4974,8 +5095,8 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>          }
>      }
> 
> -    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
> -                           cnt - n_dropped - n_missed);
> +    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
> +    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);
> 
>      return dp_packet_batch_size(packets_);
>  }
> @@ -5044,7 +5165,7 @@ handle_packet_upcall(struct
> dp_netdev_pmd_thread *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> -        emc_probabilistic_insert(pmd, key, netdev_flow);
> +        dfc_insert(pmd, key, netdev_flow);
>      }
>  }
> 
> @@ -5136,7 +5257,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> +        dfc_insert(pmd, &keys[i], flow);
>          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
>      }
> 
> @@ -5169,7 +5290,7 @@ dp_netdev_input__(struct
> dp_netdev_pmd_thread *pmd,
>      odp_port_t in_port;
> 
>      n_batches = 0;
> -    emc_processing(pmd, packets, keys, batches, &n_batches,
> +    dfc_processing(pmd, packets, keys, batches, &n_batches,
>                              md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
> @@ -6063,7 +6184,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
> *rule)
> 
>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>   * in 'mask' the values in 'key' and 'target' are the same. */
> -static inline bool
> +static bool
>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                         const struct netdev_flow_key *target)
>  {
> --
> 1.9.1
> 
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Jan Scheurich Dec. 11, 2017, 5:02 p.m. UTC | #4
Hi Cian,

Thank you for your feedback and the interesting measurement results. 

As first I was a bit surprised that the gain is smaller in your P2P test case. I guess that is because your traffic passes the datapath (and hence EMC) only once, while in our PVP test with VXLAN each packet causes three passes and EMC lookups, so the effective number of occupied EMC entries is three times as high. Could you test your scenario with significantly more flows to confirm the larger gain with growing number of flows?

I set the DFC size by macro to 1M entries mainly because it gave good acceleration all across the entire range of parallel flows we typically test in our NFVI setup (up to 500K flows). The static memory overhead of a 1M DFC is not very high (8 MB per PMD). From that perspective there seemed little need to offer a knob to tune the size for certain use cases.

However, if there are sufficient parallel flows to completely fill the DFC, the 8MB L3 cache footprint per PMD is significant and may limit the performance gain across all PMDs and cause additional L3 cache contention with guest applications (VNFs) running on the compute host. So, the main motivation for making the DFC size configurable (or to disable it completely) might be to limit the L3 cache footprint of a PMD. There might be possibly more effective tools on CPU level (e.g. "cache QoS") to achieve the same effect.

I am open for discussions on this topic and would welcome multi-PMD measurements. 

Regards, Jan

> -----Original Message-----
> From: Ferriter, Cian [mailto:cian.ferriter@intel.com]
> Sent: Monday, 11 December, 2017 17:29
> To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
> Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> Hi Jan,
> 
> This is a very interesting patch with compelling results.
> 
> I had a few questions after reading the commit message for this patch:
> How did you decide on 1M as the proposed size for the DFC?
> Will the size of this DFC be configurable (i.e. map a different number of hash bits)?
> 
> I was mostly interested in how the performance of the very basic phy2phy testcase would be affected by this change.
> Below are my results from 1-10000 flows, 64B packets, unidirectional with 1 PMD. One OpenFlow rule is installed for each stream of
> traffic being sent:
> Flows   master  DFC+EMC  Gain
>         [Mpps]  [Mpps]
> ------------------------------
> 1       12.34   13.12    6.3%
> 10      7.83    8.19     4.5%
> 100     4.71    4.78     1.3%
> 1000    3.77    3.83     1.5%
> 10000   3.27    3.51     7.1%
> 
> This shows a performance improvement for this testcase also.
> 
> Tested-by: Cian Ferriter <cian.ferriter@intel.com>
> 
> Let me know your thoughts on the above questions.
> Thanks,
> Cian
> 
> > -----Original Message-----
> > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> > bounces@openvswitch.org] On Behalf Of Jan Scheurich
> > Sent: 20 November 2017 17:33
> > To: dev@openvswitch.org
> > Subject: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
> >
> > So far the netdev datapath uses an 8K EMC to speed up the lookup of
> > frequently used flows by comparing the parsed packet headers against the
> > miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The
> > EMC is too small for many applications with 100K or more parallel packet
> > flows so that EMC threshing actually degrades performance.
> > Furthermore, the size of struct miniflow and the flow copying cost prevents
> > us from making it much larger.
> >
> > At the same time the lookup cost of the megaflow classifier (DPCLS) is
> > increasing as the number of frequently hit subtables grows with the
> > complexity of pipeline and the number of recirculations.
> >
> > To close the performance gap for many parallel flows, this patch introduces
> > the datapath flow cache (DFC) with 1M entries as lookup stage between EMC
> > and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit
> > megaflow entry and performs a masked comparison of the packet flow with
> > the megaflow key to confirm the hit. This avoids the costly DPCLS lookup
> > even for very large number of parallel flows with a small memory overhead.
> >
> > Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
> > immediately inserts an entry in the DFC so that subsequent packets get
> > speeded up. The DFC, thus, accelerate also short-lived flows.
> >
> > To further accelerate the lookup of few elephant flows, every DFC hit
> > triggers a probabilistic EMC insertion of the flow. As the DFC entry is already
> > in place the default EMC insertion probability can be reduced to
> > 1/1000 to minimize EMC thrashing should there still be many fat flows.
> > The inverse EMC insertion probability remains configurable.
> >
> > The EMC implementation is simplified by removing the possibility to store a
> > flow in two slots, as there is no particular reason why two flows should
> > systematically collide (the RSS hash is not symmetric).
> > The maximum size of the EMC flow key is limited to 256 bytes to reduce the
> > memory footprint. This should be sufficient to hold most real life packet flow
> > keys. Larger flows are not installed in the EMC.
> >
> > The pmd-stats-show command is enhanced to show both EMC and DFC hits
> > separately.
> >
> > The sweep speed for cleaning up obsolete EMC and DFC flow entries and
> > freeing dead megaflow entries is increased. With a typical PMD cycle
> > duration of 100us under load and checking one DFC entry per cycle, the DFC
> > sweep should normally complete within in 100s.
> >
> > In PVP performance tests with an L3 pipeline over VXLAN we determined the
> > optimal EMC size to be 16K entries to obtain a uniform speedup compared to
> > the master branch over the full range of parallel flows. The measurement
> > below is for 64 byte packets and the average number of subtable lookups per
> > DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a single
> > busy mask. Tests with many visited subtables should show a strong increase
> > of the gain through DFC.
> >
> > Flows   master  DFC+EMC  Gain
> >         [Mpps]  [Mpps]
> > ------------------------------
> > 8       4.45    4.62     3.8%
> > 100     4.17    4.47     7.2%
> > 1000    3.88    4.34    12.0%
> > 2000    3.54    4.17    17.8%
> > 5000    3.01    3.82    27.0%
> > 10000   2.75    3.63    31.9%
> > 20000   2.64    3.50    32.8%
> > 50000   2.60    3.33    28.1%
> > 100000  2.59    3.23    24.7%
> > 500000  2.59    3.16    21.9%
> >
> >
> > Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> > ---
> >  lib/dpif-netdev.c | 349 ++++++++++++++++++++++++++++++++++++-------
> > -----------
> >  1 file changed, 235 insertions(+), 114 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index db78318..efcf2e9
> > 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -127,19 +127,19 @@ struct netdev_flow_key {
> >      uint64_t buf[FLOW_MAX_PACKET_U64S];  };
> >
> > -/* Exact match cache for frequently used flows
> > +/* Datapath flow cache (DFC) for frequently used flows
> >   *
> > - * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
> > - * search its entries for a miniflow that matches exactly the miniflow of the
> > - * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
> > + * The cache uses the 32-bit hash of the packet (which can be the RSS
> > + hash) to
> > + * directly look up a pointer to the matching megaflow. To check for a
> > + match
> > + * the packet's flow key is compared against the key and mask of the
> > megaflow.
> >   *
> > - * A cache entry holds a reference to its 'dp_netdev_flow'.
> > - *
> > - * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS
> > different
> > - * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each
> > of
> > - * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown
> > away). Each
> > - * value is the index of a cache entry where the miniflow could be.
> > + * For even faster lookup, the most frequently used packet flows are
> > + also
> > + * inserted into a small exact match cache (EMC). The EMC uses a part
> > + of the
> > + * packet hash to look up a miniflow that matches exactly the miniflow
> > + of the
> > + * packet. The matching EMC also returns a reference to the megaflow.
> >   *
> > + * Flows are promoted from the DFC to the EMC through probabilistic
> > + insertion
> > + * after successful DFC lookup with minor probability to favor elephant
> > flows.
> >   *
> >   * Thread-safety
> >   * =============
> > @@ -148,33 +148,38 @@ struct netdev_flow_key {
> >   * If dp_netdev_input is not called from a pmd thread, a mutex is used.
> >   */
> >
> > -#define EM_FLOW_HASH_SHIFT 13
> > -#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT) -
> > #define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1) -#define
> > EM_FLOW_HASH_SEGS 2
> > +#define DFC_MASK_LEN 20
> > +#define DFC_ENTRIES (1u << DFC_MASK_LEN) #define DFC_MASK
> > (DFC_ENTRIES
> > +- 1) #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u <<
> > EMC_MASK_LEN)
> > +#define EMC_MASK (EMC_ENTRIES - 1)
> >
> >  /* Default EMC insert probability is 1 /
> > DEFAULT_EM_FLOW_INSERT_INV_PROB */ -#define
> > DEFAULT_EM_FLOW_INSERT_INV_PROB 100
> > +#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
> >  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
> >                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
> >
> >  struct emc_entry {
> >      struct dp_netdev_flow *flow;
> > -    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> > +    char key[248];   /* Holds struct netdev_flow_key of limited size. */
> >  };
> >
> >  struct emc_cache {
> > -    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
> > -    int sweep_idx;                /* For emc_cache_slow_sweep(). */
> > +    struct emc_entry entries[EMC_ENTRIES];
> > +    int sweep_idx;
> > +};
> > +
> > +struct dfc_entry {
> > +    struct dp_netdev_flow *flow;
> > +};
> > +
> > +struct dfc_cache {
> > +    struct emc_cache emc_cache;
> > +    struct dfc_entry entries[DFC_ENTRIES];
> > +    int sweep_idx;
> >  };
> >
> > -/* Iterate in the exact match cache through every entry that might contain a
> > - * miniflow with hash 'HASH'. */
> > -#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)
> > \
> > -    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
> > -         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ &
> > EM_FLOW_HASH_MASK], \
> > -         i__ < EM_FLOW_HASH_SEGS;                                            \
> > -         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
> >
> >  /* Simple non-wildcarding single-priority classifier. */
> >
> > @@ -214,6 +219,8 @@ static bool dpcls_lookup(struct dpcls *cls,
> >                           const struct netdev_flow_key keys[],
> >                           struct dpcls_rule **rules, size_t cnt,
> >                           int *num_lookups_p);
> > +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
> > +                                   const struct netdev_flow_key
> > +*target);
> >
> >  /* Set of supported meter flags */
> >  #define DP_SUPPORTED_METER_FLAGS_MASK \ @@ -332,6 +339,7 @@
> > static struct dp_netdev_port *dp_netdev_lookup_port(const struct
> > dp_netdev *dp,
> >
> >  enum dp_stat_type {
> >      DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
> > +    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
> >      DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
> >      DP_STAT_MISS,               /* Packets that did not match. */
> >      DP_STAT_LOST,               /* Packets not passed up to the client. */
> > @@ -565,7 +573,7 @@ struct dp_netdev_pmd_thread {
> >       * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
> >       * need to be protected by 'non_pmd_mutex'.  Every other instance
> >       * will only be accessed by its own pmd thread. */
> > -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
> >      struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
> >
> >      /* Queue id used by this pmd thread to send packets on all netdevs if @@
> > -744,46 +752,59 @@ dpif_netdev_xps_revalidate_pmd(const struct
> > dp_netdev_pmd_thread *pmd,  static int
> > dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
> >                                        struct tx_port *tx, long long now);
> >
> > -static inline bool emc_entry_alive(struct emc_entry *ce);
> > +static inline bool dfc_entry_alive(struct dfc_entry *ce);
> >  static void emc_clear_entry(struct emc_entry *ce);
> > +static void dfc_clear_entry(struct dfc_entry *ce);
> >
> >  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> >
> >  static void
> > -emc_cache_init(struct emc_cache *flow_cache)
> > +emc_cache_init(struct emc_cache *emc)
> >  {
> >      int i;
> >
> > -    flow_cache->sweep_idx = 0;
> > +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> > +        emc->entries[i].flow = NULL;
> > +        struct netdev_flow_key *key =
> > +                (struct netdev_flow_key *) emc->entries[i].key;
> > +        key->hash = 0;
> > +        key->len = sizeof(struct miniflow);
> > +        flowmap_init(&key->mf.map);
> > +    }
> > +    emc->sweep_idx = 0;
> > +}
> > +
> > +static void
> > +dfc_cache_init(struct dfc_cache *flow_cache) {
> > +    int i;
> > +
> > +    emc_cache_init(&flow_cache->emc_cache);
> >      for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> >          flow_cache->entries[i].flow = NULL;
> > -        flow_cache->entries[i].key.hash = 0;
> > -        flow_cache->entries[i].key.len = sizeof(struct miniflow);
> > -        flowmap_init(&flow_cache->entries[i].key.mf.map);
> >      }
> > +    flow_cache->sweep_idx = 0;
> >  }
> >
> >  static void
> > -emc_cache_uninit(struct emc_cache *flow_cache)
> > +emc_cache_uninit(struct emc_cache *emc)
> >  {
> >      int i;
> >
> > -    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> > -        emc_clear_entry(&flow_cache->entries[i]);
> > +    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
> > +        emc_clear_entry(&emc->entries[i]);
> >      }
> >  }
> >
> > -/* Check and clear dead flow references slowly (one entry at each
> > - * invocation).  */
> >  static void
> > -emc_cache_slow_sweep(struct emc_cache *flow_cache)
> > +dfc_cache_uninit(struct dfc_cache *flow_cache)
> >  {
> > -    struct emc_entry *entry = &flow_cache->entries[flow_cache-
> > >sweep_idx];
> > +    int i;
> >
> > -    if (!emc_entry_alive(entry)) {
> > -        emc_clear_entry(entry);
> > +    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
> > +        dfc_clear_entry(&flow_cache->entries[i]);
> >      }
> > -    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) &
> > EM_FLOW_HASH_MASK;
> > +    emc_cache_uninit(&flow_cache->emc_cache);
> >  }
> >
> >  /* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */ @@ -
> > 837,8 +858,8 @@ pmd_info_show_stats(struct ds *reply,
> >      }
> >
> >      /* Sum of all the matched and not matched packets gives the total.  */
> > -    total_packets = stats[DP_STAT_EXACT_HIT] +
> > stats[DP_STAT_MASKED_HIT]
> > -                    + stats[DP_STAT_MISS];
> > +    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
> > +                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];
> >
> >      for (i = 0; i < PMD_N_CYCLES; i++) {
> >          if (cycles[i] > pmd->cycles_zero[i]) { @@ -862,10 +883,13 @@
> > pmd_info_show_stats(struct ds *reply,
> >      ds_put_cstr(reply, ":\n");
> >
> >      ds_put_format(reply,
> > -                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> > +                  "\temc hits:%llu\n\tdfc hits:%llu\n"
> > +                  "\tmegaflow hits:%llu\n"
> >                    "\tavg. subtable lookups per hit:%.2f\n"
> >                    "\tmiss:%llu\n\tlost:%llu\n",
> > -                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
> > +                  stats[DP_STAT_EXACT_HIT],
> > +                  stats[DP_STAT_DFC_HIT],
> > +                  stats[DP_STAT_MASKED_HIT],
> >                    stats[DP_STAT_MASKED_HIT] > 0
> >                    ?
> > (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT]
> >                    : 0,
> > @@ -1492,6 +1516,8 @@ dpif_netdev_get_stats(const struct dpif *dpif,
> > struct dpif_dp_stats *stats)
> >          stats->n_hit += n;
> >          atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
> >          stats->n_hit += n;
> > +        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
> > +        stats->n_hit += n;
> >          atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
> >          stats->n_missed += n;
> >          atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n); @@ -2111,6
> > +2137,16 @@ netdev_flow_key_hash_in_mask(const struct
> > netdev_flow_key *key,
> >      return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);  }
> >
> > +/*
> > + * Datapath Flow Cache and EMC implementation  */
> > +
> > +static inline struct emc_entry *
> > +emc_entry_get(struct emc_cache *emc, const uint32_t hash) {
> > +    return &emc->entries[hash & EMC_MASK]; }
> > +
> >  static inline bool
> >  emc_entry_alive(struct emc_entry *ce)
> >  {
> > @@ -2127,9 +2163,15 @@ emc_clear_entry(struct emc_entry *ce)  }
> >
> >  static inline void
> > -emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> > -                 const struct netdev_flow_key *key)
> > +emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key
> > *key,
> > +                 struct dp_netdev_flow *flow)
> >  {
> > +    /* We only store small enough flows in the EMC. */
> > +    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
> > +    if (key_size > sizeof(ce->key)) {
> > +        return;
> > +    }
> > +
> >      if (ce->flow != flow) {
> >          if (ce->flow) {
> >              dp_netdev_flow_unref(ce->flow); @@ -2141,73 +2183,148 @@
> > emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> >              ce->flow = NULL;
> >          }
> >      }
> > -    if (key) {
> > -        netdev_flow_key_clone(&ce->key, key);
> > -    }
> > +    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
> >  }
> >
> >  static inline void
> > -emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
> > -           struct dp_netdev_flow *flow)
> > +emc_probabilistic_insert(struct emc_cache *emc,
> > +                         const struct netdev_flow_key *key,
> > +                         struct dp_netdev_flow *flow)
> >  {
> > -    struct emc_entry *to_be_replaced = NULL;
> > -    struct emc_entry *current_entry;
> > +    const uint32_t threshold = UINT32_MAX/1000;
> >
> > -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> > -        if (netdev_flow_key_equal(&current_entry->key, key)) {
> > -            /* We found the entry with the 'mf' miniflow */
> > -            emc_change_entry(current_entry, flow, NULL);
> > -            return;
> > +    if (random_uint32() <= threshold) {
> > +
> > +        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> > +        emc_change_entry(current_entry, key, flow);
> > +    }
> > +}
> > +
> > +static inline struct dp_netdev_flow *
> > +emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) {
> > +    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
> > +    struct netdev_flow_key *current_key =
> > +            (struct netdev_flow_key *) current_entry->key;
> > +
> > +    if (current_key->hash == key->hash
> > +        && emc_entry_alive(current_entry)
> > +        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
> > +
> > +        /* We found the entry with the 'key->mf' miniflow */
> > +        return current_entry->flow;
> > +    }
> > +    return NULL;
> > +}
> > +
> > +static inline struct dfc_entry *
> > +dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) {
> > +    return &cache->entries[hash & DFC_MASK]; }
> > +
> > +static inline bool
> > +dfc_entry_alive(struct dfc_entry *ce)
> > +{
> > +    return ce->flow && !ce->flow->dead; }
> > +
> > +static void
> > +dfc_clear_entry(struct dfc_entry *ce)
> > +{
> > +    if (ce->flow) {
> > +        dp_netdev_flow_unref(ce->flow);
> > +        ce->flow = NULL;
> > +    }
> > +}
> > +
> > +static inline void
> > +dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) {
> > +    if (ce->flow != flow) {
> > +        if (ce->flow) {
> > +            dp_netdev_flow_unref(ce->flow);
> >          }
> >
> > -        /* Replacement policy: put the flow in an empty (not alive) entry, or
> > -         * in the first entry where it can be */
> > -        if (!to_be_replaced
> > -            || (emc_entry_alive(to_be_replaced)
> > -                && !emc_entry_alive(current_entry))
> > -            || current_entry->key.hash < to_be_replaced->key.hash) {
> > -            to_be_replaced = current_entry;
> > +        if (dp_netdev_flow_ref(flow)) {
> > +            ce->flow = flow;
> > +        } else {
> > +            ce->flow = NULL;
> >          }
> >      }
> > -    /* We didn't find the miniflow in the cache.
> > -     * The 'to_be_replaced' entry is where the new flow will be stored */
> > -
> > -    emc_change_entry(to_be_replaced, flow, key);
> >  }
> >
> >  static inline void
> > -emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
> > -                         const struct netdev_flow_key *key,
> > -                         struct dp_netdev_flow *flow)
> > +dfc_insert(struct dp_netdev_pmd_thread *pmd,
> > +           const struct netdev_flow_key *key,
> > +           struct dp_netdev_flow *flow)
> >  {
> > -    /* Insert an entry into the EMC based on probability value 'min'. By
> > -     * default the value is UINT32_MAX / 100 which yields an insertion
> > -     * probability of 1/100 ie. 1% */
> > +    struct dfc_cache *cache = &pmd->flow_cache;
> > +    struct dfc_entry *current_entry;
> > +
> > +    current_entry = dfc_entry_get(cache, key->hash);
> > +    dfc_change_entry(current_entry, flow);
> > +}
> > +
> > +static inline struct dp_netdev_flow *
> > +dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
> > +           bool *exact_match)
> > +{
> > +    struct dp_netdev_flow *flow;
> >
> > -    uint32_t min;
> > -    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
> > +    /* Try an EMC lookup first. */
> > +    flow = emc_lookup(&cache->emc_cache, key);
> > +    if (flow) {
> > +        *exact_match = true;
> > +        return flow;
> > +    }
> > +
> > +    /* EMC lookup not successful: try DFC lookup. */
> > +    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
> > +    flow = current_entry->flow;
> >
> > -    if (min && random_uint32() <= min) {
> > -        emc_insert(&pmd->flow_cache, key, flow);
> > +    if (dfc_entry_alive(current_entry) &&
> > +        dpcls_rule_matches_key(&flow->cr, key)) {
> > +
> > +        /* Found a match in DFC. Insert into EMC for subsequent lookups.
> > +         * We use probabilistic insertion here so that mainly elephant
> > +         * flows enter EMC. */
> > +        emc_probabilistic_insert(&cache->emc_cache, key, flow);
> > +        *exact_match = false;
> > +        return flow;
> > +    } else {
> > +
> > +        /* No match. Need to go to DPCLS lookup. */
> > +        return NULL;
> >      }
> >  }
> >
> > -static inline struct dp_netdev_flow *
> > -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
> > +/* Check and clear dead flow references slowly (one entry at each
> > + * invocation).  */
> > +static void
> > +emc_slow_sweep(struct emc_cache *emc)
> >  {
> > -    struct emc_entry *current_entry;
> > +    struct emc_entry *entry = &emc->entries[emc->sweep_idx];
> >
> > -    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
> > -        if (current_entry->key.hash == key->hash
> > -            && emc_entry_alive(current_entry)
> > -            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
> > +    if (!emc_entry_alive(entry)) {
> > +        emc_clear_entry(entry);
> > +    }
> > +    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
> > +}
> >
> > -            /* We found the entry with the 'key->mf' miniflow */
> > -            return current_entry->flow;
> > -        }
> > +static void
> > +dfc_slow_sweep(struct dfc_cache *cache)
> > +{
> > +    /* Sweep the EMC so that both finish in the same time. */
> > +    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
> > +        emc_slow_sweep(&cache->emc_cache);
> >      }
> >
> > -    return NULL;
> > +    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
> > +    if (!dfc_entry_alive(entry)) {
> > +        dfc_clear_entry(entry);
> > +    }
> > +    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
> >  }
> >
> >  static struct dp_netdev_flow *
> > @@ -4048,7 +4165,7 @@ pmd_thread_main(void *f_)
> >      ovs_numa_thread_setaffinity_core(pmd->core_id);
> >      dpdk_set_lcore_id(pmd->core_id);
> >      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> > -    emc_cache_init(&pmd->flow_cache);
> > +    dfc_cache_init(&pmd->flow_cache);
> >  reload:
> >      pmd_alloc_static_tx_qid(pmd);
> >
> > @@ -4078,17 +4195,16 @@ reload:
> >                                                        : PMD_CYCLES_IDLE);
> >          }
> >
> > -        if (lc++ > 1024) {
> > -            bool reload;
> > +        dfc_slow_sweep(&pmd->flow_cache);
> >
> > +        if (lc++ > 1024) {
> >              lc = 0;
> >
> >              coverage_try_clear();
> >              dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
> > -            if (!ovsrcu_try_quiesce()) {
> > -                emc_cache_slow_sweep(&pmd->flow_cache);
> > -            }
> > +            ovsrcu_try_quiesce();
> >
> > +            bool reload;
> >              atomic_read_relaxed(&pmd->reload, &reload);
> >              if (reload) {
> >                  break;
> > @@ -4110,7 +4226,7 @@ reload:
> >          goto reload;
> >      }
> >
> > -    emc_cache_uninit(&pmd->flow_cache);
> > +    dfc_cache_uninit(&pmd->flow_cache);
> >      free(poll_list);
> >      pmd_free_cached_ports(pmd);
> >      return NULL;
> > @@ -4544,7 +4660,7 @@ dp_netdev_configure_pmd(struct
> > dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
> >      /* init the 'flow_cache' since there is no
> >       * actual thread created for NON_PMD_CORE_ID. */
> >      if (core_id == NON_PMD_CORE_ID) {
> > -        emc_cache_init(&pmd->flow_cache);
> > +        dfc_cache_init(&pmd->flow_cache);
> >          pmd_alloc_static_tx_qid(pmd);
> >      }
> >      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *,
> > &pmd->node),
> > @@ -4586,7 +4702,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct
> > dp_netdev_pmd_thread *pmd)
> >       * but extra cleanup is necessary */
> >      if (pmd->core_id == NON_PMD_CORE_ID) {
> >          ovs_mutex_lock(&dp->non_pmd_mutex);
> > -        emc_cache_uninit(&pmd->flow_cache);
> > +        dfc_cache_uninit(&pmd->flow_cache);
> >          pmd_free_cached_ports(pmd);
> >          pmd_free_static_tx_qid(pmd);
> >          ovs_mutex_unlock(&dp->non_pmd_mutex);
> > @@ -4896,7 +5012,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
> >      packet_batch_per_flow_update(batch, pkt, mf);
> >  }
> >
> > -/* Try to process all ('cnt') the 'packets' using only the exact match cache
> > +/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
> >   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
> >   * miniflow is copied into 'keys' and the packet pointer is moved at the
> >   * beginning of the 'packets' array.
> > @@ -4911,19 +5027,20 @@ dp_netdev_queue_batches(struct dp_packet
> > *pkt,
> >   * will be ignored.
> >   */
> >  static inline size_t
> > -emc_processing(struct dp_netdev_pmd_thread *pmd,
> > +dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >                 struct dp_packet_batch *packets_,
> >                 struct netdev_flow_key *keys,
> >                 struct packet_batch_per_flow batches[], size_t *n_batches,
> >                 bool md_is_valid, odp_port_t port_no)
> >  {
> > -    struct emc_cache *flow_cache = &pmd->flow_cache;
> > +    struct dfc_cache *flow_cache = &pmd->flow_cache;
> >      struct netdev_flow_key *key = &keys[0];
> > -    size_t n_missed = 0, n_dropped = 0;
> > +    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
> >      struct dp_packet *packet;
> >      const size_t cnt = dp_packet_batch_size(packets_);
> >      uint32_t cur_min;
> >      int i;
> > +    bool exact_match;
> >
> >      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
> >
> > @@ -4932,7 +5049,6 @@ emc_processing(struct dp_netdev_pmd_thread
> > *pmd,
> >
> >          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
> >              dp_packet_delete(packet);
> > -            n_dropped++;
> >              continue;
> >          }
> >
> > @@ -4948,7 +5064,7 @@ emc_processing(struct dp_netdev_pmd_thread
> > *pmd,
> >          }
> >          miniflow_extract(packet, &key->mf);
> >          key->len = 0; /* Not computed yet. */
> > -        /* If EMC is disabled skip hash computation and emc_lookup */
> > +        /* If DFC is disabled skip hash computation and DFC lookup */
> >          if (cur_min) {
> >              if (!md_is_valid) {
> >                  key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> > @@ -4956,11 +5072,16 @@ emc_processing(struct dp_netdev_pmd_thread
> > *pmd,
> >              } else {
> >                  key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
> >              }
> > -            flow = emc_lookup(flow_cache, key);
> > +            flow = dfc_lookup(flow_cache, key, &exact_match);
> >          } else {
> >              flow = NULL;
> >          }
> >          if (OVS_LIKELY(flow)) {
> > +            if (exact_match) {
> > +                n_emc_hit++;
> > +            } else {
> > +                n_dfc_hit++;
> > +            }
> >              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
> >                                      n_batches);
> >          } else {
> > @@ -4974,8 +5095,8 @@ emc_processing(struct dp_netdev_pmd_thread
> > *pmd,
> >          }
> >      }
> >
> > -    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
> > -                           cnt - n_dropped - n_missed);
> > +    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
> > +    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);
> >
> >      return dp_packet_batch_size(packets_);
> >  }
> > @@ -5044,7 +5165,7 @@ handle_packet_upcall(struct
> > dp_netdev_pmd_thread *pmd,
> >                                               add_actions->size);
> >          }
> >          ovs_mutex_unlock(&pmd->flow_mutex);
> > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        dfc_insert(pmd, key, netdev_flow);
> >      }
> >  }
> >
> > @@ -5136,7 +5257,7 @@ fast_path_processing(struct
> > dp_netdev_pmd_thread *pmd,
> >
> >          flow = dp_netdev_flow_cast(rules[i]);
> >
> > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        dfc_insert(pmd, &keys[i], flow);
> >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> > n_batches);
> >      }
> >
> > @@ -5169,7 +5290,7 @@ dp_netdev_input__(struct
> > dp_netdev_pmd_thread *pmd,
> >      odp_port_t in_port;
> >
> >      n_batches = 0;
> > -    emc_processing(pmd, packets, keys, batches, &n_batches,
> > +    dfc_processing(pmd, packets, keys, batches, &n_batches,
> >                              md_is_valid, port_no);
> >      if (!dp_packet_batch_is_empty(packets)) {
> >          /* Get ingress port from first packet's metadata. */
> > @@ -6063,7 +6184,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
> > *rule)
> >
> >  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
> >   * in 'mask' the values in 'key' and 'target' are the same. */
> > -static inline bool
> > +static bool
> >  dpcls_rule_matches_key(const struct dpcls_rule *rule,
> >                         const struct netdev_flow_key *target)
> >  {
> > --
> > 1.9.1
> >
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Wang, Yipeng1 Dec. 17, 2017, 6:48 p.m. UTC | #5
Hi, Jan

We went through the code and did some performance comparisons. We notice that the patch contains two parts of optimizations: EMC simplifying/resizing, and another layer of cache added before megaflow cache.  The new cache idea has the same direction with our cuckoo distributor(CD) proposal we posted back in April (https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html) and presented in OvS 2017 conference. Comparing to our patch, we saw pros and cons for both CD and DFC, and currently seeking a way to combine the benefits of both patches. We are also seeking other ways to further simplify current datapath, to have a scalable while simple solution. Below are some detailed comments:

For EMC part, we wonder if you enabled transparent huge page (THP) during the test. For our test case, the new EMC only gives a little speedup if THP enabled, since with huge page, reducing EMC entry does not benefit much. Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we tested. I guess the optimization will largely depend on traffic patterns. Another question is that it seems when EMC lookup does "netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the key comparison actually does not do correctly. Could you please double check?

For the DFC cache part, we compare with our CD patch we presented in the OvS conference. We saw CD begins to perform better than DFC with 2 or more subtables, and ~60% higher throughout with 20 subtables, especially when flow count is large (around 100k or more). We found CD is a more scalable implementation w.r.t subtable count and flow count. Part of the reason is that the 1-hash hash table of DFC does not perform consistently with various traffic patterns, and easily cause conflict misses. DFC's advantage shows up when all flows hit DFC or only 1 subtable exists. We are currently thinking about combining both approaches for example a hybrid model.

We would love to hear your opinion on this and we think the best case is we could find a way to harmonize both patches, and find a both scalable and efficient way to refactor the datapath.

Thanks
Yipeng

From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
Sent: Monday, November 20, 2017 9:33 AM
To: dev@openvswitch.org
Cc: Wang, Yipeng1 <yipeng1.wang@intel.com>; Gobriel, Sameh <sameh.gobriel@intel.com>
Subject: [PATCH] dpif-netdev: Refactor datapath flow cache

So far the netdev datapath uses an 8K EMC to speed up the lookup of
frequently used flows by comparing the parsed packet headers against
the miniflow of a cached flow, using 13 bits of the packet RSS hash
as index. The EMC is too small for many applications with 100K or more
parallel packet flows so that EMC threshing actually degrades performance.
Furthermore, the size of struct miniflow and the flow copying cost
prevents us from making it much larger.

At the same time the lookup cost of the megaflow classifier (DPCLS) is
increasing as the number of frequently hit subtables grows with the
complexity of pipeline and the number of recirculations.

To close the performance gap for many parallel flows, this patch
introduces the datapath flow cache (DFC) with 1M entries as lookup
stage between EMC and DPCLS. It directly maps 20 bits of the RSS hash
to a pointer to the last hit megaflow entry and performs a masked
comparison of the packet flow with the megaflow key to confirm the
hit. This avoids the costly DPCLS lookup even for very large number of
parallel flows with a small memory overhead.

Due the large size of the DFC and the low risk of DFC thrashing, any
DPCLS hit immediately inserts an entry in the DFC so that subsequent
packets get speeded up. The DFC, thus, accelerate also short-lived
flows.

To further accelerate the lookup of few elephant flows, every DFC hit
triggers a probabilistic EMC insertion of the flow. As the DFC entry is
already in place the default EMC insertion probability can be reduced to
1/1000 to minimize EMC thrashing should there still be many fat flows.
The inverse EMC insertion probability remains configurable.

The EMC implementation is simplified by removing the possibility to
store a flow in two slots, as there is no particular reason why two
flows should systematically collide (the RSS hash is not symmetric).
The maximum size of the EMC flow key is limited to 256 bytes to reduce
the memory footprint. This should be sufficient to hold most real life
packet flow keys. Larger flows are not installed in the EMC.

The pmd-stats-show command is enhanced to show both EMC and DFC hits
separately.

The sweep speed for cleaning up obsolete EMC and DFC flow entries and
freeing dead megaflow entries is increased. With a typical PMD cycle
duration of 100us under load and checking one DFC entry per cycle, the
DFC sweep should normally complete within in 100s.

In PVP performance tests with an L3 pipeline over VXLAN we determined the
optimal EMC size to be 16K entries to obtain a uniform speedup compared
to the master branch over the full range of parallel flows. The measurement
below is for 64 byte packets and the average number of subtable lookups
per DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for
a single busy mask. Tests with many visited subtables should show a strong
increase of the gain through DFC.

Flows   master  DFC+EMC  Gain
        [Mpps]  [Mpps]
------------------------------
8       4.45    4.62     3.8%
100     4.17    4.47     7.2%
1000    3.88    4.34    12.0%
2000    3.54    4.17    17.8%
5000    3.01    3.82    27.0%
10000   2.75    3.63    31.9%
20000   2.64    3.50    32.8%
50000   2.60    3.33    28.1%
100000  2.59    3.23    24.7%
500000  2.59    3.16    21.9%


Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com<mailto:jan.scheurich@ericsson.com>>
---
lib/dpif-netdev.c | 349 ++++++++++++++++++++++++++++++++++++------------------
1 file changed, 235 insertions(+), 114 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index db78318..efcf2e9 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -127,19 +127,19 @@ struct netdev_flow_key {
     uint64_t buf[FLOW_MAX_PACKET_U64S];
};

-/* Exact match cache for frequently used flows
+/* Datapath flow cache (DFC) for frequently used flows
  *
- * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
- * search its entries for a miniflow that matches exactly the miniflow of the
- * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
+ * The cache uses the 32-bit hash of the packet (which can be the RSS hash) to
+ * directly look up a pointer to the matching megaflow. To check for a match
+ * the packet's flow key is compared against the key and mask of the megaflow.
  *
- * A cache entry holds a reference to its 'dp_netdev_flow'.
- *
- * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different
- * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
- * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each
- * value is the index of a cache entry where the miniflow could be.
+ * For even faster lookup, the most frequently used packet flows are also
+ * inserted into a small exact match cache (EMC). The EMC uses a part of the
+ * packet hash to look up a miniflow that matches exactly the miniflow of the
+ * packet. The matching EMC also returns a reference to the megaflow.
  *
+ * Flows are promoted from the DFC to the EMC through probabilistic insertion
+ * after successful DFC lookup with minor probability to favor elephant flows.
  *
  * Thread-safety
  * =============
@@ -148,33 +148,38 @@ struct netdev_flow_key {
  * If dp_netdev_input is not called from a pmd thread, a mutex is used.
  */

-#define EM_FLOW_HASH_SHIFT 13
-#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT)
-#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
-#define EM_FLOW_HASH_SEGS 2
+#define DFC_MASK_LEN 20
+#define DFC_ENTRIES (1u << DFC_MASK_LEN)
+#define DFC_MASK (DFC_ENTRIES - 1)
+#define EMC_MASK_LEN 14
+#define EMC_ENTRIES (1u << EMC_MASK_LEN)
+#define EMC_MASK (EMC_ENTRIES - 1)

/* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
-#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
+#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
#define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
                                     DEFAULT_EM_FLOW_INSERT_INV_PROB)

struct emc_entry {
     struct dp_netdev_flow *flow;
-    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
+    char key[248];   /* Holds struct netdev_flow_key of limited size. */
};

struct emc_cache {
-    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
-    int sweep_idx;                /* For emc_cache_slow_sweep(). */
+    struct emc_entry entries[EMC_ENTRIES];
+    int sweep_idx;
+};
+
+struct dfc_entry {
+    struct dp_netdev_flow *flow;
+};
+
+struct dfc_cache {
+    struct emc_cache emc_cache;
+    struct dfc_entry entries[DFC_ENTRIES];
+    int sweep_idx;
};

-/* Iterate in the exact match cache through every entry that might contain a
- * miniflow with hash 'HASH'. */
-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
-    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
-         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
-         i__ < EM_FLOW_HASH_SEGS;                                            \
-         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)

/* Simple non-wildcarding single-priority classifier. */

@@ -214,6 +219,8 @@ static bool dpcls_lookup(struct dpcls *cls,
                          const struct netdev_flow_key keys[],
                          struct dpcls_rule **rules, size_t cnt,
                          int *num_lookups_p);
+static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
+                                   const struct netdev_flow_key *target);

/* Set of supported meter flags */
#define DP_SUPPORTED_METER_FLAGS_MASK \
@@ -332,6 +339,7 @@ static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,

enum dp_stat_type {
     DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
+    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
     DP_STAT_MISS,               /* Packets that did not match. */
     DP_STAT_LOST,               /* Packets not passed up to the client. */
@@ -565,7 +573,7 @@ struct dp_netdev_pmd_thread {
      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
      * need to be protected by 'non_pmd_mutex'.  Every other instance
      * will only be accessed by its own pmd thread. */
-    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
     struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */

     /* Queue id used by this pmd thread to send packets on all netdevs if
@@ -744,46 +752,59 @@ dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
                                       struct tx_port *tx, long long now);

-static inline bool emc_entry_alive(struct emc_entry *ce);
+static inline bool dfc_entry_alive(struct dfc_entry *ce);
static void emc_clear_entry(struct emc_entry *ce);
+static void dfc_clear_entry(struct dfc_entry *ce);

static void dp_netdev_request_reconfigure(struct dp_netdev *dp);

static void
-emc_cache_init(struct emc_cache *flow_cache)
+emc_cache_init(struct emc_cache *emc)
{
     int i;

-    flow_cache->sweep_idx = 0;
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc->entries[i].flow = NULL;
+        struct netdev_flow_key *key =
+                (struct netdev_flow_key *) emc->entries[i].key;
+        key->hash = 0;
+        key->len = sizeof(struct miniflow);
+        flowmap_init(&key->mf.map);
+    }
+    emc->sweep_idx = 0;
+}
+
+static void
+dfc_cache_init(struct dfc_cache *flow_cache)
+{
+    int i;
+
+    emc_cache_init(&flow_cache->emc_cache);
     for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
         flow_cache->entries[i].flow = NULL;
-        flow_cache->entries[i].key.hash = 0;
-        flow_cache->entries[i].key.len = sizeof(struct miniflow);
-        flowmap_init(&flow_cache->entries[i].key.mf.map);
     }
+    flow_cache->sweep_idx = 0;
}

static void
-emc_cache_uninit(struct emc_cache *flow_cache)
+emc_cache_uninit(struct emc_cache *emc)
{
     int i;

-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-        emc_clear_entry(&flow_cache->entries[i]);
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc_clear_entry(&emc->entries[i]);
     }
}

-/* Check and clear dead flow references slowly (one entry at each
- * invocation).  */
static void
-emc_cache_slow_sweep(struct emc_cache *flow_cache)
+dfc_cache_uninit(struct dfc_cache *flow_cache)
{
-    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
+    int i;

-    if (!emc_entry_alive(entry)) {
-        emc_clear_entry(entry);
+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+        dfc_clear_entry(&flow_cache->entries[i]);
     }
-    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
+    emc_cache_uninit(&flow_cache->emc_cache);
}

/* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */
@@ -837,8 +858,8 @@ pmd_info_show_stats(struct ds *reply,
     }

     /* Sum of all the matched and not matched packets gives the total.  */
-    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_MASKED_HIT]
-                    + stats[DP_STAT_MISS];
+    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
+                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];

     for (i = 0; i < PMD_N_CYCLES; i++) {
         if (cycles[i] > pmd->cycles_zero[i]) {
@@ -862,10 +883,13 @@ pmd_info_show_stats(struct ds *reply,
     ds_put_cstr(reply, ":\n");

     ds_put_format(reply,
-                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
+                  "\temc hits:%llu\n\tdfc hits:%llu\n"
+                  "\tmegaflow hits:%llu\n"
                   "\tavg. subtable lookups per hit:%.2f\n"
                   "\tmiss:%llu\n\tlost:%llu\n",
-                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
+                  stats[DP_STAT_EXACT_HIT],
+                  stats[DP_STAT_DFC_HIT],
+                  stats[DP_STAT_MASKED_HIT],
                   stats[DP_STAT_MASKED_HIT] > 0
                   ? (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT]
                   : 0,
@@ -1492,6 +1516,8 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
         stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
         stats->n_hit += n;
+        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
+        stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
         stats->n_missed += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n);
@@ -2111,6 +2137,16 @@ netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
}

+/*
+ * Datapath Flow Cache and EMC implementation
+ */
+
+static inline struct emc_entry *
+emc_entry_get(struct emc_cache *emc, const uint32_t hash)
+{
+    return &emc->entries[hash & EMC_MASK];
+}
+
static inline bool
emc_entry_alive(struct emc_entry *ce)
{
@@ -2127,9 +2163,15 @@ emc_clear_entry(struct emc_entry *ce)
}

static inline void
-emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
-                 const struct netdev_flow_key *key)
+emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key *key,
+                 struct dp_netdev_flow *flow)
{
+    /* We only store small enough flows in the EMC. */
+    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
+    if (key_size > sizeof(ce->key)) {
+        return;
+    }
+
     if (ce->flow != flow) {
         if (ce->flow) {
             dp_netdev_flow_unref(ce->flow);
@@ -2141,73 +2183,148 @@ emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
             ce->flow = NULL;
         }
     }
-    if (key) {
-        netdev_flow_key_clone(&ce->key, key);
-    }
+    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
}

static inline void
-emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
-           struct dp_netdev_flow *flow)
+emc_probabilistic_insert(struct emc_cache *emc,
+                         const struct netdev_flow_key *key,
+                         struct dp_netdev_flow *flow)
{
-    struct emc_entry *to_be_replaced = NULL;
-    struct emc_entry *current_entry;
+    const uint32_t threshold = UINT32_MAX/1000;

-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (netdev_flow_key_equal(&current_entry->key, key)) {
-            /* We found the entry with the 'mf' miniflow */
-            emc_change_entry(current_entry, flow, NULL);
-            return;
+    if (random_uint32() <= threshold) {
+
+        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+        emc_change_entry(current_entry, key, flow);
+    }
+}
+
+static inline struct dp_netdev_flow *
+emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key)
+{
+    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+    struct netdev_flow_key *current_key =
+            (struct netdev_flow_key *) current_entry->key;
+
+    if (current_key->hash == key->hash
+        && emc_entry_alive(current_entry)
+        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
+
+        /* We found the entry with the 'key->mf' miniflow */
+        return current_entry->flow;
+    }
+    return NULL;
+}
+
+static inline struct dfc_entry *
+dfc_entry_get(struct dfc_cache *cache, const uint32_t hash)
+{
+    return &cache->entries[hash & DFC_MASK];
+}
+
+static inline bool
+dfc_entry_alive(struct dfc_entry *ce)
+{
+    return ce->flow && !ce->flow->dead;
+}
+
+static void
+dfc_clear_entry(struct dfc_entry *ce)
+{
+    if (ce->flow) {
+        dp_netdev_flow_unref(ce->flow);
+        ce->flow = NULL;
+    }
+}
+
+static inline void
+dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow)
+{
+    if (ce->flow != flow) {
+        if (ce->flow) {
+            dp_netdev_flow_unref(ce->flow);
         }

-        /* Replacement policy: put the flow in an empty (not alive) entry, or
-         * in the first entry where it can be */
-        if (!to_be_replaced
-            || (emc_entry_alive(to_be_replaced)
-                && !emc_entry_alive(current_entry))
-            || current_entry->key.hash < to_be_replaced->key.hash) {
-            to_be_replaced = current_entry;
+        if (dp_netdev_flow_ref(flow)) {
+            ce->flow = flow;
+        } else {
+            ce->flow = NULL;
         }
     }
-    /* We didn't find the miniflow in the cache.
-     * The 'to_be_replaced' entry is where the new flow will be stored */
-
-    emc_change_entry(to_be_replaced, flow, key);
}

static inline void
-emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
-                         const struct netdev_flow_key *key,
-                         struct dp_netdev_flow *flow)
+dfc_insert(struct dp_netdev_pmd_thread *pmd,
+           const struct netdev_flow_key *key,
+           struct dp_netdev_flow *flow)
{
-    /* Insert an entry into the EMC based on probability value 'min'. By
-     * default the value is UINT32_MAX / 100 which yields an insertion
-     * probability of 1/100 ie. 1% */
+    struct dfc_cache *cache = &pmd->flow_cache;
+    struct dfc_entry *current_entry;
+
+    current_entry = dfc_entry_get(cache, key->hash);
+    dfc_change_entry(current_entry, flow);
+}
+
+static inline struct dp_netdev_flow *
+dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
+           bool *exact_match)
+{
+    struct dp_netdev_flow *flow;

-    uint32_t min;
-    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
+    /* Try an EMC lookup first. */
+    flow = emc_lookup(&cache->emc_cache, key);
+    if (flow) {
+        *exact_match = true;
+        return flow;
+    }
+
+    /* EMC lookup not successful: try DFC lookup. */
+    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
+    flow = current_entry->flow;

-    if (min && random_uint32() <= min) {
-        emc_insert(&pmd->flow_cache, key, flow);
+    if (dfc_entry_alive(current_entry) &&
+        dpcls_rule_matches_key(&flow->cr, key)) {
+
+        /* Found a match in DFC. Insert into EMC for subsequent lookups.
+         * We use probabilistic insertion here so that mainly elephant
+         * flows enter EMC. */
+        emc_probabilistic_insert(&cache->emc_cache, key, flow);
+        *exact_match = false;
+        return flow;
+    } else {
+
+        /* No match. Need to go to DPCLS lookup. */
+        return NULL;
     }
}

-static inline struct dp_netdev_flow *
-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
+/* Check and clear dead flow references slowly (one entry at each
+ * invocation).  */
+static void
+emc_slow_sweep(struct emc_cache *emc)
{
-    struct emc_entry *current_entry;
+    struct emc_entry *entry = &emc->entries[emc->sweep_idx];

-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (current_entry->key.hash == key->hash
-            && emc_entry_alive(current_entry)
-            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
+    if (!emc_entry_alive(entry)) {
+        emc_clear_entry(entry);
+    }
+    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
+}

-            /* We found the entry with the 'key->mf' miniflow */
-            return current_entry->flow;
-        }
+static void
+dfc_slow_sweep(struct dfc_cache *cache)
+{
+    /* Sweep the EMC so that both finish in the same time. */
+    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
+        emc_slow_sweep(&cache->emc_cache);
     }

-    return NULL;
+    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
+    if (!dfc_entry_alive(entry)) {
+        dfc_clear_entry(entry);
+    }
+    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
}

static struct dp_netdev_flow *
@@ -4048,7 +4165,7 @@ pmd_thread_main(void *f_)
     ovs_numa_thread_setaffinity_core(pmd->core_id);
     dpdk_set_lcore_id(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
-    emc_cache_init(&pmd->flow_cache);
+    dfc_cache_init(&pmd->flow_cache);
reload:
     pmd_alloc_static_tx_qid(pmd);

@@ -4078,17 +4195,16 @@ reload:
                                                       : PMD_CYCLES_IDLE);
         }

-        if (lc++ > 1024) {
-            bool reload;
+        dfc_slow_sweep(&pmd->flow_cache);

+        if (lc++ > 1024) {
             lc = 0;

             coverage_try_clear();
             dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
-            if (!ovsrcu_try_quiesce()) {
-                emc_cache_slow_sweep(&pmd->flow_cache);
-            }
+            ovsrcu_try_quiesce();

+            bool reload;
             atomic_read_relaxed(&pmd->reload, &reload);
             if (reload) {
                 break;
@@ -4110,7 +4226,7 @@ reload:
         goto reload;
     }

-    emc_cache_uninit(&pmd->flow_cache);
+    dfc_cache_uninit(&pmd->flow_cache);
     free(poll_list);
     pmd_free_cached_ports(pmd);
     return NULL;
@@ -4544,7 +4660,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
-        emc_cache_init(&pmd->flow_cache);
+        dfc_cache_init(&pmd->flow_cache);
         pmd_alloc_static_tx_qid(pmd);
     }
     cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
@@ -4586,7 +4702,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct dp_netdev_pmd_thread *pmd)
      * but extra cleanup is necessary */
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_lock(&dp->non_pmd_mutex);
-        emc_cache_uninit(&pmd->flow_cache);
+        dfc_cache_uninit(&pmd->flow_cache);
         pmd_free_cached_ports(pmd);
         pmd_free_static_tx_qid(pmd);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -4896,7 +5012,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
}

-/* Try to process all ('cnt') the 'packets' using only the exact match cache
+/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
  * beginning of the 'packets' array.
@@ -4911,19 +5027,20 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
  * will be ignored.
  */
static inline size_t
-emc_processing(struct dp_netdev_pmd_thread *pmd,
+dfc_processing(struct dp_netdev_pmd_thread *pmd,
                struct dp_packet_batch *packets_,
                struct netdev_flow_key *keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
                bool md_is_valid, odp_port_t port_no)
{
-    struct emc_cache *flow_cache = &pmd->flow_cache;
+    struct dfc_cache *flow_cache = &pmd->flow_cache;
     struct netdev_flow_key *key = &keys[0];
-    size_t n_missed = 0, n_dropped = 0;
+    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
     struct dp_packet *packet;
     const size_t cnt = dp_packet_batch_size(packets_);
     uint32_t cur_min;
     int i;
+    bool exact_match;

     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);

@@ -4932,7 +5049,6 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,

         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
             dp_packet_delete(packet);
-            n_dropped++;
             continue;
         }

@@ -4948,7 +5064,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         miniflow_extract(packet, &key->mf);
         key->len = 0; /* Not computed yet. */
-        /* If EMC is disabled skip hash computation and emc_lookup */
+        /* If DFC is disabled skip hash computation and DFC lookup */
         if (cur_min) {
             if (!md_is_valid) {
                 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
@@ -4956,11 +5072,16 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
             } else {
                 key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
             }
-            flow = emc_lookup(flow_cache, key);
+            flow = dfc_lookup(flow_cache, key, &exact_match);
         } else {
             flow = NULL;
         }
         if (OVS_LIKELY(flow)) {
+            if (exact_match) {
+                n_emc_hit++;
+            } else {
+                n_dfc_hit++;
+            }
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
         } else {
@@ -4974,8 +5095,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
     }

-    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
-                           cnt - n_dropped - n_missed);
+    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
+    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);

     return dp_packet_batch_size(packets_);
}
@@ -5044,7 +5165,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        dfc_insert(pmd, key, netdev_flow);
     }
}

@@ -5136,7 +5257,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,

         flow = dp_netdev_flow_cast(rules[i]);

-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        dfc_insert(pmd, &keys[i], flow);
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }

@@ -5169,7 +5290,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     odp_port_t in_port;

     n_batches = 0;
-    emc_processing(pmd, packets, keys, batches, &n_batches,
+    dfc_processing(pmd, packets, keys, batches, &n_batches,
                             md_is_valid, port_no);
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
@@ -6063,7 +6184,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)

/* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
-static inline bool
+static bool
dpcls_rule_matches_key(const struct dpcls_rule *rule,
                        const struct netdev_flow_key *target)
{
--
1.9.1
Jan Scheurich Dec. 18, 2017, 5:01 p.m. UTC | #6
Hi Yipeng,

Thanks a lot for your feedback. Please find some responses below.

Regards, Jan


From: Wang, Yipeng1 [mailto:yipeng1.wang@intel.com]
Sent: Sunday, 17 December, 2017 19:49
To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie <charlie.tai@intel.com>
Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache

Hi, Jan

We went through the code and did some performance comparisons. We notice that the patch contains two parts of optimizations: EMC simplifying/resizing, and another layer of cache added before megaflow cache.  The new cache idea has the same direction with our cuckoo distributor(CD) proposal we posted back in April (https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html) and presented in OvS 2017 conference. Comparing to our patch, we saw pros and cons for both CD and DFC, and currently seeking a way to combine the benefits of both patches. We are also seeking other ways to further simplify current datapath, to have a scalable while simple solution. Below are some detailed comments:

For EMC part, we wonder if you enabled transparent huge page (THP) during the test. For our test case, the new EMC only gives a little speedup if THP enabled, since with huge page, reducing EMC entry does not benefit much. Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we tested. I guess the optimization will largely depend on traffic patterns. Another question is that it seems when EMC lookup does "netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the key comparison actually does not do correctly. Could you please double check?

[Jan] Yes, THP is enabled on my test setup, but I have doubts that it significantly boosts the performance of the DFC/EMC by transparently allocating that memory on a hugepage. Do you have a means to check that on a running OVS?

My primary goal when I chose to change the EMC implementation from 2-way to 1-way associativity was to simplify the code. In my tests I have not seen any benefit of having two possible locations for an EMC entry. As far as I can see there is no theoretical reason why we should expect systematic collisions of pairs of flows that would justify such a design. There may well be specific traffic patterns that benefit from 2-way EMC, but then there are obviously others for which 1-way performs better. In doubt I believe we should choose the simpler design.

Regarding "netdev_flow_key_equal_mf", there is no difference to the baseline. The key length is taken from the stored EMC entry, not the packet's flow key.

For the DFC cache part, we compare with our CD patch we presented in the OvS conference. We saw CD begins to perform better than DFC with 2 or more subtables, and ~60% higher throughout with 20 subtables, especially when flow count is large (around 100k or more). We found CD is a more scalable implementation w.r.t subtable count and flow count. Part of the reason is that the 1-hash hash table of DFC does not perform consistently with various traffic patterns, and easily cause conflict misses. DFC's advantage shows up when all flows hit DFC or only 1 subtable exists. We are currently thinking about combining both approaches for example a hybrid model.

[Jan] A DFC hit will always be faster than a CD hit because the latter involves an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at 5000 parallel flows to ~25% at 150K flows, so even for large number of flows most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search) must be very high to cause a big degradation.

Can you specify your test setup in more detail? What kind of DFC miss rates do you measure depending on the flow rate? Can you publish your measurement results?

Do you have an updated version of your CD patch that works with the membership library in DPDK 17.11 now that OVS master builds against 17.11?

We would love to hear your opinion on this and we think the best case is we could find a way to harmonize both patches, and find a both scalable and efficient way to refactor the datapath.

I would be interested to see your ideas how to combine DFC and CD in a good way.

In principle I think CD acceleration of DPCLS lookup could complement DFC but I am a bit concerned about the combined memory and cache footprint of EMC, DFC and CD. Even for EMC+DFC I have some doubts here. This should be evaluated in multi-core setups of OVS and with real VNFs/applications in the guests that exert a significant L3 cache contention.

Also, as all three are based on the same RSS hash as key, isn't there a likelihood of hash collisions hitting all three in the same way? I am thinking about packets that have little/no entropy in the outer headers (e.g. GRE tunnels).

Thanks
Yipeng
Wang, Yipeng1 Dec. 19, 2017, 11:08 p.m. UTC | #7
Hi, Jan,

Please see my comments inlined.

Thanks
Yipeng

>-----Original Message-----
>From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
>Sent: Monday, December 18, 2017 9:01 AM
>To: Wang, Yipeng1 <yipeng1.wang@intel.com>; dev@openvswitch.org
>Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
><charlie.tai@intel.com>
>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>
>Hi Yipeng,
>
>Thanks a lot for your feedback. Please find some responses below.
>
>Regards, Jan
>
>
>From: Wang, Yipeng1 [mailto:yipeng1.wang@intel.com]
>Sent: Sunday, 17 December, 2017 19:49
>To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
>Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
><charlie.tai@intel.com>
>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>
>Hi, Jan
>
>We went through the code and did some performance comparisons. We
>notice that the patch contains two parts of optimizations: EMC
>simplifying/resizing, and another layer of cache added before megaflow cache.
>The new cache idea has the same direction with our cuckoo distributor(CD)
>proposal we posted back in April
>(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
><https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html> )
>and presented in OvS 2017 conference. Comparing to our patch, we saw pros
>and cons for both CD and DFC, and currently seeking a way to combine the
>benefits of both patches. We are also seeking other ways to further simplify
>current datapath, to have a scalable while simple solution. Below are some
>detailed comments:
>
>For EMC part, we wonder if you enabled transparent huge page (THP) during
>the test. For our test case, the new EMC only gives a little speedup if THP
>enabled, since with huge page, reducing EMC entry does not benefit much.
>Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we
>tested. I guess the optimization will largely depend on traffic patterns.
>Another question is that it seems when EMC lookup does
>"netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the
>key comparison actually does not do correctly. Could you please double check?
>
>[Jan] Yes, THP is enabled on my test setup, but I have doubts that it
>significantly boosts the performance of the DFC/EMC by transparently
>allocating that memory on a hugepage. Do you have a means to check that on
>a running OVS?

[Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
Once I turned on THP with no other changes, the current EMC's throughput increases a lot and is comparable with the newly
proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
It is worth to mention that they both use similar amount of CPU cache since only the miniflow struct is accessed by CPU,
thus the TLB should be the major concern.

>My primary goal when I chose to change the EMC implementation from 2-way
>to 1-way associativity was to simplify the code. In my tests I have not seen any
>benefit of having two possible locations for an EMC entry. As far as I can see
>there is no theoretical reason why we should expect systematic collisions of
>pairs of flows that would justify such a design. There may well be specific
>traffic patterns that benefit from 2-way EMC, but then there are obviously
>others for which 1-way performs better. In doubt I believe we should choose
>the simpler design.
>
[Wang, Yipeng] Yes that there is no systematic collisions. However, in general,
1-hash table tends to cause many more misses than 2-hash. For code simplicity,
I agree that 1-hash is simpler and much easier to understand. For performance,
if the flows can fit in 1-hash table, they should also stay in the primary location of the 2-hash table,
so basically they should have similar lookup speed. For large numbers of flows in general,
traffic will have higher miss ratio in 1-hash than 2-hash table. From one of our tests that has
10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries, 
the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.

>Regarding "netdev_flow_key_equal_mf", there is no difference to the
>baseline. The key length is taken from the stored EMC entry, not the packet's
>flow key.
>
[Wang, Yipeng] Sorry that I did not explain it correctly. It seems the key length may not be initialized
when emc_probablilistic_insert is called. If I am correct,  the EMC entry does not contain the right
key length and the EMC lookup does not use the correct length. This is because the emc insert
now happens during dfc_lookup but originally it happens after fast_processing.
Do I understand it correctly?

>For the DFC cache part, we compare with our CD patch we presented in the
>OvS conference. We saw CD begins to perform better than DFC with 2 or
>more subtables, and ~60% higher throughout with 20 subtables, especially
>when flow count is large (around 100k or more). We found CD is a more
>scalable implementation w.r.t subtable count and flow count. Part of the
>reason is that the 1-hash hash table of DFC does not perform consistently with
>various traffic patterns, and easily cause conflict misses. DFC's advantage
>shows up when all flows hit DFC or only 1 subtable exists. We are currently
>thinking about combining both approaches for example a hybrid model.
>
>[Jan] A DFC hit will always be faster than a CD hit because the latter involves
>an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at
>5000 parallel flows to ~25% at 150K flows, so even for large number of flows
>most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search)
>must be very high to cause a big degradation.
>
[Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, but CD usually has higher
hit rate for large number of flows,  as the data shows later.

>Can you specify your test setup in more detail? What kind of DFC miss rates do
>you measure depending on the flow rate? Can you publish your
>measurement results?
>
[Wang, Yipeng] We use the test/rules we posted with our CD patch. Basically we vary src_IP to hit different subtables,
and then vary dst_IP to create various numbers of flows. We use Spirent to generate src_IP from
1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 0.0.0.0 to certain value depending on the
flow count. It is similar to your traffic pattern with various UDP port number.
We use your proposed EMC design for both schemes. Here is the performance ratio we collected:

throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).     
table cnt/flow cnt	1	3	5	10	20
10k			1.00	1.00	0.85	1.00	1.00
100k			0.81	1.15	1.17	1.35	1.55
1M			0.80	1.12	1.31	1.37	1.63
                                                                         
>Do you have an updated version of your CD patch that works with the
>membership library in DPDK 17.11 now that OVS master builds against 17.11?
>
[Wang, Yipeng] We have the code and we can send to you for testing
if you would like to. But since now we think it is better to combine the benefit of both DFC and CD,
it would be better to post on mailing list a more mature patch later.

>We would love to hear your opinion on this and we think the best case is we
>could find a way to harmonize both patches, and find a both scalable and
>efficient way to refactor the datapath.
>
>I would be interested to see your ideas how to combine DFC and CD in a good
>way.
>
[Wang, Yipeng] We are thinking of using the indirect table to store either the pointer to the megaflow (like DFC), 
or the pointer to the subtable (like CD). The heuristic will depend on the number of active megaflows and the locality
of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate,
while avoid dpcls subtable access like how DFC works.

>In principle I think CD acceleration of DPCLS lookup could complement DFC but
>I am a bit concerned about the combined memory and cache footprint of EMC,
>DFC and CD. Even for EMC+DFC I have some doubts here. This should be
>evaluated in multi-core setups of OVS and with real VNFs/applications in the
>guests that exert a significant L3 cache contention.
>
[Wang, Yipeng] The cache footprint was our concern too when we worked on CD thus we store
a subtable index instead of the pointer to megaflow. With the hybrid model mentioned
above, I think it would be a better solution that is both memory efficient and scalable.

>Also, as all three are based on the same RSS hash as key, isn't there a
>likelihood of hash collisions hitting all three in the same way? I am thinking
>about packets that have little/no entropy in the outer headers (e.g. GRE
>tunnels).
>
[Wang, Yipeng] If there is no entropy within the range considered by the RSS, they 
will collide. But this is another case that the 2-hash EMC may help comparing to
1-hash, right? Keys with same signature has an alternative location to go to. 

>Thanks
>Yipeng
>
William Tu Dec. 29, 2017, 10:36 p.m. UTC | #8
On Tue, Dec 19, 2017 at 3:08 PM, Wang, Yipeng1 <yipeng1.wang@intel.com> wrote:
> Hi, Jan,
>
> Please see my comments inlined.
>
> Thanks
> Yipeng
>
>>-----Original Message-----
>>From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
>>Sent: Monday, December 18, 2017 9:01 AM
>>To: Wang, Yipeng1 <yipeng1.wang@intel.com>; dev@openvswitch.org
>>Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
>><charlie.tai@intel.com>
>>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>>
>>Hi Yipeng,
>>
>>Thanks a lot for your feedback. Please find some responses below.
>>
>>Regards, Jan
>>
>>
>>From: Wang, Yipeng1 [mailto:yipeng1.wang@intel.com]
>>Sent: Sunday, 17 December, 2017 19:49
>>To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
>>Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
>><charlie.tai@intel.com>
>>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>>
>>Hi, Jan
>>
>>We went through the code and did some performance comparisons. We
>>notice that the patch contains two parts of optimizations: EMC
>>simplifying/resizing, and another layer of cache added before megaflow cache.
>>The new cache idea has the same direction with our cuckoo distributor(CD)
>>proposal we posted back in April
>>(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
>><https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html> )
>>and presented in OvS 2017 conference. Comparing to our patch, we saw pros
>>and cons for both CD and DFC, and currently seeking a way to combine the
>>benefits of both patches. We are also seeking other ways to further simplify
>>current datapath, to have a scalable while simple solution. Below are some
>>detailed comments:
>>
>>For EMC part, we wonder if you enabled transparent huge page (THP) during
>>the test. For our test case, the new EMC only gives a little speedup if THP
>>enabled, since with huge page, reducing EMC entry does not benefit much.
>>Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we
>>tested. I guess the optimization will largely depend on traffic patterns.
>>Another question is that it seems when EMC lookup does
>>"netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the
>>key comparison actually does not do correctly. Could you please double check?
>>
>>[Jan] Yes, THP is enabled on my test setup, but I have doubts that it
>>significantly boosts the performance of the DFC/EMC by transparently
>>allocating that memory on a hugepage. Do you have a means to check that on
>>a running OVS?
>
> [Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
> If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
> Once I turned on THP with no other changes, the current EMC's throughput increases a lot and is comparable with the newly
> proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
> So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
> It is worth to mention that they both use similar amount of CPU cache since only the miniflow struct is accessed by CPU,
> thus the TLB should be the major concern.
>
>>My primary goal when I chose to change the EMC implementation from 2-way
>>to 1-way associativity was to simplify the code. In my tests I have not seen any
>>benefit of having two possible locations for an EMC entry. As far as I can see
>>there is no theoretical reason why we should expect systematic collisions of
>>pairs of flows that would justify such a design. There may well be specific
>>traffic patterns that benefit from 2-way EMC, but then there are obviously
>>others for which 1-way performs better. In doubt I believe we should choose
>>the simpler design.
>>
> [Wang, Yipeng] Yes that there is no systematic collisions. However, in general,
> 1-hash table tends to cause many more misses than 2-hash. For code simplicity,
> I agree that 1-hash is simpler and much easier to understand. For performance,
> if the flows can fit in 1-hash table, they should also stay in the primary location of the 2-hash table,
> so basically they should have similar lookup speed. For large numbers of flows in general,
> traffic will have higher miss ratio in 1-hash than 2-hash table. From one of our tests that has
> 10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries,
> the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>
>>Regarding "netdev_flow_key_equal_mf", there is no difference to the
>>baseline. The key length is taken from the stored EMC entry, not the packet's
>>flow key.
>>
> [Wang, Yipeng] Sorry that I did not explain it correctly. It seems the key length may not be initialized
> when emc_probablilistic_insert is called. If I am correct,  the EMC entry does not contain the right
> key length and the EMC lookup does not use the correct length. This is because the emc insert
> now happens during dfc_lookup but originally it happens after fast_processing.
> Do I understand it correctly?
>
>>For the DFC cache part, we compare with our CD patch we presented in the
>>OvS conference. We saw CD begins to perform better than DFC with 2 or
>>more subtables, and ~60% higher throughout with 20 subtables, especially
>>when flow count is large (around 100k or more). We found CD is a more
>>scalable implementation w.r.t subtable count and flow count. Part of the
>>reason is that the 1-hash hash table of DFC does not perform consistently with
>>various traffic patterns, and easily cause conflict misses. DFC's advantage
>>shows up when all flows hit DFC or only 1 subtable exists. We are currently
>>thinking about combining both approaches for example a hybrid model.
>>
>>[Jan] A DFC hit will always be faster than a CD hit because the latter involves
>>an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at
>>5000 parallel flows to ~25% at 150K flows, so even for large number of flows
>>most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search)
>>must be very high to cause a big degradation.
>>
> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, but CD usually has higher
> hit rate for large number of flows,  as the data shows later.
>
>>Can you specify your test setup in more detail? What kind of DFC miss rates do
>>you measure depending on the flow rate? Can you publish your
>>measurement results?
>>
> [Wang, Yipeng] We use the test/rules we posted with our CD patch. Basically we vary src_IP to hit different subtables,
> and then vary dst_IP to create various numbers of flows. We use Spirent to generate src_IP from
> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 0.0.0.0 to certain value depending on the
> flow count. It is similar to your traffic pattern with various UDP port number.
> We use your proposed EMC design for both schemes. Here is the performance ratio we collected:
>
> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).
> table cnt/flow cnt      1       3       5       10      20
> 10k                     1.00    1.00    0.85    1.00    1.00
> 100k                    0.81    1.15    1.17    1.35    1.55
> 1M                      0.80    1.12    1.31    1.37    1.63
>
>>Do you have an updated version of your CD patch that works with the
>>membership library in DPDK 17.11 now that OVS master builds against 17.11?
>>
> [Wang, Yipeng] We have the code and we can send to you for testing
> if you would like to. But since now we think it is better to combine the benefit of both DFC and CD,
> it would be better to post on mailing list a more mature patch later.
>
>>We would love to hear your opinion on this and we think the best case is we
>>could find a way to harmonize both patches, and find a both scalable and
>>efficient way to refactor the datapath.
>>
>>I would be interested to see your ideas how to combine DFC and CD in a good
>>way.
>>
> [Wang, Yipeng] We are thinking of using the indirect table to store either the pointer to the megaflow (like DFC),
> or the pointer to the subtable (like CD). The heuristic will depend on the number of active megaflows and the locality
> of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate,
> while avoid dpcls subtable access like how DFC works.
>
>>In principle I think CD acceleration of DPCLS lookup could complement DFC but
>>I am a bit concerned about the combined memory and cache footprint of EMC,
>>DFC and CD. Even for EMC+DFC I have some doubts here. This should be
>>evaluated in multi-core setups of OVS and with real VNFs/applications in the
>>guests that exert a significant L3 cache contention.
>>
> [Wang, Yipeng] The cache footprint was our concern too when we worked on CD thus we store
> a subtable index instead of the pointer to megaflow. With the hybrid model mentioned
> above, I think it would be a better solution that is both memory efficient and scalable.
>
>>Also, as all three are based on the same RSS hash as key, isn't there a
>>likelihood of hash collisions hitting all three in the same way? I am thinking
>>about packets that have little/no entropy in the outer headers (e.g. GRE
>>tunnels).
>>
> [Wang, Yipeng] If there is no entropy within the range considered by the RSS, they
> will collide. But this is another case that the 2-hash EMC may help comparing to
> 1-hash, right? Keys with same signature has an alternative location to go to.
>

Hi Yipeng,

I like the idea of using Cuckoo Distributer, but I also have some
doubts about the RSS hash collision. (Sorry for jumping into the
discussion.)

If there are rules using fields outside the 5-tuple, for example:
rule1 match: 5-tuple and vlan_id >= 1024
rule2 match: 5-tuple and vlan_id < 1024
Then rule1 will have it's unique mask in subtable1 and rule2 has
another mask in subtable2.

For packets with the same 5-tuple, but different vlan_id, they will
collide into the same CD hash table, and potentially pointing to the
wrong subtable. Since the current OVS-CD implementation (if I
understand correctly) will overwrite the existing CD hash entry, if
the traffic have patterns like one packet matching rule1 and another
matching rule2, and so on, then the lookup miss for rule1 will insert
entry pointing to subtable1, and the second packet targeting rule2
will miss again and insert entry pointing to subtable2, and so on. In
the end the miss rate will be 100% in such a worst case?

Regards,
William
Wang, Yipeng1 Jan. 2, 2018, 6:45 p.m. UTC | #9
Hi, William,

Thanks for the comment. You are right. If the RSS hash does not consider the fields that matter, the situation you mentioned could happen.

There are two design options for CD as you may find, when the signatures collide, we could either replace the existing entry (the current design), or still insert the entry into the cache. If we chose the second design, I think the situation you mentioned could be alleviated. We chose the first one mostly because of its simplicity and speed for the hit case. For example, if we allow multiple entries with the same signature stay in one bucket, then the lookup function needs to iterate all the entries in a bucket to find all the matches (for scalar version). And additional loops and variables are required to iterate all the matches. We expected to see some percentage of throughput influence for cache hit cases. 

But as you suggested, if the situation you mentioned is very common in real use cases, and RSS does not consider the vlan id, we could choose to not overwrite. Another option is to reduce the insertion rate (or turn off CD) as CD's miss ratio is high (this is similar to the EMC conditional insertion). Then the 100% miss ratio case can be alleviated. This is an easy change for CD. Or we could use software hash together with RSS to consider vlan tag, this could benefit EMC too I guess.

There are many design options and trade-offs but we eventually want to have a design that work for most use cases. 

Thanks
Yipeng

>-----Original Message-----
>
>Hi Yipeng,
>
>I like the idea of using Cuckoo Distributer, but I also have some
>doubts about the RSS hash collision. (Sorry for jumping into the
>discussion.)
>
>If there are rules using fields outside the 5-tuple, for example:
>rule1 match: 5-tuple and vlan_id >= 1024
>rule2 match: 5-tuple and vlan_id < 1024
>Then rule1 will have it's unique mask in subtable1 and rule2 has
>another mask in subtable2.
>
>For packets with the same 5-tuple, but different vlan_id, they will
>collide into the same CD hash table, and potentially pointing to the
>wrong subtable. Since the current OVS-CD implementation (if I
>understand correctly) will overwrite the existing CD hash entry, if
>the traffic have patterns like one packet matching rule1 and another
>matching rule2, and so on, then the lookup miss for rule1 will insert
>entry pointing to subtable1, and the second packet targeting rule2
>will miss again and insert entry pointing to subtable2, and so on. In
>the end the miss rate will be 100% in such a worst case?
>
>Regards,
>William
William Tu Jan. 3, 2018, 1:32 a.m. UTC | #10
Hi Yipeng,

Thanks for the reply.

On Tue, Jan 2, 2018 at 10:45 AM, Wang, Yipeng1 <yipeng1.wang@intel.com> wrote:
> Hi, William,
>
> Thanks for the comment. You are right. If the RSS hash does not consider the fields that matter, the situation you mentioned could happen.
>
> There are two design options for CD as you may find, when the signatures collide, we could either replace the existing entry (the current design), or still insert the entry into the cache. If we chose the second design, I think the situation you mentioned could be alleviated. We chose the first one mostly because of its simplicity and speed for the hit case. For example, if we allow multiple entries with the same signature stay in one bucket, then the lookup function needs to iterate all the entries in a bucket to find all the matches (for scalar version). And additional loops and variables are required to iterate all the matches. We expected to see some percentage of throughput influence for cache hit cases.

I think the cost of having multiple entries with the same signature is
too high, basically the CD lookup time increase from O(1) to O(n),
where n is the bucket size.

>
> But as you suggested, if the situation you mentioned is very common in real use cases, and RSS does not consider the vlan id, we could choose to not overwrite. Another option is to reduce the insertion rate (or turn off CD) as CD's miss ratio is high (this is similar to the EMC conditional insertion). Then the 100% miss ratio case can be alleviated. This is an easy change for CD. Or we could use software hash together with RSS to consider vlan tag, this could benefit EMC too I guess.
>
> There are many design options and trade-offs but we eventually want to have a design that work for most use cases.

I don't have any traffic dataset, but I would assume it's pretty
common that multiple tunneling protocols are deployed. That said, the
RSS hash, which is based on outer-header 5-tuple, might have little
difference and cause high collision when flows try to match fields
such as vxlan vni, or geneve metadata field. Matching the inner
packets requires recirculation, so the rss of inner 5-tuple come from
cpu, and I guess the CD's hit rate is higher for inner packets.

The DFC (datapath flow cache) patch seems to have similar drawbacks?
The fundamental issue seems to be the choice of hash function (RSS),
which only covers 5-tuple. Can we configure the rss hash to hash on
more fields when subtables uses more than 5-tuple?

Regards,
William
Fischetti, Antonio Jan. 3, 2018, 9:12 a.m. UTC | #11
Thanks William for your comments.
Some replies inline.

-Antonio

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of William Tu
> Sent: Wednesday, January 3, 2018 1:32 AM
> To: Wang, Yipeng1 <yipeng1.wang@intel.com>
> Cc: dev@openvswitch.org; Tai, Charlie <charlie.tai@intel.com>
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> Hi Yipeng,
> 
> Thanks for the reply.
> 
> On Tue, Jan 2, 2018 at 10:45 AM, Wang, Yipeng1 <yipeng1.wang@intel.com>
> wrote:
> > Hi, William,
> >
> > Thanks for the comment. You are right. If the RSS hash does not
> consider the fields that matter, the situation you mentioned could
> happen.
> >
> > There are two design options for CD as you may find, when the
> signatures collide, we could either replace the existing entry (the
> current design), or still insert the entry into the cache. If we chose
> the second design, I think the situation you mentioned could be
> alleviated. We chose the first one mostly because of its simplicity and
> speed for the hit case. For example, if we allow multiple entries with
> the same signature stay in one bucket, then the lookup function needs to
> iterate all the entries in a bucket to find all the matches (for scalar
> version). And additional loops and variables are required to iterate all
> the matches. We expected to see some percentage of throughput influence
> for cache hit cases.
> 
> I think the cost of having multiple entries with the same signature is
> too high, basically the CD lookup time increase from O(1) to O(n),
> where n is the bucket size.
> 
> >
> > But as you suggested, if the situation you mentioned is very common in
> real use cases, and RSS does not consider the vlan id, we could choose
> to not overwrite. Another option is to reduce the insertion rate (or
> turn off CD) as CD's miss ratio is high (this is similar to the EMC
> conditional insertion). Then the 100% miss ratio case can be alleviated.
> This is an easy change for CD. Or we could use software hash together
> with RSS to consider vlan tag, this could benefit EMC too I guess.

[Antonio] Hmm, I expect that adding some software hash computation would
kill the performance.


> >
> > There are many design options and trade-offs but we eventually want to
> have a design that work for most use cases.
> 
> I don't have any traffic dataset, but I would assume it's pretty
> common that multiple tunneling protocols are deployed. That said, the
> RSS hash, which is based on outer-header 5-tuple, might have little
> difference and cause high collision when flows try to match fields
> such as vxlan vni, or geneve metadata field. Matching the inner
> packets requires recirculation, so the rss of inner 5-tuple come from
> cpu, and I guess the CD's hit rate is higher for inner packets.
> 
> The DFC (datapath flow cache) patch seems to have similar drawbacks?

[Antonio] This is a general issue due to the fact that RSS hash is 
computed on a pre-defined 5-tuple of the outer header. So it's not
specific to the CD or the DFC implementations. For sure it affects
the EMC.


> The fundamental issue seems to be the choice of hash function (RSS),
> which only covers 5-tuple. Can we configure the rss hash to hash on
> more fields when subtables uses more than 5-tuple?

[Antonio] 
> 
> Regards,
> William
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Wang, Yipeng1 Jan. 5, 2018, 1:41 a.m. UTC | #12
I agree with you guys on the issue with RSS hash.

RSS is not originally designed for hash table so it is not flexible enough to be configured
in the NICs. Smart NICs will be more capable to offload hash calculation etc, I think.

Another thing is I think it will not be very costly if we combine RSS with the software hash.
We just need to use the RSS together with the Vlan field to calculate a new hash (using CRC hash
which is fast). It is like what recirculation does. We do not need to hash 5-tuple again in software.
User can choose which field they want to include into the hash for the hash table structures in OvS.

What do you think?

>-----Original Message-----
>From: Fischetti, Antonio
>Sent: Wednesday, January 3, 2018 1:13 AM
>To: William Tu <u9012063@gmail.com>; Wang, Yipeng1
><yipeng1.wang@intel.com>
>Cc: dev@openvswitch.org; Tai, Charlie <charlie.tai@intel.com>
>Subject: RE: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
>
>Thanks William for your comments.
>Some replies inline.
>
>-Antonio
>
>> -----Original Message-----
>> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
>> bounces@openvswitch.org] On Behalf Of William Tu
>> Sent: Wednesday, January 3, 2018 1:32 AM
>> To: Wang, Yipeng1 <yipeng1.wang@intel.com>
>> Cc: dev@openvswitch.org; Tai, Charlie <charlie.tai@intel.com>
>> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache
>>
>> Hi Yipeng,
>>
>> Thanks for the reply.
>>
>> On Tue, Jan 2, 2018 at 10:45 AM, Wang, Yipeng1 <yipeng1.wang@intel.com>
>> wrote:
>> > Hi, William,
>> >
>> > Thanks for the comment. You are right. If the RSS hash does not
>> consider the fields that matter, the situation you mentioned could
>> happen.
>> >
>> > There are two design options for CD as you may find, when the
>> signatures collide, we could either replace the existing entry (the
>> current design), or still insert the entry into the cache. If we chose
>> the second design, I think the situation you mentioned could be
>> alleviated. We chose the first one mostly because of its simplicity and
>> speed for the hit case. For example, if we allow multiple entries with
>> the same signature stay in one bucket, then the lookup function needs to
>> iterate all the entries in a bucket to find all the matches (for scalar
>> version). And additional loops and variables are required to iterate all
>> the matches. We expected to see some percentage of throughput influence
>> for cache hit cases.
>>
>> I think the cost of having multiple entries with the same signature is
>> too high, basically the CD lookup time increase from O(1) to O(n),
>> where n is the bucket size.
>>
>> >
>> > But as you suggested, if the situation you mentioned is very common in
>> real use cases, and RSS does not consider the vlan id, we could choose
>> to not overwrite. Another option is to reduce the insertion rate (or
>> turn off CD) as CD's miss ratio is high (this is similar to the EMC
>> conditional insertion). Then the 100% miss ratio case can be alleviated.
>> This is an easy change for CD. Or we could use software hash together
>> with RSS to consider vlan tag, this could benefit EMC too I guess.
>
>[Antonio] Hmm, I expect that adding some software hash computation would
>kill the performance.
>
>
>> >
>> > There are many design options and trade-offs but we eventually want to
>> have a design that work for most use cases.
>>
>> I don't have any traffic dataset, but I would assume it's pretty
>> common that multiple tunneling protocols are deployed. That said, the
>> RSS hash, which is based on outer-header 5-tuple, might have little
>> difference and cause high collision when flows try to match fields
>> such as vxlan vni, or geneve metadata field. Matching the inner
>> packets requires recirculation, so the rss of inner 5-tuple come from
>> cpu, and I guess the CD's hit rate is higher for inner packets.
>>
>> The DFC (datapath flow cache) patch seems to have similar drawbacks?
>
>[Antonio] This is a general issue due to the fact that RSS hash is
>computed on a pre-defined 5-tuple of the outer header. So it's not
>specific to the CD or the DFC implementations. For sure it affects
>the EMC.
>
>
>> The fundamental issue seems to be the choice of hash function (RSS),
>> which only covers 5-tuple. Can we configure the rss hash to hash on
>> more fields when subtables uses more than 5-tuple?
>
>[Antonio]
>>
>> Regards,
>> William
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Jan Scheurich Jan. 5, 2018, 3:12 p.m. UTC | #13
Hi Yipeng,

Thank you very much for your comments. Please find my answers inline.

Regards, Jan

> -----Original Message-----
> From: Wang, Yipeng1 [mailto:yipeng1.wang@intel.com]
> Sent: Wednesday, 20 December, 2017 00:09
> To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
> Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie <charlie.tai@intel.com>
> Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> Hi, Jan,
> 
> Please see my comments inlined.
> 
> Thanks
> Yipeng
> 
> >-----Original Message-----
> >From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
> >Sent: Monday, December 18, 2017 9:01 AM
> >To: Wang, Yipeng1 <yipeng1.wang@intel.com>; dev@openvswitch.org
> >Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
> ><charlie.tai@intel.com>
> >Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> >
> >Hi Yipeng,
> >
> >Thanks a lot for your feedback. Please find some responses below.
> >
> >Regards, Jan
> >
> >
> >From: Wang, Yipeng1 [mailto:yipeng1.wang@intel.com]
> >Sent: Sunday, 17 December, 2017 19:49
> >To: Jan Scheurich <jan.scheurich@ericsson.com>; dev@openvswitch.org
> >Cc: Gobriel, Sameh <sameh.gobriel@intel.com>; Tai, Charlie
> ><charlie.tai@intel.com>
> >Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> >
> >Hi, Jan
> >
> >We went through the code and did some performance comparisons. We
> >notice that the patch contains two parts of optimizations: EMC
> >simplifying/resizing, and another layer of cache added before megaflow cache.
> >The new cache idea has the same direction with our cuckoo distributor(CD)
> >proposal we posted back in April
> >(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
> ><https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html> )
> >and presented in OvS 2017 conference. Comparing to our patch, we saw pros
> >and cons for both CD and DFC, and currently seeking a way to combine the
> >benefits of both patches. We are also seeking other ways to further simplify
> >current datapath, to have a scalable while simple solution. Below are some
> >detailed comments:
> >
> >For EMC part, we wonder if you enabled transparent huge page (THP) during
> >the test. For our test case, the new EMC only gives a little speedup if THP
> >enabled, since with huge page, reducing EMC entry does not benefit much.
> >Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we
> >tested. I guess the optimization will largely depend on traffic patterns.
> >Another question is that it seems when EMC lookup does
> >"netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the
> >key comparison actually does not do correctly. Could you please double check?
> >
> >[Jan] Yes, THP is enabled on my test setup, but I have doubts that it
> >significantly boosts the performance of the DFC/EMC by transparently
> >allocating that memory on a hugepage. Do you have a means to check that on
> >a running OVS?
> 
> [Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
> If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
> Once I turned on THP with no other changes, the current EMC's throughput increases a lot and is comparable with the newly
> proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
> So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
> It is worth to mention that they both use similar amount of CPU cache since only the miniflow struct is accessed by CPU,
> thus the TLB should be the major concern.

I understand your point. But I can't seem to reproduce the effect of THP on my system. 
I don't have vTunes available, but I guess "perf stat" should also provide TLB miss 
statistics. 

How can you check if ovs-vswitchd is using transparent huge pages for backing 
e.g. the EMC memory?

> 
> >My primary goal when I chose to change the EMC implementation from 2-way
> >to 1-way associativity was to simplify the code. In my tests I have not seen any
> >benefit of having two possible locations for an EMC entry. As far as I can see
> >there is no theoretical reason why we should expect systematic collisions of
> >pairs of flows that would justify such a design. There may well be specific
> >traffic patterns that benefit from 2-way EMC, but then there are obviously
> >others for which 1-way performs better. In doubt I believe we should choose
> >the simpler design.
> >
> [Wang, Yipeng] Yes that there is no systematic collisions. However, in general,
> 1-hash table tends to cause many more misses than 2-hash. For code simplicity,
> I agree that 1-hash is simpler and much easier to understand. For performance,
> if the flows can fit in 1-hash table, they should also stay in the primary location of the 2-hash table,
> so basically they should have similar lookup speed. For large numbers of flows in general,
> traffic will have higher miss ratio in 1-hash than 2-hash table. From one of our tests that has
> 10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries,
> the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.

I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as second stage. 
But with DFC the extra cost for a miss on EMC is low as the DFC lookup only slightly higher 
than EMC itself. The EMC miss is cheap as it will typically already detected when comparing the full 
RSS hash.

Furthermore, the EMC is now mainly meant to speed up the biggest elephant
flows, so it can be smaller and thrashing is avoided by very low insertion probability.
Simplistic benchmarks using a large number of "eternal" flows with equidistantly spaced 
packets are really an unrealistic worst case for any cache-based architecture.

> 
> >Regarding "netdev_flow_key_equal_mf", there is no difference to the
> >baseline. The key length is taken from the stored EMC entry, not the packet's
> >flow key.
> >
> [Wang, Yipeng] Sorry that I did not explain it correctly. It seems the key length may not be initialized
> when emc_probablilistic_insert is called. If I am correct,  the EMC entry does not contain the right
> key length and the EMC lookup does not use the correct length. This is because the emc insert
> now happens during dfc_lookup but originally it happens after fast_processing.
> Do I understand it correctly?

You may be right. I will double-check that and fix when necessary.

> 
> >For the DFC cache part, we compare with our CD patch we presented in the
> >OvS conference. We saw CD begins to perform better than DFC with 2 or
> >more subtables, and ~60% higher throughout with 20 subtables, especially
> >when flow count is large (around 100k or more). We found CD is a more
> >scalable implementation w.r.t subtable count and flow count. Part of the
> >reason is that the 1-hash hash table of DFC does not perform consistently with
> >various traffic patterns, and easily cause conflict misses. DFC's advantage
> >shows up when all flows hit DFC or only 1 subtable exists. We are currently
> >thinking about combining both approaches for example a hybrid model.
> >
> >[Jan] A DFC hit will always be faster than a CD hit because the latter involves
> >an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at
> >5000 parallel flows to ~25% at 150K flows, so even for large number of flows
> >most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search)
> >must be very high to cause a big degradation.
> >
> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, but CD usually has higher
> hit rate for large number of flows, as the data shows later.

That is something I don't yet understand. Is this because of the fact that CD stores 
up to 16 entries per hash bucket and handles collisions better?

> 
> >Can you specify your test setup in more detail? What kind of DFC miss rates do
> >you measure depending on the flow rate? Can you publish your
> >measurement results?
> >
> [Wang, Yipeng] We use the test/rules we posted with our CD patch. Basically we vary src_IP to hit different subtables,
> and then vary dst_IP to create various numbers of flows. We use Spirent to generate src_IP from
> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 0.0.0.0 to certain value depending on the
> flow count. It is similar to your traffic pattern with various UDP port number.
> We use your proposed EMC design for both schemes. Here is the performance ratio we collected:
> 
> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).
> table cnt/flow cnt	1	3	5	10	20
> 10k			1.00	1.00	0.85	1.00	1.00
> 100k			0.81	1.15	1.17	1.35	1.55
> 1M			0.80	1.12	1.31	1.37	1.63

I assume this is 10k/100k/1M flows in total, independent of the number of subtables, right?

The degradation of DFC for large flow counts and many subtables comes from the increasing
cost for linear DPCLS searches after DFC misses I wonder how CD can avoid similar number of
misses with the same number of CD entries. Is this just because of the 16 entries per bucket?

> 
> >Do you have an updated version of your CD patch that works with the
> >membership library in DPDK 17.11 now that OVS master builds against 17.11?
> >
> [Wang, Yipeng] We have the code and we can send to you for testing
> if you would like to. But since now we think it is better to combine the benefit of both DFC and CD,
> it would be better to post on mailing list a more mature patch later.
> 
> >We would love to hear your opinion on this and we think the best case is we
> >could find a way to harmonize both patches, and find a both scalable and
> >efficient way to refactor the datapath.
> >
> >I would be interested to see your ideas how to combine DFC and CD in a good
> >way.
> >
> [Wang, Yipeng] We are thinking of using the indirect table to store either the pointer to the megaflow (like DFC),
> or the pointer to the subtable (like CD). The heuristic will depend on the number of active megaflows and the locality
> of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate,
> while avoid dpcls subtable access like how DFC works.

Yes, I was thinking about that kind of approach. Can you explain the concept of "indirect table"?
How does that save memory?

> 
> >In principle I think CD acceleration of DPCLS lookup could complement DFC but
> >I am a bit concerned about the combined memory and cache footprint of EMC,
> >DFC and CD. Even for EMC+DFC I have some doubts here. This should be
> >evaluated in multi-core setups of OVS and with real VNFs/applications in the
> >guests that exert a significant L3 cache contention.
> >
> [Wang, Yipeng] The cache footprint was our concern too when we worked on CD thus we store
> a subtable index instead of the pointer to megaflow. With the hybrid model mentioned
> above, I think it would be a better solution that is both memory efficient and scalable.
> 
> >Also, as all three are based on the same RSS hash as key, isn't there a
> >likelihood of hash collisions hitting all three in the same way? I am thinking
> >about packets that have little/no entropy in the outer headers (e.g. GRE
> >tunnels).
> >
> [Wang, Yipeng] If there is no entropy within the range considered by the RSS, they
> will collide. But this is another case that the 2-hash EMC may help comparing to
> 1-hash, right? Keys with same signature has an alternative location to go to.

The problem with too little or no entropy in the RSS hash is that many or all flows are
mapped to the same RSS hash value, so a two-way EMC won't help at all. Here only 
a better suited RSS hash algorithm on the NIC can be the solution. Many NICs today
offer possibilities to tailor the RSS hashing to specific needs. But the question is still
how to set this up.

As a work-around one could consider to fall back to a SW hash in OVS-DPDK if the 
HW RSS hash is empirically found to be inadequate (extreme collision rate). 

Regards, Jan
William Tu Jan. 5, 2018, 4:15 p.m. UTC | #14
On Thu, Jan 4, 2018 at 5:41 PM, Wang, Yipeng1 <yipeng1.wang@intel.com> wrote:
> I agree with you guys on the issue with RSS hash.
>
> RSS is not originally designed for hash table so it is not flexible enough to be configured
> in the NICs. Smart NICs will be more capable to offload hash calculation etc, I think.
>
> Another thing is I think it will not be very costly if we combine RSS with the software hash.
> We just need to use the RSS together with the Vlan field to calculate a new hash (using CRC hash
> which is fast). It is like what recirculation does. We do not need to hash 5-tuple again in software.
> User can choose which field they want to include into the hash for the hash table structures in OvS.
>
> What do you think?
>

I like the idea of incremental hash using CRC. As for adding new
field, can we do it on-demand without relying on user to choose which
fields?

Assuming that in the beginning, 5-tuple covers unions of all fields
used in all subtables, so RSS works fine. As subtables increase, there
might be new subtables using new fields, such as vlan, or tunnel
metadata. At the moment (or periodically) when we detect new subtable
using new field, then we consider adding new field into the hash
function. So for simply traffic using 5-tuple, RSS remains effective
and unchanged, for tunneled traffic or traffic matching more fields,
this on-demand hashing approach can alleviate the collision. Of
course, changing the hashing function requires rehashing all existing
elements, so we should do it less frequent.

Regards,
William
Wang, Yipeng1 Jan. 8, 2018, 9:54 p.m. UTC | #15
I think this is an interesting idea.

One caveat is that in this case we use the rules' field to infer the flows' field. If the rule
does not consider a field within which the flow has high entropy, it still does not help.
Or if the rule considers many fields but the flow does not have high entropy there, then
we add the incremental software hash overhead for little benefit.

I am not sure if that is common in real cases, I am more concerned with the second scenario, 
what do you think?

We should profile a little bit to see how much overhead the incremental CRC will
cost.

Thanks
Yipeng

>-----Original Message-----
>
>I like the idea of incremental hash using CRC. As for adding new
>field, can we do it on-demand without relying on user to choose which
>fields?
>
>Assuming that in the beginning, 5-tuple covers unions of all fields
>used in all subtables, so RSS works fine. As subtables increase, there
>might be new subtables using new fields, such as vlan, or tunnel
>metadata. At the moment (or periodically) when we detect new subtable
>using new field, then we consider adding new field into the hash
>function. So for simply traffic using 5-tuple, RSS remains effective
>and unchanged, for tunneled traffic or traffic matching more fields,
>this on-demand hashing approach can alleviate the collision. Of
>course, changing the hashing function requires rehashing all existing
>elements, so we should do it less frequent.
>
>Regards,
>William
William Tu Jan. 9, 2018, 5:25 p.m. UTC | #16
Hi Yipeng,

On Mon, Jan 8, 2018 at 1:54 PM, Wang, Yipeng1 <yipeng1.wang@intel.com> wrote:
> I think this is an interesting idea.
>
> One caveat is that in this case we use the rules' field to infer the flows' field. If the rule
> does not consider a field within which the flow has high entropy, it still does not help.
> Or if the rule considers many fields but the flow does not have high entropy there, then
> we add the incremental software hash overhead for little benefit.
>
> I am not sure if that is common in real cases, I am more concerned with the second scenario,
> what do you think?
>
Yes, you're right.
I don't have any good solution, but are we able to observe from the
subtable entries or its statistics to see whether a field has high
entropy? So in your second scenario, we don't have to add all new
fields into the hash but only a subset of them.

> We should profile a little bit to see how much overhead the incremental CRC will
> cost.
>
That would be very helpful!
Thanks
William
Wang, Yipeng1 Jan. 10, 2018, 2:54 a.m. UTC | #17
>-----Original Message-----
>> I think this is an interesting idea.
>>
>> One caveat is that in this case we use the rules' field to infer the flows' field. If the rule
>> does not consider a field within which the flow has high entropy, it still does not help.
>> Or if the rule considers many fields but the flow does not have high entropy there, then
>> we add the incremental software hash overhead for little benefit.
>>
>> I am not sure if that is common in real cases, I am more concerned with the second scenario,
>> what do you think?
>>
>Yes, you're right.
>I don't have any good solution, but are we able to observe from the
>subtable entries or its statistics to see whether a field has high
>entropy? So in your second scenario, we don't have to add all new
>fields into the hash but only a subset of them.
>

[Wang, Yipeng] 
To profile entropy I believe sampling the packets into a bloomfilter/bitmap could do the job.
But I think eventually people will debate on If the profiling cost and code complexity worth the effort or not.

>> We should profile a little bit to see how much overhead the incremental CRC will
>> cost.
>>
>That would be very helpful!
>Thanks
>William
Wang, Yipeng1 Jan. 10, 2018, 4:38 a.m. UTC | #18
Hi, Jan, Please find my reply inlined

>-----Original Message-----
>>
>> [Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
>> If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
>> Once I turned on THP with no other changes, the current EMC's 
>> throughput increases a lot and is comparable with the newly proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
>> So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
>> It is worth to mention that they both use similar amount of CPU cache 
>> since only the miniflow struct is accessed by CPU, thus the TLB should be the major concern.
>
>I understand your point. But I can't seem to reproduce the effect of THP on my system.
>I don't have vTunes available, but I guess "perf stat" should also 
>provide TLB miss statistics.
>
>How can you check if ovs-vswitchd is using transparent huge pages for 
>backing e.g. the EMC memory?
>

[Wang, Yipeng]
I used the master OVS and change the EMC to be 16k entries. I feed 10k or more flows to stress EMC.  With perf, I tried this command:
sudo perf stat -p PID -e dTLB-load-misses
It shows the TLB misses changed a lot with THP on or off on my machine. vtunes shows the EMC_lookup function's data separately though.

To check if THP is used by OvS, I found a Redhat suggested command handy:
From: https://access.redhat.com/solutions/46111
grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F "/"  '{print $0; system("ps -fp " $3)} '
I don't know how to check each individual function though.

>>
>> [Wang, Yipeng] Yes that there is no systematic collisions. However, 
>> in general, 1-hash table tends to cause many more misses than 2-hash. 
>> For code simplicity, I agree that 1-hash is simpler and much easier 
>> to understand. For performance, if the flows can fit in 1-hash table, 
>> they should also stay in the primary location of the 2-hash table, so 
>> basically they should have similar lookup speed. For large numbers of 
>> flows in general, traffic will have higher miss ratio in 1-hash than 
>> 2-hash table. From one of our tests that has 10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries, the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>
>I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as second stage.
>But with DFC the extra cost for a miss on EMC is low as the DFC lookup 
>only slightly higher than EMC itself. The EMC miss is cheap as it will 
>typically already detected when comparing the full RSS hash.
>
>Furthermore, the EMC is now mainly meant to speed up the biggest 
>elephant flows, so it can be smaller and thrashing is avoided by very low insertion probability.
>Simplistic benchmarks using a large number of "eternal" flows with 
>equidistantly spaced packets are really an unrealistic worst case for any cache-based architecture.
>

[Wang, Yipeng]
If the realistic traffic patterns mostly hit EMC with elephant flows, I agree that EMC could be simplified.

>>
>> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, 
>> but CD usually has higher hit rate for large number of flows, as the data shows later.
>
>That is something I don't yet understand. Is this because of the fact 
>that CD stores up to 16 entries per hash bucket and handles collisions better?

[Wang, Yipeng]
Yes, with 2-hash function and 16 entries per bucket, CD has much less misses in general.

As first step to combine both CD and DFC, I incorporated the signature and way-associative structure from CD into DFC. I just did simple prototype without Any performance tuning, preliminary results show good improvement over miss ratio and throughput. I will post the complete results soon.

Since DFC/CD is much faster than megaflow, I believe higher hit rate is preferred. So A CD-like way-associative structure should be helpful. The signature per entry also helps on performance, similar effect with EMC.

>>
>> [Wang, Yipeng] We use the test/rules we posted with our CD patch. 
>> Basically we vary src_IP to hit different subtables, and then vary 
>> dst_IP to create various numbers of flows. We use Spirent to generate 
>> src_IP from
>> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 
>> 0.0.0.0 to certain value depending on the flow count. It is similar to your traffic pattern with various UDP port number.
>> We use your proposed EMC design for both schemes. Here is the performance ratio we collected:
>>
>> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).
>> table cnt/flow cnt	1	3	5	10	20
>> 10k			1.00	1.00	0.85	1.00	1.00
>> 100k			0.81	1.15	1.17	1.35	1.55
>> 1M			0.80	1.12	1.31	1.37	1.63
>
>I assume this is 10k/100k/1M flows in total, independent of the number of subtables, right?
>
[Wang, Yipeng] Yes.

>The degradation of DFC for large flow counts and many subtables comes 
>from the increasing cost for linear DPCLS searches after DFC misses I 
>wonder how CD can avoid similar number of misses with the same number of CD entries. Is this just because of the 16 entries per bucket?
>

[Wang, Yipeng]
The way-associative design with 2-hash helps a lot on miss ratio.
More ways usually solves conflict misses better. For throughput difference, another reason is that since DFC does not have a signature, the DFC miss case is more expensive because the rule key needs to be accessed. 
We also found the better memory efficiency helped CD too, but with THP on and no VMs to compete CPU cache, it becomes a second order factor and not very critical in this test.

>>
>> [Wang, Yipeng] We have the code and we can send to you for testing if 
>> you would like to. But since now we think it is better to combine the 
>> benefit of both DFC and CD, it would be better to post on mailing list a more mature patch later.
>>
>> >We would love to hear your opinion on this and we think the best 
>> >case is we could find a way to harmonize both patches, and find a 
>> >both scalable and efficient way to refactor the datapath.
>> >
>> >I would be interested to see your ideas how to combine DFC and CD in 
>> >a good way.
>> >
>> [Wang, Yipeng] We are thinking of using the indirect table to store 
>> either the pointer to the megaflow (like DFC), or the pointer to the 
>> subtable (like CD). The heuristic will depend on the number of active 
>> megaflows and the locality of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate, while avoid dpcls subtable access like how DFC works.
>
>Yes, I was thinking about that kind of approach. Can you explain the concept of "indirect table"?
>How does that save memory?
>

[Wang, Yipeng]
We designed indirect table based on the fact that rules/subtables are much less than flows. It basically stores the pointers to rules/subtables, thus the CD/DFC can store the much shorter index of the indirect table rather than storing the full 8-byte pointer.

With 1M entry CD/DFC and multi-threading, using indirect table with smaller CD/DFC entries could save a lot of CPU cache.
As you mentioned, if there are VMs competing LLC, it will be critical for performance. 

Please let me know what you think.

Thanks!
Yipeng
Wang, Yipeng1 Jan. 11, 2018, 8:48 p.m. UTC | #19
>-----Original Message-----
>[Wang, Yipeng]
>Yes, with 2-hash function and 16 entries per bucket, CD has much less misses in general.
>
>As first step to combine both CD and DFC, I incorporated the signature and way-associative structure from CD into DFC. I just did
>simple prototype without Any performance tuning, preliminary results show good improvement over miss ratio and throughput. I will
>post the complete results soon.
>

[Wang, Yipeng] I implemented 8way/bucket with signature each entry. Here is just some preliminary results (not optimized). There is certain additional overhead but in general it helps a lot with larger numbers
of flows.  With 1 rule/subtable and only 1 subtable, the 1way DFC without signature always have 100% hit rate thus it is faster, but we assume that is not the common use case. For other cases, more ways
and a signature per entry should give better performance in general, and performs more consistent w.r.t. different traffic patterns.

throughput improvement with 8way/bucket and signatures
1rule/subtable					
flow/subtable	1	2	3	5	10	20
10k		1.00	0.93	1.18	0.96	0.99	1.00
100k		0.98	1.20	1.45	1.46	1.81	2.41
1M		0.85	1.25	1.32	1.40	1.43	1.09
1.5M		0.76	1.05	1.16	1.27	1.33	1.10
						
10 rule/subtable					
flow/subtable	1	2	3	5	10	20
10k		1.00	0.97	1.00	1.00	1.00	1.00
100k		1.00	0.96	1.08	1.06	1.11	1.16
1M		1.04	1.06	1.09	1.13	1.20	1.27
1.5M		1.04	1.05	1.07	1.09	1.11	1.14
Bodireddy, Bhanuprakash Feb. 16, 2018, 2:59 p.m. UTC | #20
>
>>-----Original Message-----
>>>
>>> [Wang, Yipeng] In my test, I compared the proposed EMC with current
>EMC with same 16k entries.
>>> If I turned off THP, the current EMC will cause many TLB misses because of
>its larger entry size, which I profiled with vTunes.
>>> Once I turned on THP with no other changes, the current EMC's
>>> throughput increases a lot and is comparable with the newly proposed
>EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to
>0 during the 30sec profiling time.
>>> So if THP is enabled, reducing EMC entry size may not give too much
>benefit comparing to the current EMC.
>>> It is worth to mention that they both use similar amount of CPU cache
>>> since only the miniflow struct is accessed by CPU, thus the TLB should be
>the major concern.

[BHANU] 
I found this thread on THP interesting and want to share my findings here.  
I did some micro benchmarks on this feature a long time ago and found there was some performance improvement with THP enabled.
Some of this can be attributed to faster emc_lookup() with THP enabled.

With large number of flows, emc_lookup() is back end bound and further analysis showed that there is significant
DTLB overhead. One way to reduce the overhead is to use larger pages and with THP the overhead reduce by 40% for this function.

So THP has some positive affect on emc_lookup()!

- Bhanuprakash.

>>
>>I understand your point. But I can't seem to reproduce the effect of THP on
>my system.
>>I don't have vTunes available, but I guess "perf stat" should also
>>provide TLB miss statistics.
>>
>>How can you check if ovs-vswitchd is using transparent huge pages for
>>backing e.g. the EMC memory?
>>
>
>[Wang, Yipeng]
>I used the master OVS and change the EMC to be 16k entries. I feed 10k or
>more flows to stress EMC.  With perf, I tried this command:
>sudo perf stat -p PID -e dTLB-load-misses It shows the TLB misses changed a
>lot with THP on or off on my machine. vtunes shows the EMC_lookup
>function's data separately though.
>
>To check if THP is used by OvS, I found a Redhat suggested command handy:
>From: https://access.redhat.com/solutions/46111
>grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F
>"/"  '{print $0; system("ps -fp " $3)} '
>I don't know how to check each individual function though.
>
>>>
>>> [Wang, Yipeng] Yes that there is no systematic collisions. However,
>>> in general, 1-hash table tends to cause many more misses than 2-hash.
>>> For code simplicity, I agree that 1-hash is simpler and much easier
>>> to understand. For performance, if the flows can fit in 1-hash table,
>>> they should also stay in the primary location of the 2-hash table, so
>>> basically they should have similar lookup speed. For large numbers of
>>> flows in general, traffic will have higher miss ratio in 1-hash than
>>> 2-hash table. From one of our tests that has 10k flows and 3 subtable (test
>cases described later), and EMC is sized for 16k entries, the 2-hash EMC
>causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>>
>>I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as
>second stage.
>>But with DFC the extra cost for a miss on EMC is low as the DFC lookup
>>only slightly higher than EMC itself. The EMC miss is cheap as it will
>>typically already detected when comparing the full RSS hash.
>>
>>Furthermore, the EMC is now mainly meant to speed up the biggest
>>elephant flows, so it can be smaller and thrashing is avoided by very low
>insertion probability.
>>Simplistic benchmarks using a large number of "eternal" flows with
>>equidistantly spaced packets are really an unrealistic worst case for any
>cache-based architecture.
>>
>
>[Wang, Yipeng]
>If the realistic traffic patterns mostly hit EMC with elephant flows, I agree that
>EMC could be simplified.
>
>>>
>>> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit,
>>> but CD usually has higher hit rate for large number of flows, as the data
>shows later.
>>
>>That is something I don't yet understand. Is this because of the fact
>>that CD stores up to 16 entries per hash bucket and handles collisions better?
>
>[Wang, Yipeng]
>Yes, with 2-hash function and 16 entries per bucket, CD has much less misses
>in general.
>
>As first step to combine both CD and DFC, I incorporated the signature and
>way-associative structure from CD into DFC. I just did simple prototype
>without Any performance tuning, preliminary results show good
>improvement over miss ratio and throughput. I will post the complete results
>soon.
>
>Since DFC/CD is much faster than megaflow, I believe higher hit rate is
>preferred. So A CD-like way-associative structure should be helpful. The
>signature per entry also helps on performance, similar effect with EMC.
>
>>>
>>> [Wang, Yipeng] We use the test/rules we posted with our CD patch.
>>> Basically we vary src_IP to hit different subtables, and then vary
>>> dst_IP to create various numbers of flows. We use Spirent to generate
>>> src_IP from
>>> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from
>>> 0.0.0.0 to certain value depending on the flow count. It is similar to your
>traffic pattern with various UDP port number.
>>> We use your proposed EMC design for both schemes. Here is the
>performance ratio we collected:
>>>
>>> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC
>8MB, THP on).
>>> table cnt/flow cnt	1	3	5	10	20
>>> 10k			1.00	1.00	0.85	1.00	1.00
>>> 100k			0.81	1.15	1.17	1.35	1.55
>>> 1M			0.80	1.12	1.31	1.37	1.63
>>
>>I assume this is 10k/100k/1M flows in total, independent of the number of
>subtables, right?
>>
>[Wang, Yipeng] Yes.
>
>>The degradation of DFC for large flow counts and many subtables comes
>>from the increasing cost for linear DPCLS searches after DFC misses I
>>wonder how CD can avoid similar number of misses with the same number
>of CD entries. Is this just because of the 16 entries per bucket?
>>
>
>[Wang, Yipeng]
>The way-associative design with 2-hash helps a lot on miss ratio.
>More ways usually solves conflict misses better. For throughput difference,
>another reason is that since DFC does not have a signature, the DFC miss case
>is more expensive because the rule key needs to be accessed.
>We also found the better memory efficiency helped CD too, but with THP on
>and no VMs to compete CPU cache, it becomes a second order factor and not
>very critical in this test.
>
>>>
>>> [Wang, Yipeng] We have the code and we can send to you for testing if
>>> you would like to. But since now we think it is better to combine the
>>> benefit of both DFC and CD, it would be better to post on mailing list a
>more mature patch later.
>>>
>>> >We would love to hear your opinion on this and we think the best
>>> >case is we could find a way to harmonize both patches, and find a
>>> >both scalable and efficient way to refactor the datapath.
>>> >
>>> >I would be interested to see your ideas how to combine DFC and CD in
>>> >a good way.
>>> >
>>> [Wang, Yipeng] We are thinking of using the indirect table to store
>>> either the pointer to the megaflow (like DFC), or the pointer to the
>>> subtable (like CD). The heuristic will depend on the number of active
>>> megaflows and the locality of the accesses. This way, we could keep the
>smaller size of CD using the indirect table, and higher hit rate, while avoid
>dpcls subtable access like how DFC works.
>>
>>Yes, I was thinking about that kind of approach. Can you explain the concept
>of "indirect table"?
>>How does that save memory?
>>
>
>[Wang, Yipeng]
>We designed indirect table based on the fact that rules/subtables are much
>less than flows. It basically stores the pointers to rules/subtables, thus the
>CD/DFC can store the much shorter index of the indirect table rather than
>storing the full 8-byte pointer.
>
>With 1M entry CD/DFC and multi-threading, using indirect table with smaller
>CD/DFC entries could save a lot of CPU cache.
>As you mentioned, if there are VMs competing LLC, it will be critical for
>performance.
>
>Please let me know what you think.
>
>Thanks!
>Yipeng
>_______________________________________________
>dev mailing list
>dev@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index db78318..efcf2e9 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -127,19 +127,19 @@  struct netdev_flow_key {
     uint64_t buf[FLOW_MAX_PACKET_U64S];
 };

-/* Exact match cache for frequently used flows
+/* Datapath flow cache (DFC) for frequently used flows
  *
- * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to
- * search its entries for a miniflow that matches exactly the miniflow of the
- * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow.
+ * The cache uses the 32-bit hash of the packet (which can be the RSS hash) to
+ * directly look up a pointer to the matching megaflow. To check for a match
+ * the packet's flow key is compared against the key and mask of the megaflow.
  *
- * A cache entry holds a reference to its 'dp_netdev_flow'.
- *
- * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different
- * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of
- * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each
- * value is the index of a cache entry where the miniflow could be.
+ * For even faster lookup, the most frequently used packet flows are also
+ * inserted into a small exact match cache (EMC). The EMC uses a part of the
+ * packet hash to look up a miniflow that matches exactly the miniflow of the
+ * packet. The matching EMC also returns a reference to the megaflow.
  *
+ * Flows are promoted from the DFC to the EMC through probabilistic insertion
+ * after successful DFC lookup with minor probability to favor elephant flows.
  *
  * Thread-safety
  * =============
@@ -148,33 +148,38 @@  struct netdev_flow_key {
  * If dp_netdev_input is not called from a pmd thread, a mutex is used.
  */

-#define EM_FLOW_HASH_SHIFT 13
-#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT)
-#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)
-#define EM_FLOW_HASH_SEGS 2
+#define DFC_MASK_LEN 20
+#define DFC_ENTRIES (1u << DFC_MASK_LEN)
+#define DFC_MASK (DFC_ENTRIES - 1)
+#define EMC_MASK_LEN 14
+#define EMC_ENTRIES (1u << EMC_MASK_LEN)
+#define EMC_MASK (EMC_ENTRIES - 1)

 /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */
-#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
+#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
                                     DEFAULT_EM_FLOW_INSERT_INV_PROB)

 struct emc_entry {
     struct dp_netdev_flow *flow;
-    struct netdev_flow_key key;   /* key.hash used for emc hash value. */
+    char key[248];   /* Holds struct netdev_flow_key of limited size. */
 };

 struct emc_cache {
-    struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
-    int sweep_idx;                /* For emc_cache_slow_sweep(). */
+    struct emc_entry entries[EMC_ENTRIES];
+    int sweep_idx;
+};
+
+struct dfc_entry {
+    struct dp_netdev_flow *flow;
+};
+
+struct dfc_cache {
+    struct emc_cache emc_cache;
+    struct dfc_entry entries[DFC_ENTRIES];
+    int sweep_idx;
 };

-/* Iterate in the exact match cache through every entry that might contain a
- * miniflow with hash 'HASH'. */
-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH)                 \
-    for (uint32_t i__ = 0, srch_hash__ = (HASH);                             \
-         (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
-         i__ < EM_FLOW_HASH_SEGS;                                            \
-         i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)

 /* Simple non-wildcarding single-priority classifier. */

@@ -214,6 +219,8 @@  static bool dpcls_lookup(struct dpcls *cls,
                          const struct netdev_flow_key keys[],
                          struct dpcls_rule **rules, size_t cnt,
                          int *num_lookups_p);
+static bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
+                                   const struct netdev_flow_key *target);

 /* Set of supported meter flags */
 #define DP_SUPPORTED_METER_FLAGS_MASK \
@@ -332,6 +339,7 @@  static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,

 enum dp_stat_type {
     DP_STAT_EXACT_HIT,          /* Packets that had an exact match (emc). */
+    DP_STAT_DFC_HIT,            /* Packets that had a flow cache hit (dfc). */
     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
     DP_STAT_MISS,               /* Packets that did not match. */
     DP_STAT_LOST,               /* Packets not passed up to the client. */
@@ -565,7 +573,7 @@  struct dp_netdev_pmd_thread {
      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
      * need to be protected by 'non_pmd_mutex'.  Every other instance
      * will only be accessed by its own pmd thread. */
-    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache;
     struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */

     /* Queue id used by this pmd thread to send packets on all netdevs if
@@ -744,46 +752,59 @@  dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
 static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
                                       struct tx_port *tx, long long now);

-static inline bool emc_entry_alive(struct emc_entry *ce);
+static inline bool dfc_entry_alive(struct dfc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
+static void dfc_clear_entry(struct dfc_entry *ce);

 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);

 static void
-emc_cache_init(struct emc_cache *flow_cache)
+emc_cache_init(struct emc_cache *emc)
 {
     int i;

-    flow_cache->sweep_idx = 0;
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc->entries[i].flow = NULL;
+        struct netdev_flow_key *key =
+                (struct netdev_flow_key *) emc->entries[i].key;
+        key->hash = 0;
+        key->len = sizeof(struct miniflow);
+        flowmap_init(&key->mf.map);
+    }
+    emc->sweep_idx = 0;
+}
+
+static void
+dfc_cache_init(struct dfc_cache *flow_cache)
+{
+    int i;
+
+    emc_cache_init(&flow_cache->emc_cache);
     for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
         flow_cache->entries[i].flow = NULL;
-        flow_cache->entries[i].key.hash = 0;
-        flow_cache->entries[i].key.len = sizeof(struct miniflow);
-        flowmap_init(&flow_cache->entries[i].key.mf.map);
     }
+    flow_cache->sweep_idx = 0;
 }

 static void
-emc_cache_uninit(struct emc_cache *flow_cache)
+emc_cache_uninit(struct emc_cache *emc)
 {
     int i;

-    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-        emc_clear_entry(&flow_cache->entries[i]);
+    for (i = 0; i < ARRAY_SIZE(emc->entries); i++) {
+        emc_clear_entry(&emc->entries[i]);
     }
 }

-/* Check and clear dead flow references slowly (one entry at each
- * invocation).  */
 static void
-emc_cache_slow_sweep(struct emc_cache *flow_cache)
+dfc_cache_uninit(struct dfc_cache *flow_cache)
 {
-    struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx];
+    int i;

-    if (!emc_entry_alive(entry)) {
-        emc_clear_entry(entry);
+    for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+        dfc_clear_entry(&flow_cache->entries[i]);
     }
-    flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
+    emc_cache_uninit(&flow_cache->emc_cache);
 }

 /* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */
@@ -837,8 +858,8 @@  pmd_info_show_stats(struct ds *reply,
     }

     /* Sum of all the matched and not matched packets gives the total.  */
-    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_MASKED_HIT]
-                    + stats[DP_STAT_MISS];
+    total_packets = stats[DP_STAT_EXACT_HIT] + stats[DP_STAT_DFC_HIT]
+                    + stats[DP_STAT_MASKED_HIT] + stats[DP_STAT_MISS];

     for (i = 0; i < PMD_N_CYCLES; i++) {
         if (cycles[i] > pmd->cycles_zero[i]) {
@@ -862,10 +883,13 @@  pmd_info_show_stats(struct ds *reply,
     ds_put_cstr(reply, ":\n");

     ds_put_format(reply,
-                  "\temc hits:%llu\n\tmegaflow hits:%llu\n"
+                  "\temc hits:%llu\n\tdfc hits:%llu\n"
+                  "\tmegaflow hits:%llu\n"
                   "\tavg. subtable lookups per hit:%.2f\n"
                   "\tmiss:%llu\n\tlost:%llu\n",
-                  stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
+                  stats[DP_STAT_EXACT_HIT],
+                  stats[DP_STAT_DFC_HIT],
+                  stats[DP_STAT_MASKED_HIT],
                   stats[DP_STAT_MASKED_HIT] > 0
                   ? (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT]
                   : 0,
@@ -1492,6 +1516,8 @@  dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
         stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_EXACT_HIT], &n);
         stats->n_hit += n;
+        atomic_read_relaxed(&pmd->stats.n[DP_STAT_DFC_HIT], &n);
+        stats->n_hit += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_MISS], &n);
         stats->n_missed += n;
         atomic_read_relaxed(&pmd->stats.n[DP_STAT_LOST], &n);
@@ -2111,6 +2137,16 @@  netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
     return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
 }

+/*
+ * Datapath Flow Cache and EMC implementation
+ */
+
+static inline struct emc_entry *
+emc_entry_get(struct emc_cache *emc, const uint32_t hash)
+{
+    return &emc->entries[hash & EMC_MASK];
+}
+
 static inline bool
 emc_entry_alive(struct emc_entry *ce)
 {
@@ -2127,9 +2163,15 @@  emc_clear_entry(struct emc_entry *ce)
 }

 static inline void
-emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
-                 const struct netdev_flow_key *key)
+emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key *key,
+                 struct dp_netdev_flow *flow)
 {
+    /* We only store small enough flows in the EMC. */
+    size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len;
+    if (key_size > sizeof(ce->key)) {
+        return;
+    }
+
     if (ce->flow != flow) {
         if (ce->flow) {
             dp_netdev_flow_unref(ce->flow);
@@ -2141,73 +2183,148 @@  emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
             ce->flow = NULL;
         }
     }
-    if (key) {
-        netdev_flow_key_clone(&ce->key, key);
-    }
+    netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key);
 }

 static inline void
-emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key,
-           struct dp_netdev_flow *flow)
+emc_probabilistic_insert(struct emc_cache *emc,
+                         const struct netdev_flow_key *key,
+                         struct dp_netdev_flow *flow)
 {
-    struct emc_entry *to_be_replaced = NULL;
-    struct emc_entry *current_entry;
+    const uint32_t threshold = UINT32_MAX/1000;

-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (netdev_flow_key_equal(&current_entry->key, key)) {
-            /* We found the entry with the 'mf' miniflow */
-            emc_change_entry(current_entry, flow, NULL);
-            return;
+    if (random_uint32() <= threshold) {
+
+        struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+        emc_change_entry(current_entry, key, flow);
+    }
+}
+
+static inline struct dp_netdev_flow *
+emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key)
+{
+    struct emc_entry *current_entry = emc_entry_get(emc, key->hash);
+    struct netdev_flow_key *current_key =
+            (struct netdev_flow_key *) current_entry->key;
+
+    if (current_key->hash == key->hash
+        && emc_entry_alive(current_entry)
+        && netdev_flow_key_equal_mf(current_key, &key->mf)) {
+
+        /* We found the entry with the 'key->mf' miniflow */
+        return current_entry->flow;
+    }
+    return NULL;
+}
+
+static inline struct dfc_entry *
+dfc_entry_get(struct dfc_cache *cache, const uint32_t hash)
+{
+    return &cache->entries[hash & DFC_MASK];
+}
+
+static inline bool
+dfc_entry_alive(struct dfc_entry *ce)
+{
+    return ce->flow && !ce->flow->dead;
+}
+
+static void
+dfc_clear_entry(struct dfc_entry *ce)
+{
+    if (ce->flow) {
+        dp_netdev_flow_unref(ce->flow);
+        ce->flow = NULL;
+    }
+}
+
+static inline void
+dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow)
+{
+    if (ce->flow != flow) {
+        if (ce->flow) {
+            dp_netdev_flow_unref(ce->flow);
         }

-        /* Replacement policy: put the flow in an empty (not alive) entry, or
-         * in the first entry where it can be */
-        if (!to_be_replaced
-            || (emc_entry_alive(to_be_replaced)
-                && !emc_entry_alive(current_entry))
-            || current_entry->key.hash < to_be_replaced->key.hash) {
-            to_be_replaced = current_entry;
+        if (dp_netdev_flow_ref(flow)) {
+            ce->flow = flow;
+        } else {
+            ce->flow = NULL;
         }
     }
-    /* We didn't find the miniflow in the cache.
-     * The 'to_be_replaced' entry is where the new flow will be stored */
-
-    emc_change_entry(to_be_replaced, flow, key);
 }

 static inline void
-emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
-                         const struct netdev_flow_key *key,
-                         struct dp_netdev_flow *flow)
+dfc_insert(struct dp_netdev_pmd_thread *pmd,
+           const struct netdev_flow_key *key,
+           struct dp_netdev_flow *flow)
 {
-    /* Insert an entry into the EMC based on probability value 'min'. By
-     * default the value is UINT32_MAX / 100 which yields an insertion
-     * probability of 1/100 ie. 1% */
+    struct dfc_cache *cache = &pmd->flow_cache;
+    struct dfc_entry *current_entry;
+
+    current_entry = dfc_entry_get(cache, key->hash);
+    dfc_change_entry(current_entry, flow);
+}
+
+static inline struct dp_netdev_flow *
+dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
+           bool *exact_match)
+{
+    struct dp_netdev_flow *flow;

-    uint32_t min;
-    atomic_read_relaxed(&pmd->dp->emc_insert_min, &min);
+    /* Try an EMC lookup first. */
+    flow = emc_lookup(&cache->emc_cache, key);
+    if (flow) {
+        *exact_match = true;
+        return flow;
+    }
+
+    /* EMC lookup not successful: try DFC lookup. */
+    struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash);
+    flow = current_entry->flow;

-    if (min && random_uint32() <= min) {
-        emc_insert(&pmd->flow_cache, key, flow);
+    if (dfc_entry_alive(current_entry) &&
+        dpcls_rule_matches_key(&flow->cr, key)) {
+
+        /* Found a match in DFC. Insert into EMC for subsequent lookups.
+         * We use probabilistic insertion here so that mainly elephant
+         * flows enter EMC. */
+        emc_probabilistic_insert(&cache->emc_cache, key, flow);
+        *exact_match = false;
+        return flow;
+    } else {
+
+        /* No match. Need to go to DPCLS lookup. */
+        return NULL;
     }
 }

-static inline struct dp_netdev_flow *
-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
+/* Check and clear dead flow references slowly (one entry at each
+ * invocation).  */
+static void
+emc_slow_sweep(struct emc_cache *emc)
 {
-    struct emc_entry *current_entry;
+    struct emc_entry *entry = &emc->entries[emc->sweep_idx];

-    EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-        if (current_entry->key.hash == key->hash
-            && emc_entry_alive(current_entry)
-            && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
+    if (!emc_entry_alive(entry)) {
+        emc_clear_entry(entry);
+    }
+    emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK;
+}

-            /* We found the entry with the 'key->mf' miniflow */
-            return current_entry->flow;
-        }
+static void
+dfc_slow_sweep(struct dfc_cache *cache)
+{
+    /* Sweep the EMC so that both finish in the same time. */
+    if ((cache->sweep_idx & (DFC_ENTRIES/EMC_ENTRIES - 1)) == 0) {
+        emc_slow_sweep(&cache->emc_cache);
     }

-    return NULL;
+    struct dfc_entry *entry = &cache->entries[cache->sweep_idx];
+    if (!dfc_entry_alive(entry)) {
+        dfc_clear_entry(entry);
+    }
+    cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK;
 }

 static struct dp_netdev_flow *
@@ -4048,7 +4165,7 @@  pmd_thread_main(void *f_)
     ovs_numa_thread_setaffinity_core(pmd->core_id);
     dpdk_set_lcore_id(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
-    emc_cache_init(&pmd->flow_cache);
+    dfc_cache_init(&pmd->flow_cache);
 reload:
     pmd_alloc_static_tx_qid(pmd);

@@ -4078,17 +4195,16 @@  reload:
                                                       : PMD_CYCLES_IDLE);
         }

-        if (lc++ > 1024) {
-            bool reload;
+        dfc_slow_sweep(&pmd->flow_cache);

+        if (lc++ > 1024) {
             lc = 0;

             coverage_try_clear();
             dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt);
-            if (!ovsrcu_try_quiesce()) {
-                emc_cache_slow_sweep(&pmd->flow_cache);
-            }
+            ovsrcu_try_quiesce();

+            bool reload;
             atomic_read_relaxed(&pmd->reload, &reload);
             if (reload) {
                 break;
@@ -4110,7 +4226,7 @@  reload:
         goto reload;
     }

-    emc_cache_uninit(&pmd->flow_cache);
+    dfc_cache_uninit(&pmd->flow_cache);
     free(poll_list);
     pmd_free_cached_ports(pmd);
     return NULL;
@@ -4544,7 +4660,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
-        emc_cache_init(&pmd->flow_cache);
+        dfc_cache_init(&pmd->flow_cache);
         pmd_alloc_static_tx_qid(pmd);
     }
     cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
@@ -4586,7 +4702,7 @@  dp_netdev_del_pmd(struct dp_netdev *dp, struct dp_netdev_pmd_thread *pmd)
      * but extra cleanup is necessary */
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_lock(&dp->non_pmd_mutex);
-        emc_cache_uninit(&pmd->flow_cache);
+        dfc_cache_uninit(&pmd->flow_cache);
         pmd_free_cached_ports(pmd);
         pmd_free_static_tx_qid(pmd);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -4896,7 +5012,7 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
 }

-/* Try to process all ('cnt') the 'packets' using only the exact match cache
+/* Try to process all ('cnt') the 'packets' using only the PMD flow cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
  * beginning of the 'packets' array.
@@ -4911,19 +5027,20 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
  * will be ignored.
  */
 static inline size_t
-emc_processing(struct dp_netdev_pmd_thread *pmd,
+dfc_processing(struct dp_netdev_pmd_thread *pmd,
                struct dp_packet_batch *packets_,
                struct netdev_flow_key *keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
                bool md_is_valid, odp_port_t port_no)
 {
-    struct emc_cache *flow_cache = &pmd->flow_cache;
+    struct dfc_cache *flow_cache = &pmd->flow_cache;
     struct netdev_flow_key *key = &keys[0];
-    size_t n_missed = 0, n_dropped = 0;
+    size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0;
     struct dp_packet *packet;
     const size_t cnt = dp_packet_batch_size(packets_);
     uint32_t cur_min;
     int i;
+    bool exact_match;

     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);

@@ -4932,7 +5049,6 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,

         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
             dp_packet_delete(packet);
-            n_dropped++;
             continue;
         }

@@ -4948,7 +5064,7 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         miniflow_extract(packet, &key->mf);
         key->len = 0; /* Not computed yet. */
-        /* If EMC is disabled skip hash computation and emc_lookup */
+        /* If DFC is disabled skip hash computation and DFC lookup */
         if (cur_min) {
             if (!md_is_valid) {
                 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
@@ -4956,11 +5072,16 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
             } else {
                 key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
             }
-            flow = emc_lookup(flow_cache, key);
+            flow = dfc_lookup(flow_cache, key, &exact_match);
         } else {
             flow = NULL;
         }
         if (OVS_LIKELY(flow)) {
+            if (exact_match) {
+                n_emc_hit++;
+            } else {
+                n_dfc_hit++;
+            }
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
         } else {
@@ -4974,8 +5095,8 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
     }

-    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT,
-                           cnt - n_dropped - n_missed);
+    dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, n_emc_hit);
+    dp_netdev_count_packet(pmd, DP_STAT_DFC_HIT, n_dfc_hit);

     return dp_packet_batch_size(packets_);
 }
@@ -5044,7 +5165,7 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        dfc_insert(pmd, key, netdev_flow);
     }
 }

@@ -5136,7 +5257,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,

         flow = dp_netdev_flow_cast(rules[i]);

-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        dfc_insert(pmd, &keys[i], flow);
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }

@@ -5169,7 +5290,7 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     odp_port_t in_port;

     n_batches = 0;
-    emc_processing(pmd, packets, keys, batches, &n_batches,
+    dfc_processing(pmd, packets, keys, batches, &n_batches,
                             md_is_valid, port_no);
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
@@ -6063,7 +6184,7 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)

 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
  * in 'mask' the values in 'key' and 'target' are the same. */
-static inline bool
+static bool
 dpcls_rule_matches_key(const struct dpcls_rule *rule,
                        const struct netdev_flow_key *target)
 {