From patchwork Mon Apr 2 17:04:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Wang, Yipeng1" X-Patchwork-Id: 894413 X-Patchwork-Delegate: ian.stokes@intel.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40FTtc70bhz9s1w for ; Tue, 3 Apr 2018 10:07:36 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 2784AED4; Tue, 3 Apr 2018 00:07:35 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id D1FD7ECB for ; Tue, 3 Apr 2018 00:07:33 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 252F7613 for ; Tue, 3 Apr 2018 00:07:32 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Apr 2018 17:07:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,398,1517904000"; d="scan'208";a="188125254" Received: from skx-yipeng.jf.intel.com ([10.54.81.175]) by orsmga004.jf.intel.com with ESMTP; 02 Apr 2018 17:07:31 -0700 From: yipeng wang To: dev@openvswitch.org, jan.scheurich@ericsson.com, u9012063@gmail.com Date: Mon, 2 Apr 2018 10:04:40 -0700 Message-Id: <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1520364208-54362-1-git-send-email-yipeng1.wang@intel.com> References: <1520364208-54362-1-git-send-email-yipeng1.wang@intel.com> X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00, DATE_IN_PAST_06_12, T_RP_MATCHES_RCVD autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: "../patches/v2/v2-0000-cover-letter.patch"@mail.linuxfoundation.org Subject: [ovs-dev] [PATCH v2 1/4] dpif-netdev: Refactor datapath flow cache X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: yipeng From: Jan Scheurich This is a rebase of Jan's previous patch [PATCH] dpif-netdev: Refactor datapath flow cache https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341066.html So far the netdev datapath uses an 8K EMC to speed up the lookup of frequently used flows by comparing the parsed packet headers against the miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The EMC is too small for many applications with 100K or more parallel packet flows so that EMC threshing actually degrades performance. Furthermore, the size of struct miniflow and the flow copying cost prevents us from making it much larger. At the same time the lookup cost of the megaflow classifier (DPCLS) is increasing as the number of frequently hit subtables grows with the complexity of pipeline and the number of recirculations. To close the performance gap for many parallel flows, this patch introduces the datapath flow cache (DFC) with 1M entries as lookup stage between EMC and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last hit megaflow entry and performs a masked comparison of the packet flow with the megaflow key to confirm the hit. This avoids the costly DPCLS lookup even for very large number of parallel flows with a small memory overhead. Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit immediately inserts an entry in the DFC so that subsequent packets get speeded up. The DFC, thus, accelerate also short-lived flows. To further accelerate the lookup of few elephant flows, every DFC hit triggers a probabilistic EMC insertion of the flow. As the DFC entry is already in place the default EMC insertion probability can be reduced to 1/1000 to minimize EMC thrashing should there still be many fat flows. The inverse EMC insertion probability remains configurable. The EMC implementation is simplified by removing the possibility to store a flow in two slots, as there is no particular reason why two flows should systematically collide (the RSS hash is not symmetric). The maximum size of the EMC flow key is limited to 256 bytes to reduce the memory footprint. This should be sufficient to hold most real life packet flow keys. Larger flows are not installed in the EMC. The pmd-stats-show command is enhanced to show both EMC and DFC hits separately. The sweep speed for cleaning up obsolete EMC and DFC flow entries and freeing dead megaflow entries is increased. With a typical PMD cycle duration of 100us under load and checking one DFC entry per cycle, the DFC sweep should normally complete within in 100s. In PVP performance tests with an L3 pipeline over VXLAN we determined the optimal EMC size to be 16K entries to obtain a uniform speedup compared to the master branch over the full range of parallel flows. The measurement below is for 64 byte packets and the average number of subtable lookups per DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a single busy mask. Tests with many visited subtables should show a strong increase of the gain through DFC. Flows master DFC+EMC Gain [Mpps] [Mpps] ------------------------------ 8 4.45 4.62 3.8% 100 4.17 4.47 7.2% 1000 3.88 4.34 12.0% 2000 3.54 4.17 17.8% 5000 3.01 3.82 27.0% 10000 2.75 3.63 31.9% 20000 2.64 3.50 32.8% 50000 2.60 3.33 28.1% 100000 2.59 3.23 24.7% 500000 2.59 3.16 21.9% Signed-off-by: Jan Scheurich dpif-netdev: Fix EMC key length EMC's key length is not initilized when insertion. Initialize the key length before insertion. Signed-off-by: Yipeng Wang --- lib/dpif-netdev-perf.h | 1 + lib/dpif-netdev.c | 371 ++++++++++++++++++++++++++++++++----------------- 2 files changed, 247 insertions(+), 125 deletions(-) diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 5993c25..157befb 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -48,6 +48,7 @@ extern "C" { enum pmd_stat_type { PMD_STAT_EXACT_HIT, /* Packets that had an exact match (emc). */ + PMD_STAT_DFC_HIT, /* Packets that had a flow cache hit (dfc). */ PMD_STAT_MASKED_HIT, /* Packets that matched in the flow table. */ PMD_STAT_MISS, /* Packets that did not match and upcall was ok. */ PMD_STAT_LOST, /* Packets that did not match and upcall failed. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index b07fc6b..cea2bf8 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -128,19 +128,19 @@ struct netdev_flow_key { uint64_t buf[FLOW_MAX_PACKET_U64S]; }; -/* Exact match cache for frequently used flows +/* Datapath flow cache (DFC) for frequently used flows * - * The cache uses a 32-bit hash of the packet (which can be the RSS hash) to - * search its entries for a miniflow that matches exactly the miniflow of the - * packet. It stores the 'dpcls_rule' (rule) that matches the miniflow. + * The cache uses the 32-bit hash of the packet (which can be the RSS hash) to + * directly look up a pointer to the matching megaflow. To check for a match + * the packet's flow key is compared against the key and mask of the megaflow. * - * A cache entry holds a reference to its 'dp_netdev_flow'. - * - * A miniflow with a given hash can be in one of EM_FLOW_HASH_SEGS different - * entries. The 32-bit hash is split into EM_FLOW_HASH_SEGS values (each of - * them is EM_FLOW_HASH_SHIFT bits wide and the remainder is thrown away). Each - * value is the index of a cache entry where the miniflow could be. + * For even faster lookup, the most frequently used packet flows are also + * inserted into a small exact match cache (EMC). The EMC uses a part of the + * packet hash to look up a miniflow that matches exactly the miniflow of the + * packet. The matching EMC also returns a reference to the megaflow. * + * Flows are promoted from the DFC to the EMC through probabilistic insertion + * after successful DFC lookup with minor probability to favor elephant flows. * * Thread-safety * ============= @@ -149,33 +149,38 @@ struct netdev_flow_key { * If dp_netdev_input is not called from a pmd thread, a mutex is used. */ -#define EM_FLOW_HASH_SHIFT 13 -#define EM_FLOW_HASH_ENTRIES (1u << EM_FLOW_HASH_SHIFT) -#define EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1) -#define EM_FLOW_HASH_SEGS 2 +#define DFC_MASK_LEN 20 +#define DFC_ENTRIES (1u << DFC_MASK_LEN) +#define DFC_MASK (DFC_ENTRIES - 1) +#define EMC_MASK_LEN 14 +#define EMC_ENTRIES (1u << EMC_MASK_LEN) +#define EMC_MASK (EMC_ENTRIES - 1) /* Default EMC insert probability is 1 / DEFAULT_EM_FLOW_INSERT_INV_PROB */ -#define DEFAULT_EM_FLOW_INSERT_INV_PROB 100 +#define DEFAULT_EM_FLOW_INSERT_INV_PROB 1000 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \ DEFAULT_EM_FLOW_INSERT_INV_PROB) struct emc_entry { struct dp_netdev_flow *flow; - struct netdev_flow_key key; /* key.hash used for emc hash value. */ + char key[248]; /* Holds struct netdev_flow_key of limited size. */ }; struct emc_cache { - struct emc_entry entries[EM_FLOW_HASH_ENTRIES]; - int sweep_idx; /* For emc_cache_slow_sweep(). */ + struct emc_entry entries[EMC_ENTRIES]; + int sweep_idx; +}; + +struct dfc_entry { + struct dp_netdev_flow *flow; +}; + +struct dfc_cache { + struct emc_cache emc_cache; + struct dfc_entry entries[DFC_ENTRIES]; + int sweep_idx; }; -/* Iterate in the exact match cache through every entry that might contain a - * miniflow with hash 'HASH'. */ -#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH) \ - for (uint32_t i__ = 0, srch_hash__ = (HASH); \ - (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \ - i__ < EM_FLOW_HASH_SEGS; \ - i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT) /* Simple non-wildcarding single-priority classifier. */ @@ -216,7 +221,8 @@ static bool dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[], struct dpcls_rule **rules, size_t cnt, int *num_lookups_p); - +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule, + const struct netdev_flow_key *target); /* Set of supported meter flags */ #define DP_SUPPORTED_METER_FLAGS_MASK \ (OFPMF13_STATS | OFPMF13_PKTPS | OFPMF13_KBPS | OFPMF13_BURST) @@ -548,7 +554,7 @@ struct dp_netdev_pmd_thread { * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly * need to be protected by 'non_pmd_mutex'. Every other instance * will only be accessed by its own pmd thread. */ - struct emc_cache flow_cache; + OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct dfc_cache flow_cache; /* Flow-Table and classifiers * @@ -713,46 +719,59 @@ dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd, static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd, struct tx_port *tx); -static inline bool emc_entry_alive(struct emc_entry *ce); +static inline bool dfc_entry_alive(struct dfc_entry *ce); static void emc_clear_entry(struct emc_entry *ce); +static void dfc_clear_entry(struct dfc_entry *ce); static void dp_netdev_request_reconfigure(struct dp_netdev *dp); static void -emc_cache_init(struct emc_cache *flow_cache) +emc_cache_init(struct emc_cache *emc) { int i; - flow_cache->sweep_idx = 0; + for (i = 0; i < ARRAY_SIZE(emc->entries); i++) { + emc->entries[i].flow = NULL; + struct netdev_flow_key *key = + (struct netdev_flow_key *) emc->entries[i].key; + key->hash = 0; + key->len = sizeof(struct miniflow); + flowmap_init(&key->mf.map); + } + emc->sweep_idx = 0; +} + +static void +dfc_cache_init(struct dfc_cache *flow_cache) +{ + int i; + + emc_cache_init(&flow_cache->emc_cache); for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) { flow_cache->entries[i].flow = NULL; - flow_cache->entries[i].key.hash = 0; - flow_cache->entries[i].key.len = sizeof(struct miniflow); - flowmap_init(&flow_cache->entries[i].key.mf.map); } + flow_cache->sweep_idx = 0; } static void -emc_cache_uninit(struct emc_cache *flow_cache) +emc_cache_uninit(struct emc_cache *emc) { int i; - for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) { - emc_clear_entry(&flow_cache->entries[i]); + for (i = 0; i < ARRAY_SIZE(emc->entries); i++) { + emc_clear_entry(&emc->entries[i]); } } -/* Check and clear dead flow references slowly (one entry at each - * invocation). */ static void -emc_cache_slow_sweep(struct emc_cache *flow_cache) +dfc_cache_uninit(struct dfc_cache *flow_cache) { - struct emc_entry *entry = &flow_cache->entries[flow_cache->sweep_idx]; + int i; - if (!emc_entry_alive(entry)) { - emc_clear_entry(entry); + for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) { + dfc_clear_entry(&flow_cache->entries[i]); } - flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK; + emc_cache_uninit(&flow_cache->emc_cache); } /* Updates the time in PMD threads context and should be called in three cases: @@ -852,6 +871,7 @@ pmd_info_show_stats(struct ds *reply, "\tpacket recirculations: %"PRIu64"\n" "\tavg. datapath passes per packet: %.02f\n" "\temc hits: %"PRIu64"\n" + "\tdfc hits: %"PRIu64"\n" "\tmegaflow hits: %"PRIu64"\n" "\tavg. subtable lookups per megaflow hit: %.02f\n" "\tmiss with success upcall: %"PRIu64"\n" @@ -859,6 +879,7 @@ pmd_info_show_stats(struct ds *reply, "\tavg. packets per output batch: %.02f\n", total_packets, stats[PMD_STAT_RECIRC], passes_per_pkt, stats[PMD_STAT_EXACT_HIT], + stats[PMD_STAT_DFC_HIT], stats[PMD_STAT_MASKED_HIT], lookups_per_hit, stats[PMD_STAT_MISS], stats[PMD_STAT_LOST], packets_per_batch); @@ -1476,6 +1497,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats) stats->n_flows += cmap_count(&pmd->flow_table); pmd_perf_read_counters(&pmd->perf_stats, pmd_stats); stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT]; + stats->n_hit += pmd_stats[PMD_STAT_DFC_HIT]; stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT]; stats->n_missed += pmd_stats[PMD_STAT_MISS]; stats->n_lost += pmd_stats[PMD_STAT_LOST]; @@ -2094,6 +2116,16 @@ netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key, return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8); } +/* + * Datapath Flow Cache and EMC implementation + */ + +static inline struct emc_entry * +emc_entry_get(struct emc_cache *emc, const uint32_t hash) +{ + return &emc->entries[hash & EMC_MASK]; +} + static inline bool emc_entry_alive(struct emc_entry *ce) { @@ -2110,9 +2142,15 @@ emc_clear_entry(struct emc_entry *ce) } static inline void -emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow, - const struct netdev_flow_key *key) +emc_change_entry(struct emc_entry *ce, const struct netdev_flow_key *key, + struct dp_netdev_flow *flow) { + /* We only store small enough flows in the EMC. */ + size_t key_size = offsetof(struct netdev_flow_key, mf) + key->len; + if (key_size > sizeof(ce->key)) { + return; + } + if (ce->flow != flow) { if (ce->flow) { dp_netdev_flow_unref(ce->flow); @@ -2124,38 +2162,7 @@ emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow, ce->flow = NULL; } } - if (key) { - netdev_flow_key_clone(&ce->key, key); - } -} - -static inline void -emc_insert(struct emc_cache *cache, const struct netdev_flow_key *key, - struct dp_netdev_flow *flow) -{ - struct emc_entry *to_be_replaced = NULL; - struct emc_entry *current_entry; - - EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) { - if (netdev_flow_key_equal(¤t_entry->key, key)) { - /* We found the entry with the 'mf' miniflow */ - emc_change_entry(current_entry, flow, NULL); - return; - } - - /* Replacement policy: put the flow in an empty (not alive) entry, or - * in the first entry where it can be */ - if (!to_be_replaced - || (emc_entry_alive(to_be_replaced) - && !emc_entry_alive(current_entry)) - || current_entry->key.hash < to_be_replaced->key.hash) { - to_be_replaced = current_entry; - } - } - /* We didn't find the miniflow in the cache. - * The 'to_be_replaced' entry is where the new flow will be stored */ - - emc_change_entry(to_be_replaced, flow, key); + netdev_flow_key_clone((struct netdev_flow_key *) ce->key, key); } static inline void @@ -2168,29 +2175,151 @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd, * probability of 1/100 ie. 1% */ uint32_t min; + struct emc_cache *emc = &(pmd->flow_cache).emc_cache; + atomic_read_relaxed(&pmd->dp->emc_insert_min, &min); if (min && random_uint32() <= min) { - emc_insert(&pmd->flow_cache, key, flow); + struct emc_entry *current_entry = emc_entry_get(emc, key->hash); + emc_change_entry(current_entry, key, flow); } } static inline struct dp_netdev_flow * -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key) +emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) +{ + struct emc_entry *current_entry = emc_entry_get(emc, key->hash); + struct netdev_flow_key *current_key = + (struct netdev_flow_key *) current_entry->key; + + if (current_key->hash == key->hash + && emc_entry_alive(current_entry) + && netdev_flow_key_equal_mf(current_key, &key->mf)) { + + /* We found the entry with the 'key->mf' miniflow */ + return current_entry->flow; + } + return NULL; +} + +static inline struct dfc_entry * +dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) +{ + return &cache->entries[hash & DFC_MASK]; +} + +static inline bool +dfc_entry_alive(struct dfc_entry *ce) { - struct emc_entry *current_entry; + return ce->flow && !ce->flow->dead; +} + +static void +dfc_clear_entry(struct dfc_entry *ce) +{ + if (ce->flow) { + dp_netdev_flow_unref(ce->flow); + ce->flow = NULL; + } +} - EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) { - if (current_entry->key.hash == key->hash - && emc_entry_alive(current_entry) - && netdev_flow_key_equal_mf(¤t_entry->key, &key->mf)) { +static inline void +dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) +{ + if (ce->flow != flow) { + if (ce->flow) { + dp_netdev_flow_unref(ce->flow); + } - /* We found the entry with the 'key->mf' miniflow */ - return current_entry->flow; + if (dp_netdev_flow_ref(flow)) { + ce->flow = flow; + } else { + ce->flow = NULL; } } +} - return NULL; +static inline void +dfc_insert(struct dp_netdev_pmd_thread *pmd, + const struct netdev_flow_key *key, + struct dp_netdev_flow *flow) +{ + struct dfc_cache *cache = &pmd->flow_cache; + struct dfc_entry *current_entry; + + current_entry = dfc_entry_get(cache, key->hash); + dfc_change_entry(current_entry, flow); +} + +static inline struct dp_netdev_flow * +dfc_lookup(struct dp_netdev_pmd_thread *pmd, struct netdev_flow_key *key, + bool *exact_match) +{ + struct dp_netdev_flow *flow; + uint32_t cur_min; + struct dfc_cache *cache = &pmd->flow_cache; + + /* Try an EMC lookup first. */ + atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); + + if(cur_min) { + flow = emc_lookup(&cache->emc_cache, key); + } + else { + flow = NULL; + } + if (flow) { + *exact_match = true; + return flow; + } + + /* EMC lookup not successful: try DFC lookup. */ + struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash); + flow = current_entry->flow; + + if (dfc_entry_alive(current_entry) && + dpcls_rule_matches_key(&flow->cr, key)) { + + /* Found a match in DFC. Insert into EMC for subsequent lookups. + * We use probabilistic insertion here so that mainly elephant + * flows enter EMC. */ + key->len = netdev_flow_key_size(miniflow_n_values(&key->mf)); + emc_probabilistic_insert(pmd, key, flow); + *exact_match = false; + return flow; + } else { + + /* No match. Need to go to DPCLS lookup. */ + return NULL; + } +} + +/* Check and clear dead flow references slowly (one entry at each + * invocation). */ +static void +emc_slow_sweep(struct emc_cache *emc) +{ + struct emc_entry *entry = &emc->entries[emc->sweep_idx]; + + if (!emc_entry_alive(entry)) { + emc_clear_entry(entry); + } + emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK; +} + +static void +dfc_slow_sweep(struct dfc_cache *cache) +{ + /* Sweep the EMC so that both finish in the same time. */ + if ((cache->sweep_idx & (DFC_ENTRIES / EMC_ENTRIES - 1)) == 0) { + emc_slow_sweep(&cache->emc_cache); + } + + struct dfc_entry *entry = &cache->entries[cache->sweep_idx]; + if (!dfc_entry_alive(entry)) { + dfc_clear_entry(entry); + } + cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK; } static struct dp_netdev_flow * @@ -2443,17 +2572,11 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd, ovs_assert(match->wc.masks.in_port.odp_port == ODPP_NONE); odp_port_t in_port = match->flow.in_port.odp_port; - /* As we select the dpcls based on the port number, each netdev flow - * belonging to the same dpcls will have the same odp_port value. - * For performance reasons we wildcard odp_port here in the mask. In the - * typical case dp_hash is also wildcarded, and the resulting 8-byte - * chunk {dp_hash, in_port} will be ignored by netdev_flow_mask_init() and - * will not be part of the subtable mask. - * This will speed up the hash computation during dpcls_lookup() because - * there is one less call to hash_add64() in this case. */ - match->wc.masks.in_port.odp_port = 0; + /* Since we added DFC cache per PMD, the input port is needed to match + * the key in DFC. DFC is not per port. Otherwise in_port could be + * excluded from mask */ + netdev_flow_mask_init(&mask, match); - match->wc.masks.in_port.odp_port = ODPP_NONE; /* Make sure wc does not have metadata. */ ovs_assert(!FLOWMAP_HAS_FIELD(&mask.mf.map, metadata) @@ -4113,7 +4236,7 @@ pmd_thread_main(void *f_) ovs_numa_thread_setaffinity_core(pmd->core_id); dpdk_set_lcore_id(pmd->core_id); poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list); - emc_cache_init(&pmd->flow_cache); + dfc_cache_init(&pmd->flow_cache); reload: pmd_alloc_static_tx_qid(pmd); @@ -4164,7 +4287,7 @@ reload: coverage_try_clear(); dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt); if (!ovsrcu_try_quiesce()) { - emc_cache_slow_sweep(&pmd->flow_cache); + dfc_slow_sweep(&pmd->flow_cache); } atomic_read_relaxed(&pmd->reload, &reload); @@ -4187,7 +4310,7 @@ reload: goto reload; } - emc_cache_uninit(&pmd->flow_cache); + dfc_cache_uninit(&pmd->flow_cache); free(poll_list); pmd_free_cached_ports(pmd); return NULL; @@ -4623,7 +4746,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp, /* init the 'flow_cache' since there is no * actual thread created for NON_PMD_CORE_ID. */ if (core_id == NON_PMD_CORE_ID) { - emc_cache_init(&pmd->flow_cache); + dfc_cache_init(&pmd->flow_cache); pmd_alloc_static_tx_qid(pmd); } pmd_perf_stats_init(&pmd->perf_stats); @@ -4666,7 +4789,7 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct dp_netdev_pmd_thread *pmd) * but extra cleanup is necessary */ if (pmd->core_id == NON_PMD_CORE_ID) { ovs_mutex_lock(&dp->non_pmd_mutex); - emc_cache_uninit(&pmd->flow_cache); + dfc_cache_uninit(&pmd->flow_cache); pmd_free_cached_ports(pmd); pmd_free_static_tx_qid(pmd); ovs_mutex_unlock(&dp->non_pmd_mutex); @@ -4970,7 +5093,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt, packet_batch_per_flow_update(batch, pkt, mf); } -/* Try to process all ('cnt') the 'packets' using only the exact match cache +/* Try to process all ('cnt') the 'packets' using only the PMD flow cache * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the * miniflow is copied into 'keys' and the packet pointer is moved at the * beginning of the 'packets' array. @@ -4985,21 +5108,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt, * will be ignored. */ static inline size_t -emc_processing(struct dp_netdev_pmd_thread *pmd, +dfc_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, struct packet_batch_per_flow batches[], size_t *n_batches, bool md_is_valid, odp_port_t port_no) { - struct emc_cache *flow_cache = &pmd->flow_cache; struct netdev_flow_key *key = &keys[0]; - size_t n_missed = 0, n_dropped = 0; + size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0; struct dp_packet *packet; const size_t cnt = dp_packet_batch_size(packets_); - uint32_t cur_min; int i; + bool exact_match; - atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); pmd_perf_update_counter(&pmd->perf_stats, md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV, cnt); @@ -5009,7 +5130,6 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) { dp_packet_delete(packet); - n_dropped++; continue; } @@ -5025,19 +5145,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, } miniflow_extract(packet, &key->mf); key->len = 0; /* Not computed yet. */ - /* If EMC is disabled skip hash computation and emc_lookup */ - if (cur_min) { - if (!md_is_valid) { - key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, - &key->mf); - } else { - key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf); - } - flow = emc_lookup(flow_cache, key); + if (!md_is_valid) { + key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, + &key->mf); } else { - flow = NULL; + key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf); } + flow = dfc_lookup(pmd, key, &exact_match); + if (OVS_LIKELY(flow)) { + if (exact_match) { + n_emc_hit++; + } else { + n_dfc_hit++; + } dp_netdev_queue_batches(packet, flow, &key->mf, batches, n_batches); } else { @@ -5051,8 +5172,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, } } - pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, - cnt - n_dropped - n_missed); + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit); + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_DFC_HIT, n_dfc_hit); return dp_packet_batch_size(packets_); } @@ -5119,7 +5240,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, add_actions->size); } ovs_mutex_unlock(&pmd->flow_mutex); - emc_probabilistic_insert(pmd, key, netdev_flow); + dfc_insert(pmd, key, netdev_flow); } return error; } @@ -5215,7 +5336,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, flow = dp_netdev_flow_cast(rules[i]); - emc_probabilistic_insert(pmd, &keys[i], flow); + dfc_insert(pmd, &keys[i], flow); dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches); } @@ -5251,7 +5372,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, odp_port_t in_port; n_batches = 0; - emc_processing(pmd, packets, keys, batches, &n_batches, + dfc_processing(pmd, packets, keys, batches, &n_batches, md_is_valid, port_no); if (!dp_packet_batch_is_empty(packets)) { /* Get ingress port from first packet's metadata. */ @@ -6202,7 +6323,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule) /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool +static bool dpcls_rule_matches_key(const struct dpcls_rule *rule, const struct netdev_flow_key *target) { From patchwork Mon Apr 2 17:04:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Wang, Yipeng1" X-Patchwork-Id: 894415 X-Patchwork-Delegate: ian.stokes@intel.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40FTvn64Xyz9s1w for ; Tue, 3 Apr 2018 10:08:37 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 53B02F21; Tue, 3 Apr 2018 00:07:37 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id AEE59ED4 for ; Tue, 3 Apr 2018 00:07:34 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 2772C614 for ; Tue, 3 Apr 2018 00:07:34 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Apr 2018 17:07:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,398,1517904000"; d="scan'208";a="188125255" Received: from skx-yipeng.jf.intel.com ([10.54.81.175]) by orsmga004.jf.intel.com with ESMTP; 02 Apr 2018 17:07:31 -0700 From: yipeng wang To: dev@openvswitch.org, jan.scheurich@ericsson.com, u9012063@gmail.com Date: Mon, 2 Apr 2018 10:04:41 -0700 Message-Id: <1522688683-59140-2-git-send-email-yipeng1.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> References: <1520364208-54362-1-git-send-email-yipeng1.wang@intel.com> <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00, DATE_IN_PAST_06_12, T_RP_MATCHES_RCVD autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: "../patches/v2/v2-0000-cover-letter.patch"@mail.linuxfoundation.org Subject: [ovs-dev] [PATCH v2 2/4] dpif-netdev: Use way-associative cache X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Yipeng Wang This commit use a way-associative cache rather than a simple single entry hash for DFC. Experiments show that this design generally has much higher hit rate. Signed-off-by: Yipeng Wang --- lib/dpif-netdev.c | 107 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 70 insertions(+), 37 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index cea2bf8..28d6086 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -150,8 +150,10 @@ struct netdev_flow_key { */ #define DFC_MASK_LEN 20 +#define DFC_ENTRY_PER_BUCKET 8 #define DFC_ENTRIES (1u << DFC_MASK_LEN) -#define DFC_MASK (DFC_ENTRIES - 1) +#define DFC_BUCKET_CNT (DFC_ENTRIES / DFC_ENTRY_PER_BUCKET) +#define DFC_MASK (DFC_BUCKET_CNT - 1) #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u << EMC_MASK_LEN) #define EMC_MASK (EMC_ENTRIES - 1) @@ -171,13 +173,14 @@ struct emc_cache { int sweep_idx; }; -struct dfc_entry { - struct dp_netdev_flow *flow; +struct dfc_bucket { + uint16_t sig[DFC_ENTRY_PER_BUCKET]; + struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET]; }; struct dfc_cache { struct emc_cache emc_cache; - struct dfc_entry entries[DFC_ENTRIES]; + struct dfc_bucket buckets[DFC_BUCKET_CNT]; int sweep_idx; }; @@ -719,9 +722,9 @@ dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd, static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd, struct tx_port *tx); -static inline bool dfc_entry_alive(struct dfc_entry *ce); +static inline bool dfc_entry_alive(struct dp_netdev_flow *flow); static void emc_clear_entry(struct emc_entry *ce); -static void dfc_clear_entry(struct dfc_entry *ce); +static void dfc_clear_entry(struct dfc_bucket *b, int idx); static void dp_netdev_request_reconfigure(struct dp_netdev *dp); @@ -744,11 +747,13 @@ emc_cache_init(struct emc_cache *emc) static void dfc_cache_init(struct dfc_cache *flow_cache) { - int i; + int i, j; emc_cache_init(&flow_cache->emc_cache); - for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) { - flow_cache->entries[i].flow = NULL; + for (i = 0; i < DFC_BUCKET_CNT; i++) { + for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) { + flow_cache->buckets[i].flow[j] = NULL; + } } flow_cache->sweep_idx = 0; } @@ -766,10 +771,12 @@ emc_cache_uninit(struct emc_cache *emc) static void dfc_cache_uninit(struct dfc_cache *flow_cache) { - int i; + int i, j; - for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) { - dfc_clear_entry(&flow_cache->entries[i]); + for (i = 0; i < DFC_BUCKET_CNT; i++) { + for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) { + dfc_clear_entry(&(flow_cache->buckets[i]), j); + } } emc_cache_uninit(&flow_cache->emc_cache); } @@ -2202,39 +2209,46 @@ emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) return NULL; } -static inline struct dfc_entry * +static inline struct dp_netdev_flow * dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) { - return &cache->entries[hash & DFC_MASK]; + struct dfc_bucket *bucket = &cache->buckets[hash & DFC_MASK]; + uint16_t sig = hash >> 16; + for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + if(bucket->sig[i] == sig) { + return bucket->flow[i]; + } + } + return NULL; } static inline bool -dfc_entry_alive(struct dfc_entry *ce) +dfc_entry_alive(struct dp_netdev_flow *flow) { - return ce->flow && !ce->flow->dead; + return flow && !flow->dead; } static void -dfc_clear_entry(struct dfc_entry *ce) +dfc_clear_entry(struct dfc_bucket *b, int idx) { - if (ce->flow) { - dp_netdev_flow_unref(ce->flow); - ce->flow = NULL; + if (b->flow[idx]) { + dp_netdev_flow_unref(b->flow[idx]); + b->flow[idx] = NULL; } } static inline void -dfc_change_entry(struct dfc_entry *ce, struct dp_netdev_flow *flow) +dfc_change_entry(struct dfc_bucket *b, int idx, struct dp_netdev_flow *flow) { - if (ce->flow != flow) { - if (ce->flow) { - dp_netdev_flow_unref(ce->flow); + if (b->flow[idx] != flow) { + if (b->flow[idx]) { + dp_netdev_flow_unref(b->flow[idx]); } if (dp_netdev_flow_ref(flow)) { - ce->flow = flow; + b->flow[idx] = flow; } else { - ce->flow = NULL; + b->flow[idx] = NULL; } } } @@ -2245,10 +2259,25 @@ dfc_insert(struct dp_netdev_pmd_thread *pmd, struct dp_netdev_flow *flow) { struct dfc_cache *cache = &pmd->flow_cache; - struct dfc_entry *current_entry; - current_entry = dfc_entry_get(cache, key->hash); - dfc_change_entry(current_entry, flow); + struct dfc_bucket *bucket = &cache->buckets[key->hash & DFC_MASK]; + uint16_t sig = key->hash >> 16; + for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + if(bucket->sig[i] == sig) { + dfc_change_entry(bucket, i, flow); + return; + } + } + for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + if(bucket->flow[i] == NULL) { + bucket->sig[i] = sig; + dfc_change_entry(bucket, i, flow); + return; + } + } + int idx = random_uint32() & (DFC_ENTRY_PER_BUCKET - 1); + bucket->sig[idx] = sig; + dfc_change_entry(bucket, idx, flow); } static inline struct dp_netdev_flow * @@ -2274,10 +2303,9 @@ dfc_lookup(struct dp_netdev_pmd_thread *pmd, struct netdev_flow_key *key, } /* EMC lookup not successful: try DFC lookup. */ - struct dfc_entry *current_entry = dfc_entry_get(cache, key->hash); - flow = current_entry->flow; + flow = dfc_entry_get(cache, key->hash); - if (dfc_entry_alive(current_entry) && + if (flow != NULL && dfc_entry_alive(flow) && dpcls_rule_matches_key(&flow->cr, key)) { /* Found a match in DFC. Insert into EMC for subsequent lookups. @@ -2311,15 +2339,20 @@ static void dfc_slow_sweep(struct dfc_cache *cache) { /* Sweep the EMC so that both finish in the same time. */ - if ((cache->sweep_idx & (DFC_ENTRIES / EMC_ENTRIES - 1)) == 0) { + if ((cache->sweep_idx & (DFC_BUCKET_CNT / EMC_ENTRIES - 1)) == 0) { emc_slow_sweep(&cache->emc_cache); } - struct dfc_entry *entry = &cache->entries[cache->sweep_idx]; - if (!dfc_entry_alive(entry)) { - dfc_clear_entry(entry); + if((cache->sweep_idx & (DFC_ENTRY_PER_BUCKET - 1)) == 0) { + uint32_t bkt = cache->sweep_idx / DFC_ENTRY_PER_BUCKET; + struct dfc_bucket *bucket = &cache->buckets[bkt]; + for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++){ + if (!dfc_entry_alive(bucket->flow[i])) { + dfc_clear_entry(bucket, i); + } + } } - cache->sweep_idx = (cache->sweep_idx + 1) & DFC_MASK; + cache->sweep_idx = (cache->sweep_idx + 1) & (DFC_ENTRIES - 1); } static struct dp_netdev_flow * From patchwork Mon Apr 2 17:04:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Wang, Yipeng1" X-Patchwork-Id: 894414 X-Patchwork-Delegate: ian.stokes@intel.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40FTvJ04pqz9s1w for ; Tue, 3 Apr 2018 10:08:12 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 78210F19; Tue, 3 Apr 2018 00:07:36 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id AD3EAECB for ; Tue, 3 Apr 2018 00:07:34 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id DCDBD613 for ; Tue, 3 Apr 2018 00:07:33 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Apr 2018 17:07:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,398,1517904000"; d="scan'208";a="188125256" Received: from skx-yipeng.jf.intel.com ([10.54.81.175]) by orsmga004.jf.intel.com with ESMTP; 02 Apr 2018 17:07:31 -0700 From: yipeng wang To: dev@openvswitch.org, jan.scheurich@ericsson.com, u9012063@gmail.com Date: Mon, 2 Apr 2018 10:04:42 -0700 Message-Id: <1522688683-59140-3-git-send-email-yipeng1.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> References: <1520364208-54362-1-git-send-email-yipeng1.wang@intel.com> <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00, DATE_IN_PAST_06_12, T_RP_MATCHES_RCVD autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: "../patches/v2/v2-0000-cover-letter.patch"@mail.linuxfoundation.org Subject: [ovs-dev] [PATCH v2 3/4] dpif-netdev: use flow_table as indirect table X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Yipeng Wang It is not memory efficient to store pointers in the distributor. In this commit, we store a 2 Byte index which is the index into flow_table. If the flow table is larger than 2^16, the rules store in the high index entry will not be cached in distributor. In cmap, we add two APIs to support find flow by index and find index by flow. Since flow table is a cuckoo hash, it is possible that keys are moved around and also keys to be rehashed. In such case, distributor will have misses and refresh. However, this should not happen frequently and distributor as a cache should not cause functional error. Signed-off-by: Yipeng Wang --- lib/cmap.c | 73 ++++++++++++++++++++++++++++++ lib/cmap.h | 5 +++ lib/dpif-netdev.c | 131 +++++++++++++++++++++--------------------------------- 3 files changed, 128 insertions(+), 81 deletions(-) diff --git a/lib/cmap.c b/lib/cmap.c index 07719a8..db1c806 100644 --- a/lib/cmap.c +++ b/lib/cmap.c @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash) hash); } +/* Find a node by the index of the entry of cmap. For example, index 7 means + * the second bucket and the third item. + * Notice that it is not protected by the optimistic lock (versioning) because + * of performance reasons. Currently it is only used by the datapath DFC cache. + * + * Return node for the entry of index or NULL if the index beyond boundary */ +const struct cmap_node * +cmap_find_by_index(const struct cmap *cmap, uint16_t index) +{ + const struct cmap_impl *impl = cmap_get_impl(cmap); + + uint32_t b = index / CMAP_K; + uint32_t e = index % CMAP_K; + + if (b > impl->mask) { + return NULL; + } + + const struct cmap_bucket *bucket = &impl->buckets[b]; + + return cmap_node_next(&bucket->nodes[e]); +} + +/* Find the index of certain hash value. Currently only used by the datapath + * DFC cache. + * + * Return the index of the entry if found, or UINT32_MAX if not found */ + +uint32_t +cmap_find_index(const struct cmap *cmap, uint32_t hash) +{ + const struct cmap_impl *impl = cmap_get_impl(cmap); + uint32_t h1 = rehash(impl, hash); + uint32_t h2 = other_hash(h1); + + uint32_t b_index1 = h1 & impl->mask; + uint32_t b_index2 = h2 & impl->mask; + + uint32_t c1, c2; + uint32_t index = UINT32_MAX; + + const struct cmap_bucket *b1 = &impl->buckets[b_index1]; + const struct cmap_bucket *b2 = &impl->buckets[b_index2]; + + do { + do { + c1 = read_even_counter(b1); + for (int i = 0; i < CMAP_K; i++) { + if (b1->hashes[i] == hash) { + index = b_index1 * CMAP_K + i; + } + } + } while (OVS_UNLIKELY(counter_changed(b1, c1))); + if (index != UINT32_MAX) { + break; + } + do { + c2 = read_even_counter(b2); + for (int i = 0; i < CMAP_K; i++) { + if (b2->hashes[i] == hash) { + index = b_index2 * CMAP_K + i; + } + } + } while (OVS_UNLIKELY(counter_changed(b2, c2))); + + if (index != UINT32_MAX) { + break; + } + } while (OVS_UNLIKELY(counter_changed(b1, c1))); + + return index; +} + /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1, * and sets the corresponding pointer in 'nodes', if the hash value was * found from the 'cmap'. In other cases the 'nodes' values are not changed, diff --git a/lib/cmap.h b/lib/cmap.h index 8bfb6c0..076afd6 100644 --- a/lib/cmap.h +++ b/lib/cmap.h @@ -145,6 +145,11 @@ size_t cmap_replace(struct cmap *, struct cmap_node *old_node, const struct cmap_node *cmap_find(const struct cmap *, uint32_t hash); struct cmap_node *cmap_find_protected(const struct cmap *, uint32_t hash); +/* Find node by index or find index by hash */ +const struct cmap_node * cmap_find_by_index(const struct cmap *, + uint16_t index); +uint32_t cmap_find_index(const struct cmap *, uint32_t hash); + /* Looks up multiple 'hashes', when the corresponding bit in 'map' is 1, * and sets the corresponding pointer in 'nodes', if the hash value was * found from the 'cmap'. In other cases the 'nodes' values are not changed, diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 28d6086..8fe736d 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -175,13 +175,12 @@ struct emc_cache { struct dfc_bucket { uint16_t sig[DFC_ENTRY_PER_BUCKET]; - struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET]; + uint16_t flow_idx[DFC_ENTRY_PER_BUCKET]; }; struct dfc_cache { struct emc_cache emc_cache; struct dfc_bucket buckets[DFC_BUCKET_CNT]; - int sweep_idx; }; @@ -722,7 +721,6 @@ dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd, static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd, struct tx_port *tx); -static inline bool dfc_entry_alive(struct dp_netdev_flow *flow); static void emc_clear_entry(struct emc_entry *ce); static void dfc_clear_entry(struct dfc_bucket *b, int idx); @@ -752,10 +750,9 @@ dfc_cache_init(struct dfc_cache *flow_cache) emc_cache_init(&flow_cache->emc_cache); for (i = 0; i < DFC_BUCKET_CNT; i++) { for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) { - flow_cache->buckets[i].flow[j] = NULL; + flow_cache->buckets[i].flow_idx[j] = UINT16_MAX; } } - flow_cache->sweep_idx = 0; } static void @@ -2209,84 +2206,72 @@ emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) return NULL; } -static inline struct dp_netdev_flow * -dfc_entry_get(struct dfc_cache *cache, const uint32_t hash) +static inline const struct cmap_node * +dfc_entry_get(struct dp_netdev_pmd_thread *pmd, struct dfc_cache *cache, const uint32_t hash) { struct dfc_bucket *bucket = &cache->buckets[hash & DFC_MASK]; uint16_t sig = hash >> 16; + uint16_t index = UINT16_MAX; + for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { - if(bucket->sig[i] == sig) { - return bucket->flow[i]; + if (bucket->sig[i] == sig) { + index = bucket->flow_idx[i]; + break; } } + if (index != UINT16_MAX) { + return cmap_find_by_index(&pmd->flow_table, index); + } return NULL; } -static inline bool -dfc_entry_alive(struct dp_netdev_flow *flow) -{ - return flow && !flow->dead; -} static void dfc_clear_entry(struct dfc_bucket *b, int idx) { - if (b->flow[idx]) { - dp_netdev_flow_unref(b->flow[idx]); - b->flow[idx] = NULL; - } -} - -static inline void -dfc_change_entry(struct dfc_bucket *b, int idx, struct dp_netdev_flow *flow) -{ - if (b->flow[idx] != flow) { - if (b->flow[idx]) { - dp_netdev_flow_unref(b->flow[idx]); - } - - if (dp_netdev_flow_ref(flow)) { - b->flow[idx] = flow; - } else { - b->flow[idx] = NULL; - } - } + b->flow_idx[idx] = UINT16_MAX; } static inline void dfc_insert(struct dp_netdev_pmd_thread *pmd, const struct netdev_flow_key *key, - struct dp_netdev_flow *flow) + uint16_t index) { struct dfc_cache *cache = &pmd->flow_cache; - struct dfc_bucket *bucket = &cache->buckets[key->hash & DFC_MASK]; + + if (index == UINT16_MAX) { + return; + } + uint16_t sig = key->hash >> 16; for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { if(bucket->sig[i] == sig) { - dfc_change_entry(bucket, i, flow); + bucket->flow_idx[i] = index; return; } } for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { - if(bucket->flow[i] == NULL) { + if(bucket->flow_idx[i] == UINT16_MAX) { bucket->sig[i] = sig; - dfc_change_entry(bucket, i, flow); + bucket->flow_idx[i] = index; return; } } int idx = random_uint32() & (DFC_ENTRY_PER_BUCKET - 1); bucket->sig[idx] = sig; - dfc_change_entry(bucket, idx, flow); + bucket->flow_idx[idx] = index; } static inline struct dp_netdev_flow * -dfc_lookup(struct dp_netdev_pmd_thread *pmd, struct netdev_flow_key *key, - bool *exact_match) +dfc_lookup(struct dp_netdev_pmd_thread *pmd, + struct netdev_flow_key *key, + bool *exact_match) { struct dp_netdev_flow *flow; uint32_t cur_min; struct dfc_cache *cache = &pmd->flow_cache; + const struct cmap_node *flow_node; /* Try an EMC lookup first. */ atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); @@ -2303,23 +2288,18 @@ dfc_lookup(struct dp_netdev_pmd_thread *pmd, struct netdev_flow_key *key, } /* EMC lookup not successful: try DFC lookup. */ - flow = dfc_entry_get(cache, key->hash); - - if (flow != NULL && dfc_entry_alive(flow) && - dpcls_rule_matches_key(&flow->cr, key)) { - - /* Found a match in DFC. Insert into EMC for subsequent lookups. - * We use probabilistic insertion here so that mainly elephant - * flows enter EMC. */ - key->len = netdev_flow_key_size(miniflow_n_values(&key->mf)); - emc_probabilistic_insert(pmd, key, flow); - *exact_match = false; - return flow; - } else { + flow_node = dfc_entry_get(pmd, cache, key->hash); - /* No match. Need to go to DPCLS lookup. */ - return NULL; + CMAP_NODE_FOR_EACH (flow, node, flow_node) { + if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key))) { + key->len = netdev_flow_key_size(miniflow_n_values(&key->mf)); + emc_probabilistic_insert(pmd, key, flow); + *exact_match = false; + return flow; + } } + + return NULL; } /* Check and clear dead flow references slowly (one entry at each @@ -2335,26 +2315,6 @@ emc_slow_sweep(struct emc_cache *emc) emc->sweep_idx = (emc->sweep_idx + 1) & EMC_MASK; } -static void -dfc_slow_sweep(struct dfc_cache *cache) -{ - /* Sweep the EMC so that both finish in the same time. */ - if ((cache->sweep_idx & (DFC_BUCKET_CNT / EMC_ENTRIES - 1)) == 0) { - emc_slow_sweep(&cache->emc_cache); - } - - if((cache->sweep_idx & (DFC_ENTRY_PER_BUCKET - 1)) == 0) { - uint32_t bkt = cache->sweep_idx / DFC_ENTRY_PER_BUCKET; - struct dfc_bucket *bucket = &cache->buckets[bkt]; - for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++){ - if (!dfc_entry_alive(bucket->flow[i])) { - dfc_clear_entry(bucket, i); - } - } - } - cache->sweep_idx = (cache->sweep_idx + 1) & (DFC_ENTRIES - 1); -} - static struct dp_netdev_flow * dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd, const struct netdev_flow_key *key, @@ -4319,8 +4279,9 @@ reload: coverage_try_clear(); dp_netdev_pmd_try_optimize(pmd, poll_list, poll_cnt); + if (!ovsrcu_try_quiesce()) { - dfc_slow_sweep(&pmd->flow_cache); + emc_slow_sweep(&((pmd->flow_cache).emc_cache)); } atomic_read_relaxed(&pmd->reload, &reload); @@ -5178,6 +5139,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, } miniflow_extract(packet, &key->mf); key->len = 0; /* Not computed yet. */ + if (!md_is_valid) { key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, &key->mf); @@ -5273,7 +5235,11 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, add_actions->size); } ovs_mutex_unlock(&pmd->flow_mutex); - dfc_insert(pmd, key, netdev_flow); + uint32_t cmap_index = cmap_find_index(&pmd->flow_table, + dp_netdev_flow_hash(&netdev_flow->ufid)); + uint16_t index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : + (uint16_t)cmap_index; + dfc_insert(pmd, key, index); } return error; } @@ -5368,8 +5334,11 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, } flow = dp_netdev_flow_cast(rules[i]); - - dfc_insert(pmd, &keys[i], flow); + uint32_t cmap_index = cmap_find_index(&pmd->flow_table, + dp_netdev_flow_hash(&flow->ufid)); + uint16_t index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : + (uint16_t)cmap_index; + dfc_insert(pmd, &keys[i], index); dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches); } From patchwork Mon Apr 2 17:04:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Wang, Yipeng1" X-Patchwork-Id: 894416 X-Patchwork-Delegate: ian.stokes@intel.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40FTwH0TQkz9s1w for ; Tue, 3 Apr 2018 10:09:03 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 40434FBA; Tue, 3 Apr 2018 00:07:38 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id EF20AF0E for ; Tue, 3 Apr 2018 00:07:35 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id B6B8E613 for ; Tue, 3 Apr 2018 00:07:34 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Apr 2018 17:07:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,398,1517904000"; d="scan'208";a="188125257" Received: from skx-yipeng.jf.intel.com ([10.54.81.175]) by orsmga004.jf.intel.com with ESMTP; 02 Apr 2018 17:07:31 -0700 From: yipeng wang To: dev@openvswitch.org, jan.scheurich@ericsson.com, u9012063@gmail.com Date: Mon, 2 Apr 2018 10:04:43 -0700 Message-Id: <1522688683-59140-4-git-send-email-yipeng1.wang@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> References: <1520364208-54362-1-git-send-email-yipeng1.wang@intel.com> <1522688683-59140-1-git-send-email-yipeng1.wang@intel.com> X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00, DATE_IN_PAST_06_12, T_RP_MATCHES_RCVD autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: "../patches/v2/v2-0000-cover-letter.patch"@mail.linuxfoundation.org Subject: [ovs-dev] [PATCH v2 4/4] dpif-netdev: Split DFC cache and code optimization X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org Split DFC cache into EMC and SMC (signature match cache). This makes the code more clear and either of the cache can operate separately. It is also easier for future optimizations. Rewrite the DFC processing code to do batching lookup. By using prefetching with batching lookup, memory access cost is hidden and performance improved significantly. Signed-off-by: Yipeng Wang --- lib/dpif-netdev-perf.h | 2 +- lib/dpif-netdev.c | 275 ++++++++++++++++++++++++++++--------------------- 2 files changed, 161 insertions(+), 116 deletions(-) diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 157befb..3d9de6f 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -48,7 +48,7 @@ extern "C" { enum pmd_stat_type { PMD_STAT_EXACT_HIT, /* Packets that had an exact match (emc). */ - PMD_STAT_DFC_HIT, /* Packets that had a flow cache hit (dfc). */ + PMD_STAT_SMC_HIT, /* Packets that had a sig match hit (SMC). */ PMD_STAT_MASKED_HIT, /* Packets that matched in the flow table. */ PMD_STAT_MISS, /* Packets that did not match and upcall was ok. */ PMD_STAT_LOST, /* Packets that did not match and upcall failed. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 8fe736d..a170750 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -128,6 +128,7 @@ struct netdev_flow_key { uint64_t buf[FLOW_MAX_PACKET_U64S]; }; + /* Datapath flow cache (DFC) for frequently used flows * * The cache uses the 32-bit hash of the packet (which can be the RSS hash) to @@ -149,11 +150,11 @@ struct netdev_flow_key { * If dp_netdev_input is not called from a pmd thread, a mutex is used. */ -#define DFC_MASK_LEN 20 -#define DFC_ENTRY_PER_BUCKET 8 -#define DFC_ENTRIES (1u << DFC_MASK_LEN) -#define DFC_BUCKET_CNT (DFC_ENTRIES / DFC_ENTRY_PER_BUCKET) -#define DFC_MASK (DFC_BUCKET_CNT - 1) +#define SMC_ENTRY_PER_BUCKET 4 +#define SMC_ENTRIES (1u << 20) +#define SMC_BUCKET_CNT (SMC_ENTRIES / SMC_ENTRY_PER_BUCKET) +#define SMC_MASK (SMC_BUCKET_CNT - 1) + #define EMC_MASK_LEN 14 #define EMC_ENTRIES (1u << EMC_MASK_LEN) #define EMC_MASK (EMC_ENTRIES - 1) @@ -173,14 +174,19 @@ struct emc_cache { int sweep_idx; }; -struct dfc_bucket { - uint16_t sig[DFC_ENTRY_PER_BUCKET]; - uint16_t flow_idx[DFC_ENTRY_PER_BUCKET]; +struct smc_bucket { + uint16_t sig[SMC_ENTRY_PER_BUCKET]; + uint16_t flow_idx[SMC_ENTRY_PER_BUCKET]; +}; + +/* Signature match cache, differentiate from EMC cache */ +struct smc_cache { + struct smc_bucket buckets[SMC_BUCKET_CNT]; }; struct dfc_cache { struct emc_cache emc_cache; - struct dfc_bucket buckets[DFC_BUCKET_CNT]; + struct smc_cache smc_cache; }; @@ -220,7 +226,7 @@ static void dpcls_insert(struct dpcls *, struct dpcls_rule *, const struct netdev_flow_key *mask); static void dpcls_remove(struct dpcls *, struct dpcls_rule *); static bool dpcls_lookup(struct dpcls *cls, - const struct netdev_flow_key keys[], + const struct netdev_flow_key *keys[], struct dpcls_rule **rules, size_t cnt, int *num_lookups_p); static bool dpcls_rule_matches_key(const struct dpcls_rule *rule, @@ -722,7 +728,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd, struct tx_port *tx); static void emc_clear_entry(struct emc_entry *ce); -static void dfc_clear_entry(struct dfc_bucket *b, int idx); +static void smc_clear_entry(struct smc_bucket *b, int idx); static void dp_netdev_request_reconfigure(struct dp_netdev *dp); @@ -743,19 +749,24 @@ emc_cache_init(struct emc_cache *emc) } static void -dfc_cache_init(struct dfc_cache *flow_cache) +smc_cache_init(struct smc_cache *smc_cache) { int i, j; - - emc_cache_init(&flow_cache->emc_cache); - for (i = 0; i < DFC_BUCKET_CNT; i++) { - for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) { - flow_cache->buckets[i].flow_idx[j] = UINT16_MAX; + for (i = 0; i < SMC_BUCKET_CNT; i++) { + for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) { + smc_cache->buckets[i].flow_idx[j] = UINT16_MAX; } } } static void +dfc_cache_init(struct dfc_cache *flow_cache) +{ + emc_cache_init(&flow_cache->emc_cache); + smc_cache_init(&flow_cache->smc_cache); +} + +static void emc_cache_uninit(struct emc_cache *emc) { int i; @@ -766,15 +777,21 @@ emc_cache_uninit(struct emc_cache *emc) } static void -dfc_cache_uninit(struct dfc_cache *flow_cache) +smc_cache_uninit(struct smc_cache *smc) { int i, j; - for (i = 0; i < DFC_BUCKET_CNT; i++) { - for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) { - dfc_clear_entry(&(flow_cache->buckets[i]), j); + for (i = 0; i < SMC_BUCKET_CNT; i++) { + for (j = 0; j < SMC_ENTRY_PER_BUCKET; j++) { + smc_clear_entry(&(smc->buckets[i]), j); } } +} + +static void +dfc_cache_uninit(struct dfc_cache *flow_cache) +{ + smc_cache_uninit(&flow_cache->smc_cache); emc_cache_uninit(&flow_cache->emc_cache); } @@ -875,7 +892,7 @@ pmd_info_show_stats(struct ds *reply, "\tpacket recirculations: %"PRIu64"\n" "\tavg. datapath passes per packet: %.02f\n" "\temc hits: %"PRIu64"\n" - "\tdfc hits: %"PRIu64"\n" + "\tsmc hits: %"PRIu64"\n" "\tmegaflow hits: %"PRIu64"\n" "\tavg. subtable lookups per megaflow hit: %.02f\n" "\tmiss with success upcall: %"PRIu64"\n" @@ -883,7 +900,7 @@ pmd_info_show_stats(struct ds *reply, "\tavg. packets per output batch: %.02f\n", total_packets, stats[PMD_STAT_RECIRC], passes_per_pkt, stats[PMD_STAT_EXACT_HIT], - stats[PMD_STAT_DFC_HIT], + stats[PMD_STAT_SMC_HIT], stats[PMD_STAT_MASKED_HIT], lookups_per_hit, stats[PMD_STAT_MISS], stats[PMD_STAT_LOST], packets_per_batch); @@ -1501,7 +1518,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats) stats->n_flows += cmap_count(&pmd->flow_table); pmd_perf_read_counters(&pmd->perf_stats, pmd_stats); stats->n_hit += pmd_stats[PMD_STAT_EXACT_HIT]; - stats->n_hit += pmd_stats[PMD_STAT_DFC_HIT]; + stats->n_hit += pmd_stats[PMD_STAT_SMC_HIT]; stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT]; stats->n_missed += pmd_stats[PMD_STAT_MISS]; stats->n_lost += pmd_stats[PMD_STAT_LOST]; @@ -2207,13 +2224,14 @@ emc_lookup(struct emc_cache *emc, const struct netdev_flow_key *key) } static inline const struct cmap_node * -dfc_entry_get(struct dp_netdev_pmd_thread *pmd, struct dfc_cache *cache, const uint32_t hash) +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache, + const uint32_t hash) { - struct dfc_bucket *bucket = &cache->buckets[hash & DFC_MASK]; + struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK]; uint16_t sig = hash >> 16; uint16_t index = UINT16_MAX; - for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) { if (bucket->sig[i] == sig) { index = bucket->flow_idx[i]; break; @@ -2227,79 +2245,44 @@ dfc_entry_get(struct dp_netdev_pmd_thread *pmd, struct dfc_cache *cache, const u static void -dfc_clear_entry(struct dfc_bucket *b, int idx) +smc_clear_entry(struct smc_bucket *b, int idx) { b->flow_idx[idx] = UINT16_MAX; } -static inline void -dfc_insert(struct dp_netdev_pmd_thread *pmd, +static inline int +smc_insert(struct dp_netdev_pmd_thread *pmd, const struct netdev_flow_key *key, - uint16_t index) + uint32_t hash) { - struct dfc_cache *cache = &pmd->flow_cache; - struct dfc_bucket *bucket = &cache->buckets[key->hash & DFC_MASK]; + struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache; + struct smc_bucket *bucket = &smc_cache->buckets[key->hash & SMC_MASK]; + uint16_t index; + uint32_t cmap_index = cmap_find_index(&pmd->flow_table, hash); + index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : (uint16_t)cmap_index; if (index == UINT16_MAX) { - return; + return 0; } uint16_t sig = key->hash >> 16; - for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) { if(bucket->sig[i] == sig) { bucket->flow_idx[i] = index; - return; + return 1; } } - for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) { + for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) { if(bucket->flow_idx[i] == UINT16_MAX) { bucket->sig[i] = sig; bucket->flow_idx[i] = index; - return; + return 1; } } - int idx = random_uint32() & (DFC_ENTRY_PER_BUCKET - 1); + int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1); bucket->sig[idx] = sig; bucket->flow_idx[idx] = index; -} - -static inline struct dp_netdev_flow * -dfc_lookup(struct dp_netdev_pmd_thread *pmd, - struct netdev_flow_key *key, - bool *exact_match) -{ - struct dp_netdev_flow *flow; - uint32_t cur_min; - struct dfc_cache *cache = &pmd->flow_cache; - const struct cmap_node *flow_node; - - /* Try an EMC lookup first. */ - atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); - - if(cur_min) { - flow = emc_lookup(&cache->emc_cache, key); - } - else { - flow = NULL; - } - if (flow) { - *exact_match = true; - return flow; - } - - /* EMC lookup not successful: try DFC lookup. */ - flow_node = dfc_entry_get(pmd, cache, key->hash); - - CMAP_NODE_FOR_EACH (flow, node, flow_node) { - if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key))) { - key->len = netdev_flow_key_size(miniflow_n_values(&key->mf)); - emc_probabilistic_insert(pmd, key, flow); - *exact_match = false; - return flow; - } - } - - return NULL; + return 1; } /* Check and clear dead flow references slowly (one entry at each @@ -2327,7 +2310,7 @@ dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd, cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port); if (OVS_LIKELY(cls)) { - dpcls_lookup(cls, key, &rule, 1, lookup_num_p); + dpcls_lookup(cls, &key, &rule, 1, lookup_num_p); netdev_flow = dp_netdev_flow_cast(rule); } return netdev_flow; @@ -5087,6 +5070,57 @@ dp_netdev_queue_batches(struct dp_packet *pkt, packet_batch_per_flow_update(batch, pkt, mf); } +/* By doing batching SMC lookup, we can use prefetch + * to hide memory access cost */ +static inline void +smc_lookup_batch(struct dp_netdev_pmd_thread *pmd, + struct netdev_flow_key *keys, + struct netdev_flow_key **missed_keys, + struct dp_packet_batch *packets_, + struct packet_batch_per_flow batches[], + size_t *n_batches, const int cnt) +{ + int i; + struct dp_packet *packet; + size_t n_smc_hit = 0, n_missed = 0; + struct dfc_cache *cache = &pmd->flow_cache; + struct smc_cache *smc_cache = &cache->smc_cache; + const struct cmap_node *flow_node; + + /* Prefetch buckets for all packets */ + for (i = 0; i < cnt; i++) { + OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]); + } + + DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { + struct dp_netdev_flow *flow = NULL; + flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash); + + if (OVS_LIKELY(flow_node != NULL)) { + /* Optunistically we only check the head of the node linked list */ + flow = CONTAINER_OF(flow_node, struct dp_netdev_flow, node); + if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]))) { + /* SMC hit and emc miss, we insert into EMC */ + keys[i].len = + netdev_flow_key_size(miniflow_n_values(&keys[i].mf)); + emc_probabilistic_insert(pmd, &keys[i], flow); + dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, + n_batches); + n_smc_hit++; + continue; + } + } + + /* SMC missed. Group missed packets together at + * the beginning of the 'packets' array. */ + dp_packet_batch_refill(packets_, packet, i); + /* Put missed keys to the pointer arrays return to the caller */ + missed_keys[n_missed++] = &keys[i]; + } + + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit); +} + /* Try to process all ('cnt') the 'packets' using only the PMD flow cache * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the * miniflow is copied into 'keys' and the packet pointer is moved at the @@ -5105,20 +5139,24 @@ static inline size_t dfc_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, + struct netdev_flow_key **missed_keys, struct packet_batch_per_flow batches[], size_t *n_batches, bool md_is_valid, odp_port_t port_no) { struct netdev_flow_key *key = &keys[0]; - size_t n_missed = 0, n_dfc_hit = 0, n_emc_hit = 0; + size_t n_missed = 0, n_emc_hit = 0; + struct dfc_cache *cache = &pmd->flow_cache; struct dp_packet *packet; const size_t cnt = dp_packet_batch_size(packets_); int i; - bool exact_match; + uint32_t cur_min; pmd_perf_update_counter(&pmd->perf_stats, md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV, cnt); + atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); + DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { struct dp_netdev_flow *flow; @@ -5139,23 +5177,22 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, } miniflow_extract(packet, &key->mf); key->len = 0; /* Not computed yet. */ - - if (!md_is_valid) { - key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, - &key->mf); + /* If EMC is disabled skip hash computation and emc_lookup */ + if (cur_min) { + if (!md_is_valid) { + key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, + &key->mf); + } else { + key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf); + } + flow = emc_lookup(&cache->emc_cache, key); } else { - key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf); + flow = NULL; } - flow = dfc_lookup(pmd, key, &exact_match); - if (OVS_LIKELY(flow)) { - if (exact_match) { - n_emc_hit++; - } else { - n_dfc_hit++; - } dp_netdev_queue_batches(packet, flow, &key->mf, batches, n_batches); + n_emc_hit++; } else { /* Exact match cache missed. Group missed packets together at * the beginning of the 'packets' array. */ @@ -5168,7 +5205,12 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, } pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit); - pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_DFC_HIT, n_dfc_hit); + + /* Packets miss EMC will do a batch lookup in SMC */ + if (n_missed != 0) { + smc_lookup_batch(pmd, keys, missed_keys, packets_, batches, + n_batches, n_missed); + } return dp_packet_batch_size(packets_); } @@ -5235,11 +5277,11 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, add_actions->size); } ovs_mutex_unlock(&pmd->flow_mutex); - uint32_t cmap_index = cmap_find_index(&pmd->flow_table, - dp_netdev_flow_hash(&netdev_flow->ufid)); - uint16_t index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : - (uint16_t)cmap_index; - dfc_insert(pmd, key, index); + uint32_t hash = dp_netdev_flow_hash(&netdev_flow->ufid); + if (!smc_insert(pmd, key, hash)) { + /* if smc table insertion failed, we insert to emc table */ + emc_probabilistic_insert(pmd, key, netdev_flow); + } } return error; } @@ -5247,7 +5289,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, static inline void fast_path_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, - struct netdev_flow_key *keys, + struct netdev_flow_key **keys, struct packet_batch_per_flow batches[], size_t *n_batches, odp_port_t in_port) @@ -5269,12 +5311,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, for (size_t i = 0; i < cnt; i++) { /* Key length is needed in all the cases, hash computed on demand. */ - keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf)); + keys[i]->len = netdev_flow_key_size(miniflow_n_values(&keys[i]->mf)); } /* Get the classifier for the in_port */ cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port); if (OVS_LIKELY(cls)) { - any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt); + any_miss = !dpcls_lookup(cls, (const struct netdev_flow_key **)keys, + rules, cnt, &lookup_cnt); } else { any_miss = true; memset(rules, 0, sizeof(rules)); @@ -5296,7 +5339,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, /* It's possible that an earlier slow path execution installed * a rule covering this flow. In this case, it's a lot cheaper * to catch it here than execute a miss. */ - netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i], + netdev_flow = dp_netdev_pmd_lookup_flow(pmd, keys[i], &add_lookup_cnt); if (netdev_flow) { lookup_cnt += add_lookup_cnt; @@ -5304,7 +5347,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, continue; } - int error = handle_packet_upcall(pmd, packet, &keys[i], + int error = handle_packet_upcall(pmd, packet, keys[i], &actions, &put_actions); if (OVS_UNLIKELY(error)) { @@ -5334,12 +5377,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, } flow = dp_netdev_flow_cast(rules[i]); - uint32_t cmap_index = cmap_find_index(&pmd->flow_table, - dp_netdev_flow_hash(&flow->ufid)); - uint16_t index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : - (uint16_t)cmap_index; - dfc_insert(pmd, &keys[i], index); - dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches); + uint32_t hash = dp_netdev_flow_hash(&flow->ufid); + if (!smc_insert(pmd, keys[i], hash)) { + /* if smc table insertion failed, we insert to emc table only*/ + emc_probabilistic_insert(pmd, keys[i], flow); + } + + dp_netdev_queue_batches(packet, flow, &keys[i]->mf, batches, n_batches); } pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT, @@ -5369,17 +5413,18 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, #endif OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE]; + struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE]; struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; size_t n_batches; odp_port_t in_port; n_batches = 0; - dfc_processing(pmd, packets, keys, batches, &n_batches, + dfc_processing(pmd, packets, keys, missed_keys, batches, &n_batches, md_is_valid, port_no); if (!dp_packet_batch_is_empty(packets)) { /* Get ingress port from first packet's metadata. */ in_port = packets->packets[0]->md.in_port.odp_port; - fast_path_processing(pmd, packets, keys, + fast_path_processing(pmd, packets, missed_keys, batches, &n_batches, in_port); } @@ -6352,7 +6397,7 @@ dpcls_rule_matches_key(const struct dpcls_rule *rule, * * Returns true if all miniflows found a corresponding rule. */ static bool -dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[], +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[], struct dpcls_rule **rules, const size_t cnt, int *num_lookups_p) { @@ -6391,7 +6436,7 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[], * masked with the subtable's mask to avoid hashing the wildcarded * bits. */ ULLONG_FOR_EACH_1(i, keys_map) { - hashes[i] = netdev_flow_key_hash_in_mask(&keys[i], + hashes[i] = netdev_flow_key_hash_in_mask(keys[i], &subtable->mask); } /* Lookup. */ @@ -6405,7 +6450,7 @@ dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[], struct dpcls_rule *rule; CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) { - if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) { + if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i]))) { rules[i] = rule; /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap * within one second optimization interval. */