diff mbox

[ovs-dev,RFC] dpif-netdev: dpcls per in_port with sorted subtables

Message ID 20160701134424.00000e93@web.de
State RFC
Headers show

Commit Message

Jan Scheurich July 1, 2016, 11:44 a.m. UTC
This is a redesigned version of the earlier RFC patch “dpif-netdev: Sorted
subtable vectors per in_port in dpcls”. 

The user-space datapath (dpif-netdev) consists of a first level "exact match
cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
and the OVS forwarding performance is determined by the megaflow classifier.

The megaflow classifier (dpcls) consists of a variable number of hash tables
(aka subtables), each containing megaflow entries with the same mask of
packet header and metadata fields to match upon. A dpcls lookup matches a
given packet against all subtables in sequence until it hits a match. As
megaflow cache entries are by construction non-overlapping, the first match
is the only match.

Today the order of the subtables in the dpcls is essentially random so that
on average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the
total number of subtables. Even though every single hash-table lookup is
fast, the performance of the current dpcls degrades when there are many
subtables.

How does the patch address this issue:

In reality there is often a strong correlation between the ingress port and a
small subset of subtables that have hits. The entire megaflow cache typically
decomposes nicely into partitions that are hit only by packets entering from
a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM ->
Phy). 

Therefore, maintaining a separate dpcls instance per ingress port with its
subtable vector sorted by frequency of hits reduces the average number of
subtables lookups in the dpcls to a minimum, even if the total number of
subtables gets large. This is possible because megaflows always have an exact
match on in_port, so every megaflow belongs to unique dpcls instance.

To monitor the effectiveness of the patch we have enhanced the ovs-appctl
dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
per hit" to report the average number of subtable lookup needed for a
megaflow match. Ideally, this should be close to 1 and almost all cases much
smaller than N/2.

I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay mesh.
With pure L3 tenant traffic between VMs on different nodes the resulting
netdev dpcls contains N=4 subtables.

Disabling the EMC, I have measured a baseline performance (in+out) of ~1.32
Mpps (64 bytes, 1000 L4 flows). The average number of subtable lookups per
dpcls match is 2.5.

With the patch the average number of subtable lookups per dpcls match is
reduced 1 and the forwarding performance grows by ~30% to 1.72 Mpps. 

As the actual number of subtables will often be higher in reality, we can
assume that this is at the lower end of the speed-up one can expect from this
optimization. Just running a parallel ping between the VXLAN tunnel endpoints
increases the number of subtables and hence the average number of subtable
lookups from 2.5 to 3.5 with a corresponding decrease of throughput to 1.14
Mpps. With the patch the parallel ping has no impact on average number of
subtable lookups and performance. The performance gain is then ~50%.

The main change to the previous patch is that instead of having a subtable
vector per in_port in a single dplcs instance, we now have one dpcls instance
with a single subtable per ingress port. This is better aligned with the
design base code and also improves the number of subtable lookups in a miss
case.

The PMD tests have been adjusted to the additional line in pmd-stats-show.

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>


---

 lib/dpif-netdev.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
 tests/pmd.at      |   6 +++--
 2 files changed, 165 insertions(+), 25 deletions(-)

Comments

Jarno Rajahalme July 4, 2016, 12:05 p.m. UTC | #1
I like this RFC patch and suggest you send a non-RFC patch after taking care of the following:

- There are numerous coding style violations. I've done my best to note them below, but you should also read 'CodingStyle.md'.

- The subject line should have 'Patch' in all upper-case and have the version number follow it (e.g., "[RFC PATCH v2]...").

- You should analyze the optimization code path to make sure there are no blocking locks being taken, as we have been carefully removing any such blocking to eliminate the associated latency spikes. It would be best to make a statement about this in the commit message as well.

I'd like someone else review the non-RFC patch to have an another set of eyes on this code.

  Jarno

> On Jul 1, 2016, at 4:44 AM, Jan Scheurich <jan.scheurich@web.de> wrote:
> 
> This is a redesigned version of the earlier RFC patch “dpif-netdev: Sorted
> subtable vectors per in_port in dpcls”. 
> 
> The user-space datapath (dpif-netdev) consists of a first level "exact match
> cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
> many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
> and the OVS forwarding performance is determined by the megaflow classifier.
> 
> The megaflow classifier (dpcls) consists of a variable number of hash tables
> (aka subtables), each containing megaflow entries with the same mask of
> packet header and metadata fields to match upon. A dpcls lookup matches a
> given packet against all subtables in sequence until it hits a match. As
> megaflow cache entries are by construction non-overlapping, the first match
> is the only match.
> 
> Today the order of the subtables in the dpcls is essentially random so that
> on average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the
> total number of subtables. Even though every single hash-table lookup is
> fast, the performance of the current dpcls degrades when there are many
> subtables.
> 
> How does the patch address this issue:
> 
> In reality there is often a strong correlation between the ingress port and a
> small subset of subtables that have hits. The entire megaflow cache typically
> decomposes nicely into partitions that are hit only by packets entering from
> a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM ->
> Phy). 
> 
> Therefore, maintaining a separate dpcls instance per ingress port with its
> subtable vector sorted by frequency of hits reduces the average number of
> subtables lookups in the dpcls to a minimum, even if the total number of
> subtables gets large. This is possible because megaflows always have an exact
> match on in_port, so every megaflow belongs to unique dpcls instance.
> 
> To monitor the effectiveness of the patch we have enhanced the ovs-appctl
> dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
> per hit" to report the average number of subtable lookup needed for a
> megaflow match. Ideally, this should be close to 1 and almost all cases much
> smaller than N/2.
> 
> I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay mesh.
> With pure L3 tenant traffic between VMs on different nodes the resulting
> netdev dpcls contains N=4 subtables.
> 
> Disabling the EMC, I have measured a baseline performance (in+out) of ~1.32
> Mpps (64 bytes, 1000 L4 flows). The average number of subtable lookups per
> dpcls match is 2.5.
> 
> With the patch the average number of subtable lookups per dpcls match is
> reduced 1 and the forwarding performance grows by ~30% to 1.72 Mpps. 
> 
> As the actual number of subtables will often be higher in reality, we can
> assume that this is at the lower end of the speed-up one can expect from this
> optimization. Just running a parallel ping between the VXLAN tunnel endpoints
> increases the number of subtables and hence the average number of subtable
> lookups from 2.5 to 3.5 with a corresponding decrease of throughput to 1.14
> Mpps. With the patch the parallel ping has no impact on average number of
> subtable lookups and performance. The performance gain is then ~50%.
> 
> The main change to the previous patch is that instead of having a subtable
> vector per in_port in a single dplcs instance, we now have one dpcls instance
> with a single subtable per ingress port. This is better aligned with the
> design base code and also improves the number of subtable lookups in a miss
> case.
> 
> The PMD tests have been adjusted to the additional line in pmd-stats-show.
> 
> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
> 
> 
> ---
> 
> lib/dpif-netdev.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
> tests/pmd.at      |   6 +++--
> 2 files changed, 165 insertions(+), 25 deletions(-)
> 
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index ff4227c..43d6e61 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -72,6 +72,7 @@
> 
> VLOG_DEFINE_THIS_MODULE(dpif_netdev);
> 
> +

Added empty line for no reason.

> #define FLOW_DUMP_MAX_BATCH 50
> /* Use per thread recirc_depth to prevent recirculation loop. */
> #define MAX_RECIRC_DEPTH 5
> @@ -149,7 +150,12 @@ struct emc_cache {
> 
> /* Simple non-wildcarding single-priority classifier. */
> 
> +#define NUM_SUBTABLE_SLOTS 32
> +#define DPCLS_OPTIMIATION_INTERVAL 1000

Spelling mistake, should be "optimization"

> +
> struct dpcls {
> +    struct cmap_node node;
> +    odp_port_t in_port;
>     struct cmap subtables_map;
>     struct pvector subtables;
> };
> @@ -164,12 +170,14 @@ struct dpcls_rule {
> 
> static void dpcls_init(struct dpcls *);
> static void dpcls_destroy(struct dpcls *);
> +static void dpcls_optimize(struct dpcls *);
> static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
>                          const struct netdev_flow_key *mask);
> static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
> -static bool dpcls_lookup(const struct dpcls *cls,
> +static bool dpcls_lookup(struct dpcls *cls,
>                          const struct netdev_flow_key keys[],
> -                         struct dpcls_rule **rules, size_t cnt);
> +                         struct dpcls_rule **rules, size_t cnt,
> +                         int *num_lookups_p);
> 
> /* Datapath based on the network device interface from netdev.h.
>  *
> @@ -239,6 +247,7 @@ enum dp_stat_type {
>     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
>     DP_STAT_MISS,               /* Packets that did not match. */
>     DP_STAT_LOST,               /* Packets not passed up to the client. */
> +    DP_STAT_LOOKUP_HIT,        /* Number of subtable lookups for flow table hits */
>     DP_N_STATS
> };
> 
> @@ -419,15 +428,19 @@ struct dp_netdev_pmd_thread {
>      * will only be accessed by its own pmd thread. */
>     struct emc_cache flow_cache;
> 
> -    /* Classifier and Flow-Table.
> +    /* Flow-Table and classifiers
>      *
>      * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
> -     * changes to 'cls' must be made while still holding the 'flow_mutex'.
> +     * changes to 'classifiesr' must be made while still holding the 'flow_mutex'.
>      */
>     struct ovs_mutex flow_mutex;
> -    struct dpcls cls;
>     struct cmap flow_table OVS_GUARDED; /* Flow table. */
> 
> +    /* One classifier per in_port polled by the pmd */
> +    struct cmap classifiers;
> +    /* Periodically sort subtable vectors according to hit frequencies */
> +    long long int next_optimization;
> +
>     /* Statistics. */
>     struct dp_netdev_pmd_stats stats;
> 
> @@ -538,6 +551,8 @@ static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
> static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
> static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>     OVS_REQUIRES(pmd->port_mutex);
> +static inline void
> +dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd);
> 
> static inline bool emc_entry_alive(struct emc_entry *ce);
> static void emc_clear_entry(struct emc_entry *ce);
> @@ -657,8 +672,10 @@ pmd_info_show_stats(struct ds *reply,
> 
>     ds_put_format(reply,
>                   "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> +                  "\tavg. subtable lookups per hit:%.2f\n"
>                   "\tmiss:%llu\n\tlost:%llu\n",
>                   stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
> +                  stats[DP_STAT_MASKED_HIT] > 0 ? (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,
>                   stats[DP_STAT_MISS], stats[DP_STAT_LOST]);
> 
>     if (total_cycles == 0) {
> @@ -1501,20 +1518,59 @@ dp_netdev_flow_hash(const ovs_u128 *ufid)
>     return ufid->u32[0];
> }
> 
> +static inline struct dpcls *
> +dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> +                           odp_port_t in_port)
> +{
> +    struct dpcls *cls;
> +    uint32_t hash = hash_port_no(in_port);
> +    CMAP_FOR_EACH_WITH_HASH (cls, node, hash, &pmd->classifiers) {
> +        if (cls->in_port == in_port) {
> +            /* Port classifier exists already */
> +            return cls;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static inline struct dpcls *
> +dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd,
> +                         odp_port_t in_port)
> +{
> +    struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    uint32_t hash = hash_port_no(in_port);
> +
> +    if (!cls) {
> +        /* Create new classifier for in_port */
> +        cls = xmalloc(sizeof(*cls));
> +        dpcls_init(cls);
> +        cls->in_port = in_port;
> +        cmap_insert(&pmd->classifiers, &cls->node, hash);
> +        VLOG_DBG("Creating dpcls %p for in_port %d", cls, in_port);
> +    }
> +    return cls;
> +}
> +
> static void
> dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
>                           struct dp_netdev_flow *flow)
>     OVS_REQUIRES(pmd->flow_mutex)
> {
>     struct cmap_node *node = CONST_CAST(struct cmap_node *, &flow->node);
> +    struct dpcls *cls;
> +    odp_port_t in_port = flow->flow.in_port.odp_port;
> 
> -    dpcls_remove(&pmd->cls, &flow->cr);
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    if (cls) {


It would be better to assert that cls was found, as cmap_remove() requires the node to be removed to be in the cmap anyway.

> +        dpcls_remove(cls, &flow->cr);
> +    }
>     cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
>     flow->dead = true;
> 
>     dp_netdev_flow_unref(flow);
> }
> 
> +

Extra newline for no reason.

> static void
> dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd)
> {
> @@ -1855,15 +1911,20 @@ emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
> }
> 
> static struct dp_netdev_flow *
> -dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,
> -                          const struct netdev_flow_key *key)
> +dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
> +                          const struct netdev_flow_key *key,
> +                          int *lookup_num_p)
> {
> -    struct dp_netdev_flow *netdev_flow;
> +    struct dpcls *cls;
>     struct dpcls_rule *rule;
> +    odp_port_t in_port = MINIFLOW_GET_U32(&key->mf, in_port);
> +    struct dp_netdev_flow *netdev_flow = NULL;
> 
> -    dpcls_lookup(&pmd->cls, key, &rule, 1);
> -    netdev_flow = dp_netdev_flow_cast(rule);
> -
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    if (cls) {

Could use OVS_LIKELY() here.

> +        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
> +        netdev_flow = dp_netdev_flow_cast(rule);
> +    }
>     return netdev_flow;
> }
> 
> @@ -2096,6 +2157,8 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
> {
>     struct dp_netdev_flow *flow;
>     struct netdev_flow_key mask;
> +    struct dpcls *cls;
> +    odp_port_t in_port = match->flow.in_port.odp_port;
> 
>     netdev_flow_mask_init(&mask, match);
>     /* Make sure wc does not have metadata. */
> @@ -2114,7 +2177,11 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
>     ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions, actions_len));
> 
>     netdev_flow_key_init_masked(&flow->cr.flow, &match->flow, &mask);
> -    dpcls_insert(&pmd->cls, &flow->cr, &mask);
> +
> +    /* Select dpcls for in_port. Relies on in_port to be exact match */
> +    ovs_assert(match->wc.masks.in_port.odp_port == ~0);

I'm not sure if this is in CodingStyle.md, but I'd rather use UINT32_MAX than ~0.

> +    cls = dp_netdev_pmd_find_dpcls(pmd, in_port);
> +    dpcls_insert(cls, &flow->cr, &mask);
> 
>     cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *, &flow->node),
>                 dp_netdev_flow_hash(&flow->ufid));
> @@ -2195,7 +2262,7 @@ dpif_netdev_flow_put(struct dpif *dpif, const struct dpif_flow_put *put)
>     }
> 
>     ovs_mutex_lock(&pmd->flow_mutex);
> -    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key);
> +    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key, 0);

Use NULL instead of '0' for a null pointer.

>     if (!netdev_flow) {
>         if (put->flags & DPIF_FP_CREATE) {
>             if (cmap_count(&pmd->flow_table) < MAX_FLOWS) {
> @@ -2872,6 +2939,7 @@ reload:
> 
>             emc_cache_slow_sweep(&pmd->flow_cache);
>             coverage_try_clear();
> +            dp_netdev_pmd_try_optimize(pmd);

Make sure the code path can't block.

>             ovsrcu_quiesce();
> 
>             atomic_read_relaxed(&pmd->change_seq, &seq);
> @@ -3031,8 +3099,9 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>     ovs_mutex_init(&pmd->cond_mutex);
>     ovs_mutex_init(&pmd->flow_mutex);
>     ovs_mutex_init(&pmd->port_mutex);
> -    dpcls_init(&pmd->cls);
>     cmap_init(&pmd->flow_table);
> +    cmap_init(&pmd->classifiers);
> +    pmd->next_optimization = time_msec() + DPCLS_OPTIMIATION_INTERVAL;
>     ovs_list_init(&pmd->poll_list);
>     hmap_init(&pmd->tx_ports);
>     hmap_init(&pmd->port_cache);
> @@ -3048,10 +3117,16 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
> static void
> dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
> {
> +    struct dpcls *cls;
> +
>     dp_netdev_pmd_flow_flush(pmd);
> -    dpcls_destroy(&pmd->cls);
>     hmap_destroy(&pmd->port_cache);
>     hmap_destroy(&pmd->tx_ports);
> +    /* All flows (including their dpcls_rules) have been deleted already */
> +    CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
> +        dpcls_destroy(cls);
> +    }
> +    cmap_destroy(&pmd->classifiers);
>     cmap_destroy(&pmd->flow_table);
>     ovs_mutex_destroy(&pmd->flow_mutex);
>     latch_destroy(&pmd->exit_latch);
> @@ -3221,6 +3296,7 @@ dp_netdev_del_port_from_pmd__(struct dp_netdev_port *port,
>         free(tx);
>         found = true;
>     }
> +

Don't add newline here, especially if you do not add one after the corresponding ovs_mutex_lock().

>     ovs_mutex_unlock(&pmd->port_mutex);
> 
>     return found;
> @@ -3796,7 +3872,7 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, struct dp_packet *packet,
>          * to be locking everyone out of making flow installs.  If we
>          * move to a per-core classifier, it would be reasonable. */
>         ovs_mutex_lock(&pmd->flow_mutex);
> -        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key);
> +        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, 0);

NULL.

>         if (OVS_LIKELY(!netdev_flow)) {
>             netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
>                                              add_actions->data,
> @@ -3822,18 +3898,32 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>     enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> #endif
>     struct dp_packet **packets = packets_->packets;
> +    struct dpcls *cls;
>     struct dpcls_rule *rules[PKT_ARRAY_SIZE];
>     struct dp_netdev *dp = pmd->dp;
>     struct emc_cache *flow_cache = &pmd->flow_cache;
>     int miss_cnt = 0, lost_cnt = 0;
> +    int lookup_cnt = 0, add_lookup_cnt;
>     bool any_miss;
>     size_t i;
> +    odp_port_t in_port;
> 
>     for (i = 0; i < cnt; i++) {
>         /* Key length is needed in all the cases, hash computed on demand. */
>         keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
>     }
> -    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);
> +    /* Get ingress port from metadata of first packet. */
> +    /* Relies on all packets in patch having same ingress port! */

Maybe make this explicit by making 'in_port' a new argument for this function?

> +    ovs_assert (cnt > 0);
> +    in_port = packets[0]->md.in_port.odp_port;
> +    /* Get the classifier for the in_port */
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    if (cls) {

Could use OVS_LIKELY() here.

> +        any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
> +    } else {
> +        any_miss = true;
> +        memset(rules, 0, sizeof(rules));
> +    }
>     if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp->upcall_rwlock)) {
>         uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
>         struct ofpbuf actions, put_actions;
> @@ -3851,8 +3941,9 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>             /* It's possible that an earlier slow path execution installed
>              * a rule covering this flow.  In this case, it's a lot cheaper
>              * to catch it here than execute a miss. */
> -            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i]);
> +            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i], &add_lookup_cnt);
>             if (netdev_flow) {
> +                lookup_cnt += add_lookup_cnt;
>                 rules[i] = &netdev_flow->cr;
>                 continue;
>             }
> @@ -3891,6 +3982,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>     }
> 
>     dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
> +    dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);
>     dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
>     dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
> }
> @@ -3947,7 +4039,7 @@ static void
> dp_netdev_recirculate(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets)
> {
> -     dp_netdev_input__(pmd, packets, true, 0);
> +    dp_netdev_input__(pmd, packets, true, 0);

Might fix the whitespace in the preceding dp_netdev_input() as well, even better to put unrelated style fixes to a separate patch.

> }
> 
> struct dp_netdev_execute_aux {
> @@ -4372,6 +4464,7 @@ struct dpcls_subtable {
> 
>     /* These fields are accessed by readers. */
>     struct cmap rules;           /* Contains "struct dpcls_rule"s. */
> +    uint32_t hit_cnt;            /* Number of match hits in subtable. */

Add "for the current optimization interval"

>     struct netdev_flow_key mask; /* Wildcards for fields (const). */
>     /* 'mask' must be the last field, additional space is allocated here. */
> };
> @@ -4388,6 +4481,7 @@ dpcls_init(struct dpcls *cls)
> static void
> dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable *subtable)
> {
> +    VLOG_DBG("Destroying subtable %p for in_port %d", subtable, cls->in_port);

Suggest removing debugging from the final patch.

>     pvector_remove(&cls->subtables, subtable);
>     cmap_remove(&cls->subtables_map, &subtable->cmap_node,
>                 subtable->mask.hash);
> @@ -4422,10 +4516,13 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>     subtable = xmalloc(sizeof *subtable
>                        - sizeof subtable->mask.mf + mask->len);
>     cmap_init(&subtable->rules);
> +    subtable->hit_cnt = 0;
>     netdev_flow_key_clone(&subtable->mask, mask);
>     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
> +    /* Insert new subtable with priority zero (no hits yet) */
>     pvector_insert(&cls->subtables, subtable, 0);
>     pvector_publish(&cls->subtables);
> +    VLOG_DBG("Creating %lu. subtable %p for in_port %d", cmap_count(&cls->subtables_map), subtable, cls->in_port);
> 
>     return subtable;
> }
> @@ -4444,6 +4541,37 @@ dpcls_find_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>     return dpcls_create_subtable(cls, mask);
> }
> 
> +
> +/* Periodically sort the dpcls subtable vectors according to hit counts */
> +static inline void
> +dpcls_optimize(struct dpcls *cls)
> +{
> +    struct dpcls_subtable *subtable;
> +
> +    struct pvector *pvec = &cls->subtables;
> +    PVECTOR_FOR_EACH (subtable, pvec) {
> +        pvector_change_priority(pvec, subtable, subtable->hit_cnt);
> +        subtable->hit_cnt = 0;
> +    }
> +    pvector_publish(pvec);
> +}
> +
> +static inline void
> +dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd)
> +{
> +    struct dpcls *cls;
> +    long long int now = time_msec();
> +
> +    if (now > pmd->next_optimization) {
> +        /* Optimize each classifier */
> +        CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
> +            dpcls_optimize(cls);
> +        }
> +        /* Start new measuring interval */
> +        pmd->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;
> +    }
> +}
> +
> /* Insert 'rule' into 'cls'. */
> static void
> dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
> @@ -4451,6 +4579,7 @@ dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
> {
>     struct dpcls_subtable *subtable = dpcls_find_subtable(cls, mask);
> 
> +    /* Refer to subtable's mask, also for later removal */

Add the trailing '.'

>     rule->mask = &subtable->mask;
>     cmap_insert(&subtable->rules, &rule->cmap_node, rule->flow.hash);
> }
> @@ -4463,10 +4592,11 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
> 
>     ovs_assert(rule->mask);
> 
> +    /* Get subtable from reference in rule->mask. */
>     INIT_CONTAINER(subtable, rule->mask, mask);
> -
>     if (cmap_remove(&subtable->rules, &rule->cmap_node, rule->flow.hash)
>         == 0) {
> +        /* Delete empty subtable. */
>         dpcls_destroy_subtable(cls, subtable);
>         pvector_publish(&cls->subtables);
>     }
> @@ -4501,8 +4631,9 @@ dpcls_rule_matches_key(const struct dpcls_rule *rule,
>  *
>  * Returns true if all flows found a corresponding rule. */
> static bool
> -dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
> -             struct dpcls_rule **rules, const size_t cnt)
> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
> +             struct dpcls_rule **rules, const size_t cnt,
> +             int *num_lookups_p)
> {
>     /* The batch size 16 was experimentally found faster than 8 or 32. */
>     typedef uint16_t map_type;
> @@ -4522,6 +4653,8 @@ dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
>     }
>     memset(rules, 0, cnt * sizeof *rules);
> 
> +    int lookups_match = 0, subtable_pos = 1;
> +
>     PVECTOR_FOR_EACH (subtable, &cls->subtables) {
>         const struct netdev_flow_key *mkeys = keys;
>         struct dpcls_rule **mrules = rules;
> @@ -4554,6 +4687,8 @@ dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
>                 CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
>                     if (OVS_LIKELY(dpcls_rule_matches_key(rule, &mkeys[i]))) {
>                         mrules[i] = rule;
> +                        subtable->hit_cnt++;
> +                        lookups_match += subtable_pos;
>                         goto next;
>                     }
>                 }
> @@ -4565,8 +4700,11 @@ dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
>             remains |= maps[m];
>         }
>         if (!remains) {
> +            if (num_lookups_p) *num_lookups_p = lookups_match;

if statements must have the code block in curly braces and not on the same line.

>             return true;              /* All found. */
>         }
> +        subtable_pos++;
>     }
> +    if (num_lookups_p) *num_lookups_p = lookups_match;

Ditto.

>     return false;                     /* Some misses. */
> }
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 3216762..c033047 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -162,10 +162,11 @@ dummy@ovs-dummy: hit:0 missed:0
> 		p0 7/1: (dummy-pmd: configured_rx_queues=4, configured_tx_queues=<cleared>, requested_rx_queues=4, requested_tx_queues=<cleared>)
> ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
> pmd thread numa_id <cleared> core_id <cleared>:
> 	emc hits:0
> 	megaflow hits:0
> +	avg. subtable lookups per hit:0.00
> 	miss:0
> 	lost:0
> ])
> @@ -188,10 +189,11 @@ AT_CHECK([cat ovs-vswitchd.log | filter_flow_install | strip_xout], [0], [dnl
> recirc_id(0),in_port(1),eth(src=50:54:00:00:00:77,dst=50:54:00:00:01:78),eth_type(0x0800),ipv4(frag=no), actions: <del>
> ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
> pmd thread numa_id <cleared> core_id <cleared>:
> 	emc hits:19
> 	megaflow hits:0
> +	avg. subtable lookups per hit:0.00
> 	miss:1
> 	lost:0
> ])
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
Jan Scheurich July 7, 2016, 8:27 a.m. UTC | #2
> From: Jarno Rajahalme [mailto:jarno@ovn.org]

> Sent: Monday, 04 July, 2016 14:06

> 

> I like this RFC patch and suggest you send a non-RFC patch after taking care

> of the following:


Thanks for the review. Will implement all requested changes and submit non-RFC PATCH v2 when ready.

> - You should analyze the optimization code path to make sure there are no

> blocking locks being taken, as we have been carefully removing any such

> blocking to eliminate the associated latency spikes. It would be best to make

> a statement about this in the commit message as well.


Everything looks fine except for the final call to pvector_publish(). This internally calls ovsrcu_postpone() for freeing the old impl vector. If I read the postpone function correctly, it may need to flush the cbset to a guarded list of flushed callbacks, i.e. it might block. Do you consider this a blocking lock that requires changes?

I actually have some doubts about thread safety of the dpcls complex. My understanding is that dpcls rules are inserted only by the owning PMD thread itself as result of processing upcalls. But dplcs rules are removed by revalidator threads as part of deletion of e.g. expired dpif flow entries. I may be missing something here, but I don't see how struct dpcls and its constituent parts (struct cmap subtables_map, struct pvector subtables, struct dpcls_subtable's cmap rules) are protected against concurrent modifications by the owning PMD and revalidator threads.

Since we obviously can't have locks in the PMD loop, either all modifications would have to be handled by non-PMD threads synchronized through locks/mutexes, or all modification must be done by the PMD thread itself, which means the dpif_ops should be passed to the PMD thread e.g. through a DPDK ring, for execution.

Regards, Jan
Jan Scheurich July 7, 2016, 4:11 p.m. UTC | #3
> From: Jan Scheurich

> Sent: Thursday, 07 July, 2016 10:28

> 

> I actually have some doubts about thread safety of the dpcls complex. My

> understanding is that dpcls rules are inserted only by the owning PMD thread

> itself as result of processing upcalls. But dplcs rules are removed by

> revalidator threads as part of deletion of e.g. expired dpif flow entries. I may

> be missing something here, but I don't see how struct dpcls and its

> constituent parts (struct cmap subtables_map, struct pvector subtables,

> struct dpcls_subtable's cmap rules) are protected against concurrent

> modifications by the owning PMD and revalidator threads.


I realized that pmd->flow_mutex is used to prevent simultaneous updates of the flow table and dplcs classifiers by PMD thread and revalidator thread. So my thread safety concern seems taken care of. But it also means that the PMD thread may block for flow installation after an upcall. 

The flow_mutex is only taken by the revalidator process for deleting one flow at a time, so the exclusion period is not very long, but it there is a risk that interference between upcalls and revalidator sweeps slow down the PMD. Have you measured the effect of the additional mutex after upcall and taken a conscious decision to keep it that way?

In any case, I believe my patch cannot be applied without synchronizing through flow_mutex while optimizing the subtable vectors. Would that be acceptable every second? 

One alternative could be to run timer-controlled optimization from a non-PMD thread and synchronize that through the flow_mutex. This would only block the PMD if an upcall hit an ongoing optimization.

The optimal solution would probably be to do all modifications within the PMD thread and pass the flow delete requests from revalidator to PMD thread through a (pair of) DPDK ring(s).

Regards, Jan
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index ff4227c..43d6e61 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -72,6 +72,7 @@ 
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
+
 #define FLOW_DUMP_MAX_BATCH 50
 /* Use per thread recirc_depth to prevent recirculation loop. */
 #define MAX_RECIRC_DEPTH 5
@@ -149,7 +150,12 @@  struct emc_cache {

 /* Simple non-wildcarding single-priority classifier. */
 
+#define NUM_SUBTABLE_SLOTS 32
+#define DPCLS_OPTIMIATION_INTERVAL 1000
+
 struct dpcls {
+    struct cmap_node node;
+    odp_port_t in_port;
     struct cmap subtables_map;
     struct pvector subtables;
 };
@@ -164,12 +170,14 @@  struct dpcls_rule {
 
 static void dpcls_init(struct dpcls *);
 static void dpcls_destroy(struct dpcls *);
+static void dpcls_optimize(struct dpcls *);
 static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
                          const struct netdev_flow_key *mask);
 static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
-static bool dpcls_lookup(const struct dpcls *cls,
+static bool dpcls_lookup(struct dpcls *cls,
                          const struct netdev_flow_key keys[],
-                         struct dpcls_rule **rules, size_t cnt);
+                         struct dpcls_rule **rules, size_t cnt,
+                         int *num_lookups_p);

 /* Datapath based on the network device interface from netdev.h.
  *
@@ -239,6 +247,7 @@  enum dp_stat_type {
     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
     DP_STAT_MISS,               /* Packets that did not match. */
     DP_STAT_LOST,               /* Packets not passed up to the client. */
+    DP_STAT_LOOKUP_HIT,        /* Number of subtable lookups for flow table hits */
     DP_N_STATS
 };
 
@@ -419,15 +428,19 @@  struct dp_netdev_pmd_thread {
      * will only be accessed by its own pmd thread. */
     struct emc_cache flow_cache;
 
-    /* Classifier and Flow-Table.
+    /* Flow-Table and classifiers
      *
      * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
-     * changes to 'cls' must be made while still holding the 'flow_mutex'.
+     * changes to 'classifiesr' must be made while still holding the 'flow_mutex'.
      */
     struct ovs_mutex flow_mutex;
-    struct dpcls cls;
     struct cmap flow_table OVS_GUARDED; /* Flow table. */
 
+    /* One classifier per in_port polled by the pmd */
+    struct cmap classifiers;
+    /* Periodically sort subtable vectors according to hit frequencies */
+    long long int next_optimization;
+
     /* Statistics. */
     struct dp_netdev_pmd_stats stats;
 
@@ -538,6 +551,8 @@  static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
 static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
     OVS_REQUIRES(pmd->port_mutex);
+static inline void
+dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd);
 
 static inline bool emc_entry_alive(struct emc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
@@ -657,8 +672,10 @@  pmd_info_show_stats(struct ds *reply,
 
     ds_put_format(reply,
                   "\temc hits:%llu\n\tmegaflow hits:%llu\n"
+                  "\tavg. subtable lookups per hit:%.2f\n"
                   "\tmiss:%llu\n\tlost:%llu\n",
                   stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
+                  stats[DP_STAT_MASKED_HIT] > 0 ? (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,
                   stats[DP_STAT_MISS], stats[DP_STAT_LOST]);
 
     if (total_cycles == 0) {
@@ -1501,20 +1518,59 @@  dp_netdev_flow_hash(const ovs_u128 *ufid)
     return ufid->u32[0];
 }
 
+static inline struct dpcls *
+dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
+                           odp_port_t in_port)
+{
+    struct dpcls *cls;
+    uint32_t hash = hash_port_no(in_port);
+    CMAP_FOR_EACH_WITH_HASH (cls, node, hash, &pmd->classifiers) {
+        if (cls->in_port == in_port) {
+            /* Port classifier exists already */
+            return cls;
+        }
+    }
+    return NULL;
+}
+
+static inline struct dpcls *
+dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd,
+                         odp_port_t in_port)
+{
+    struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
+    uint32_t hash = hash_port_no(in_port);
+
+    if (!cls) {
+        /* Create new classifier for in_port */
+        cls = xmalloc(sizeof(*cls));
+        dpcls_init(cls);
+        cls->in_port = in_port;
+        cmap_insert(&pmd->classifiers, &cls->node, hash);
+        VLOG_DBG("Creating dpcls %p for in_port %d", cls, in_port);
+    }
+    return cls;
+}
+
 static void
 dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
                           struct dp_netdev_flow *flow)
     OVS_REQUIRES(pmd->flow_mutex)
 {
     struct cmap_node *node = CONST_CAST(struct cmap_node *, &flow->node);
+    struct dpcls *cls;
+    odp_port_t in_port = flow->flow.in_port.odp_port;
 
-    dpcls_remove(&pmd->cls, &flow->cr);
+    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
+    if (cls) {
+        dpcls_remove(cls, &flow->cr);
+    }
     cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
     flow->dead = true;
 
     dp_netdev_flow_unref(flow);
 }
 
+
 static void
 dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd)
 {
@@ -1855,15 +1911,20 @@  emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
 }
 
 static struct dp_netdev_flow *
-dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,
-                          const struct netdev_flow_key *key)
+dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
+                          const struct netdev_flow_key *key,
+                          int *lookup_num_p)
 {
-    struct dp_netdev_flow *netdev_flow;
+    struct dpcls *cls;
     struct dpcls_rule *rule;
+    odp_port_t in_port = MINIFLOW_GET_U32(&key->mf, in_port);
+    struct dp_netdev_flow *netdev_flow = NULL;
 
-    dpcls_lookup(&pmd->cls, key, &rule, 1);
-    netdev_flow = dp_netdev_flow_cast(rule);
-
+    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
+    if (cls) {
+        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
+        netdev_flow = dp_netdev_flow_cast(rule);
+    }
     return netdev_flow;
 }
 
@@ -2096,6 +2157,8 @@  dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
 {
     struct dp_netdev_flow *flow;
     struct netdev_flow_key mask;
+    struct dpcls *cls;
+    odp_port_t in_port = match->flow.in_port.odp_port;
 
     netdev_flow_mask_init(&mask, match);
     /* Make sure wc does not have metadata. */
@@ -2114,7 +2177,11 @@  dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
     ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions, actions_len));
 
     netdev_flow_key_init_masked(&flow->cr.flow, &match->flow, &mask);
-    dpcls_insert(&pmd->cls, &flow->cr, &mask);
+
+    /* Select dpcls for in_port. Relies on in_port to be exact match */
+    ovs_assert(match->wc.masks.in_port.odp_port == ~0);
+    cls = dp_netdev_pmd_find_dpcls(pmd, in_port);
+    dpcls_insert(cls, &flow->cr, &mask);
 
     cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *, &flow->node),
                 dp_netdev_flow_hash(&flow->ufid));
@@ -2195,7 +2262,7 @@  dpif_netdev_flow_put(struct dpif *dpif, const struct dpif_flow_put *put)
     }
 
     ovs_mutex_lock(&pmd->flow_mutex);
-    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key);
+    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key, 0);
     if (!netdev_flow) {
         if (put->flags & DPIF_FP_CREATE) {
             if (cmap_count(&pmd->flow_table) < MAX_FLOWS) {
@@ -2872,6 +2939,7 @@  reload:
 
             emc_cache_slow_sweep(&pmd->flow_cache);
             coverage_try_clear();
+            dp_netdev_pmd_try_optimize(pmd);
             ovsrcu_quiesce();
 
             atomic_read_relaxed(&pmd->change_seq, &seq);
@@ -3031,8 +3099,9 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     ovs_mutex_init(&pmd->cond_mutex);
     ovs_mutex_init(&pmd->flow_mutex);
     ovs_mutex_init(&pmd->port_mutex);
-    dpcls_init(&pmd->cls);
     cmap_init(&pmd->flow_table);
+    cmap_init(&pmd->classifiers);
+    pmd->next_optimization = time_msec() + DPCLS_OPTIMIATION_INTERVAL;
     ovs_list_init(&pmd->poll_list);
     hmap_init(&pmd->tx_ports);
     hmap_init(&pmd->port_cache);
@@ -3048,10 +3117,16 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
 static void
 dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 {
+    struct dpcls *cls;
+
     dp_netdev_pmd_flow_flush(pmd);
-    dpcls_destroy(&pmd->cls);
     hmap_destroy(&pmd->port_cache);
     hmap_destroy(&pmd->tx_ports);
+    /* All flows (including their dpcls_rules) have been deleted already */
+    CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
+        dpcls_destroy(cls);
+    }
+    cmap_destroy(&pmd->classifiers);
     cmap_destroy(&pmd->flow_table);
     ovs_mutex_destroy(&pmd->flow_mutex);
     latch_destroy(&pmd->exit_latch);
@@ -3221,6 +3296,7 @@  dp_netdev_del_port_from_pmd__(struct dp_netdev_port *port,
         free(tx);
         found = true;
     }
+
     ovs_mutex_unlock(&pmd->port_mutex);
 
     return found;
@@ -3796,7 +3872,7 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, struct dp_packet *packet,
          * to be locking everyone out of making flow installs.  If we
          * move to a per-core classifier, it would be reasonable. */
         ovs_mutex_lock(&pmd->flow_mutex);
-        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key);
+        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, 0);
         if (OVS_LIKELY(!netdev_flow)) {
             netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
                                              add_actions->data,
@@ -3822,18 +3898,32 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
 #endif
     struct dp_packet **packets = packets_->packets;
+    struct dpcls *cls;
     struct dpcls_rule *rules[PKT_ARRAY_SIZE];
     struct dp_netdev *dp = pmd->dp;
     struct emc_cache *flow_cache = &pmd->flow_cache;
     int miss_cnt = 0, lost_cnt = 0;
+    int lookup_cnt = 0, add_lookup_cnt;
     bool any_miss;
     size_t i;
+    odp_port_t in_port;
 
     for (i = 0; i < cnt; i++) {
         /* Key length is needed in all the cases, hash computed on demand. */
         keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
     }
-    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);
+    /* Get ingress port from metadata of first packet. */
+    /* Relies on all packets in patch having same ingress port! */
+    ovs_assert (cnt > 0);
+    in_port = packets[0]->md.in_port.odp_port;
+    /* Get the classifier for the in_port */
+    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
+    if (cls) {
+        any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
+    } else {
+        any_miss = true;
+        memset(rules, 0, sizeof(rules));
+    }
     if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp->upcall_rwlock)) {
         uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
         struct ofpbuf actions, put_actions;
@@ -3851,8 +3941,9 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
             /* It's possible that an earlier slow path execution installed
              * a rule covering this flow.  In this case, it's a lot cheaper
              * to catch it here than execute a miss. */
-            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i]);
+            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i], &add_lookup_cnt);
             if (netdev_flow) {
+                lookup_cnt += add_lookup_cnt;
                 rules[i] = &netdev_flow->cr;
                 continue;
             }
@@ -3891,6 +3982,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     }
 
     dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
+    dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
 }
@@ -3947,7 +4039,7 @@  static void
 dp_netdev_recirculate(struct dp_netdev_pmd_thread *pmd,
                       struct dp_packet_batch *packets)
 {
-     dp_netdev_input__(pmd, packets, true, 0);
+    dp_netdev_input__(pmd, packets, true, 0);
 }
 
 struct dp_netdev_execute_aux {
@@ -4372,6 +4464,7 @@  struct dpcls_subtable {
 
     /* These fields are accessed by readers. */
     struct cmap rules;           /* Contains "struct dpcls_rule"s. */
+    uint32_t hit_cnt;            /* Number of match hits in subtable. */
     struct netdev_flow_key mask; /* Wildcards for fields (const). */
     /* 'mask' must be the last field, additional space is allocated here. */
 };
@@ -4388,6 +4481,7 @@  dpcls_init(struct dpcls *cls)
 static void
 dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable *subtable)
 {
+    VLOG_DBG("Destroying subtable %p for in_port %d", subtable, cls->in_port);
     pvector_remove(&cls->subtables, subtable);
     cmap_remove(&cls->subtables_map, &subtable->cmap_node,
                 subtable->mask.hash);
@@ -4422,10 +4516,13 @@  dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     subtable = xmalloc(sizeof *subtable
                        - sizeof subtable->mask.mf + mask->len);
     cmap_init(&subtable->rules);
+    subtable->hit_cnt = 0;
     netdev_flow_key_clone(&subtable->mask, mask);
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
+    /* Insert new subtable with priority zero (no hits yet) */
     pvector_insert(&cls->subtables, subtable, 0);
     pvector_publish(&cls->subtables);
+    VLOG_DBG("Creating %lu. subtable %p for in_port %d", cmap_count(&cls->subtables_map), subtable, cls->in_port);
 
     return subtable;
 }
@@ -4444,6 +4541,37 @@  dpcls_find_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     return dpcls_create_subtable(cls, mask);
 }
 
+
+/* Periodically sort the dpcls subtable vectors according to hit counts */
+static inline void
+dpcls_optimize(struct dpcls *cls)
+{
+    struct dpcls_subtable *subtable;
+
+    struct pvector *pvec = &cls->subtables;
+    PVECTOR_FOR_EACH (subtable, pvec) {
+        pvector_change_priority(pvec, subtable, subtable->hit_cnt);
+        subtable->hit_cnt = 0;
+    }
+    pvector_publish(pvec);
+}
+
+static inline void
+dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd)
+{
+    struct dpcls *cls;
+    long long int now = time_msec();
+
+    if (now > pmd->next_optimization) {
+        /* Optimize each classifier */
+        CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
+            dpcls_optimize(cls);
+        }
+        /* Start new measuring interval */
+        pmd->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;
+    }
+}
+
 /* Insert 'rule' into 'cls'. */
 static void
 dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
@@ -4451,6 +4579,7 @@  dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
 {
     struct dpcls_subtable *subtable = dpcls_find_subtable(cls, mask);
 
+    /* Refer to subtable's mask, also for later removal */
     rule->mask = &subtable->mask;
     cmap_insert(&subtable->rules, &rule->cmap_node, rule->flow.hash);
 }
@@ -4463,10 +4592,11 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
 
     ovs_assert(rule->mask);
 
+    /* Get subtable from reference in rule->mask. */
     INIT_CONTAINER(subtable, rule->mask, mask);
-
     if (cmap_remove(&subtable->rules, &rule->cmap_node, rule->flow.hash)
         == 0) {
+        /* Delete empty subtable. */
         dpcls_destroy_subtable(cls, subtable);
         pvector_publish(&cls->subtables);
     }
@@ -4501,8 +4631,9 @@  dpcls_rule_matches_key(const struct dpcls_rule *rule,
  *
  * Returns true if all flows found a corresponding rule. */
 static bool
-dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
-             struct dpcls_rule **rules, const size_t cnt)
+dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
+             struct dpcls_rule **rules, const size_t cnt,
+             int *num_lookups_p)
 {
     /* The batch size 16 was experimentally found faster than 8 or 32. */
     typedef uint16_t map_type;
@@ -4522,6 +4653,8 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
     }
     memset(rules, 0, cnt * sizeof *rules);
 
+    int lookups_match = 0, subtable_pos = 1;
+
     PVECTOR_FOR_EACH (subtable, &cls->subtables) {
         const struct netdev_flow_key *mkeys = keys;
         struct dpcls_rule **mrules = rules;
@@ -4554,6 +4687,8 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
                 CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
                     if (OVS_LIKELY(dpcls_rule_matches_key(rule, &mkeys[i]))) {
                         mrules[i] = rule;
+                        subtable->hit_cnt++;
+                        lookups_match += subtable_pos;
                         goto next;
                     }
                 }
@@ -4565,8 +4700,11 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
             remains |= maps[m];
         }
         if (!remains) {
+            if (num_lookups_p) *num_lookups_p = lookups_match;
             return true;              /* All found. */
         }
+        subtable_pos++;
     }
+    if (num_lookups_p) *num_lookups_p = lookups_match;
     return false;                     /* Some misses. */
 }
diff --git a/tests/pmd.at b/tests/pmd.at
index 3216762..c033047 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -162,10 +162,11 @@  dummy@ovs-dummy: hit:0 missed:0
 		p0 7/1: (dummy-pmd: configured_rx_queues=4, configured_tx_queues=<cleared>, requested_rx_queues=4, requested_tx_queues=<cleared>)
 ])
 
-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
 pmd thread numa_id <cleared> core_id <cleared>:
 	emc hits:0
 	megaflow hits:0
+	avg. subtable lookups per hit:0.00
 	miss:0
 	lost:0
 ])
@@ -188,10 +189,11 @@  AT_CHECK([cat ovs-vswitchd.log | filter_flow_install | strip_xout], [0], [dnl
 recirc_id(0),in_port(1),eth(src=50:54:00:00:00:77,dst=50:54:00:00:01:78),eth_type(0x0800),ipv4(frag=no), actions: <del>
 ])
 
-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
 pmd thread numa_id <cleared> core_id <cleared>:
 	emc hits:19
 	megaflow hits:0
+	avg. subtable lookups per hit:0.00
 	miss:1
 	lost:0
 ])