diff mbox

[ovs-dev,RFC] dpif-netdev: Sorted subtable vectors per in_port in dpcls

Message ID CFF8EF42F1132E4CBE2BF0AB6C21C58D55E22CF3@ESESSMB205.ericsson.se
State RFC
Headers show

Commit Message

Jan Scheurich June 16, 2016, 1:56 p.m. UTC
The user-space datapath (dpif-netdev) consists of a first level "exact match cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient and the OVS forwarding performance is determined by the megaflow classifier.

The megaflow classifier (dpcls) consists of a variable number of hash tables (aka subtables), each containing megaflow entries with the same mask of packet header and metadata fields to match upon. A dpcls lookup matches a given packet against all subtables in sequence until it hits a match. As megaflow cache entries are by construction non-overlapping, the first match is the only match.

Today the order of the subtables in the dpcls is essentially random so that on average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the total number of subtables. Even though every single hash-table lookup is fast, the performance of the current dpcls degrades when there are many subtables.

How does the patch address this issue:

In reality there is often a strong correlation between the ingress port and a small subset of subtables that have hits. The entire megaflow cache typically decomposes nicely into partitions that are hit only by packets entering from a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM -> Phy). 

Therefore, keeping a separate list of subtables per ingress port, sorted by frequency of hits, reduces the average number of subtables lookups in the dpcls to a minimum, even if the total number of subtables gets large. 

The patch introduces 32 subtable vectors per dpcls and hashes the ingress port to select the subtable vector. The patch also counts matches per 32 slots in each vector (hashing the subtable pointer to obtain the slot) and sorts the vectors according to match frequency every second.

To monitor the effectiveness of the patch we have enhanced the ovs-appctl dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups per hit" to report the average number of subtable lookup needed for a megaflow match. Ideally, this should be close to 1 and much smaller than N/2.

I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay mesh. With pure L3 tenant traffic between VMs on different nodes the resulting netdev dpcls contains N=4 subtables.

Disabling the EMC, I have measured a baseline performance (in+out) of ~1.32 Mpps (64 bytes, 1000 L4 flows). The average number of subtable lookups per dpcls match is 2.5.

With the patch the average number of subtable lookups per dpcls match goes down to 1.25 (apparently there are still two ports of different nature hashed to the same vector, otherwise it should be exactly one). Even so the forwarding performance grows by ~30% to 1.72 Mpps. 

As the number of subtables will often be higher in reality, we can assume that this is at the lower end of the speed-up one can expect from this optimization. Just running a parallel ping between the VXLAN tunnel endpoints increases the number of subtables and hence the average number of subtable lookups from 2.5 to 3.5 with a corresponding decrease of throughput to 1.14 Mpps. With the patch the parallel ping has no impact on average number of subtable lookups and performance. The performance gain is then ~50%.

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>

-----

 lib/dpif-netdev.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 92 insertions(+), 18 deletions(-)

Comments

Fischetti, Antonio June 20, 2016, 3:42 p.m. UTC | #1
Hi Jan,
that's an interesting approach. 
I added few comments and questions below.

Thanks,
Antonio


> -----Original Message-----

> From: dev [mailto:dev-bounces@openvswitch.org] On Behalf Of Jan

> Scheurich

> Sent: Thursday, June 16, 2016 2:56 PM

> To: dev@openvswitch.org

> Subject: [ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors

> per in_port in dpcls

> 

> The user-space datapath (dpif-netdev) consists of a first level

> "exact match cache" (EMC) matching on 5-tuples and the normal

> megaflow classifier. With many parallel packet flows (e.g. TCP

> connections) the EMC becomes inefficient and the OVS forwarding

> performance is determined by the megaflow classifier.

> 

> The megaflow classifier (dpcls) consists of a variable number of hash

> tables (aka subtables), each containing megaflow entries with the

> same mask of packet header and metadata fields to match upon. A dpcls

> lookup matches a given packet against all subtables in sequence until

> it hits a match. As megaflow cache entries are by construction non-

> overlapping, the first match is the only match.

> 

> Today the order of the subtables in the dpcls is essentially random

> so that on average a dpcsl lookup has to visit N/2 subtables for a

> hit, when N is the total number of subtables. Even though every

> single hash-table lookup is fast, the performance of the current

> dpcls degrades when there are many subtables.

> 

> How does the patch address this issue:

> 

> In reality there is often a strong correlation between the ingress

> port and a small subset of subtables that have hits. The entire

> megaflow cache typically decomposes nicely into partitions that are

> hit only by packets entering from a range of similar ports (e.g.

> traffic from Phy  -> VM vs. traffic from VM -> Phy).

> 

> Therefore, keeping a separate list of subtables per ingress port,

> sorted by frequency of hits, reduces the average number of subtables

> lookups in the dpcls to a minimum, even if the total number of

> subtables gets large.

> 

> The patch introduces 32 subtable vectors per dpcls and hashes the

> ingress port to select the subtable vector. The patch also counts

> matches per 32 slots in each vector (hashing the subtable pointer to

> obtain the slot) and sorts the vectors according to match frequency

> every second.

> 

> To monitor the effectiveness of the patch we have enhanced the ovs-

> appctl dpif-netdev/pmd-stats-show command with an extra line "avg.

> subtable lookups per hit" to report the average number of subtable

> lookup needed for a megaflow match. Ideally, this should be close to

> 1 and much smaller than N/2.

> 

> I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay

> mesh. With pure L3 tenant traffic between VMs on different nodes the

> resulting netdev dpcls contains N=4 subtables.

> 

> Disabling the EMC, I have measured a baseline performance (in+out) of

> ~1.32 Mpps (64 bytes, 1000 L4 flows). The average number of subtable

> lookups per dpcls match is 2.5.

> 

> With the patch the average number of subtable lookups per dpcls match

> goes down to 1.25 (apparently there are still two ports of different

> nature hashed to the same vector, otherwise it should be exactly

> one). Even so the forwarding performance grows by ~30% to 1.72 Mpps.


[Antonio F] 
What if the range of the hash values is partitioned per port type? 
I mean if the 1st range - say the first 10 values - are reserved for 
dpdk ports

if <is a dpdk port>  // range is [0..9]
    uint32_t port_idx = 0 + hash_port_no(port_no) % 10;
else    // range is [10..NUM_SUBTABLE_VECTORS]
    uint32_t port_idx = 10 + hash_port_no(port_no) % (NUM_SUBTABLE_VECTORS - 10);

Could this help to avoid collisions between ports of different types,
eg a dpdk with a vhostuser port?


> 

> As the number of subtables will often be higher in reality, we can


[Antonio F] Do you know how many subtables can be used in a real scenario?


> assume that this is at the lower end of the speed-up one can expect

> from this optimization. Just running a parallel ping between the

> VXLAN tunnel endpoints increases the number of subtables and hence

> the average number of subtable lookups from 2.5 to 3.5 with a

> corresponding decrease of throughput to 1.14 Mpps. With the patch the

> parallel ping has no impact on average number of subtable lookups and

> performance. The performance gain is then ~50%.

> 

> Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>

> 

> -----

> 

>  lib/dpif-netdev.c | 110

> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

> +++---------------

>  1 file changed, 92 insertions(+), 18 deletions(-)

> 

> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

> index 8d39d9e..c491110 100644

> --- a/lib/dpif-netdev.c

> +++ b/lib/dpif-netdev.c

> @@ -149,9 +149,15 @@ struct emc_cache {

> 

>  /* Simple non-wildcarding single-priority classifier. */

> 

> +#define NUM_SUBTABLE_VECTORS 32

> +#define NUM_SUBTABLE_SLOTS 32

> +#define DPCLS_OPTIMIATION_INTERVAL 1000

> +

>  struct dpcls {

>      struct cmap subtables_map;

> -    struct pvector subtables;

> +    struct pvector subtables[NUM_SUBTABLE_VECTORS];

> +    uint64_t hit_cnt[NUM_SUBTABLE_VECTORS][NUM_SUBTABLE_SLOTS];

> +    long long int next_optimization;

>  };

> 

>  /* A rule to be inserted to the classifier. */

> @@ -164,12 +170,14 @@ struct dpcls_rule {

> 

>  static void dpcls_init(struct dpcls *);

>  static void dpcls_destroy(struct dpcls *);

> +static void dpcls_try_optimize(struct dpcls *);

>  static void dpcls_insert(struct dpcls *, struct dpcls_rule *,

>                           const struct netdev_flow_key *mask);

>  static void dpcls_remove(struct dpcls *, struct dpcls_rule *);

> -static bool dpcls_lookup(const struct dpcls *cls,

> +static bool dpcls_lookup(struct dpcls *cls,

>                           const struct netdev_flow_key keys[],

> -                         struct dpcls_rule **rules, size_t cnt);

> +                         struct dpcls_rule **rules, size_t cnt,

> +                         const odp_port_t port_no, int

> *num_lookups_p);

> 

>  /* Datapath based on the network device interface from netdev.h.

>   *

> @@ -239,6 +247,7 @@ enum dp_stat_type {

>      DP_STAT_MASKED_HIT,         /* Packets that matched in the flow

> table. */

>      DP_STAT_MISS,               /* Packets that did not match. */

>      DP_STAT_LOST,               /* Packets not passed up to the

> client. */

> +    DP_STATS_LOOKUP_HIT,        /* Number of subtable lookups for

> flow table hits */

>      DP_N_STATS

>  };

> 

> @@ -657,8 +666,10 @@ pmd_info_show_stats(struct ds *reply,

> 

>      ds_put_format(reply,

>                    "\temc hits:%llu\n\tmegaflow hits:%llu\n"

> +                  "\tavg. subtable lookups per hit:%5.2f\n"

>                    "\tmiss:%llu\n\tlost:%llu\n",

>                    stats[DP_STAT_EXACT_HIT],

> stats[DP_STAT_MASKED_HIT],

> +                  stats[DP_STAT_MASKED_HIT] > 0 ?

> (1.0*stats[DP_STATS_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,

>                    stats[DP_STAT_MISS], stats[DP_STAT_LOST]);

> 

>      if (total_cycles == 0) {

> @@ -1855,13 +1866,13 @@ emc_lookup(struct emc_cache *cache, const

> struct netdev_flow_key *key)

>  }

> 

>  static struct dp_netdev_flow *

> -dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,

> +dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,

>                            const struct netdev_flow_key *key)

>  {

>      struct dp_netdev_flow *netdev_flow;

>      struct dpcls_rule *rule;

> 

> -    dpcls_lookup(&pmd->cls, key, &rule, 1);

> +    dpcls_lookup(&pmd->cls, key, &rule, 1, 0, 0);  /* Use 0 as dummy

> in_port */

>      netdev_flow = dp_netdev_flow_cast(rule);

> 

>      return netdev_flow;

> @@ -2874,6 +2885,7 @@ reload:

> 

>              emc_cache_slow_sweep(&pmd->flow_cache);

>              coverage_try_clear();

> +            dpcls_try_optimize(&pmd->cls);

>              ovsrcu_quiesce();

> 

>              atomic_read_relaxed(&pmd->change_seq, &seq);

> @@ -3814,7 +3826,8 @@ static inline void

>  fast_path_processing(struct dp_netdev_pmd_thread *pmd,

>                       struct dp_packet_batch *packets_,

>                       struct netdev_flow_key *keys,

> -                     struct packet_batch_per_flow batches[], size_t

> *n_batches)

> +                     struct packet_batch_per_flow batches[], size_t

> *n_batches,

> +                     odp_port_t port_no)

>  {

>      int cnt = packets_->count;

>  #if !defined(__CHECKER__) && !defined(_WIN32)

> @@ -3827,7 +3840,7 @@ fast_path_processing(struct

> dp_netdev_pmd_thread *pmd,

>      struct dpcls_rule *rules[PKT_ARRAY_SIZE];

>      struct dp_netdev *dp = pmd->dp;

>      struct emc_cache *flow_cache = &pmd->flow_cache;

> -    int miss_cnt = 0, lost_cnt = 0;

> +    int miss_cnt = 0, lost_cnt = 0, lookup_cnt;

>      bool any_miss;

>      size_t i;

> 

> @@ -3835,7 +3848,7 @@ fast_path_processing(struct

> dp_netdev_pmd_thread *pmd,

>          /* Key length is needed in all the cases, hash computed on

> demand. */

>          keys[i].len =

> netdev_flow_key_size(miniflow_n_values(&keys[i].mf));

>      }

> -    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);

> +    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt, port_no,

> &lookup_cnt);

>      if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp-

> >upcall_rwlock)) {

>          uint64_t actions_stub[512 / 8], slow_stub[512 / 8];

>          struct ofpbuf actions, put_actions;

> @@ -3893,6 +3906,7 @@ fast_path_processing(struct

> dp_netdev_pmd_thread *pmd,

>      }

> 

>      dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);

> +    dp_netdev_count_packet(pmd, DP_STATS_LOOKUP_HIT, lookup_cnt);

>      dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);

>      dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);

>  }

> @@ -3925,7 +3939,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread

> *pmd,

>                              md_is_valid, port_no);

>      if (OVS_UNLIKELY(newcnt)) {

>          packets->count = newcnt;

> -        fast_path_processing(pmd, packets, keys, batches,

> &n_batches);

> +        fast_path_processing(pmd, packets, keys, batches,

> &n_batches, port_no);

>      }

> 

>      for (i = 0; i < n_batches; i++) {

> @@ -4354,14 +4368,24 @@ struct dpcls_subtable {

>  static void

>  dpcls_init(struct dpcls *cls)

>  {

> +    int i;

> +

>      cmap_init(&cls->subtables_map);

> -    pvector_init(&cls->subtables);

> +    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {

> +        pvector_init(&cls->subtables[i]);

> +    }

> +    memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);

> +    cls->next_optimization = time_msec() +

> DPCLS_OPTIMIATION_INTERVAL;

>  }

> 

>  static void

>  dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable

> *subtable)

>  {

> -    pvector_remove(&cls->subtables, subtable);

> +    int i;

> +

> +    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {

> +        pvector_remove(&cls->subtables[i], subtable);

> +    }

>      cmap_remove(&cls->subtables_map, &subtable->cmap_node,

>                  subtable->mask.hash);

>      cmap_destroy(&subtable->rules);

> @@ -4376,13 +4400,16 @@ dpcls_destroy(struct dpcls *cls)

>  {

>      if (cls) {

>          struct dpcls_subtable *subtable;

> +        int i;

> 

>          CMAP_FOR_EACH (subtable, cmap_node, &cls->subtables_map) {

>              ovs_assert(cmap_count(&subtable->rules) == 0);

>              dpcls_destroy_subtable(cls, subtable);

>          }

>          cmap_destroy(&cls->subtables_map);

> -        pvector_destroy(&cls->subtables);

> +        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {

> +            pvector_destroy(&cls->subtables[i]);

> +        }

>      }

>  }

> 

> @@ -4390,6 +4417,7 @@ static struct dpcls_subtable *

>  dpcls_create_subtable(struct dpcls *cls, const struct

> netdev_flow_key *mask)

>  {

>      struct dpcls_subtable *subtable;

> +    int i;

> 

>      /* Need to add one. */

>      subtable = xmalloc(sizeof *subtable

> @@ -4397,8 +4425,11 @@ dpcls_create_subtable(struct dpcls *cls, const

> struct netdev_flow_key *mask)

>      cmap_init(&subtable->rules);

>      netdev_flow_key_clone(&subtable->mask, mask);

>      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask-

> >hash);

> -    pvector_insert(&cls->subtables, subtable, 0);

> -    pvector_publish(&cls->subtables);

> +    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {

> +        /* Insert new subtable with priority zero (no hits) */

> +        pvector_insert(&cls->subtables[i], subtable, 0);

> +        pvector_publish(&cls->subtables[i]);

> +    }

> 

>      return subtable;

>  }

> @@ -4417,6 +4448,33 @@ dpcls_find_subtable(struct dpcls *cls, const

> struct netdev_flow_key *mask)

>      return dpcls_create_subtable(cls, mask);

>  }

> 

> +

> +/* Periodically sort the subtable vectors according to hit counts */

> +static inline void

> +dpcls_try_optimize(struct dpcls *cls)

> +{

> +    long long int now = time_msec();

> +

> +    if (now > cls->next_optimization) {

> +        struct dpcls_subtable *subtable;

> +        int port_idx, subtable_idx;

> +

> +        for (port_idx=0; port_idx<NUM_SUBTABLE_VECTORS; port_idx++)

> {

> +            struct pvector *pvec = &cls->subtables[port_idx];

> +            PVECTOR_FOR_EACH (subtable, pvec) {

> +                subtable_idx = hash_pointer(subtable, 210365) %

> NUM_SUBTABLE_SLOTS;

> +                pvector_change_priority(pvec, subtable,

> +                        cls->hit_cnt[port_idx][subtable_idx]);

> +            }

> +            pvector_publish(pvec);

> +        }

> +

> +        /* Start new measuring interval */

> +        memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);

> +        cls->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;

> +    }

> +}

> +

>  /* Insert 'rule' into 'cls'. */

>  static void

>  dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,

> @@ -4433,6 +4491,7 @@ static void

>  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)

>  {

>      struct dpcls_subtable *subtable;

> +    int i;

> 

>      ovs_assert(rule->mask);

> 

> @@ -4441,7 +4500,9 @@ dpcls_remove(struct dpcls *cls, struct

> dpcls_rule *rule)

>      if (cmap_remove(&subtable->rules, &rule->cmap_node, rule-

> >flow.hash)

>          == 0) {

>          dpcls_destroy_subtable(cls, subtable);

> -        pvector_publish(&cls->subtables);

> +        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {

> +            pvector_publish(&cls->subtables[i]);

> +        }

>      }

>  }

> 

> @@ -4474,8 +4535,9 @@ dpcls_rule_matches_key(const struct dpcls_rule

> *rule,

>   *

>   * Returns true if all flows found a corresponding rule. */

>  static bool

> -dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key

> keys[],

> -             struct dpcls_rule **rules, const size_t cnt)

> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],

> +             struct dpcls_rule **rules, const size_t cnt,

> +             const odp_port_t port_no, int *num_lookups_p)

>  {

>      /* The batch size 16 was experimentally found faster than 8 or

> 32. */

>      typedef uint16_t map_type;

> @@ -4495,7 +4557,12 @@ dpcls_lookup(const struct dpcls *cls, const

> struct netdev_flow_key keys[],

>      }

>      memset(rules, 0, cnt * sizeof *rules);

> 

> -    PVECTOR_FOR_EACH (subtable, &cls->subtables) {

> +    /* Select the subtable vector for the in_port */

> +    uint32_t port_idx = hash_port_no(port_no) %

> NUM_SUBTABLE_VECTORS;



[Antonio F] May we simply use the port_no instead of its hashed value?

Otherwise can we compute port_idx
at the beginning or at the port creation so to avoid computing it 
on every batch of keys?


> +    uint64_t *hit_cnt = cls->hit_cnt[port_idx];

> +    int lookups_match = 0, subtable_pos = 1;

> +

> +    PVECTOR_FOR_EACH (subtable, &cls->subtables[port_idx]) {

>          const struct netdev_flow_key *mkeys = keys;

>          struct dpcls_rule **mrules = rules;

>          map_type remains = 0;

> @@ -4527,6 +4594,10 @@ dpcls_lookup(const struct dpcls *cls, const

> struct netdev_flow_key keys[],

>                  CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {

>                      if (OVS_LIKELY(dpcls_rule_matches_key(rule,

> &mkeys[i]))) {

>                          mrules[i] = rule;

> +                        /* Update the subtable hit statistics for

> the in_port vector */

> +                        int subtable_idx = hash_pointer(subtable,

> 210365) % NUM_SUBTABLE_SLOTS;

> +                        hit_cnt[subtable_idx]++;

> +                        lookups_match += subtable_pos;

>                          goto next;

>                      }

>                  }

> @@ -4538,8 +4609,11 @@ dpcls_lookup(const struct dpcls *cls, const

> struct netdev_flow_key keys[],

>              remains |= maps[m];

>          }

>          if (!remains) {

> +            if (num_lookups_p) *num_lookups_p = lookups_match;

>              return true;              /* All found. */

>          }

> +        subtable_pos++;

>      }

> +    if (num_lookups_p) *num_lookups_p = lookups_match;

>      return false;                     /* Some misses. */

>  }

> 

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> http://openvswitch.org/mailman/listinfo/dev
Jan Scheurich June 21, 2016, 9:27 a.m. UTC | #2
> [Antonio F]

> What if the range of the hash values is partitioned per port type?

> I mean if the 1st range - say the first 10 values - are reserved for dpdk ports

> 

> if <is a dpdk port>  // range is [0..9]

>     uint32_t port_idx = 0 + hash_port_no(port_no) % 10;

> else    // range is [10..NUM_SUBTABLE_VECTORS]

>     uint32_t port_idx = 10 + hash_port_no(port_no) %

> (NUM_SUBTABLE_VECTORS - 10);

> 

> Could this help to avoid collisions between ports of different types, eg a dpdk

> with a vhostuser port?


You are probably right that ports of the different type (dpdk, vhostuser, tunnel, internal, tunnel, patch, etc) are likely to have hits in different subsets of subtables. Whether or not static partitioning of the subtable vector space per in_port type gives better results than hashing over a larger set of subtable vectors would have to be tested.

> >

> > As the number of subtables will often be higher in reality, we can

> 

> [Antonio F] Do you know how many subtables can be used in a real scenario?


That very much depends on the OVS bridge setup, the OpenFlow pipeline and the traffic patterns. Any overlay networking solution will have at least two subtables for the tunnel handling plus any number of subtables needed for processing the inner packets. If you have different virtualized networking services present (e.g. ACL, L2, L3, Service Chaining, ...) packets will take different parts through the pipeline, resulting in different megaflow wild-card masks.

Furthermore, also (periodic) maintenance packets (e.g. LACP, ARP, PING, BFD) will typically lead to creation of dedicated subtables, that can degrade the dpcls performance today. 

My guess would be that in a realistic OVS deployment in an OpenStack cloud with SDN controller (e.g. ODL) running ACL/L2/L3/NAT services and link bonding, there will be in the order of 10-15 subtables. But I don't have access to such a system to provide empiric data.

> >

> > -    PVECTOR_FOR_EACH (subtable, &cls->subtables) {

> > +    /* Select the subtable vector for the in_port */

> > +    uint32_t port_idx = hash_port_no(port_no) %

> > NUM_SUBTABLE_VECTORS; 

> 

> [Antonio F] May we simply use the port_no instead of its hashed value?


I considered using the odp port number directly but didn't dare to do so because of the potentially large number of ports in a switch and the memory impact.

> Otherwise can we compute port_idx

> at the beginning or at the port creation so to avoid computing it on every

> batch of keys?


I can prototype if computing the index at creation and storing it as port data gives any benefit. I guess the same could be done for the subtable index, right?
Fischetti, Antonio June 21, 2016, 1:13 p.m. UTC | #3
> -----Original Message-----

> From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]

> Sent: Tuesday, June 21, 2016 10:27 AM

> To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> dev@openvswitch.org

> Subject: RE: [RFC Patch] dpif-netdev: Sorted subtable vectors per

> in_port in dpcls

> 

> > [Antonio F]

> > What if the range of the hash values is partitioned per port type?

> > I mean if the 1st range - say the first 10 values - are reserved

> for dpdk ports

> >

> > if <is a dpdk port>  // range is [0..9]

> >     uint32_t port_idx = 0 + hash_port_no(port_no) % 10;

> > else    // range is [10..NUM_SUBTABLE_VECTORS]

> >     uint32_t port_idx = 10 + hash_port_no(port_no) %

> > (NUM_SUBTABLE_VECTORS - 10);

> >

> > Could this help to avoid collisions between ports of different

> types, eg a dpdk

> > with a vhostuser port?

> 

> You are probably right that ports of the different type (dpdk,

> vhostuser, tunnel, internal, tunnel, patch, etc) are likely to have

> hits in different subsets of subtables. Whether or not static

> partitioning of the subtable vector space per in_port type gives

> better results than hashing over a larger set of subtable vectors

> would have to be tested.


[Antonio F] ok, it makes sense. 


> 

> > >

> > > As the number of subtables will often be higher in reality, we

> can

> >

> > [Antonio F] Do you know how many subtables can be used in a real

> scenario?

> 

> That very much depends on the OVS bridge setup, the OpenFlow pipeline

> and the traffic patterns. Any overlay networking solution will have

> at least two subtables for the tunnel handling plus any number of

> subtables needed for processing the inner packets. If you have

> different virtualized networking services present (e.g. ACL, L2, L3,

> Service Chaining, ...) packets will take different parts through the

> pipeline, resulting in different megaflow wild-card masks.

> 

> Furthermore, also (periodic) maintenance packets (e.g. LACP, ARP,

> PING, BFD) will typically lead to creation of dedicated subtables,

> that can degrade the dpcls performance today.

> 

> My guess would be that in a realistic OVS deployment in an OpenStack

> cloud with SDN controller (e.g. ODL) running ACL/L2/L3/NAT services

> and link bonding, there will be in the order of 10-15 subtables. But

> I don't have access to such a system to provide empiric data.


[Antonio F] That's interesting, thanks.


> 

> > >

> > > -    PVECTOR_FOR_EACH (subtable, &cls->subtables) {

> > > +    /* Select the subtable vector for the in_port */

> > > +    uint32_t port_idx = hash_port_no(port_no) %

> > > NUM_SUBTABLE_VECTORS;

> >

> > [Antonio F] May we simply use the port_no instead of its hashed

> value?

> 

> I considered using the odp port number directly but didn't dare to do

> so because of the potentially large number of ports in a switch and

> the memory impact.

 
[Antonio F] Yes, it makes sense.


> 

> > Otherwise can we compute port_idx

> > at the beginning or at the port creation so to avoid computing it

> on every

> > batch of keys?


[Antonio F] 
Another benefit of computing the port_idx at the port creation 
could be that if the new computed value for port_idx is already used by 
another port - to avoid collisions - you could change it to the 
next unused value or such? 

In this case even a simple progressive number
port_idx = last_port_idx + 1
could be used instead of a hashed value.


> 

> I can prototype if computing the index at creation and storing it as

> port data gives any benefit. I guess the same could be done for the

> subtable index, right?


[Antonio F] Yes, especially because hash_pointer(subtable,..)
is inside dpcls_lookup(), and also dpcls_try_optimize().
Maybe subtable_idx could stay inside struct dpcls_subtable, to
be set after the subtable is created.
Bodireddy, Bhanuprakash June 22, 2016, 2:27 p.m. UTC | #4
>-----Original Message-----

>From: dev [mailto:dev-bounces@openvswitch.org] On Behalf Of Jan

>Scheurich

>Sent: Thursday, June 16, 2016 2:56 PM

>To: dev@openvswitch.org

>Subject: [ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors per

>in_port in dpcls

>

>The user-space datapath (dpif-netdev) consists of a first level "exact match

>cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With

>many parallel packet flows (e.g. TCP connections) the EMC becomes

>inefficient and the OVS forwarding performance is determined by the

>megaflow classifier.

>

>The megaflow classifier (dpcls) consists of a variable number of hash tables

>(aka subtables), each containing megaflow entries with the same mask of

>packet header and metadata fields to match upon. A dpcls lookup matches a

>given packet against all subtables in sequence until it hits a match. As

>megaflow cache entries are by construction non-overlapping, the first match

>is the only match.

>

>Today the order of the subtables in the dpcls is essentially random so that on

>average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the total

>number of subtables. Even though every single hash-table lookup is fast, the

>performance of the current dpcls degrades when there are many subtables.

>

>How does the patch address this issue:

>

>In reality there is often a strong correlation between the ingress port and a

>small subset of subtables that have hits. The entire megaflow cache typically

>decomposes nicely into partitions that are hit only by packets entering from a

>range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM -> Phy).

>

>Therefore, keeping a separate list of subtables per ingress port, sorted by

>frequency of hits, reduces the average number of subtables lookups in the

>dpcls to a minimum, even if the total number of subtables gets large.


I like the proposed approach of subtable prioritization for each ingress port there by reducing the lookup time.
+1 on this approach.

>

>The patch introduces 32 subtable vectors per dpcls and hashes the ingress

>port to select the subtable vector. The patch also counts matches per 32 slots

>in each vector (hashing the subtable pointer to obtain the slot) and sorts the

>vectors according to match frequency every second.

>

>To monitor the effectiveness of the patch we have enhanced the ovs-appctl

>dpif-netdev/pmd-stats-show command with an extra line "avg. subtable

>lookups per hit" to report the average number of subtable lookup needed for

>a megaflow match. Ideally, this should be close to 1 and much smaller than

>N/2.

>

>I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay mesh.

>With pure L3 tenant traffic between VMs on different nodes the resulting

>netdev dpcls contains N=4 subtables.

>

>Disabling the EMC, I have measured a baseline performance (in+out) of ~1.32

>Mpps (64 bytes, 1000 L4 flows). The average number of subtable lookups per

>dpcls match is 2.5.

>

>With the patch the average number of subtable lookups per dpcls match goes

>down to 1.25 (apparently there are still two ports of different nature hashed

>to the same vector, otherwise it should be exactly one). Even so the

>forwarding performance grows by ~30% to 1.72 Mpps.


I ran some benchmarks and observed that the patch improves performance even with multiple subtables around.
EMC is disabled here and had 5 VMs doing packet forwarding. The flow rules are setup so that 8 subtables are created and 
the performance improvement of 16% was observed In this case. I would like to try some more complex test scenarios when I get time.

Regards,
Bhanu Prakash.

>

>As the number of subtables will often be higher in reality, we can assume that

>this is at the lower end of the speed-up one can expect from this

>optimization. Just running a parallel ping between the VXLAN tunnel

>endpoints increases the number of subtables and hence the average number

>of subtable lookups from 2.5 to 3.5 with a corresponding decrease of

>throughput to 1.14 Mpps. With the patch the parallel ping has no impact on

>average number of subtable lookups and performance. The performance gain

>is then ~50%.

>

>Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>

>
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8d39d9e..c491110 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -149,9 +149,15 @@  struct emc_cache {
 
 /* Simple non-wildcarding single-priority classifier. */

+#define NUM_SUBTABLE_VECTORS 32
+#define NUM_SUBTABLE_SLOTS 32
+#define DPCLS_OPTIMIATION_INTERVAL 1000
+
 struct dpcls {
     struct cmap subtables_map;
-    struct pvector subtables;
+    struct pvector subtables[NUM_SUBTABLE_VECTORS];
+    uint64_t hit_cnt[NUM_SUBTABLE_VECTORS][NUM_SUBTABLE_SLOTS];
+    long long int next_optimization;
 };

 /* A rule to be inserted to the classifier. */
@@ -164,12 +170,14 @@  struct dpcls_rule {

 static void dpcls_init(struct dpcls *);
 static void dpcls_destroy(struct dpcls *);
+static void dpcls_try_optimize(struct dpcls *);
 static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
                          const struct netdev_flow_key *mask);
 static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
-static bool dpcls_lookup(const struct dpcls *cls,
+static bool dpcls_lookup(struct dpcls *cls,
                          const struct netdev_flow_key keys[],
-                         struct dpcls_rule **rules, size_t cnt);
+                         struct dpcls_rule **rules, size_t cnt,
+                         const odp_port_t port_no, int *num_lookups_p);
 
 /* Datapath based on the network device interface from netdev.h.
  *
@@ -239,6 +247,7 @@  enum dp_stat_type {
     DP_STAT_MASKED_HIT,         /* Packets that matched in the flow table. */
     DP_STAT_MISS,               /* Packets that did not match. */
     DP_STAT_LOST,               /* Packets not passed up to the client. */
+    DP_STATS_LOOKUP_HIT,        /* Number of subtable lookups for flow table hits */
     DP_N_STATS
 };

@@ -657,8 +666,10 @@  pmd_info_show_stats(struct ds *reply,

     ds_put_format(reply,
                   "\temc hits:%llu\n\tmegaflow hits:%llu\n"
+                  "\tavg. subtable lookups per hit:%5.2f\n"
                   "\tmiss:%llu\n\tlost:%llu\n",
                   stats[DP_STAT_EXACT_HIT], stats[DP_STAT_MASKED_HIT],
+                  stats[DP_STAT_MASKED_HIT] > 0 ? (1.0*stats[DP_STATS_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,
                   stats[DP_STAT_MISS], stats[DP_STAT_LOST]);

     if (total_cycles == 0) {
@@ -1855,13 +1866,13 @@  emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
 }

 static struct dp_netdev_flow *
-dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,
+dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
                           const struct netdev_flow_key *key)
 {
     struct dp_netdev_flow *netdev_flow;
     struct dpcls_rule *rule;

-    dpcls_lookup(&pmd->cls, key, &rule, 1);
+    dpcls_lookup(&pmd->cls, key, &rule, 1, 0, 0);  /* Use 0 as dummy in_port */
     netdev_flow = dp_netdev_flow_cast(rule);

     return netdev_flow;
@@ -2874,6 +2885,7 @@  reload:

             emc_cache_slow_sweep(&pmd->flow_cache);
             coverage_try_clear();
+            dpcls_try_optimize(&pmd->cls);
             ovsrcu_quiesce();

             atomic_read_relaxed(&pmd->change_seq, &seq);
@@ -3814,7 +3826,8 @@  static inline void
 fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                      struct dp_packet_batch *packets_,
                      struct netdev_flow_key *keys,
-                     struct packet_batch_per_flow batches[], size_t *n_batches)
+                     struct packet_batch_per_flow batches[], size_t *n_batches,
+                     odp_port_t port_no)
 {
     int cnt = packets_->count;
 #if !defined(__CHECKER__) && !defined(_WIN32)
@@ -3827,7 +3840,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     struct dpcls_rule *rules[PKT_ARRAY_SIZE];
     struct dp_netdev *dp = pmd->dp;
     struct emc_cache *flow_cache = &pmd->flow_cache;
-    int miss_cnt = 0, lost_cnt = 0;
+    int miss_cnt = 0, lost_cnt = 0, lookup_cnt;
     bool any_miss;
     size_t i;

@@ -3835,7 +3848,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
         /* Key length is needed in all the cases, hash computed on demand. */
         keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
     }
-    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);
+    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt, port_no, &lookup_cnt);
     if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp->upcall_rwlock)) {
         uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
         struct ofpbuf actions, put_actions;
@@ -3893,6 +3906,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     }

     dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
+    dp_netdev_count_packet(pmd, DP_STATS_LOOKUP_HIT, lookup_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
 }
@@ -3925,7 +3939,7 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
                             md_is_valid, port_no);
     if (OVS_UNLIKELY(newcnt)) {
         packets->count = newcnt;
-        fast_path_processing(pmd, packets, keys, batches, &n_batches);
+        fast_path_processing(pmd, packets, keys, batches, &n_batches, port_no);
     }

     for (i = 0; i < n_batches; i++) {
@@ -4354,14 +4368,24 @@  struct dpcls_subtable {
 static void
 dpcls_init(struct dpcls *cls)
 {
+    int i;
+
     cmap_init(&cls->subtables_map);
-    pvector_init(&cls->subtables);
+    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+        pvector_init(&cls->subtables[i]);
+    }
+    memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
+    cls->next_optimization = time_msec() + DPCLS_OPTIMIATION_INTERVAL;
 }

 static void
 dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable *subtable)
 {
-    pvector_remove(&cls->subtables, subtable);
+    int i;
+
+    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+        pvector_remove(&cls->subtables[i], subtable);
+    }
     cmap_remove(&cls->subtables_map, &subtable->cmap_node,
                 subtable->mask.hash);
     cmap_destroy(&subtable->rules);
@@ -4376,13 +4400,16 @@  dpcls_destroy(struct dpcls *cls)
 {
     if (cls) {
         struct dpcls_subtable *subtable;
+        int i;

         CMAP_FOR_EACH (subtable, cmap_node, &cls->subtables_map) {
             ovs_assert(cmap_count(&subtable->rules) == 0);
             dpcls_destroy_subtable(cls, subtable);
         }
         cmap_destroy(&cls->subtables_map);
-        pvector_destroy(&cls->subtables);
+        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+            pvector_destroy(&cls->subtables[i]);
+        }
     }
 }

@@ -4390,6 +4417,7 @@  static struct dpcls_subtable *
 dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
 {
     struct dpcls_subtable *subtable;
+    int i;

     /* Need to add one. */
     subtable = xmalloc(sizeof *subtable
@@ -4397,8 +4425,11 @@  dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     cmap_init(&subtable->rules);
     netdev_flow_key_clone(&subtable->mask, mask);
     cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
-    pvector_insert(&cls->subtables, subtable, 0);
-    pvector_publish(&cls->subtables);
+    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+        /* Insert new subtable with priority zero (no hits) */
+        pvector_insert(&cls->subtables[i], subtable, 0);
+        pvector_publish(&cls->subtables[i]);
+    }

     return subtable;
 }
@@ -4417,6 +4448,33 @@  dpcls_find_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
     return dpcls_create_subtable(cls, mask);
 }

+
+/* Periodically sort the subtable vectors according to hit counts */
+static inline void
+dpcls_try_optimize(struct dpcls *cls)
+{
+    long long int now = time_msec();
+
+    if (now > cls->next_optimization) {
+        struct dpcls_subtable *subtable;
+        int port_idx, subtable_idx;
+
+        for (port_idx=0; port_idx<NUM_SUBTABLE_VECTORS; port_idx++) {
+            struct pvector *pvec = &cls->subtables[port_idx];
+            PVECTOR_FOR_EACH (subtable, pvec) {
+                subtable_idx = hash_pointer(subtable, 210365) % NUM_SUBTABLE_SLOTS;
+                pvector_change_priority(pvec, subtable,
+                        cls->hit_cnt[port_idx][subtable_idx]);
+            }
+            pvector_publish(pvec);
+        }
+
+        /* Start new measuring interval */
+        memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
+        cls->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;
+    }
+}
+
 /* Insert 'rule' into 'cls'. */
 static void
 dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
@@ -4433,6 +4491,7 @@  static void
 dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
 {
     struct dpcls_subtable *subtable;
+    int i;

     ovs_assert(rule->mask);

@@ -4441,7 +4500,9 @@  dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
     if (cmap_remove(&subtable->rules, &rule->cmap_node, rule->flow.hash)
         == 0) {
         dpcls_destroy_subtable(cls, subtable);
-        pvector_publish(&cls->subtables);
+        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+            pvector_publish(&cls->subtables[i]);
+        }
     }
 }

@@ -4474,8 +4535,9 @@  dpcls_rule_matches_key(const struct dpcls_rule *rule,
  *
  * Returns true if all flows found a corresponding rule. */
 static bool
-dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
-             struct dpcls_rule **rules, const size_t cnt)
+dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
+             struct dpcls_rule **rules, const size_t cnt,
+             const odp_port_t port_no, int *num_lookups_p)
 {
     /* The batch size 16 was experimentally found faster than 8 or 32. */
     typedef uint16_t map_type;
@@ -4495,7 +4557,12 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
     }
     memset(rules, 0, cnt * sizeof *rules);

-    PVECTOR_FOR_EACH (subtable, &cls->subtables) {
+    /* Select the subtable vector for the in_port */
+    uint32_t port_idx = hash_port_no(port_no) % NUM_SUBTABLE_VECTORS;
+    uint64_t *hit_cnt = cls->hit_cnt[port_idx];
+    int lookups_match = 0, subtable_pos = 1;
+
+    PVECTOR_FOR_EACH (subtable, &cls->subtables[port_idx]) {
         const struct netdev_flow_key *mkeys = keys;
         struct dpcls_rule **mrules = rules;
         map_type remains = 0;
@@ -4527,6 +4594,10 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
                 CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
                     if (OVS_LIKELY(dpcls_rule_matches_key(rule, &mkeys[i]))) {
                         mrules[i] = rule;
+                        /* Update the subtable hit statistics for the in_port vector */
+                        int subtable_idx = hash_pointer(subtable, 210365) % NUM_SUBTABLE_SLOTS;
+                        hit_cnt[subtable_idx]++;
+                        lookups_match += subtable_pos;
                         goto next;
                     }
                 }
@@ -4538,8 +4609,11 @@  dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key keys[],
             remains |= maps[m];
         }
         if (!remains) {
+            if (num_lookups_p) *num_lookups_p = lookups_match;
             return true;              /* All found. */
         }
+        subtable_pos++;
     }
+    if (num_lookups_p) *num_lookups_p = lookups_match;
     return false;                     /* Some misses. */
 }