diff mbox series

[ovs-dev,v4] Upcall/Slowpath rate-limiter for OVS

Message ID 2CFADA65-5875-4A27-8D16-F4F4E9D495A3@ericsson.com
State Deferred
Delegated to: Ian Stokes
Headers show
Series [ovs-dev,v4] Upcall/Slowpath rate-limiter for OVS | expand

Commit Message

Manohar Krishnappa Chidambaraswamy May 14, 2018, 9:32 a.m. UTC
Jan,

Have addressed your comments/suggestions. Please take a look and let me
know if you have any more comments.

Signed-off-by: Manohar K C
<manohar.krishnappa.chidambaraswamy@ericsson.com>
CC: Jan Scheurich <jan.scheurich@ericsson.com>
---
v3 : v2 rebased to master and adpated to dpif-netdev-perf counters.
     https://patchwork.ozlabs.org/patch/909676/

v2 : v1 rebased to master.
     https://patchwork.ozlabs.org/patch/860687/

v1 : Initial patch for upcall rate-limiter based on token-bucket.
     https://patchwork.ozlabs.org/patch/836737/

 Documentation/howto/dpdk.rst |  53 ++++++++++++++++++
 lib/dpif-netdev-perf.h       |   1 +
 lib/dpif-netdev.c            | 130 ++++++++++++++++++++++++++++++++++++++++---
 lib/dpif.h                   |   1 +
 vswitchd/vswitch.xml         |  42 ++++++++++++++
 5 files changed, 220 insertions(+), 7 deletions(-)

Comments

Aaron Conole June 7, 2018, 1:36 p.m. UTC | #1
Manohar Krishnappa Chidambaraswamy
<manohar.krishnappa.chidambaraswamy@ericsson.com> writes:

> Jan,
>
> Have addressed your comments/suggestions. Please take a look and let me
> know if you have any more comments.
>
> Signed-off-by: Manohar K C
> <manohar.krishnappa.chidambaraswamy@ericsson.com>
> CC: Jan Scheurich <jan.scheurich@ericsson.com>
> ---

Hi Jan, and Manohar,

Have you considered making this token bucket per-port instead of
per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
particular PMD, possibly leading to an unfair performance for that PMD
thread.  Am I just being overly paranoid?

How are you stress testing this?  Is there some framework you're using?

I'm interested in possibly implementing a similar rate-limiter from the
kernel side (for parity), but want to see if it's really a problem
with the netlink datapath first.

-Aaron

> v3 : v2 rebased to master and adpated to dpif-netdev-perf counters.
>      https://patchwork.ozlabs.org/patch/909676/
>
> v2 : v1 rebased to master.
>      https://patchwork.ozlabs.org/patch/860687/
>
> v1 : Initial patch for upcall rate-limiter based on token-bucket.
>      https://patchwork.ozlabs.org/patch/836737/
>
>  Documentation/howto/dpdk.rst |  53 ++++++++++++++++++
>  lib/dpif-netdev-perf.h       |   1 +
>  lib/dpif-netdev.c            | 130 ++++++++++++++++++++++++++++++++++++++++---
>  lib/dpif.h                   |   1 +
>  vswitchd/vswitch.xml         |  42 ++++++++++++++
>  5 files changed, 220 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index 380181d..b717804 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -358,6 +358,59 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>  
>         $ cat /proc/interrupts | grep virtio
>  
> +Upcall rate limiting
> +--------------------
> +In OVS, when an incoming packet does not match any flow in the flow
> +table maintained in fast-path (either in userspace or kernel), the
> +packet is processed in the slow-path via a mechanism known as upcall.
> +In OVS-DPDK, both fast-path and slow-path execute in the context of a
> +common poll-mode-driver (PMD) thread, without any partitioning of CPU
> +cycles between the two.
> +
> +When there is a burst of new flows coming into the data-path, packets
> +are punted to slow-path in the order they are received and the PMD is
> +busy for the duration of the upcall. Slow-path processing of a packet
> +consumes 100-200 times the cycles of fast-path handling. As a result,
> +the forwarding performance of a PMD degrades significantly during an
> +upcall burst. If the PMD was highly loaded already, it becomes temporarily
> +overloaded and its rx queues start filling up. If the upcall burst is long
> +enough, packets will be dropped when rx queues are full. This happens even
> +if the new flows are unexpected and the slow-path decides to drop the
> +packets.
> +
> +It is likely that most of the packets dropped due to rx queue overflow
> +belong to established flows that should have been processed by the
> +fast-path. Hence, the current OVS-DPDK architecture favors the handling
> +of new flows over the forwarding of established flows. This is generally
> +a sub-optimal approach.
> +
> +Without a limit to the rate of upcalls, OVS-DPDK is vulnerable for DoS
> +attacks. But even sporadic bursts of e.g. unexpected multicast packets
> +have shown to cause such packet drops.
> +
> +ovs-vsctl can be used to enable and configure upcall rate limit parameters.
> +There are 2 configurable parameters ``upcall-rate`` and ``upcall-burst`` which
> +take effect when the configuration parameter ``upcall-rl`` is set to true.
> +
> +Upcall rate-limiting is done at independent PMD level. Configured values for
> +``upcall-rate`` and ``upcall-burst`` are used indepdently in each PMD
> +(and non-PMD) threads which execute upcalls.
> +
> +Upcall rate should be set using ``upcall-rate`` in upcalls-per-sec. If not
> +configured, default value of 500 upcalls-per-sec will be set. For example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000
> +
> +Upcall burst size should be set using ``upcall-burst`` in upcalls. If not
> +configured, default value of 500 upcalls will be set. For example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000
> +
> +Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For
> +example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true
> +
>  Further Reading
>  ---------------
>  
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..e4b945a 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -55,6 +55,7 @@ enum pmd_stat_type {
>                               * number of packet passes through the datapath
>                               * pipeline and should not be overlapping with each
>                               * other. */
> +    PMD_STAT_RL_DROP,       /* Packets dropped due to upcall rate-limit. */
>      PMD_STAT_MASKED_LOOKUP, /* Number of subtable lookups for flow table
>                                 hits. Each MASKED_HIT hit will have >= 1
>                                 MASKED_LOOKUP(s). */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index be31fd0..dba8629 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -101,6 +101,11 @@ static struct shash dp_netdevs OVS_GUARDED_BY(dp_netdev_mutex)
>  
>  static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600);
>  
> +/* These default values may be tuned based on upcall performance */
> +#define UPCALL_RATELIMIT_DEFAULT false /* Disabled by default */
> +#define UPCALL_RATE_DEFAULT      500  /* upcalls-per-sec */
> +#define UPCALL_BURST_DEFAULT     500  /* maximum burst of upcalls */
> +
>  #define DP_NETDEV_CS_SUPPORTED_MASK (CS_NEW | CS_ESTABLISHED | CS_RELATED \
>                                       | CS_INVALID | CS_REPLY_DIR | CS_TRACKED \
>                                       | CS_SRC_NAT | CS_DST_NAT)
> @@ -316,6 +321,12 @@ struct dp_netdev {
>      uint64_t last_tnl_conf_seq;
>  
>      struct conntrack conntrack;
> +
> +    /* Upcall rate-limiter parameters */
> +    atomic_bool upcall_ratelimit;
> +    atomic_uint32_t upcall_rate;
> +    atomic_uint32_t upcall_burst;
> +    bool upcall_params_changed;
>  };
>  
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -615,6 +626,11 @@ struct dp_netdev_pmd_thread {
>      /* Keep track of detailed PMD performance statistics. */
>      struct pmd_perf_stats perf_stats;
>  
> +    /* Policer to rate limit upcalls */
> +    struct token_bucket upcall_tb;  /* Cache of rate and burst parameters. */
> +    bool upcall_ratelimit;          /* Cache of global enable/disable
> +                                       parameter. */
> +
>      /* Set to true if the pmd thread needs to be reloaded. */
>      bool need_reload;
>  };
> @@ -856,12 +872,13 @@ pmd_info_show_stats(struct ds *reply,
>                    "\tavg. subtable lookups per megaflow hit: %.02f\n"
>                    "\tmiss with success upcall: %"PRIu64"\n"
>                    "\tmiss with failed upcall: %"PRIu64"\n"
> +                  "\tdrops due to upcall ratelimiter: %"PRIu64"\n"
>                    "\tavg. packets per output batch: %.02f\n",
>                    total_packets, stats[PMD_STAT_RECIRC],
>                    passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
>                    stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
>                    stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
> -                  packets_per_batch);
> +                  stats[PMD_STAT_RL_DROP], packets_per_batch);
>  
>      if (total_cycles == 0) {
>          return;
> @@ -1479,6 +1496,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
>          stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
>          stats->n_missed += pmd_stats[PMD_STAT_MISS];
>          stats->n_lost += pmd_stats[PMD_STAT_LOST];
> +        stats->n_rl_dropped += pmd_stats[PMD_STAT_RL_DROP];
>      }
>      stats->n_masks = UINT32_MAX;
>      stats->n_mask_hit = UINT64_MAX;
> @@ -1489,11 +1507,24 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
>  static void
>  dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
>  {
> +    uint32_t rate, burst;
> +
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&pmd->dp->non_pmd_mutex);
>          ovs_mutex_lock(&pmd->port_mutex);
>          pmd_load_cached_ports(pmd);
>          ovs_mutex_unlock(&pmd->port_mutex);
> +
> +        /* Reconfig the upcall policer if params have changed */
> +        atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
> +        atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
> +        if ((rate != pmd->upcall_tb.rate) ||
> +            (burst != pmd->upcall_tb.burst)) {
> +            token_bucket_init(&pmd->upcall_tb, rate, burst);
> +        }
> +        atomic_read_relaxed(&pmd->dp->upcall_ratelimit,
> +                            &pmd->upcall_ratelimit);
> +
>          ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
>          return;
>      }
> @@ -2987,6 +3018,9 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
>      uint32_t insert_min, cur_min;
>      uint32_t tx_flush_interval, cur_tx_flush_interval;
> +    uint32_t rate, cur_rate;
> +    uint32_t burst, cur_burst;
> +    bool ratelimit, cur_ratelimit;
>  
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>                                       DEFAULT_TX_FLUSH_INTERVAL);
> @@ -3021,6 +3055,44 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>          }
>      }
>  
> +    /* Handle upcall policer params */
> +    ratelimit = smap_get_bool(other_config, "upcall-rl",
> +                              UPCALL_RATELIMIT_DEFAULT);
> +    rate = smap_get_int(other_config, "upcall-rate",
> +                        UPCALL_RATE_DEFAULT);
> +    burst = smap_get_int(other_config, "upcall-burst",
> +                         UPCALL_BURST_DEFAULT);
> +
> +    atomic_read_relaxed(&dp->upcall_ratelimit, &cur_ratelimit);
> +    atomic_read_relaxed(&dp->upcall_rate, &cur_rate);
> +    atomic_read_relaxed(&dp->upcall_burst, &cur_burst);
> +
> +    if ((rate != cur_rate) || (burst != cur_burst)) {
> +        VLOG_INFO("Upcall ratelimiter params changed : Old - rate=%d burst=%d "
> +                  ": New - rate=%d burst=%d\n", cur_rate, cur_burst,
> +                  rate, burst);
> +
> +        atomic_store_relaxed(&dp->upcall_rate, rate);
> +        atomic_store_relaxed(&dp->upcall_burst, burst);
> +
> +        dp->upcall_params_changed = true;
> +    }
> +
> +    if (ratelimit != cur_ratelimit) {
> +        VLOG_INFO("Upcall ratelimiter changed to %s\n",
> +                  (ratelimit ? "Enabled" : "Disabled"));
> +        VLOG_INFO("Upcall ratelimiter params : rate=%d burst=%d\n",
> +                  rate, burst);
> +
> +        atomic_store_relaxed(&dp->upcall_ratelimit, ratelimit);
> +
> +        dp->upcall_params_changed = true;
> +    }
> +
> +    if (dp->upcall_params_changed) {
> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>      return 0;
>  }
>  
> @@ -3653,6 +3725,7 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>      struct hmapx_node *node;
>      bool changed = false;
>      bool need_to_adjust_static_tx_qids = false;
> +    bool need_to_reinit_upcall_ratelimiter = false;
>  
>      /* The pmd threads should be started only if there's a pmd port in the
>       * datapath.  If the user didn't provide any "pmd-cpu-mask", we start
> @@ -3674,6 +3747,13 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>          need_to_adjust_static_tx_qids = true;
>      }
>  
> +    /* Check if upcall ratelimiter paramters have changed */
> +    if (dp->upcall_params_changed) {
> +        need_to_reinit_upcall_ratelimiter = true;
> +
> +        dp->upcall_params_changed = false;
> +    }
> +
>      /* Check for unwanted pmd threads */
>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>          if (pmd->core_id == NON_PMD_CORE_ID) {
> @@ -3682,7 +3762,8 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>          if (!ovs_numa_dump_contains_core(pmd_cores, pmd->numa_id,
>                                                      pmd->core_id)) {
>              hmapx_add(&to_delete, pmd);
> -        } else if (need_to_adjust_static_tx_qids) {
> +        } else if (need_to_adjust_static_tx_qids ||
> +                   need_to_reinit_upcall_ratelimiter) {
>              pmd->need_reload = true;
>          }
>      }
> @@ -4106,6 +4187,7 @@ pmd_thread_main(void *f_)
>      int poll_cnt;
>      int i;
>      int process_packets = 0;
> +    uint32_t rate, burst;
>  
>      poll_list = NULL;
>  
> @@ -4135,6 +4217,15 @@ reload:
>          lc = UINT_MAX;
>      }
>  
> +    /* Reconfig upcall policer token bucket with configured params. */
> +    atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
> +    atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
> +    if ((rate != pmd->upcall_tb.rate) ||
> +        (burst != pmd->upcall_tb.burst)) {
> +        token_bucket_init(&pmd->upcall_tb, rate, burst);
> +    }
> +    atomic_read_relaxed(&pmd->dp->upcall_ratelimit, &pmd->upcall_ratelimit);
> +
>      pmd->intrvl_tsc_prev = 0;
>      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
>      cycles_counter_update(s);
> @@ -4596,6 +4687,8 @@ static void
>  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>                          unsigned core_id, int numa_id)
>  {
> +    uint32_t rate, burst;
> +
>      pmd->dp = dp;
>      pmd->core_id = core_id;
>      pmd->numa_id = numa_id;
> @@ -4627,6 +4720,13 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>          emc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
> +
> +    /* Initialize upcall policer token bucket with configured params */
> +    atomic_read_relaxed(&dp->upcall_rate, &rate);
> +    atomic_read_relaxed(&dp->upcall_burst, &burst);
> +    token_bucket_init(&pmd->upcall_tb, rate, burst);
> +    atomic_read_relaxed(&dp->upcall_ratelimit, &pmd->upcall_ratelimit);
> +
>      pmd_perf_stats_init(&pmd->perf_stats);
>      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
>                  hash_int(core_id, 0));
> @@ -5145,6 +5245,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>      struct dpcls_rule *rules[PKT_ARRAY_SIZE];
>      struct dp_netdev *dp = pmd->dp;
>      int upcall_ok_cnt = 0, upcall_fail_cnt = 0;
> +    int upcall_rl_drop_cnt = 0;
>      int lookup_cnt = 0, add_lookup_cnt;
>      bool any_miss;
>  
> @@ -5185,13 +5286,26 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                  continue;
>              }
>  
> -            int error = handle_packet_upcall(pmd, packet, &keys[i],
> -                                             &actions, &put_actions);
> +            /*
> +             * Grab a token from the upcall policer to enter slowpath. If token
> +             * is not available, drop and account the packet. This is to
> +             * rate-limit packets getting into slowpath.
> +             */
> +            if (pmd->upcall_ratelimit &&
> +                !token_bucket_withdraw(&pmd->upcall_tb, 1)) {
> +                dp_packet_delete(packet);
>  
> -            if (OVS_UNLIKELY(error)) {
> -                upcall_fail_cnt++;
> +                upcall_rl_drop_cnt++;
>              } else {
> -                upcall_ok_cnt++;
> +
> +                int error = handle_packet_upcall(pmd, packet, &keys[i],
> +                                                 &actions, &put_actions);
> +
> +                if (OVS_UNLIKELY(error)) {
> +                    upcall_fail_cnt++;
> +                } else {
> +                    upcall_ok_cnt++;
> +                }
>              }
>          }
>  
> @@ -5228,6 +5342,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                              upcall_ok_cnt);
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_LOST,
>                              upcall_fail_cnt);
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RL_DROP,
> +                            upcall_rl_drop_cnt);
>  }
>  
>  /* Packets enter the datapath from a port (or from recirculation) here.
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 94f89ec..708e798 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -438,6 +438,7 @@ struct dpif_dp_stats {
>      uint64_t n_hit;             /* Number of flow table matches. */
>      uint64_t n_missed;          /* Number of flow table misses. */
>      uint64_t n_lost;            /* Number of misses not sent to userspace. */
> +    uint64_t n_rl_dropped;      /* Number of drops due to upcall rate-limit. */
>      uint64_t n_flows;           /* Number of flows present. */
>      uint64_t n_mask_hit;        /* Number of mega flow masks visited for
>                                     flow table matches. */
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 94a64dd..858ba4e 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -428,6 +428,48 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="upcall-rl"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Set this value to <code>true</code> to enable upcall rate-limiting.
> +          The upcall parameters like rate and burst will be ignored, if this is
> +          not set.
> +        </p>
> +        <p>
> +          The default value is <code>false</code> and upcall rate-limiting will
> +          be disabled.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="upcall-rate"
> +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
> +        <p>
> +          When upcall rate limiting is enabled, this specifies the sustained
> +          rate of upcalls in upcalls per second that are admitted per
> +          packet-polling thread (PMD or non-PMD) in the netdev datapath when
> +          upcall rate limiting is enabled. The default value is 500 upcalls
> +          per second.
> +
> +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
> +          upcalls in the netdev datapath. This can be used for debugging
> +          purposes.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="upcall-burst"
> +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
> +        <p>
> +          When upcall rate limiting is enabled, this specifies the maximum
> +          burst of upcalls that are admitted per packet-polling  thread
> +          (PMD or non-PMD) in the netdev datapath independent from their
> +          packet arrival rate. The default value is a burst of 500 upcalls.
> +
> +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
> +          upcalls in the netdev datapath. This can be used for debugging
> +          purposes.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="vlan-limit"
>                type='{"type": "integer", "minInteger": 0}'>
>          <p>
Manohar Krishnappa Chidambaraswamy June 8, 2018, 7:44 a.m. UTC | #2
Aaron,

Thanx for your comments/suggestions. Please see inline.

Thanx
Manu

On 07/06/18, 7:06 PM, "Aaron Conole" <aconole@redhat.com> wrote:

    Manohar Krishnappa Chidambaraswamy
    <manohar.krishnappa.chidambaraswamy@ericsson.com> writes:
    
    > Jan,
    >
    > Have addressed your comments/suggestions. Please take a look and let me
    > know if you have any more comments.
    >
    > Signed-off-by: Manohar K C
    > <manohar.krishnappa.chidambaraswamy@ericsson.com>
    > CC: Jan Scheurich <jan.scheurich@ericsson.com>
    > ---
    
    Hi Jan, and Manohar,
    
    Have you considered making this token bucket per-port instead of
    per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
    particular PMD, possibly leading to an unfair performance for that PMD
    thread.  Am I just being overly paranoid?
[manu] Yes, this is possible. But it can happen for both fast and slowpath today, as PMDs sequentially iterate through ports. In order to keep it simple, its done per-PMD. It can be extended to per-port if needed.
    
    How are you stress testing this?  Is there some framework you're using?
[manu] Since the number/rate of new flows coming in depends on the deployment (expected traffic pattern), upcall ratelimit parameters need to be tuned for a given deployment. We have only done functional tests on this for now.
    
    I'm interested in possibly implementing a similar rate-limiter from the
    kernel side (for parity), but want to see if it's really a problem
    with the netlink datapath first.

[manu] In kernel path, slow-path and fast-path are executed in different contexts, with slow-path executing in dedicated userspace threads and fast-path in kernel. An upcall is posted from the kernel module over netlink sockets (a.k.a queues) to upcall threads in userspace. Hence the number of upcalls can be controlled by controlling the depth of this queue. So token-bucket would be unnecessary IMO.
    
    -Aaron
    
    > v3 : v2 rebased to master and adpated to dpif-netdev-perf counters.
    >      https://patchwork.ozlabs.org/patch/909676/
    >
    > v2 : v1 rebased to master.
    >      https://patchwork.ozlabs.org/patch/860687/
    >
    > v1 : Initial patch for upcall rate-limiter based on token-bucket.
    >      https://patchwork.ozlabs.org/patch/836737/
    >
    >  Documentation/howto/dpdk.rst |  53 ++++++++++++++++++
    >  lib/dpif-netdev-perf.h       |   1 +
    >  lib/dpif-netdev.c            | 130 ++++++++++++++++++++++++++++++++++++++++---
    >  lib/dpif.h                   |   1 +
    >  vswitchd/vswitch.xml         |  42 ++++++++++++++
    >  5 files changed, 220 insertions(+), 7 deletions(-)
    >
    > diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
    > index 380181d..b717804 100644
    > --- a/Documentation/howto/dpdk.rst
    > +++ b/Documentation/howto/dpdk.rst
    > @@ -358,6 +358,59 @@ devices to bridge ``br0``. Once complete, follow the below steps:
    >  
    >         $ cat /proc/interrupts | grep virtio
    >  
    > +Upcall rate limiting
    > +--------------------
    > +In OVS, when an incoming packet does not match any flow in the flow
    > +table maintained in fast-path (either in userspace or kernel), the
    > +packet is processed in the slow-path via a mechanism known as upcall.
    > +In OVS-DPDK, both fast-path and slow-path execute in the context of a
    > +common poll-mode-driver (PMD) thread, without any partitioning of CPU
    > +cycles between the two.
    > +
    > +When there is a burst of new flows coming into the data-path, packets
    > +are punted to slow-path in the order they are received and the PMD is
    > +busy for the duration of the upcall. Slow-path processing of a packet
    > +consumes 100-200 times the cycles of fast-path handling. As a result,
    > +the forwarding performance of a PMD degrades significantly during an
    > +upcall burst. If the PMD was highly loaded already, it becomes temporarily
    > +overloaded and its rx queues start filling up. If the upcall burst is long
    > +enough, packets will be dropped when rx queues are full. This happens even
    > +if the new flows are unexpected and the slow-path decides to drop the
    > +packets.
    > +
    > +It is likely that most of the packets dropped due to rx queue overflow
    > +belong to established flows that should have been processed by the
    > +fast-path. Hence, the current OVS-DPDK architecture favors the handling
    > +of new flows over the forwarding of established flows. This is generally
    > +a sub-optimal approach.
    > +
    > +Without a limit to the rate of upcalls, OVS-DPDK is vulnerable for DoS
    > +attacks. But even sporadic bursts of e.g. unexpected multicast packets
    > +have shown to cause such packet drops.
    > +
    > +ovs-vsctl can be used to enable and configure upcall rate limit parameters.
    > +There are 2 configurable parameters ``upcall-rate`` and ``upcall-burst`` which
    > +take effect when the configuration parameter ``upcall-rl`` is set to true.
    > +
    > +Upcall rate-limiting is done at independent PMD level. Configured values for
    > +``upcall-rate`` and ``upcall-burst`` are used indepdently in each PMD
    > +(and non-PMD) threads which execute upcalls.
    > +
    > +Upcall rate should be set using ``upcall-rate`` in upcalls-per-sec. If not
    > +configured, default value of 500 upcalls-per-sec will be set. For example::
    > +
    > +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000
    > +
    > +Upcall burst size should be set using ``upcall-burst`` in upcalls. If not
    > +configured, default value of 500 upcalls will be set. For example::
    > +
    > +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000
    > +
    > +Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For
    > +example::
    > +
    > +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true
    > +
    >  Further Reading
    >  ---------------
    >  
    > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
    > index 5993c25..e4b945a 100644
    > --- a/lib/dpif-netdev-perf.h
    > +++ b/lib/dpif-netdev-perf.h
    > @@ -55,6 +55,7 @@ enum pmd_stat_type {
    >                               * number of packet passes through the datapath
    >                               * pipeline and should not be overlapping with each
    >                               * other. */
    > +    PMD_STAT_RL_DROP,       /* Packets dropped due to upcall rate-limit. */
    >      PMD_STAT_MASKED_LOOKUP, /* Number of subtable lookups for flow table
    >                                 hits. Each MASKED_HIT hit will have >= 1
    >                                 MASKED_LOOKUP(s). */
    > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
    > index be31fd0..dba8629 100644
    > --- a/lib/dpif-netdev.c
    > +++ b/lib/dpif-netdev.c
    > @@ -101,6 +101,11 @@ static struct shash dp_netdevs OVS_GUARDED_BY(dp_netdev_mutex)
    >  
    >  static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600);
    >  
    > +/* These default values may be tuned based on upcall performance */
    > +#define UPCALL_RATELIMIT_DEFAULT false /* Disabled by default */
    > +#define UPCALL_RATE_DEFAULT      500  /* upcalls-per-sec */
    > +#define UPCALL_BURST_DEFAULT     500  /* maximum burst of upcalls */
    > +
    >  #define DP_NETDEV_CS_SUPPORTED_MASK (CS_NEW | CS_ESTABLISHED | CS_RELATED \
    >                                       | CS_INVALID | CS_REPLY_DIR | CS_TRACKED \
    >                                       | CS_SRC_NAT | CS_DST_NAT)
    > @@ -316,6 +321,12 @@ struct dp_netdev {
    >      uint64_t last_tnl_conf_seq;
    >  
    >      struct conntrack conntrack;
    > +
    > +    /* Upcall rate-limiter parameters */
    > +    atomic_bool upcall_ratelimit;
    > +    atomic_uint32_t upcall_rate;
    > +    atomic_uint32_t upcall_burst;
    > +    bool upcall_params_changed;
    >  };
    >  
    >  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
    > @@ -615,6 +626,11 @@ struct dp_netdev_pmd_thread {
    >      /* Keep track of detailed PMD performance statistics. */
    >      struct pmd_perf_stats perf_stats;
    >  
    > +    /* Policer to rate limit upcalls */
    > +    struct token_bucket upcall_tb;  /* Cache of rate and burst parameters. */
    > +    bool upcall_ratelimit;          /* Cache of global enable/disable
    > +                                       parameter. */
    > +
    >      /* Set to true if the pmd thread needs to be reloaded. */
    >      bool need_reload;
    >  };
    > @@ -856,12 +872,13 @@ pmd_info_show_stats(struct ds *reply,
    >                    "\tavg. subtable lookups per megaflow hit: %.02f\n"
    >                    "\tmiss with success upcall: %"PRIu64"\n"
    >                    "\tmiss with failed upcall: %"PRIu64"\n"
    > +                  "\tdrops due to upcall ratelimiter: %"PRIu64"\n"
    >                    "\tavg. packets per output batch: %.02f\n",
    >                    total_packets, stats[PMD_STAT_RECIRC],
    >                    passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
    >                    stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
    >                    stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
    > -                  packets_per_batch);
    > +                  stats[PMD_STAT_RL_DROP], packets_per_batch);
    >  
    >      if (total_cycles == 0) {
    >          return;
    > @@ -1479,6 +1496,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
    >          stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
    >          stats->n_missed += pmd_stats[PMD_STAT_MISS];
    >          stats->n_lost += pmd_stats[PMD_STAT_LOST];
    > +        stats->n_rl_dropped += pmd_stats[PMD_STAT_RL_DROP];
    >      }
    >      stats->n_masks = UINT32_MAX;
    >      stats->n_mask_hit = UINT64_MAX;
    > @@ -1489,11 +1507,24 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
    >  static void
    >  dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
    >  {
    > +    uint32_t rate, burst;
    > +
    >      if (pmd->core_id == NON_PMD_CORE_ID) {
    >          ovs_mutex_lock(&pmd->dp->non_pmd_mutex);
    >          ovs_mutex_lock(&pmd->port_mutex);
    >          pmd_load_cached_ports(pmd);
    >          ovs_mutex_unlock(&pmd->port_mutex);
    > +
    > +        /* Reconfig the upcall policer if params have changed */
    > +        atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
    > +        atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
    > +        if ((rate != pmd->upcall_tb.rate) ||
    > +            (burst != pmd->upcall_tb.burst)) {
    > +            token_bucket_init(&pmd->upcall_tb, rate, burst);
    > +        }
    > +        atomic_read_relaxed(&pmd->dp->upcall_ratelimit,
    > +                            &pmd->upcall_ratelimit);
    > +
    >          ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
    >          return;
    >      }
    > @@ -2987,6 +3018,9 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
    >                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
    >      uint32_t insert_min, cur_min;
    >      uint32_t tx_flush_interval, cur_tx_flush_interval;
    > +    uint32_t rate, cur_rate;
    > +    uint32_t burst, cur_burst;
    > +    bool ratelimit, cur_ratelimit;
    >  
    >      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
    >                                       DEFAULT_TX_FLUSH_INTERVAL);
    > @@ -3021,6 +3055,44 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
    >          }
    >      }
    >  
    > +    /* Handle upcall policer params */
    > +    ratelimit = smap_get_bool(other_config, "upcall-rl",
    > +                              UPCALL_RATELIMIT_DEFAULT);
    > +    rate = smap_get_int(other_config, "upcall-rate",
    > +                        UPCALL_RATE_DEFAULT);
    > +    burst = smap_get_int(other_config, "upcall-burst",
    > +                         UPCALL_BURST_DEFAULT);
    > +
    > +    atomic_read_relaxed(&dp->upcall_ratelimit, &cur_ratelimit);
    > +    atomic_read_relaxed(&dp->upcall_rate, &cur_rate);
    > +    atomic_read_relaxed(&dp->upcall_burst, &cur_burst);
    > +
    > +    if ((rate != cur_rate) || (burst != cur_burst)) {
    > +        VLOG_INFO("Upcall ratelimiter params changed : Old - rate=%d burst=%d "
    > +                  ": New - rate=%d burst=%d\n", cur_rate, cur_burst,
    > +                  rate, burst);
    > +
    > +        atomic_store_relaxed(&dp->upcall_rate, rate);
    > +        atomic_store_relaxed(&dp->upcall_burst, burst);
    > +
    > +        dp->upcall_params_changed = true;
    > +    }
    > +
    > +    if (ratelimit != cur_ratelimit) {
    > +        VLOG_INFO("Upcall ratelimiter changed to %s\n",
    > +                  (ratelimit ? "Enabled" : "Disabled"));
    > +        VLOG_INFO("Upcall ratelimiter params : rate=%d burst=%d\n",
    > +                  rate, burst);
    > +
    > +        atomic_store_relaxed(&dp->upcall_ratelimit, ratelimit);
    > +
    > +        dp->upcall_params_changed = true;
    > +    }
    > +
    > +    if (dp->upcall_params_changed) {
    > +        dp_netdev_request_reconfigure(dp);
    > +    }
    > +
    >      return 0;
    >  }
    >  
    > @@ -3653,6 +3725,7 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
    >      struct hmapx_node *node;
    >      bool changed = false;
    >      bool need_to_adjust_static_tx_qids = false;
    > +    bool need_to_reinit_upcall_ratelimiter = false;
    >  
    >      /* The pmd threads should be started only if there's a pmd port in the
    >       * datapath.  If the user didn't provide any "pmd-cpu-mask", we start
    > @@ -3674,6 +3747,13 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
    >          need_to_adjust_static_tx_qids = true;
    >      }
    >  
    > +    /* Check if upcall ratelimiter paramters have changed */
    > +    if (dp->upcall_params_changed) {
    > +        need_to_reinit_upcall_ratelimiter = true;
    > +
    > +        dp->upcall_params_changed = false;
    > +    }
    > +
    >      /* Check for unwanted pmd threads */
    >      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
    >          if (pmd->core_id == NON_PMD_CORE_ID) {
    > @@ -3682,7 +3762,8 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
    >          if (!ovs_numa_dump_contains_core(pmd_cores, pmd->numa_id,
    >                                                      pmd->core_id)) {
    >              hmapx_add(&to_delete, pmd);
    > -        } else if (need_to_adjust_static_tx_qids) {
    > +        } else if (need_to_adjust_static_tx_qids ||
    > +                   need_to_reinit_upcall_ratelimiter) {
    >              pmd->need_reload = true;
    >          }
    >      }
    > @@ -4106,6 +4187,7 @@ pmd_thread_main(void *f_)
    >      int poll_cnt;
    >      int i;
    >      int process_packets = 0;
    > +    uint32_t rate, burst;
    >  
    >      poll_list = NULL;
    >  
    > @@ -4135,6 +4217,15 @@ reload:
    >          lc = UINT_MAX;
    >      }
    >  
    > +    /* Reconfig upcall policer token bucket with configured params. */
    > +    atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
    > +    atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
    > +    if ((rate != pmd->upcall_tb.rate) ||
    > +        (burst != pmd->upcall_tb.burst)) {
    > +        token_bucket_init(&pmd->upcall_tb, rate, burst);
    > +    }
    > +    atomic_read_relaxed(&pmd->dp->upcall_ratelimit, &pmd->upcall_ratelimit);
    > +
    >      pmd->intrvl_tsc_prev = 0;
    >      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
    >      cycles_counter_update(s);
    > @@ -4596,6 +4687,8 @@ static void
    >  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
    >                          unsigned core_id, int numa_id)
    >  {
    > +    uint32_t rate, burst;
    > +
    >      pmd->dp = dp;
    >      pmd->core_id = core_id;
    >      pmd->numa_id = numa_id;
    > @@ -4627,6 +4720,13 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
    >          emc_cache_init(&pmd->flow_cache);
    >          pmd_alloc_static_tx_qid(pmd);
    >      }
    > +
    > +    /* Initialize upcall policer token bucket with configured params */
    > +    atomic_read_relaxed(&dp->upcall_rate, &rate);
    > +    atomic_read_relaxed(&dp->upcall_burst, &burst);
    > +    token_bucket_init(&pmd->upcall_tb, rate, burst);
    > +    atomic_read_relaxed(&dp->upcall_ratelimit, &pmd->upcall_ratelimit);
    > +
    >      pmd_perf_stats_init(&pmd->perf_stats);
    >      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
    >                  hash_int(core_id, 0));
    > @@ -5145,6 +5245,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
    >      struct dpcls_rule *rules[PKT_ARRAY_SIZE];
    >      struct dp_netdev *dp = pmd->dp;
    >      int upcall_ok_cnt = 0, upcall_fail_cnt = 0;
    > +    int upcall_rl_drop_cnt = 0;
    >      int lookup_cnt = 0, add_lookup_cnt;
    >      bool any_miss;
    >  
    > @@ -5185,13 +5286,26 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
    >                  continue;
    >              }
    >  
    > -            int error = handle_packet_upcall(pmd, packet, &keys[i],
    > -                                             &actions, &put_actions);
    > +            /*
    > +             * Grab a token from the upcall policer to enter slowpath. If token
    > +             * is not available, drop and account the packet. This is to
    > +             * rate-limit packets getting into slowpath.
    > +             */
    > +            if (pmd->upcall_ratelimit &&
    > +                !token_bucket_withdraw(&pmd->upcall_tb, 1)) {
    > +                dp_packet_delete(packet);
    >  
    > -            if (OVS_UNLIKELY(error)) {
    > -                upcall_fail_cnt++;
    > +                upcall_rl_drop_cnt++;
    >              } else {
    > -                upcall_ok_cnt++;
    > +
    > +                int error = handle_packet_upcall(pmd, packet, &keys[i],
    > +                                                 &actions, &put_actions);
    > +
    > +                if (OVS_UNLIKELY(error)) {
    > +                    upcall_fail_cnt++;
    > +                } else {
    > +                    upcall_ok_cnt++;
    > +                }
    >              }
    >          }
    >  
    > @@ -5228,6 +5342,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
    >                              upcall_ok_cnt);
    >      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_LOST,
    >                              upcall_fail_cnt);
    > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RL_DROP,
    > +                            upcall_rl_drop_cnt);
    >  }
    >  
    >  /* Packets enter the datapath from a port (or from recirculation) here.
    > diff --git a/lib/dpif.h b/lib/dpif.h
    > index 94f89ec..708e798 100644
    > --- a/lib/dpif.h
    > +++ b/lib/dpif.h
    > @@ -438,6 +438,7 @@ struct dpif_dp_stats {
    >      uint64_t n_hit;             /* Number of flow table matches. */
    >      uint64_t n_missed;          /* Number of flow table misses. */
    >      uint64_t n_lost;            /* Number of misses not sent to userspace. */
    > +    uint64_t n_rl_dropped;      /* Number of drops due to upcall rate-limit. */
    >      uint64_t n_flows;           /* Number of flows present. */
    >      uint64_t n_mask_hit;        /* Number of mega flow masks visited for
    >                                     flow table matches. */
    > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
    > index 94a64dd..858ba4e 100644
    > --- a/vswitchd/vswitch.xml
    > +++ b/vswitchd/vswitch.xml
    > @@ -428,6 +428,48 @@
    >          </p>
    >        </column>
    >  
    > +      <column name="other_config" key="upcall-rl"
    > +              type='{"type": "boolean"}'>
    > +        <p>
    > +          Set this value to <code>true</code> to enable upcall rate-limiting.
    > +          The upcall parameters like rate and burst will be ignored, if this is
    > +          not set.
    > +        </p>
    > +        <p>
    > +          The default value is <code>false</code> and upcall rate-limiting will
    > +          be disabled.
    > +        </p>
    > +      </column>
    > +
    > +      <column name="other_config" key="upcall-rate"
    > +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
    > +        <p>
    > +          When upcall rate limiting is enabled, this specifies the sustained
    > +          rate of upcalls in upcalls per second that are admitted per
    > +          packet-polling thread (PMD or non-PMD) in the netdev datapath when
    > +          upcall rate limiting is enabled. The default value is 500 upcalls
    > +          per second.
    > +
    > +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
    > +          upcalls in the netdev datapath. This can be used for debugging
    > +          purposes.
    > +        </p>
    > +      </column>
    > +
    > +      <column name="other_config" key="upcall-burst"
    > +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
    > +        <p>
    > +          When upcall rate limiting is enabled, this specifies the maximum
    > +          burst of upcalls that are admitted per packet-polling  thread
    > +          (PMD or non-PMD) in the netdev datapath independent from their
    > +          packet arrival rate. The default value is a burst of 500 upcalls.
    > +
    > +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
    > +          upcalls in the netdev datapath. This can be used for debugging
    > +          purposes.
    > +        </p>
    > +      </column>
    > +
    >        <column name="other_config" key="vlan-limit"
    >                type='{"type": "integer", "minInteger": 0}'>
    >          <p>
Jan Scheurich June 8, 2018, 9 a.m. UTC | #3
>     Have you considered making this token bucket per-port instead of
>     per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
>     particular PMD, possibly leading to an unfair performance for that PMD
>     thread.  Am I just being overly paranoid?
> [manu] Yes, this is possible. But it can happen for both fast and slowpath today, as PMDs sequentially iterate through ports. In order
> to keep it simple, its done per-PMD. It can be extended to per-port if needed.

The purpose of the upcall rate limiter for the netdev datapath is to protect a PMD from becoming clogged down by having to process an excessive number of upcalls. It is not to police the number of upcalls per port to some rate, especially not across multiple PMDs (in the case of RSS).

I think what you are after, Aaron, is some kind of fairness scheme that provides each rx queue with a minimum rate of upcalls even if the global PMD rate limit is reached? I don't believe simply partitioning the global PMD rate limit into a number of smaller rx queue buckets would be a good solution. But I don't have a better alternative either.

I agree with Manu that it should not stop us implementing the PMD-level protection. We can add a fairness scheme later, if needed.

BR, Jan
Aaron Conole June 12, 2018, 7:26 p.m. UTC | #4
Manohar Krishnappa Chidambaraswamy
<manohar.krishnappa.chidambaraswamy@ericsson.com> writes:

>     How are you stress testing this?  Is there some framework you're using?
> [manu] Since the number/rate of new flows coming in depends on the
> deployment (expected traffic pattern), upcall ratelimit parameters
> need to be tuned for a given deployment. We have only done functional
> tests on this for now.

Let me know when you figure out what framework / setup you're planning
on using to stress this.

>     I'm interested in possibly implementing a similar rate-limiter from the
>     kernel side (for parity), but want to see if it's really a problem
>     with the netlink datapath first.
>
> [manu] In kernel path, slow-path and fast-path are executed in
> different contexts, with slow-path executing in dedicated userspace
> threads and fast-path in kernel. An upcall is posted from the kernel
> module over netlink sockets (a.k.a queues) to upcall threads in
> userspace. Hence the number of upcalls can be controlled by
> controlling the depth of this queue. So token-bucket would be
> unnecessary IMO.

Sure, but you end up with similar starvation issues (well... in practice
you end up out of file descriptors because of the number of upcall
sockets opened - maybe you see why I'm interested in per-port rate
limiting :)
Aaron Conole June 12, 2018, 7:31 p.m. UTC | #5
Jan Scheurich <jan.scheurich@ericsson.com> writes:

>>     Have you considered making this token bucket per-port instead of
>>     per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
>>     particular PMD, possibly leading to an unfair performance for that PMD
>>     thread.  Am I just being overly paranoid?
>> [manu] Yes, this is possible. But it can happen for both fast and
>> slowpath today, as PMDs sequentially iterate through ports. In order
>> to keep it simple, its done per-PMD. It can be extended to per-port if needed.
>
> The purpose of the upcall rate limiter for the netdev datapath is to
> protect a PMD from becoming clogged down by having to process an
> excessive number of upcalls. It is not to police the number of upcalls
> per port to some rate, especially not across multiple PMDs (in the
> case of RSS).

Okay.  I guess you would first create this to police the global upcall
pool, and then if you need a future policing per-port, you would add
that (something like you indicate at the end of this mail)?

> I think what you are after, Aaron, is some kind of fairness scheme
> that provides each rx queue with a minimum rate of upcalls even if the
> global PMD rate limit is reached? I don't believe simply partitioning
> the global PMD rate limit into a number of smaller rx queue buckets
> would be a good solution. But I don't have a better alternative
> either.

Yes, that's what I'm thinking about.  I'm concerned about the
scalability from the kernel side for the number of upcall fds required
for the kernel case.  I guess that if I'm going to try and propose a
solution, I would want it to match with any existing userspace datapath
solution (at least, semantically) if it could.

> I agree with Manu that it should not stop us implementing the
> PMD-level protection. We can add a fairness scheme later, if needed.

Okay - I'm interested in that fairness for other reasons.  Maybe I'll
cook up the patches and see what comes from it.

> BR, Jan
diff mbox series

Patch

diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index 380181d..b717804 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -358,6 +358,59 @@  devices to bridge ``br0``. Once complete, follow the below steps:
 
        $ cat /proc/interrupts | grep virtio
 
+Upcall rate limiting
+--------------------
+In OVS, when an incoming packet does not match any flow in the flow
+table maintained in fast-path (either in userspace or kernel), the
+packet is processed in the slow-path via a mechanism known as upcall.
+In OVS-DPDK, both fast-path and slow-path execute in the context of a
+common poll-mode-driver (PMD) thread, without any partitioning of CPU
+cycles between the two.
+
+When there is a burst of new flows coming into the data-path, packets
+are punted to slow-path in the order they are received and the PMD is
+busy for the duration of the upcall. Slow-path processing of a packet
+consumes 100-200 times the cycles of fast-path handling. As a result,
+the forwarding performance of a PMD degrades significantly during an
+upcall burst. If the PMD was highly loaded already, it becomes temporarily
+overloaded and its rx queues start filling up. If the upcall burst is long
+enough, packets will be dropped when rx queues are full. This happens even
+if the new flows are unexpected and the slow-path decides to drop the
+packets.
+
+It is likely that most of the packets dropped due to rx queue overflow
+belong to established flows that should have been processed by the
+fast-path. Hence, the current OVS-DPDK architecture favors the handling
+of new flows over the forwarding of established flows. This is generally
+a sub-optimal approach.
+
+Without a limit to the rate of upcalls, OVS-DPDK is vulnerable for DoS
+attacks. But even sporadic bursts of e.g. unexpected multicast packets
+have shown to cause such packet drops.
+
+ovs-vsctl can be used to enable and configure upcall rate limit parameters.
+There are 2 configurable parameters ``upcall-rate`` and ``upcall-burst`` which
+take effect when the configuration parameter ``upcall-rl`` is set to true.
+
+Upcall rate-limiting is done at independent PMD level. Configured values for
+``upcall-rate`` and ``upcall-burst`` are used indepdently in each PMD
+(and non-PMD) threads which execute upcalls.
+
+Upcall rate should be set using ``upcall-rate`` in upcalls-per-sec. If not
+configured, default value of 500 upcalls-per-sec will be set. For example::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000
+
+Upcall burst size should be set using ``upcall-burst`` in upcalls. If not
+configured, default value of 500 upcalls will be set. For example::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000
+
+Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For
+example::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true
+
 Further Reading
 ---------------
 
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 5993c25..e4b945a 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -55,6 +55,7 @@  enum pmd_stat_type {
                              * number of packet passes through the datapath
                              * pipeline and should not be overlapping with each
                              * other. */
+    PMD_STAT_RL_DROP,       /* Packets dropped due to upcall rate-limit. */
     PMD_STAT_MASKED_LOOKUP, /* Number of subtable lookups for flow table
                                hits. Each MASKED_HIT hit will have >= 1
                                MASKED_LOOKUP(s). */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index be31fd0..dba8629 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -101,6 +101,11 @@  static struct shash dp_netdevs OVS_GUARDED_BY(dp_netdev_mutex)
 
 static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600);
 
+/* These default values may be tuned based on upcall performance */
+#define UPCALL_RATELIMIT_DEFAULT false /* Disabled by default */
+#define UPCALL_RATE_DEFAULT      500  /* upcalls-per-sec */
+#define UPCALL_BURST_DEFAULT     500  /* maximum burst of upcalls */
+
 #define DP_NETDEV_CS_SUPPORTED_MASK (CS_NEW | CS_ESTABLISHED | CS_RELATED \
                                      | CS_INVALID | CS_REPLY_DIR | CS_TRACKED \
                                      | CS_SRC_NAT | CS_DST_NAT)
@@ -316,6 +321,12 @@  struct dp_netdev {
     uint64_t last_tnl_conf_seq;
 
     struct conntrack conntrack;
+
+    /* Upcall rate-limiter parameters */
+    atomic_bool upcall_ratelimit;
+    atomic_uint32_t upcall_rate;
+    atomic_uint32_t upcall_burst;
+    bool upcall_params_changed;
 };
 
 static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
@@ -615,6 +626,11 @@  struct dp_netdev_pmd_thread {
     /* Keep track of detailed PMD performance statistics. */
     struct pmd_perf_stats perf_stats;
 
+    /* Policer to rate limit upcalls */
+    struct token_bucket upcall_tb;  /* Cache of rate and burst parameters. */
+    bool upcall_ratelimit;          /* Cache of global enable/disable
+                                       parameter. */
+
     /* Set to true if the pmd thread needs to be reloaded. */
     bool need_reload;
 };
@@ -856,12 +872,13 @@  pmd_info_show_stats(struct ds *reply,
                   "\tavg. subtable lookups per megaflow hit: %.02f\n"
                   "\tmiss with success upcall: %"PRIu64"\n"
                   "\tmiss with failed upcall: %"PRIu64"\n"
+                  "\tdrops due to upcall ratelimiter: %"PRIu64"\n"
                   "\tavg. packets per output batch: %.02f\n",
                   total_packets, stats[PMD_STAT_RECIRC],
                   passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
                   stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
                   stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
-                  packets_per_batch);
+                  stats[PMD_STAT_RL_DROP], packets_per_batch);
 
     if (total_cycles == 0) {
         return;
@@ -1479,6 +1496,7 @@  dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
         stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
         stats->n_missed += pmd_stats[PMD_STAT_MISS];
         stats->n_lost += pmd_stats[PMD_STAT_LOST];
+        stats->n_rl_dropped += pmd_stats[PMD_STAT_RL_DROP];
     }
     stats->n_masks = UINT32_MAX;
     stats->n_mask_hit = UINT64_MAX;
@@ -1489,11 +1507,24 @@  dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
 static void
 dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
 {
+    uint32_t rate, burst;
+
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_lock(&pmd->dp->non_pmd_mutex);
         ovs_mutex_lock(&pmd->port_mutex);
         pmd_load_cached_ports(pmd);
         ovs_mutex_unlock(&pmd->port_mutex);
+
+        /* Reconfig the upcall policer if params have changed */
+        atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
+        atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
+        if ((rate != pmd->upcall_tb.rate) ||
+            (burst != pmd->upcall_tb.burst)) {
+            token_bucket_init(&pmd->upcall_tb, rate, burst);
+        }
+        atomic_read_relaxed(&pmd->dp->upcall_ratelimit,
+                            &pmd->upcall_ratelimit);
+
         ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
         return;
     }
@@ -2987,6 +3018,9 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
                         DEFAULT_EM_FLOW_INSERT_INV_PROB);
     uint32_t insert_min, cur_min;
     uint32_t tx_flush_interval, cur_tx_flush_interval;
+    uint32_t rate, cur_rate;
+    uint32_t burst, cur_burst;
+    bool ratelimit, cur_ratelimit;
 
     tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
                                      DEFAULT_TX_FLUSH_INTERVAL);
@@ -3021,6 +3055,44 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         }
     }
 
+    /* Handle upcall policer params */
+    ratelimit = smap_get_bool(other_config, "upcall-rl",
+                              UPCALL_RATELIMIT_DEFAULT);
+    rate = smap_get_int(other_config, "upcall-rate",
+                        UPCALL_RATE_DEFAULT);
+    burst = smap_get_int(other_config, "upcall-burst",
+                         UPCALL_BURST_DEFAULT);
+
+    atomic_read_relaxed(&dp->upcall_ratelimit, &cur_ratelimit);
+    atomic_read_relaxed(&dp->upcall_rate, &cur_rate);
+    atomic_read_relaxed(&dp->upcall_burst, &cur_burst);
+
+    if ((rate != cur_rate) || (burst != cur_burst)) {
+        VLOG_INFO("Upcall ratelimiter params changed : Old - rate=%d burst=%d "
+                  ": New - rate=%d burst=%d\n", cur_rate, cur_burst,
+                  rate, burst);
+
+        atomic_store_relaxed(&dp->upcall_rate, rate);
+        atomic_store_relaxed(&dp->upcall_burst, burst);
+
+        dp->upcall_params_changed = true;
+    }
+
+    if (ratelimit != cur_ratelimit) {
+        VLOG_INFO("Upcall ratelimiter changed to %s\n",
+                  (ratelimit ? "Enabled" : "Disabled"));
+        VLOG_INFO("Upcall ratelimiter params : rate=%d burst=%d\n",
+                  rate, burst);
+
+        atomic_store_relaxed(&dp->upcall_ratelimit, ratelimit);
+
+        dp->upcall_params_changed = true;
+    }
+
+    if (dp->upcall_params_changed) {
+        dp_netdev_request_reconfigure(dp);
+    }
+
     return 0;
 }
 
@@ -3653,6 +3725,7 @@  reconfigure_pmd_threads(struct dp_netdev *dp)
     struct hmapx_node *node;
     bool changed = false;
     bool need_to_adjust_static_tx_qids = false;
+    bool need_to_reinit_upcall_ratelimiter = false;
 
     /* The pmd threads should be started only if there's a pmd port in the
      * datapath.  If the user didn't provide any "pmd-cpu-mask", we start
@@ -3674,6 +3747,13 @@  reconfigure_pmd_threads(struct dp_netdev *dp)
         need_to_adjust_static_tx_qids = true;
     }
 
+    /* Check if upcall ratelimiter paramters have changed */
+    if (dp->upcall_params_changed) {
+        need_to_reinit_upcall_ratelimiter = true;
+
+        dp->upcall_params_changed = false;
+    }
+
     /* Check for unwanted pmd threads */
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         if (pmd->core_id == NON_PMD_CORE_ID) {
@@ -3682,7 +3762,8 @@  reconfigure_pmd_threads(struct dp_netdev *dp)
         if (!ovs_numa_dump_contains_core(pmd_cores, pmd->numa_id,
                                                     pmd->core_id)) {
             hmapx_add(&to_delete, pmd);
-        } else if (need_to_adjust_static_tx_qids) {
+        } else if (need_to_adjust_static_tx_qids ||
+                   need_to_reinit_upcall_ratelimiter) {
             pmd->need_reload = true;
         }
     }
@@ -4106,6 +4187,7 @@  pmd_thread_main(void *f_)
     int poll_cnt;
     int i;
     int process_packets = 0;
+    uint32_t rate, burst;
 
     poll_list = NULL;
 
@@ -4135,6 +4217,15 @@  reload:
         lc = UINT_MAX;
     }
 
+    /* Reconfig upcall policer token bucket with configured params. */
+    atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
+    atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
+    if ((rate != pmd->upcall_tb.rate) ||
+        (burst != pmd->upcall_tb.burst)) {
+        token_bucket_init(&pmd->upcall_tb, rate, burst);
+    }
+    atomic_read_relaxed(&pmd->dp->upcall_ratelimit, &pmd->upcall_ratelimit);
+
     pmd->intrvl_tsc_prev = 0;
     atomic_store_relaxed(&pmd->intrvl_cycles, 0);
     cycles_counter_update(s);
@@ -4596,6 +4687,8 @@  static void
 dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
                         unsigned core_id, int numa_id)
 {
+    uint32_t rate, burst;
+
     pmd->dp = dp;
     pmd->core_id = core_id;
     pmd->numa_id = numa_id;
@@ -4627,6 +4720,13 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
         emc_cache_init(&pmd->flow_cache);
         pmd_alloc_static_tx_qid(pmd);
     }
+
+    /* Initialize upcall policer token bucket with configured params */
+    atomic_read_relaxed(&dp->upcall_rate, &rate);
+    atomic_read_relaxed(&dp->upcall_burst, &burst);
+    token_bucket_init(&pmd->upcall_tb, rate, burst);
+    atomic_read_relaxed(&dp->upcall_ratelimit, &pmd->upcall_ratelimit);
+
     pmd_perf_stats_init(&pmd->perf_stats);
     cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
                 hash_int(core_id, 0));
@@ -5145,6 +5245,7 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     struct dpcls_rule *rules[PKT_ARRAY_SIZE];
     struct dp_netdev *dp = pmd->dp;
     int upcall_ok_cnt = 0, upcall_fail_cnt = 0;
+    int upcall_rl_drop_cnt = 0;
     int lookup_cnt = 0, add_lookup_cnt;
     bool any_miss;
 
@@ -5185,13 +5286,26 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                 continue;
             }
 
-            int error = handle_packet_upcall(pmd, packet, &keys[i],
-                                             &actions, &put_actions);
+            /*
+             * Grab a token from the upcall policer to enter slowpath. If token
+             * is not available, drop and account the packet. This is to
+             * rate-limit packets getting into slowpath.
+             */
+            if (pmd->upcall_ratelimit &&
+                !token_bucket_withdraw(&pmd->upcall_tb, 1)) {
+                dp_packet_delete(packet);
 
-            if (OVS_UNLIKELY(error)) {
-                upcall_fail_cnt++;
+                upcall_rl_drop_cnt++;
             } else {
-                upcall_ok_cnt++;
+
+                int error = handle_packet_upcall(pmd, packet, &keys[i],
+                                                 &actions, &put_actions);
+
+                if (OVS_UNLIKELY(error)) {
+                    upcall_fail_cnt++;
+                } else {
+                    upcall_ok_cnt++;
+                }
             }
         }
 
@@ -5228,6 +5342,8 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                             upcall_ok_cnt);
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_LOST,
                             upcall_fail_cnt);
+    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RL_DROP,
+                            upcall_rl_drop_cnt);
 }
 
 /* Packets enter the datapath from a port (or from recirculation) here.
diff --git a/lib/dpif.h b/lib/dpif.h
index 94f89ec..708e798 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -438,6 +438,7 @@  struct dpif_dp_stats {
     uint64_t n_hit;             /* Number of flow table matches. */
     uint64_t n_missed;          /* Number of flow table misses. */
     uint64_t n_lost;            /* Number of misses not sent to userspace. */
+    uint64_t n_rl_dropped;      /* Number of drops due to upcall rate-limit. */
     uint64_t n_flows;           /* Number of flows present. */
     uint64_t n_mask_hit;        /* Number of mega flow masks visited for
                                    flow table matches. */
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 94a64dd..858ba4e 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -428,6 +428,48 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="upcall-rl"
+              type='{"type": "boolean"}'>
+        <p>
+          Set this value to <code>true</code> to enable upcall rate-limiting.
+          The upcall parameters like rate and burst will be ignored, if this is
+          not set.
+        </p>
+        <p>
+          The default value is <code>false</code> and upcall rate-limiting will
+          be disabled.
+        </p>
+      </column>
+
+      <column name="other_config" key="upcall-rate"
+        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
+        <p>
+          When upcall rate limiting is enabled, this specifies the sustained
+          rate of upcalls in upcalls per second that are admitted per
+          packet-polling thread (PMD or non-PMD) in the netdev datapath when
+          upcall rate limiting is enabled. The default value is 500 upcalls
+          per second.
+
+          Setting both upcall-rate and upcall-burst to <code>0</code> disables
+          upcalls in the netdev datapath. This can be used for debugging
+          purposes.
+        </p>
+      </column>
+
+      <column name="other_config" key="upcall-burst"
+        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
+        <p>
+          When upcall rate limiting is enabled, this specifies the maximum
+          burst of upcalls that are admitted per packet-polling  thread
+          (PMD or non-PMD) in the netdev datapath independent from their
+          packet arrival rate. The default value is a burst of 500 upcalls.
+
+          Setting both upcall-rate and upcall-burst to <code>0</code> disables
+          upcalls in the netdev datapath. This can be used for debugging
+          purposes.
+        </p>
+      </column>
+
       <column name="other_config" key="vlan-limit"
               type='{"type": "integer", "minInteger": 0}'>
         <p>