[ovs-dev,v2] netdev_dpdk.c: Add QoS functionality.
diff mbox

Message ID 1445002090-29185-1-git-send-email-ian.stokes@intel.com
State Changes Requested
Headers show

Commit Message

Stokes, Ian Oct. 16, 2015, 1:28 p.m. UTC
This patch provides the modifications required in netdev-dpdk.c and
vswitch.xml to allow for a DPDK user space QoS algorithm.

This patch adds a QoS configuration structure for netdev-dpdk and
expected QoS operations 'dpdk_qos_ops'. Various helper functions
are also supplied.

Also included are the modifications required for vswitch.xml to allow a
new QoS implementation for netdev-dpdk devices. This includes a new QoS type
`egress-policer` as well as its expected QoS table entries.

The QoS functionality implemented for DPDK devices is `egress-policer`.
This can be used to drop egress packets at a configurable rate.

The INSTALL.DPDK.md guide has also been modified to provide an example
configuration of `egress-policer` QoS.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>
---
 INSTALL.DPDK.md      |   20 +++
 lib/netdev-dpdk.c    |  416 +++++++++++++++++++++++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml |   52 +++++++
 3 files changed, 481 insertions(+), 7 deletions(-)

Comments

Stokes, Ian Oct. 22, 2015, 9:29 a.m. UTC | #1
Hi All,

Just wondering if anyone has input on v2 of the netdev-dpdk QoS patch?
Any feedback would be much appreciated.

Thanks
Ian

> -----Original Message-----
> From: Stokes, Ian
> Sent: Friday, October 16, 2015 2:28 PM
> To: dev@openvswitch.org
> Cc: Stokes, Ian
> Subject: [PATCH v2] netdev_dpdk.c: Add QoS functionality.
> 
> This patch provides the modifications required in netdev-dpdk.c and
> vswitch.xml to allow for a DPDK user space QoS algorithm.
> 
> This patch adds a QoS configuration structure for netdev-dpdk and
> expected QoS operations 'dpdk_qos_ops'. Various helper functions are
> also supplied.
> 
> Also included are the modifications required for vswitch.xml to allow a
> new QoS implementation for netdev-dpdk devices. This includes a new QoS
> type `egress-policer` as well as its expected QoS table entries.
> 
> The QoS functionality implemented for DPDK devices is `egress-policer`.
> This can be used to drop egress packets at a configurable rate.
> 
> The INSTALL.DPDK.md guide has also been modified to provide an example
> configuration of `egress-policer` QoS.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> ---
>  INSTALL.DPDK.md      |   20 +++
>  lib/netdev-dpdk.c    |  416
> +++++++++++++++++++++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml |   52 +++++++
>  3 files changed, 481 insertions(+), 7 deletions(-)
> 
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index 7bf110c..c64b8ce
> 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd:
>     ./ovs-ofctl add-flow br0 in_port=2,action=output:1
>     ```
> 
> +8. QoS usage example
> +
> +   Assuming you have a vhost-user port transmitting traffic consisting
> of
> +   packets of size 64 bytes, the following command would limit the
> egress
> +   transmission rate of the port to ~1,000,000 packets per second:
> +
> +   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create
> qos
> +   type=egress-policer other-config:cir=46000000 other-config:cbs=2048`
> +
> +   To examine the QoS configuration of the port:
> +
> +   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
> +
> +   To clear the QoS configuration from the port and ovsdb use the
> following:
> +
> +   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0 qos`
> +
> +   For more details regarding egress-policer parameters please refer to
> the
> +   vswitch.xml.
> +
>  Performance Tuning:
>  -------------------
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> 93b0589..cd4c532 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -44,6 +44,7 @@
>  #include "ovs-rcu.h"
>  #include "packets.h"
>  #include "shash.h"
> +#include "smap.h"
>  #include "sset.h"
>  #include "unaligned.h"
>  #include "timeval.h"
> @@ -52,12 +53,14 @@
> 
>  #include "rte_config.h"
>  #include "rte_mbuf.h"
> +#include "rte_meter.h"
>  #include "rte_virtio_net.h"
> 
>  VLOG_DEFINE_THIS_MODULE(dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> 
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> +#define DPDK_MAX_QOS_NAME_SIZE 10
> 
>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE  #define OVS_VPORT_DPDK
> "ovs_dpdk"
> @@ -142,6 +145,107 @@ static int rte_eal_init_ret = ENODEV;
> 
>  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> 
> +/* Quality of Service */
> +
> +/* An instance of a QoS configuration.  Always associated with a
> +particular
> + * network device.
> + *
> + * Each QoS implementation subclasses this with whatever additional
> +data it
> + * needs.
> + */
> +struct qos_conf {
> +    const struct dpdk_qos_ops *ops;
> +};
> +
> +/* A particular implementation of dpdk QoS operations.
> + *
> + * The functions below return 0 if successful or a positive errno value
> +on
> + * failure, except where otherwise noted. All of them must be provided,
> +except
> + * where otherwise noted.
> + */
> +struct dpdk_qos_ops {
> +
> +    /* Name of the QoS type */
> +    const char *qos_name;
> +
> +    /* Called to construct the QoS implementation on 'netdev'. The
> +     * implementation should make the appropriate calls to configure
> QoS
> +     * according to 'details'. The implementation may assume that any
> current
> +     * QoS configuration already installed should be destroyed before
> +     * constructing the new configuration.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function must return 0 if and only if it sets 'netdev-
> >qos_conf'
> +     * to an initialized 'struct qos_conf'.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_construct)(struct netdev *netdev, const struct smap
> + *details);
> +
> +    /* Destroys the data structures allocated by the implementation as
> part of
> +     * 'qos_conf.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    void (*qos_destruct)(struct netdev *netdev, struct qos_conf *conf);
> +
> +    /* Retrieves details of 'netdev->qos_conf' configuration into
> 'details'.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     */
> +    int (*qos_get)(const struct netdev *netdev, struct smap *details);
> +
> +    /* Reconfigures 'netdev->qos_conf' according to 'details',
> performing any
> +     * required calls to complete the reconfiguration.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function may be null if 'qos_conf' is not configurable.
> +     */
> +    int (*qos_set)(struct netdev *netdev, const struct smap *details);
> +
> +    /* Modify an array of rte_mbufs. The modification is specific to
> +     * each qos implementation.
> +     *
> +     * The function should take and array of mbufs and an int
> representing
> +     * the current number of mbufs present in the array.
> +     *
> +     * After the function has performed a qos modification to the array
> of
> +     * mbufs it returns an int representing the number of mbufs now
> present in
> +     * the array. This value is can then be passed to the port send
> function
> +     * along with the modified array for transmission.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf
> **pkts,
> +                           int pkt_cnt); };
> +
> +/* dpdk_qos_ops for each type of user space QoS implementation */
> +static const struct dpdk_qos_ops egress_policer_ops;
> +
> +/*
> + * Array of dpdk_qos_ops, contains pointer to all supported QoS
> + * operations.
> + */
> +static const struct dpdk_qos_ops *const qos_confs[] = {
> +    &egress_policer_ops,
> +    NULL
> +};
> +
> +/* Action that can be set for a packet for rte_meter */ enum
> +egress_policer_action {
> +        GREEN = e_RTE_METER_GREEN,
> +        DROP = 1,
> +};
> +
>  /* Contains all 'struct dpdk_dev's. */
>  static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>      = OVS_LIST_INITIALIZER(&dpdk_list); @@ -235,6 +339,11 @@ struct
> netdev_dpdk {
> 
>      /* In dpdk_list. */
>      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> +
> +    /* QoS configuration and lock for the device */
> +    struct qos_conf *qos_conf;
> +    rte_spinlock_t qos_lock;
> +
>  };
> 
>  struct netdev_rxq_dpdk {
> @@ -614,6 +723,10 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
> int port_no,
>          goto unlock;
>      }
> 
> +    /* Initialise QoS configuration to NULL and qos lock to unlocked */
> +    netdev->qos_conf = NULL;
> +    rte_spinlock_init(&netdev->qos_lock);
> +
>      netdev_->n_txq = NR_QUEUE;
>      netdev_->n_rxq = NR_QUEUE;
>      netdev->real_n_txq = NR_QUEUE;
> @@ -1038,6 +1151,21 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_,
> struct dp_packet **packets,
>      return 0;
>  }
> 
> +static inline int
> +netdev_dpdk_qos_process__(struct netdev_dpdk *dev, struct rte_mbuf
> **pkts,
> +                        int cnt)
> +{
> +    struct netdev *netdev = &dev->up;
> +
> +    rte_spinlock_lock(&dev->qos_lock);
> +    if (dev->qos_conf != NULL) {
> +        cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts, cnt);
> +    }
> +    rte_spinlock_unlock(&dev->qos_lock);
> +
> +    return cnt;
> +}
> +
>  static inline void
>  netdev_dpdk_vhost_update_tx_counters(struct netdev_stats *stats,
>                                       struct dp_packet **packets, @@ -
> 1072,6 +1200,9 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, struct
> dp_packet **pkts,
>          goto out;
>      }
> 
> +    /* Check has QoS has been configured for the netdev */
> +    cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt);
> +
>      /* There is vHost TX single queue, So we need to lock it for TX. */
>      rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
> 
> @@ -1216,6 +1347,10 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet **pkts,
>      if (dev->type == DPDK_DEV_VHOST) {
>          __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs,
> newcnt, true);
>      } else {
> +
> +        /* Check if QoS has been configured for this netdev. */
> +        newcnt = netdev_dpdk_qos_process__(dev, mbufs, newcnt);
> +
>          dpdk_queue_pkts(dev, qid, mbufs, newcnt);
>          dpdk_queue_flush(dev, qid);
>      }
> @@ -1275,9 +1410,15 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
> qid,
> 
>              if (OVS_UNLIKELY(size > dev->max_packet_len)) {
>                  if (next_tx_idx != i) {
> +                    cnt = i - next_tx_idx;
> +
> +                    cnt = netdev_dpdk_qos_process__(dev,
> +                                                    (struct
> rte_mbuf**)pkts,
> +                                                    cnt);
> +
>                      dpdk_queue_pkts(dev, qid,
>                                      (struct rte_mbuf
> **)&pkts[next_tx_idx],
> -                                    i-next_tx_idx);
> +                                    cnt);
>                  }
> 
>                  VLOG_WARN_RL(&rl, "Too big size %d max_packet_len %d",
> @@ -1289,9 +1430,12 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
> qid,
>              }
>          }
>          if (next_tx_idx != cnt) {
> -           dpdk_queue_pkts(dev, qid,
> -                            (struct rte_mbuf **)&pkts[next_tx_idx],
> -                            cnt-next_tx_idx);
> +            cnt -= next_tx_idx;
> +
> +            cnt = netdev_dpdk_qos_process__(dev, (struct
> + rte_mbuf**)pkts, cnt);
> +
> +            dpdk_queue_pkts(dev, qid, (struct rte_mbuf
> **)&pkts[next_tx_idx],
> +                            cnt);
>          }
> 
>          if (OVS_UNLIKELY(dropped)) {
> @@ -2032,6 +2176,264 @@ unlock_dpdk:
>      return err;
>  }
> 
> +/* QoS Functions */
> +
> +/*
> + * Initialize QoS configuration operations.
> + */
> +static void
> +qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops) {
> +    conf->ops = ops;
> +}
> +
> +/*
> + * Search existing QoS operations in qos_ops and compare each set of
> + * operations qos_name to name. Return a dpdk_qos_ops pointer to a
> +match,
> + * else return NULL
> + */
> +static const struct dpdk_qos_ops *
> +qos_lookup_name(const char *name)
> +{
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
> +            return ops;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * Call qos_destruct to clean up items associated with the netdevs
> + * qos_conf. Set netdevs qos_conf to NULL.
> + */
> +static void
> +qos_delete_conf(struct netdev *netdev_) {
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    if (netdev->qos_conf) {
> +        if (netdev->qos_conf->ops->qos_destruct) {
> +            netdev->qos_conf->ops->qos_destruct(netdev_, netdev-
> >qos_conf);
> +        }
> +        netdev->qos_conf = NULL;
> +    }
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +}
> +
> +static int
> +netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
> +                           struct sset *types) {
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (ops->qos_construct && ops->qos_name[0] != '\0') {
> +            sset_add(types, ops->qos_name);
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int
> +netdev_dpdk_get_qos(const struct netdev *netdev_,
> +                    const char **typep, struct smap *details) {
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    int error = 0;
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if(netdev->qos_conf) {
> +        *typep = netdev->qos_conf->ops->qos_name;
> +        error = (netdev->qos_conf->ops->qos_get
> +                 ? netdev->qos_conf->ops->qos_get(netdev_, details):
> 0);
> +    }
> +    ovs_mutex_unlock(&netdev->mutex);
> +
> +    return error;
> +}
> +
> +static int
> +netdev_dpdk_set_qos(struct netdev *netdev_,
> +                    const char *type, const struct smap *details) {
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    const struct dpdk_qos_ops *new_ops = NULL;
> +    int error = 0;
> +
> +    /* If type is empty the current QoS configuration for the dpdk-
> netdev can
> +     * be destroyed */
> +    if(type[0] == '\0') {
> +        qos_delete_conf(netdev_);
> +    }
> +
> +    new_ops = qos_lookup_name(type);
> +    if (!new_ops || !new_ops->qos_construct) {
> +        return EOPNOTSUPP;
> +    }
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if (netdev->qos_conf) {
> +        if (new_ops == netdev->qos_conf->ops) {
> +            error = new_ops->qos_set ? new_ops->qos_set(netdev_,
> details) : 0;
> +        }
> +        else {
> +            /* Delete existing QoS configuration. */
> +            qos_delete_conf(netdev_);
> +            ovs_assert(netdev->qos_conf == NULL);
> +
> +            /* Install new QoS configuration. */
> +            error = new_ops->qos_construct(netdev_, details);
> +            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +        }
> +    } else {
> +        error = new_ops->qos_construct(netdev_, details);
> +        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +    }
> +
> +    ovs_mutex_unlock(&netdev->mutex);
> +    return error;
> +}
> +
> +/* egress-policer details */
> +
> +struct egress_policer {
> +    struct qos_conf qos_conf;
> +    struct rte_meter_srtcm_params app_srtcm_params;
> +    struct rte_meter_srtcm test_meter;
> +};
> +
> +static struct egress_policer *
> +egress_policer_get__(const struct netdev *netdev_) {
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    return CONTAINER_OF(netdev->qos_conf, struct egress_policer,
> +qos_conf); }
> +
> +static int
> +egress_policer_qos_construct(struct netdev *netdev_,
> +                            const struct smap *details) {
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    struct egress_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    int err = 0;
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    policer = xmalloc(sizeof *policer);
> +    qos_conf_init(&policer->qos_conf, &egress_policer_ops);
> +    netdev->qos_conf = &policer->qos_conf;
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.ebs = 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter,
> +                                    &policer->app_srtcm_params);
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +
> +    return err;
> +}
> +
> +static void
> +egress_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED,
> +                        struct qos_conf *conf) {
> +    struct egress_policer *policer = CONTAINER_OF(conf, struct
> egress_policer,
> +                                                qos_conf);
> +    free(policer);
> +}
> +
> +static int
> +egress_policer_qos_get(const struct netdev *netdev, struct smap
> +*details) {
> +    struct egress_policer *policer = egress_policer_get__(netdev);
> +    smap_add_format(details, "cir", "%llu",
> +                    1ULL * policer->app_srtcm_params.cir);
> +    smap_add_format(details, "cbs", "%llu",
> +                    1ULL * policer->app_srtcm_params.cbs);
> +    return 0;
> +}
> +
> +static int
> +egress_policer_qos_set(struct netdev *netdev_, const struct smap
> +*details) {
> +    struct egress_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    int err = 0;
> +
> +    policer = egress_policer_get__(netdev_);
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.ebs = 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter,
> +                                    &policer->app_srtcm_params);
> +
> +    return err;
> +}
> +
> +static inline int
> +egress_policer_pkt_handle__(struct rte_meter_srtcm *meter,
> +                            struct rte_mbuf *pkt, uint64_t time) {
> +    uint8_t output_color;
> +    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct
> ether_hdr);
> +    enum egress_policer_action action = GREEN;
> +
> +    /* color input is not used for blind modes */
> +    output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter,
> time,
> +
> + pkt_len);
> +
> +    /* Check output color, 0 corresponds to GREEN i.e. packet can be
> +     * transmitted. If greater than 0 then action is set to DROP. */
> +    if (output_color) {
> +        action = DROP;
> +    }
> +
> +    return action;
> +}
> +
> +static int
> +egress_policer_alg_process(struct netdev *netdev_, struct rte_mbuf
> **pkts,
> +                        int pkt_cnt)
> +{
> +    int i = 0;
> +    int cnt = pkt_cnt;
> +    struct egress_policer *policer = egress_policer_get__(netdev_);
> +    struct rte_mbuf *pkt = NULL;
> +    uint64_t current_time = rte_rdtsc();
> +
> +    for(i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        /* Handle current packet */
> +        if (egress_policer_pkt_handle__(&policer->test_meter, pkt,
> +                                        current_time) == DROP) {
> +            rte_pktmbuf_free(pkt);
> +            cnt = cnt - 1;
> +        }
> +    }
> +
> +    return cnt;
> +}
> +
> +static const struct dpdk_qos_ops egress_policer_ops = {
> +    "egress-policer",    /* qos_name */
> +    egress_policer_qos_construct,
> +    egress_policer_qos_destruct,
> +    egress_policer_qos_get,
> +    egress_policer_qos_set,
> +    egress_policer_alg_process
> +};
> +
>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
> SEND, \
>      GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
> \
>  {                                                             \
> @@ -2069,10 +2471,10 @@ unlock_dpdk:
>      NULL,                       /* set_advertisements */      \
>                                                                \
>      NULL,                       /* set_policing */            \
> -    NULL,                       /* get_qos_types */           \
> +    netdev_dpdk_get_qos_types,                                \
>      NULL,                       /* get_qos_capabilities */    \
> -    NULL,                       /* get_qos */                 \
> -    NULL,                       /* set_qos */                 \
> +    netdev_dpdk_get_qos,                                      \
> +    netdev_dpdk_set_qos,                                      \
>      NULL,                       /* get_queue */               \
>      NULL,                       /* set_queue */               \
>      NULL,                       /* delete_queue */            \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> 4dbd9e3..5be8eb6 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3298,6 +3298,26 @@
>            for information on how this classifier works.
>          </dd>
>        </dl>
> +      <dl>
> +        <dt><code>egress-policer</code></dt>
> +        <dd>
> +          An egress policer algorithm. This implementation uses the
> DPDK
> +          rte_meter library. The rte_meter library provides an
> implementation
> +          of srTCM (RFC2697) which allows the metering and policing of
> traffic.
> +          The implementation in OVS creates a color blind srTCM meter
> with a
> +          single token bucket used to police traffic. It should be
> noted that
> +          when the rte_meter is configured as part of QoS there will be
> a
> +          performance overhead as the rte_mter itself will consume CPU
> cycles
> +          in order to police traffic. These CPU cycles ordinarily are
> used for
> +          packet proccessing. As such the drop in performance will be
> noticed
> +          in terms of overall aggregate traffic throughput.
> +
> +          For more information regarding the usage of the DPDK
> rte_meter see
> +
> <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
> +          For more information regarding srTCM see
> +          <code>https://tools.ietf.org/html/rfc2697</code>
> +        </dd>
> +      </dl>
>      </column>
> 
>      <column name="queues">
> @@ -3334,6 +3354,38 @@
>        </column>
>      </group>
> 
> +    <group title="Configuration for egress-policer QoS">
> +      <p>
> +        <ref table="QoS"/> <ref table="QoS" column="type"/>
> +        <code>egress-policer</code> provides egress policing for
> userspace
> +        port types with DPDK.
> +
> +        It has the following key-value pairs defined.
> +      </p>
> +
> +      <column name="other_config" key="cir" type='{"type": "integer"}'>
> +        The Committed Information Rate (CIR) is measured in bytes of IP
> +        packets per second, i.e. it includes the IP header, but not
> link
> +        specific (e.g. Ethernet) headers. This represents the bytes per
> second
> +        rate at which the token bucket will be updated. The cir value
> is
> +        calculated by (pps x packet data size).  For example assuming a
> user
> +        wishes to limit a stream consisting of 64 byte packets to 1
> million
> +        packets per second the CIR would be set to to to 46000000. This
> value
> +        can be broken into '1,000,000 x 46'. Where 1,000,000 is the
> policing
> +        rate for the number of packets per second and 46 represents the
> size
> +        of the packet data for a 64 byte ip packet.
> +      </column>
> +      <column name="other_config" key="cbs" type='{"type": "integer"}'>
> +        The Committed Burst Size (CBS) is measured in bytes and
> represents a
> +        token bucket. At a minimum this value should be be set to the
> expected
> +        largest size packet in the traffic stream. In practice larger
> values
> +        may be used to increase the size of the token bucket. If a
> packet can
> +        be transmitted then the cbs will be decremented by the number
> of
> +        bytes/tokens of the packet. If there are not enough tokens in
> the cbs
> +        bucket the packet will be dropped.
> +      </column>
> +    </group>
> +
>      <group title="Common Columns">
>        The overall purpose of these columns is described under
> <code>Common
>        Columns</code> at the beginning of this document.
> --
> 1.7.4.1
Ben Pfaff Nov. 4, 2015, 3:54 a.m. UTC | #2
On Thu, Oct 22, 2015 at 09:29:37AM +0000, Stokes, Ian wrote:
> Just wondering if anyone has input on v2 of the netdev-dpdk QoS patch?
> Any feedback would be much appreciated.

I don't object but I'm the wrong person to affirmatively review it.
Stokes, Ian Nov. 18, 2015, 9:28 a.m. UTC | #3
> -----Original Message-----
> From: Ben Pfaff [mailto:blp@nicira.com]
> Sent: Wednesday, November 04, 2015 3:55 AM
> To: Stokes, Ian
> Cc: dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v2] netdev_dpdk.c: Add QoS functionality.
> 
> On Thu, Oct 22, 2015 at 09:29:37AM +0000, Stokes, Ian wrote:
> > Just wondering if anyone has input on v2 of the netdev-dpdk QoS patch?
> > Any feedback would be much appreciated.
> 
> I don't object but I'm the wrong person to affirmatively review it.

Thanks for the reply Ben. I realise the OVS conference is running this week so people will be busy attending. If there is anyone available to review the version 2 patch it would be much appreciated it.

Thanks
Ian
Flavio Leitner Nov. 26, 2015, 8:57 p.m. UTC | #4
Funny, everyone is trying to get more performance out of
vhost-user and you're are adding qos... :-)

I am still looking into it, so don't consider this a
throughout review.


On Fri, Oct 16, 2015 at 02:28:10PM +0100, Ian Stokes wrote:
> This patch provides the modifications required in netdev-dpdk.c and
> vswitch.xml to allow for a DPDK user space QoS algorithm.
> 
> This patch adds a QoS configuration structure for netdev-dpdk and
> expected QoS operations 'dpdk_qos_ops'. Various helper functions
> are also supplied.
> 
> Also included are the modifications required for vswitch.xml to allow a
> new QoS implementation for netdev-dpdk devices. This includes a new QoS type
> `egress-policer` as well as its expected QoS table entries.
> 
> The QoS functionality implemented for DPDK devices is `egress-policer`.
> This can be used to drop egress packets at a configurable rate.
> 
> The INSTALL.DPDK.md guide has also been modified to provide an example
> configuration of `egress-policer` QoS.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> ---
>  INSTALL.DPDK.md      |   20 +++
>  lib/netdev-dpdk.c    |  416 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml |   52 +++++++
>  3 files changed, 481 insertions(+), 7 deletions(-)
> 
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 7bf110c..c64b8ce 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd:
>     ./ovs-ofctl add-flow br0 in_port=2,action=output:1
>     ```
>  
> +8. QoS usage example
> +
> +   Assuming you have a vhost-user port transmitting traffic consisting of
> +   packets of size 64 bytes, the following command would limit the egress
> +   transmission rate of the port to ~1,000,000 packets per second:
> +
> +   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos
> +   type=egress-policer other-config:cir=46000000 other-config:cbs=2048`
> +
> +   To examine the QoS configuration of the port:
> +
> +   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
> +
> +   To clear the QoS configuration from the port and ovsdb use the following:
> +
> +   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0 qos`
> +
> +   For more details regarding egress-policer parameters please refer to the
> +   vswitch.xml.
> +
>  Performance Tuning:
>  -------------------
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 93b0589..cd4c532 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -44,6 +44,7 @@
>  #include "ovs-rcu.h"
>  #include "packets.h"
>  #include "shash.h"
> +#include "smap.h"
>  #include "sset.h"
>  #include "unaligned.h"
>  #include "timeval.h"
> @@ -52,12 +53,14 @@
>  
>  #include "rte_config.h"
>  #include "rte_mbuf.h"
> +#include "rte_meter.h"
>  #include "rte_virtio_net.h"
>  
>  VLOG_DEFINE_THIS_MODULE(dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> +#define DPDK_MAX_QOS_NAME_SIZE 10
>  
>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>  #define OVS_VPORT_DPDK "ovs_dpdk"
> @@ -142,6 +145,107 @@ static int rte_eal_init_ret = ENODEV;
>  
>  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
>  
> +/* Quality of Service */
> +
> +/* An instance of a QoS configuration.  Always associated with a particular
> + * network device.
> + *
> + * Each QoS implementation subclasses this with whatever additional data it
> + * needs.
> + */
> +struct qos_conf {
> +    const struct dpdk_qos_ops *ops;
> +};
> +
> +/* A particular implementation of dpdk QoS operations.
> + *
> + * The functions below return 0 if successful or a positive errno value on
> + * failure, except where otherwise noted. All of them must be provided, except
> + * where otherwise noted.
> + */
> +struct dpdk_qos_ops {
> +
> +    /* Name of the QoS type */
> +    const char *qos_name;
> +
> +    /* Called to construct the QoS implementation on 'netdev'. The
> +     * implementation should make the appropriate calls to configure QoS
> +     * according to 'details'. The implementation may assume that any current
> +     * QoS configuration already installed should be destroyed before
> +     * constructing the new configuration.
> +     *
> +     * The contents of 'details' should be documented as valid for 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function must return 0 if and only if it sets 'netdev->qos_conf'
> +     * to an initialized 'struct qos_conf'.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_construct)(struct netdev *netdev, const struct smap *details);
> +
> +    /* Destroys the data structures allocated by the implementation as part of
> +     * 'qos_conf.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    void (*qos_destruct)(struct netdev *netdev, struct qos_conf *conf);
> +
> +    /* Retrieves details of 'netdev->qos_conf' configuration into 'details'.
> +     *
> +     * The contents of 'details' should be documented as valid for 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     */
> +    int (*qos_get)(const struct netdev *netdev, struct smap *details);
> +
> +    /* Reconfigures 'netdev->qos_conf' according to 'details', performing any
> +     * required calls to complete the reconfiguration.
> +     *
> +     * The contents of 'details' should be documented as valid for 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function may be null if 'qos_conf' is not configurable.
> +     */
> +    int (*qos_set)(struct netdev *netdev, const struct smap *details);
> +
> +    /* Modify an array of rte_mbufs. The modification is specific to
> +     * each qos implementation.
> +     *
> +     * The function should take and array of mbufs and an int representing
> +     * the current number of mbufs present in the array.
> +     *
> +     * After the function has performed a qos modification to the array of
> +     * mbufs it returns an int representing the number of mbufs now present in
> +     * the array. This value is can then be passed to the port send function
> +     * along with the modified array for transmission.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf **pkts,
> +                           int pkt_cnt);
> +};
> +
> +/* dpdk_qos_ops for each type of user space QoS implementation */
> +static const struct dpdk_qos_ops egress_policer_ops;
> +
> +/*
> + * Array of dpdk_qos_ops, contains pointer to all supported QoS
> + * operations.
> + */
> +static const struct dpdk_qos_ops *const qos_confs[] = {
> +    &egress_policer_ops,
> +    NULL
> +};
> +
> +/* Action that can be set for a packet for rte_meter */
> +enum egress_policer_action {
> +        GREEN = e_RTE_METER_GREEN,
> +        DROP = 1,
> +};
> +
>  /* Contains all 'struct dpdk_dev's. */
>  static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>      = OVS_LIST_INITIALIZER(&dpdk_list);
> @@ -235,6 +339,11 @@ struct netdev_dpdk {
>  
>      /* In dpdk_list. */
>      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> +
> +    /* QoS configuration and lock for the device */
> +    struct qos_conf *qos_conf;
> +    rte_spinlock_t qos_lock;
> +
>  };
>  
>  struct netdev_rxq_dpdk {
> @@ -614,6 +723,10 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>          goto unlock;
>      }
>  
> +    /* Initialise QoS configuration to NULL and qos lock to unlocked */
> +    netdev->qos_conf = NULL;
> +    rte_spinlock_init(&netdev->qos_lock);
> +
>      netdev_->n_txq = NR_QUEUE;
>      netdev_->n_rxq = NR_QUEUE;
>      netdev->real_n_txq = NR_QUEUE;
> @@ -1038,6 +1151,21 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet **packets,
>      return 0;
>  }
>  
> +static inline int
> +netdev_dpdk_qos_process__(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> +                        int cnt)
> +{
> +    struct netdev *netdev = &dev->up;
> +

People are working really hard to get the data path as fast
as possible and this seems to be too expensive:

> +    rte_spinlock_lock(&dev->qos_lock);
> +    if (dev->qos_conf != NULL) {
> +        cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts, cnt);
> +    }
> +    rte_spinlock_unlock(&dev->qos_lock);

Maybe we could do something like this:
 
    if (dev->qos_conf != NULL) {
        rte_spinlock_lock(&dev->qos_lock);
        if (dev->qos_conf != NULL) {
            cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts, cnt);
        }
        rte_spinlock_unlock(&dev->qos_lock);
    }

Though I still have to look at qos_alg_process() to see if the locking
could be done in another way.

> +
> +    return cnt;
> +}
> +
>  static inline void
>  netdev_dpdk_vhost_update_tx_counters(struct netdev_stats *stats,
>                                       struct dp_packet **packets,
> @@ -1072,6 +1200,9 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, struct dp_packet **pkts,
>          goto out;
>      }
>  
> +    /* Check has QoS has been configured for the netdev */
> +    cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt);
> +
>      /* There is vHost TX single queue, So we need to lock it for TX. */
>      rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
>  
> @@ -1216,6 +1347,10 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet **pkts,
>      if (dev->type == DPDK_DEV_VHOST) {
>          __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs, newcnt, true);
>      } else {
> +
> +        /* Check if QoS has been configured for this netdev. */
> +        newcnt = netdev_dpdk_qos_process__(dev, mbufs, newcnt);
> +
>          dpdk_queue_pkts(dev, qid, mbufs, newcnt);
>          dpdk_queue_flush(dev, qid);
>      }
> @@ -1275,9 +1410,15 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>  
>              if (OVS_UNLIKELY(size > dev->max_packet_len)) {
>                  if (next_tx_idx != i) {
> +                    cnt = i - next_tx_idx;
> +
> +                    cnt = netdev_dpdk_qos_process__(dev,
> +                                                    (struct rte_mbuf**)pkts,
> +                                                    cnt);


I have another concern at this point because for each batch it
will grab a lock pretty much serializing multiple queues.

> +
>                      dpdk_queue_pkts(dev, qid,
>                                      (struct rte_mbuf **)&pkts[next_tx_idx],
> -                                    i-next_tx_idx);
> +                                    cnt);
>                  }
>  
>                  VLOG_WARN_RL(&rl, "Too big size %d max_packet_len %d",
> @@ -1289,9 +1430,12 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>              }
>          }
>          if (next_tx_idx != cnt) {
> -           dpdk_queue_pkts(dev, qid,
> -                            (struct rte_mbuf **)&pkts[next_tx_idx],
> -                            cnt-next_tx_idx);
> +            cnt -= next_tx_idx;
> +
> +            cnt = netdev_dpdk_qos_process__(dev, (struct rte_mbuf**)pkts, cnt);

Same here.
and it seems it should update tx_dropped too.

Thanks,
fbl

> +
> +            dpdk_queue_pkts(dev, qid, (struct rte_mbuf **)&pkts[next_tx_idx],
> +                            cnt);
>          }
>  
>          if (OVS_UNLIKELY(dropped)) {
> @@ -2032,6 +2176,264 @@ unlock_dpdk:
>      return err;
>  }
>  
> +/* QoS Functions */
> +
> +/*
> + * Initialize QoS configuration operations.
> + */
> +static void
> +qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops)
> +{
> +    conf->ops = ops;
> +}
> +
> +/*
> + * Search existing QoS operations in qos_ops and compare each set of
> + * operations qos_name to name. Return a dpdk_qos_ops pointer to a match,
> + * else return NULL
> + */
> +static const struct dpdk_qos_ops *
> +qos_lookup_name(const char *name)
> +{
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
> +            return ops;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * Call qos_destruct to clean up items associated with the netdevs
> + * qos_conf. Set netdevs qos_conf to NULL.
> + */
> +static void
> +qos_delete_conf(struct netdev *netdev_)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    if (netdev->qos_conf) {
> +        if (netdev->qos_conf->ops->qos_destruct) {
> +            netdev->qos_conf->ops->qos_destruct(netdev_, netdev->qos_conf);
> +        }
> +        netdev->qos_conf = NULL;
> +    }
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +}
> +
> +static int
> +netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
> +                           struct sset *types)
> +{
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (ops->qos_construct && ops->qos_name[0] != '\0') {
> +            sset_add(types, ops->qos_name);
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int
> +netdev_dpdk_get_qos(const struct netdev *netdev_,
> +                    const char **typep, struct smap *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    int error = 0;
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if(netdev->qos_conf) {
> +        *typep = netdev->qos_conf->ops->qos_name;
> +        error = (netdev->qos_conf->ops->qos_get
> +                 ? netdev->qos_conf->ops->qos_get(netdev_, details): 0);
> +    }
> +    ovs_mutex_unlock(&netdev->mutex);
> +
> +    return error;
> +}
> +
> +static int
> +netdev_dpdk_set_qos(struct netdev *netdev_,
> +                    const char *type, const struct smap *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    const struct dpdk_qos_ops *new_ops = NULL;
> +    int error = 0;
> +
> +    /* If type is empty the current QoS configuration for the dpdk-netdev can
> +     * be destroyed */
> +    if(type[0] == '\0') {
> +        qos_delete_conf(netdev_);
> +    }
> +
> +    new_ops = qos_lookup_name(type);
> +    if (!new_ops || !new_ops->qos_construct) {
> +        return EOPNOTSUPP;
> +    }
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if (netdev->qos_conf) {
> +        if (new_ops == netdev->qos_conf->ops) {
> +            error = new_ops->qos_set ? new_ops->qos_set(netdev_, details) : 0;
> +        }
> +        else {
> +            /* Delete existing QoS configuration. */
> +            qos_delete_conf(netdev_);
> +            ovs_assert(netdev->qos_conf == NULL);
> +
> +            /* Install new QoS configuration. */
> +            error = new_ops->qos_construct(netdev_, details);
> +            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +        }
> +    } else {
> +        error = new_ops->qos_construct(netdev_, details);
> +        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +    }
> +
> +    ovs_mutex_unlock(&netdev->mutex);
> +    return error;
> +}
> +
> +/* egress-policer details */
> +
> +struct egress_policer {
> +    struct qos_conf qos_conf;
> +    struct rte_meter_srtcm_params app_srtcm_params;
> +    struct rte_meter_srtcm test_meter;
> +};
> +
> +static struct egress_policer *
> +egress_policer_get__(const struct netdev *netdev_)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    return CONTAINER_OF(netdev->qos_conf, struct egress_policer, qos_conf);
> +}
> +
> +static int
> +egress_policer_qos_construct(struct netdev *netdev_,
> +                            const struct smap *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    struct egress_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    int err = 0;
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    policer = xmalloc(sizeof *policer);
> +    qos_conf_init(&policer->qos_conf, &egress_policer_ops);
> +    netdev->qos_conf = &policer->qos_conf;
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
> +    policer->app_srtcm_params.ebs = 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter,
> +                                    &policer->app_srtcm_params);
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +
> +    return err;
> +}
> +
> +static void
> +egress_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED,
> +                        struct qos_conf *conf)
> +{
> +    struct egress_policer *policer = CONTAINER_OF(conf, struct egress_policer,
> +                                                qos_conf);
> +    free(policer);
> +}
> +
> +static int
> +egress_policer_qos_get(const struct netdev *netdev, struct smap *details)
> +{
> +    struct egress_policer *policer = egress_policer_get__(netdev);
> +    smap_add_format(details, "cir", "%llu",
> +                    1ULL * policer->app_srtcm_params.cir);
> +    smap_add_format(details, "cbs", "%llu",
> +                    1ULL * policer->app_srtcm_params.cbs);
> +    return 0;
> +}
> +
> +static int
> +egress_policer_qos_set(struct netdev *netdev_, const struct smap *details)
> +{
> +    struct egress_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    int err = 0;
> +
> +    policer = egress_policer_get__(netdev_);
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
> +    policer->app_srtcm_params.ebs = 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter,
> +                                    &policer->app_srtcm_params);
> +
> +    return err;
> +}
> +
> +static inline int
> +egress_policer_pkt_handle__(struct rte_meter_srtcm *meter,
> +                            struct rte_mbuf *pkt, uint64_t time)
> +{
> +    uint8_t output_color;
> +    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct ether_hdr);
> +    enum egress_policer_action action = GREEN;
> +
> +    /* color input is not used for blind modes */
> +    output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter, time,
> +                                                                pkt_len);
> +
> +    /* Check output color, 0 corresponds to GREEN i.e. packet can be
> +     * transmitted. If greater than 0 then action is set to DROP. */
> +    if (output_color) {
> +        action = DROP;
> +    }
> +
> +    return action;
> +}
> +
> +static int
> +egress_policer_alg_process(struct netdev *netdev_, struct rte_mbuf **pkts,
> +                        int pkt_cnt)
> +{
> +    int i = 0;
> +    int cnt = pkt_cnt;
> +    struct egress_policer *policer = egress_policer_get__(netdev_);
> +    struct rte_mbuf *pkt = NULL;
> +    uint64_t current_time = rte_rdtsc();
> +
> +    for(i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        /* Handle current packet */
> +        if (egress_policer_pkt_handle__(&policer->test_meter, pkt,
> +                                        current_time) == DROP) {
> +            rte_pktmbuf_free(pkt);
> +            cnt = cnt - 1;
> +        }
> +    }
> +
> +    return cnt;
> +}
> +
> +static const struct dpdk_qos_ops egress_policer_ops = {
> +    "egress-policer",    /* qos_name */
> +    egress_policer_qos_construct,
> +    egress_policer_qos_destruct,
> +    egress_policer_qos_get,
> +    egress_policer_qos_set,
> +    egress_policer_alg_process
> +};
> +
>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
>      GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
>  {                                                             \
> @@ -2069,10 +2471,10 @@ unlock_dpdk:
>      NULL,                       /* set_advertisements */      \
>                                                                \
>      NULL,                       /* set_policing */            \
> -    NULL,                       /* get_qos_types */           \
> +    netdev_dpdk_get_qos_types,                                \
>      NULL,                       /* get_qos_capabilities */    \
> -    NULL,                       /* get_qos */                 \
> -    NULL,                       /* set_qos */                 \
> +    netdev_dpdk_get_qos,                                      \
> +    netdev_dpdk_set_qos,                                      \
>      NULL,                       /* get_queue */               \
>      NULL,                       /* set_queue */               \
>      NULL,                       /* delete_queue */            \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4dbd9e3..5be8eb6 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3298,6 +3298,26 @@
>            for information on how this classifier works.
>          </dd>
>        </dl>
> +      <dl>
> +        <dt><code>egress-policer</code></dt>
> +        <dd>
> +          An egress policer algorithm. This implementation uses the DPDK
> +          rte_meter library. The rte_meter library provides an implementation
> +          of srTCM (RFC2697) which allows the metering and policing of traffic.
> +          The implementation in OVS creates a color blind srTCM meter with a
> +          single token bucket used to police traffic. It should be noted that
> +          when the rte_meter is configured as part of QoS there will be a
> +          performance overhead as the rte_mter itself will consume CPU cycles
> +          in order to police traffic. These CPU cycles ordinarily are used for
> +          packet proccessing. As such the drop in performance will be noticed
> +          in terms of overall aggregate traffic throughput.
> +
> +          For more information regarding the usage of the DPDK rte_meter see
> +          <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
> +          For more information regarding srTCM see
> +          <code>https://tools.ietf.org/html/rfc2697</code>
> +        </dd>
> +      </dl>
>      </column>
>  
>      <column name="queues">
> @@ -3334,6 +3354,38 @@
>        </column>
>      </group>
>  
> +    <group title="Configuration for egress-policer QoS">
> +      <p>
> +        <ref table="QoS"/> <ref table="QoS" column="type"/>
> +        <code>egress-policer</code> provides egress policing for userspace
> +        port types with DPDK.
> +
> +        It has the following key-value pairs defined.
> +      </p>
> +
> +      <column name="other_config" key="cir" type='{"type": "integer"}'>
> +        The Committed Information Rate (CIR) is measured in bytes of IP
> +        packets per second, i.e. it includes the IP header, but not link
> +        specific (e.g. Ethernet) headers. This represents the bytes per second
> +        rate at which the token bucket will be updated. The cir value is
> +        calculated by (pps x packet data size).  For example assuming a user
> +        wishes to limit a stream consisting of 64 byte packets to 1 million
> +        packets per second the CIR would be set to to to 46000000. This value
> +        can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
> +        rate for the number of packets per second and 46 represents the size
> +        of the packet data for a 64 byte ip packet.
> +      </column>
> +      <column name="other_config" key="cbs" type='{"type": "integer"}'>
> +        The Committed Burst Size (CBS) is measured in bytes and represents a
> +        token bucket. At a minimum this value should be be set to the expected
> +        largest size packet in the traffic stream. In practice larger values
> +        may be used to increase the size of the token bucket. If a packet can
> +        be transmitted then the cbs will be decremented by the number of
> +        bytes/tokens of the packet. If there are not enough tokens in the cbs
> +        bucket the packet will be dropped.
> +      </column>
> +    </group>
> +
>      <group title="Common Columns">
>        The overall purpose of these columns is described under <code>Common
>        Columns</code> at the beginning of this document.
> -- 
> 1.7.4.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
Stokes, Ian Nov. 30, 2015, 5:47 p.m. UTC | #5
> -----Original Message-----
> From: Flavio Leitner [mailto:fbl@sysclose.org]
> Sent: Thursday, November 26, 2015 8:58 PM
> To: Stokes, Ian
> Cc: dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v2] netdev_dpdk.c: Add QoS functionality.
> 
> 
> Funny, everyone is trying to get more performance out of vhost-user and
> you're are adding qos... :-)
Performance versus functionality, it wouldn’t be fun if it wasn't a challenge :)
> 
> I am still looking into it, so don't consider this a throughout review.
Thanks for taking the time to look at this, much appreciated. I've taken on board the suggestions you have below, I'm happy to wait for a thorough review when you get a chance so that the rework will be more complete and substantial on my side. 

Thanks
Ian
> 
> 
> On Fri, Oct 16, 2015 at 02:28:10PM +0100, Ian Stokes wrote:
> > This patch provides the modifications required in netdev-dpdk.c and
> > vswitch.xml to allow for a DPDK user space QoS algorithm.
> >
> > This patch adds a QoS configuration structure for netdev-dpdk and
> > expected QoS operations 'dpdk_qos_ops'. Various helper functions are
> > also supplied.
> >
> > Also included are the modifications required for vswitch.xml to allow
> > a new QoS implementation for netdev-dpdk devices. This includes a new
> > QoS type `egress-policer` as well as its expected QoS table entries.
> >
> > The QoS functionality implemented for DPDK devices is `egress-
> policer`.
> > This can be used to drop egress packets at a configurable rate.
> >
> > The INSTALL.DPDK.md guide has also been modified to provide an example
> > configuration of `egress-policer` QoS.
> >
> > Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> > ---
> >  INSTALL.DPDK.md      |   20 +++
> >  lib/netdev-dpdk.c    |  416
> +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  vswitchd/vswitch.xml |   52 +++++++
> >  3 files changed, 481 insertions(+), 7 deletions(-)
> >
> > diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index 7bf110c..c64b8ce
> > 100644
> > --- a/INSTALL.DPDK.md
> > +++ b/INSTALL.DPDK.md
> > @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd:
> >     ./ovs-ofctl add-flow br0 in_port=2,action=output:1
> >     ```
> >
> > +8. QoS usage example
> > +
> > +   Assuming you have a vhost-user port transmitting traffic
> consisting of
> > +   packets of size 64 bytes, the following command would limit the
> egress
> > +   transmission rate of the port to ~1,000,000 packets per second:
> > +
> > +   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create
> qos
> > +   type=egress-policer other-config:cir=46000000
> > + other-config:cbs=2048`
> > +
> > +   To examine the QoS configuration of the port:
> > +
> > +   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
> > +
> > +   To clear the QoS configuration from the port and ovsdb use the
> following:
> > +
> > +   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0
> > + qos`
> > +
> > +   For more details regarding egress-policer parameters please refer
> to the
> > +   vswitch.xml.
> > +
> >  Performance Tuning:
> >  -------------------
> >
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > 93b0589..cd4c532 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -44,6 +44,7 @@
> >  #include "ovs-rcu.h"
> >  #include "packets.h"
> >  #include "shash.h"
> > +#include "smap.h"
> >  #include "sset.h"
> >  #include "unaligned.h"
> >  #include "timeval.h"
> > @@ -52,12 +53,14 @@
> >
> >  #include "rte_config.h"
> >  #include "rte_mbuf.h"
> > +#include "rte_meter.h"
> >  #include "rte_virtio_net.h"
> >
> >  VLOG_DEFINE_THIS_MODULE(dpdk);
> >  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> >
> >  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> > +#define DPDK_MAX_QOS_NAME_SIZE 10
> >
> >  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE  #define OVS_VPORT_DPDK
> > "ovs_dpdk"
> > @@ -142,6 +145,107 @@ static int rte_eal_init_ret = ENODEV;
> >
> >  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> >
> > +/* Quality of Service */
> > +
> > +/* An instance of a QoS configuration.  Always associated with a
> > +particular
> > + * network device.
> > + *
> > + * Each QoS implementation subclasses this with whatever additional
> > +data it
> > + * needs.
> > + */
> > +struct qos_conf {
> > +    const struct dpdk_qos_ops *ops;
> > +};
> > +
> > +/* A particular implementation of dpdk QoS operations.
> > + *
> > + * The functions below return 0 if successful or a positive errno
> > +value on
> > + * failure, except where otherwise noted. All of them must be
> > +provided, except
> > + * where otherwise noted.
> > + */
> > +struct dpdk_qos_ops {
> > +
> > +    /* Name of the QoS type */
> > +    const char *qos_name;
> > +
> > +    /* Called to construct the QoS implementation on 'netdev'. The
> > +     * implementation should make the appropriate calls to configure
> QoS
> > +     * according to 'details'. The implementation may assume that any
> current
> > +     * QoS configuration already installed should be destroyed before
> > +     * constructing the new configuration.
> > +     *
> > +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> > +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> > +     * (which is built as ovs-vswitchd.conf.db(8)).
> > +     *
> > +     * This function must return 0 if and only if it sets 'netdev-
> >qos_conf'
> > +     * to an initialized 'struct qos_conf'.
> > +     *
> > +     * For all QoS implementations it should always be non-null.
> > +     */
> > +    int (*qos_construct)(struct netdev *netdev, const struct smap
> > + *details);
> > +
> > +    /* Destroys the data structures allocated by the implementation
> as part of
> > +     * 'qos_conf.
> > +     *
> > +     * For all QoS implementations it should always be non-null.
> > +     */
> > +    void (*qos_destruct)(struct netdev *netdev, struct qos_conf
> > + *conf);
> > +
> > +    /* Retrieves details of 'netdev->qos_conf' configuration into
> 'details'.
> > +     *
> > +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> > +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> > +     * (which is built as ovs-vswitchd.conf.db(8)).
> > +     */
> > +    int (*qos_get)(const struct netdev *netdev, struct smap
> > + *details);
> > +
> > +    /* Reconfigures 'netdev->qos_conf' according to 'details',
> performing any
> > +     * required calls to complete the reconfiguration.
> > +     *
> > +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> > +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> > +     * (which is built as ovs-vswitchd.conf.db(8)).
> > +     *
> > +     * This function may be null if 'qos_conf' is not configurable.
> > +     */
> > +    int (*qos_set)(struct netdev *netdev, const struct smap
> > + *details);
> > +
> > +    /* Modify an array of rte_mbufs. The modification is specific to
> > +     * each qos implementation.
> > +     *
> > +     * The function should take and array of mbufs and an int
> representing
> > +     * the current number of mbufs present in the array.
> > +     *
> > +     * After the function has performed a qos modification to the
> array of
> > +     * mbufs it returns an int representing the number of mbufs now
> present in
> > +     * the array. This value is can then be passed to the port send
> function
> > +     * along with the modified array for transmission.
> > +     *
> > +     * For all QoS implementations it should always be non-null.
> > +     */
> > +    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf
> **pkts,
> > +                           int pkt_cnt); };
> > +
> > +/* dpdk_qos_ops for each type of user space QoS implementation */
> > +static const struct dpdk_qos_ops egress_policer_ops;
> > +
> > +/*
> > + * Array of dpdk_qos_ops, contains pointer to all supported QoS
> > + * operations.
> > + */
> > +static const struct dpdk_qos_ops *const qos_confs[] = {
> > +    &egress_policer_ops,
> > +    NULL
> > +};
> > +
> > +/* Action that can be set for a packet for rte_meter */ enum
> > +egress_policer_action {
> > +        GREEN = e_RTE_METER_GREEN,
> > +        DROP = 1,
> > +};
> > +
> >  /* Contains all 'struct dpdk_dev's. */  static struct ovs_list
> > dpdk_list OVS_GUARDED_BY(dpdk_mutex)
> >      = OVS_LIST_INITIALIZER(&dpdk_list); @@ -235,6 +339,11 @@ struct
> > netdev_dpdk {
> >
> >      /* In dpdk_list. */
> >      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> > +
> > +    /* QoS configuration and lock for the device */
> > +    struct qos_conf *qos_conf;
> > +    rte_spinlock_t qos_lock;
> > +
> >  };
> >
> >  struct netdev_rxq_dpdk {
> > @@ -614,6 +723,10 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
> int port_no,
> >          goto unlock;
> >      }
> >
> > +    /* Initialise QoS configuration to NULL and qos lock to unlocked
> */
> > +    netdev->qos_conf = NULL;
> > +    rte_spinlock_init(&netdev->qos_lock);
> > +
> >      netdev_->n_txq = NR_QUEUE;
> >      netdev_->n_rxq = NR_QUEUE;
> >      netdev->real_n_txq = NR_QUEUE;
> > @@ -1038,6 +1151,21 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_,
> struct dp_packet **packets,
> >      return 0;
> >  }
> >
> > +static inline int
> > +netdev_dpdk_qos_process__(struct netdev_dpdk *dev, struct rte_mbuf
> **pkts,
> > +                        int cnt)
> > +{
> > +    struct netdev *netdev = &dev->up;
> > +
> 
> People are working really hard to get the data path as fast as possible
> and this seems to be too expensive:
> 
> > +    rte_spinlock_lock(&dev->qos_lock);
> > +    if (dev->qos_conf != NULL) {
> > +        cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts, cnt);
> > +    }
> > +    rte_spinlock_unlock(&dev->qos_lock);
> 
> Maybe we could do something like this:
> 
>     if (dev->qos_conf != NULL) {
>         rte_spinlock_lock(&dev->qos_lock);
>         if (dev->qos_conf != NULL) {
>             cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts,
> cnt);
>         }
>         rte_spinlock_unlock(&dev->qos_lock);
>     }
> 
> Though I still have to look at qos_alg_process() to see if the locking
> could be done in another way.
> 
> > +
> > +    return cnt;
> > +}
> > +
> >  static inline void
> >  netdev_dpdk_vhost_update_tx_counters(struct netdev_stats *stats,
> >                                       struct dp_packet **packets, @@
> > -1072,6 +1200,9 @@ __netdev_dpdk_vhost_send(struct netdev *netdev,
> struct dp_packet **pkts,
> >          goto out;
> >      }
> >
> > +    /* Check has QoS has been configured for the netdev */
> > +    cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt);
> > +
> >      /* There is vHost TX single queue, So we need to lock it for TX.
> */
> >      rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
> >
> > @@ -1216,6 +1347,10 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet **pkts,
> >      if (dev->type == DPDK_DEV_VHOST) {
> >          __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs,
> newcnt, true);
> >      } else {
> > +
> > +        /* Check if QoS has been configured for this netdev. */
> > +        newcnt = netdev_dpdk_qos_process__(dev, mbufs, newcnt);
> > +
> >          dpdk_queue_pkts(dev, qid, mbufs, newcnt);
> >          dpdk_queue_flush(dev, qid);
> >      }
> > @@ -1275,9 +1410,15 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
> > qid,
> >
> >              if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> >                  if (next_tx_idx != i) {
> > +                    cnt = i - next_tx_idx;
> > +
> > +                    cnt = netdev_dpdk_qos_process__(dev,
> > +                                                    (struct
> rte_mbuf**)pkts,
> > +                                                    cnt);
> 
> 
> I have another concern at this point because for each batch it will grab
> a lock pretty much serializing multiple queues.
> 
> > +
> >                      dpdk_queue_pkts(dev, qid,
> >                                      (struct rte_mbuf
> **)&pkts[next_tx_idx],
> > -                                    i-next_tx_idx);
> > +                                    cnt);
> >                  }
> >
> >                  VLOG_WARN_RL(&rl, "Too big size %d max_packet_len
> > %d", @@ -1289,9 +1430,12 @@ netdev_dpdk_send__(struct netdev_dpdk
> *dev, int qid,
> >              }
> >          }
> >          if (next_tx_idx != cnt) {
> > -           dpdk_queue_pkts(dev, qid,
> > -                            (struct rte_mbuf **)&pkts[next_tx_idx],
> > -                            cnt-next_tx_idx);
> > +            cnt -= next_tx_idx;
> > +
> > +            cnt = netdev_dpdk_qos_process__(dev, (struct
> > + rte_mbuf**)pkts, cnt);
> 
> Same here.
> and it seems it should update tx_dropped too.
> 
> Thanks,
> fbl
> 
> > +
> > +            dpdk_queue_pkts(dev, qid, (struct rte_mbuf
> **)&pkts[next_tx_idx],
> > +                            cnt);
> >          }
> >
> >          if (OVS_UNLIKELY(dropped)) {
> > @@ -2032,6 +2176,264 @@ unlock_dpdk:
> >      return err;
> >  }
> >
> > +/* QoS Functions */
> > +
> > +/*
> > + * Initialize QoS configuration operations.
> > + */
> > +static void
> > +qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops)
> > +{
> > +    conf->ops = ops;
> > +}
> > +
> > +/*
> > + * Search existing QoS operations in qos_ops and compare each set of
> > + * operations qos_name to name. Return a dpdk_qos_ops pointer to a
> > +match,
> > + * else return NULL
> > + */
> > +static const struct dpdk_qos_ops *
> > +qos_lookup_name(const char *name)
> > +{
> > +    const struct dpdk_qos_ops *const *opsp;
> > +
> > +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> > +        const struct dpdk_qos_ops *ops = *opsp;
> > +        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
> > +            return ops;
> > +        }
> > +    }
> > +    return NULL;
> > +}
> > +
> > +/*
> > + * Call qos_destruct to clean up items associated with the netdevs
> > + * qos_conf. Set netdevs qos_conf to NULL.
> > + */
> > +static void
> > +qos_delete_conf(struct netdev *netdev_) {
> > +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> > +
> > +    rte_spinlock_lock(&netdev->qos_lock);
> > +    if (netdev->qos_conf) {
> > +        if (netdev->qos_conf->ops->qos_destruct) {
> > +            netdev->qos_conf->ops->qos_destruct(netdev_, netdev-
> >qos_conf);
> > +        }
> > +        netdev->qos_conf = NULL;
> > +    }
> > +    rte_spinlock_unlock(&netdev->qos_lock);
> > +}
> > +
> > +static int
> > +netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
> > +                           struct sset *types) {
> > +    const struct dpdk_qos_ops *const *opsp;
> > +
> > +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> > +        const struct dpdk_qos_ops *ops = *opsp;
> > +        if (ops->qos_construct && ops->qos_name[0] != '\0') {
> > +            sset_add(types, ops->qos_name);
> > +        }
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int
> > +netdev_dpdk_get_qos(const struct netdev *netdev_,
> > +                    const char **typep, struct smap *details) {
> > +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> > +    int error = 0;
> > +
> > +    ovs_mutex_lock(&netdev->mutex);
> > +    if(netdev->qos_conf) {
> > +        *typep = netdev->qos_conf->ops->qos_name;
> > +        error = (netdev->qos_conf->ops->qos_get
> > +                 ? netdev->qos_conf->ops->qos_get(netdev_, details):
> 0);
> > +    }
> > +    ovs_mutex_unlock(&netdev->mutex);
> > +
> > +    return error;
> > +}
> > +
> > +static int
> > +netdev_dpdk_set_qos(struct netdev *netdev_,
> > +                    const char *type, const struct smap *details) {
> > +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> > +    const struct dpdk_qos_ops *new_ops = NULL;
> > +    int error = 0;
> > +
> > +    /* If type is empty the current QoS configuration for the dpdk-
> netdev can
> > +     * be destroyed */
> > +    if(type[0] == '\0') {
> > +        qos_delete_conf(netdev_);
> > +    }
> > +
> > +    new_ops = qos_lookup_name(type);
> > +    if (!new_ops || !new_ops->qos_construct) {
> > +        return EOPNOTSUPP;
> > +    }
> > +
> > +    ovs_mutex_lock(&netdev->mutex);
> > +    if (netdev->qos_conf) {
> > +        if (new_ops == netdev->qos_conf->ops) {
> > +            error = new_ops->qos_set ? new_ops->qos_set(netdev_,
> details) : 0;
> > +        }
> > +        else {
> > +            /* Delete existing QoS configuration. */
> > +            qos_delete_conf(netdev_);
> > +            ovs_assert(netdev->qos_conf == NULL);
> > +
> > +            /* Install new QoS configuration. */
> > +            error = new_ops->qos_construct(netdev_, details);
> > +            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> > +        }
> > +    } else {
> > +        error = new_ops->qos_construct(netdev_, details);
> > +        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> > +    }
> > +
> > +    ovs_mutex_unlock(&netdev->mutex);
> > +    return error;
> > +}
> > +
> > +/* egress-policer details */
> > +
> > +struct egress_policer {
> > +    struct qos_conf qos_conf;
> > +    struct rte_meter_srtcm_params app_srtcm_params;
> > +    struct rte_meter_srtcm test_meter; };
> > +
> > +static struct egress_policer *
> > +egress_policer_get__(const struct netdev *netdev_) {
> > +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> > +    return CONTAINER_OF(netdev->qos_conf, struct egress_policer,
> > +qos_conf); }
> > +
> > +static int
> > +egress_policer_qos_construct(struct netdev *netdev_,
> > +                            const struct smap *details) {
> > +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> > +    struct egress_policer *policer;
> > +    const char *cir_s;
> > +    const char *cbs_s;
> > +    int err = 0;
> > +
> > +    rte_spinlock_lock(&netdev->qos_lock);
> > +    policer = xmalloc(sizeof *policer);
> > +    qos_conf_init(&policer->qos_conf, &egress_policer_ops);
> > +    netdev->qos_conf = &policer->qos_conf;
> > +    cir_s = smap_get(details, "cir");
> > +    cbs_s = smap_get(details, "cbs");
> > +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10)
> : 0;
> > +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10)
> : 0;
> > +    policer->app_srtcm_params.ebs = 0;
> > +    err = rte_meter_srtcm_config(&policer->test_meter,
> > +                                    &policer->app_srtcm_params);
> > +    rte_spinlock_unlock(&netdev->qos_lock);
> > +
> > +    return err;
> > +}
> > +
> > +static void
> > +egress_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED,
> > +                        struct qos_conf *conf) {
> > +    struct egress_policer *policer = CONTAINER_OF(conf, struct
> egress_policer,
> > +                                                qos_conf);
> > +    free(policer);
> > +}
> > +
> > +static int
> > +egress_policer_qos_get(const struct netdev *netdev, struct smap
> > +*details) {
> > +    struct egress_policer *policer = egress_policer_get__(netdev);
> > +    smap_add_format(details, "cir", "%llu",
> > +                    1ULL * policer->app_srtcm_params.cir);
> > +    smap_add_format(details, "cbs", "%llu",
> > +                    1ULL * policer->app_srtcm_params.cbs);
> > +    return 0;
> > +}
> > +
> > +static int
> > +egress_policer_qos_set(struct netdev *netdev_, const struct smap
> > +*details) {
> > +    struct egress_policer *policer;
> > +    const char *cir_s;
> > +    const char *cbs_s;
> > +    int err = 0;
> > +
> > +    policer = egress_policer_get__(netdev_);
> > +    cir_s = smap_get(details, "cir");
> > +    cbs_s = smap_get(details, "cbs");
> > +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10)
> : 0;
> > +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10)
> : 0;
> > +    policer->app_srtcm_params.ebs = 0;
> > +    err = rte_meter_srtcm_config(&policer->test_meter,
> > +                                    &policer->app_srtcm_params);
> > +
> > +    return err;
> > +}
> > +
> > +static inline int
> > +egress_policer_pkt_handle__(struct rte_meter_srtcm *meter,
> > +                            struct rte_mbuf *pkt, uint64_t time) {
> > +    uint8_t output_color;
> > +    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct
> ether_hdr);
> > +    enum egress_policer_action action = GREEN;
> > +
> > +    /* color input is not used for blind modes */
> > +    output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter,
> time,
> > +
> > + pkt_len);
> > +
> > +    /* Check output color, 0 corresponds to GREEN i.e. packet can be
> > +     * transmitted. If greater than 0 then action is set to DROP. */
> > +    if (output_color) {
> > +        action = DROP;
> > +    }
> > +
> > +    return action;
> > +}
> > +
> > +static int
> > +egress_policer_alg_process(struct netdev *netdev_, struct rte_mbuf
> **pkts,
> > +                        int pkt_cnt)
> > +{
> > +    int i = 0;
> > +    int cnt = pkt_cnt;
> > +    struct egress_policer *policer = egress_policer_get__(netdev_);
> > +    struct rte_mbuf *pkt = NULL;
> > +    uint64_t current_time = rte_rdtsc();
> > +
> > +    for(i = 0; i < pkt_cnt; i++) {
> > +        pkt = pkts[i];
> > +        /* Handle current packet */
> > +        if (egress_policer_pkt_handle__(&policer->test_meter, pkt,
> > +                                        current_time) == DROP) {
> > +            rte_pktmbuf_free(pkt);
> > +            cnt = cnt - 1;
> > +        }
> > +    }
> > +
> > +    return cnt;
> > +}
> > +
> > +static const struct dpdk_qos_ops egress_policer_ops = {
> > +    "egress-policer",    /* qos_name */
> > +    egress_policer_qos_construct,
> > +    egress_policer_qos_destruct,
> > +    egress_policer_qos_get,
> > +    egress_policer_qos_set,
> > +    egress_policer_alg_process
> > +};
> > +
> >  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
> SEND, \
> >      GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
> \
> >  {                                                             \
> > @@ -2069,10 +2471,10 @@ unlock_dpdk:
> >      NULL,                       /* set_advertisements */      \
> >                                                                \
> >      NULL,                       /* set_policing */            \
> > -    NULL,                       /* get_qos_types */           \
> > +    netdev_dpdk_get_qos_types,                                \
> >      NULL,                       /* get_qos_capabilities */    \
> > -    NULL,                       /* get_qos */                 \
> > -    NULL,                       /* set_qos */                 \
> > +    netdev_dpdk_get_qos,                                      \
> > +    netdev_dpdk_set_qos,                                      \
> >      NULL,                       /* get_queue */               \
> >      NULL,                       /* set_queue */               \
> >      NULL,                       /* delete_queue */            \
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > 4dbd9e3..5be8eb6 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3298,6 +3298,26 @@
> >            for information on how this classifier works.
> >          </dd>
> >        </dl>
> > +      <dl>
> > +        <dt><code>egress-policer</code></dt>
> > +        <dd>
> > +          An egress policer algorithm. This implementation uses the
> DPDK
> > +          rte_meter library. The rte_meter library provides an
> implementation
> > +          of srTCM (RFC2697) which allows the metering and policing
> of traffic.
> > +          The implementation in OVS creates a color blind srTCM meter
> with a
> > +          single token bucket used to police traffic. It should be
> noted that
> > +          when the rte_meter is configured as part of QoS there will
> be a
> > +          performance overhead as the rte_mter itself will consume
> CPU cycles
> > +          in order to police traffic. These CPU cycles ordinarily are
> used for
> > +          packet proccessing. As such the drop in performance will be
> noticed
> > +          in terms of overall aggregate traffic throughput.
> > +
> > +          For more information regarding the usage of the DPDK
> rte_meter see
> > +
> <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
> > +          For more information regarding srTCM see
> > +          <code>https://tools.ietf.org/html/rfc2697</code>
> > +        </dd>
> > +      </dl>
> >      </column>
> >
> >      <column name="queues">
> > @@ -3334,6 +3354,38 @@
> >        </column>
> >      </group>
> >
> > +    <group title="Configuration for egress-policer QoS">
> > +      <p>
> > +        <ref table="QoS"/> <ref table="QoS" column="type"/>
> > +        <code>egress-policer</code> provides egress policing for
> userspace
> > +        port types with DPDK.
> > +
> > +        It has the following key-value pairs defined.
> > +      </p>
> > +
> > +      <column name="other_config" key="cir" type='{"type":
> "integer"}'>
> > +        The Committed Information Rate (CIR) is measured in bytes of
> IP
> > +        packets per second, i.e. it includes the IP header, but not
> link
> > +        specific (e.g. Ethernet) headers. This represents the bytes
> per second
> > +        rate at which the token bucket will be updated. The cir value
> is
> > +        calculated by (pps x packet data size).  For example assuming
> a user
> > +        wishes to limit a stream consisting of 64 byte packets to 1
> million
> > +        packets per second the CIR would be set to to to 46000000.
> This value
> > +        can be broken into '1,000,000 x 46'. Where 1,000,000 is the
> policing
> > +        rate for the number of packets per second and 46 represents
> the size
> > +        of the packet data for a 64 byte ip packet.
> > +      </column>
> > +      <column name="other_config" key="cbs" type='{"type":
> "integer"}'>
> > +        The Committed Burst Size (CBS) is measured in bytes and
> represents a
> > +        token bucket. At a minimum this value should be be set to the
> expected
> > +        largest size packet in the traffic stream. In practice larger
> values
> > +        may be used to increase the size of the token bucket. If a
> packet can
> > +        be transmitted then the cbs will be decremented by the number
> of
> > +        bytes/tokens of the packet. If there are not enough tokens in
> the cbs
> > +        bucket the packet will be dropped.
> > +      </column>
> > +    </group>
> > +
> >      <group title="Common Columns">
> >        The overall purpose of these columns is described under
> <code>Common
> >        Columns</code> at the beginning of this document.
> > --
> > 1.7.4.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > http://openvswitch.org/mailman/listinfo/dev

Patch
diff mbox

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 7bf110c..c64b8ce 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -207,6 +207,26 @@  Using the DPDK with ovs-vswitchd:
    ./ovs-ofctl add-flow br0 in_port=2,action=output:1
    ```
 
+8. QoS usage example
+
+   Assuming you have a vhost-user port transmitting traffic consisting of
+   packets of size 64 bytes, the following command would limit the egress
+   transmission rate of the port to ~1,000,000 packets per second:
+
+   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos
+   type=egress-policer other-config:cir=46000000 other-config:cbs=2048`
+
+   To examine the QoS configuration of the port:
+
+   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
+
+   To clear the QoS configuration from the port and ovsdb use the following:
+
+   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0 qos`
+
+   For more details regarding egress-policer parameters please refer to the
+   vswitch.xml.
+
 Performance Tuning:
 -------------------
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 93b0589..cd4c532 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -44,6 +44,7 @@ 
 #include "ovs-rcu.h"
 #include "packets.h"
 #include "shash.h"
+#include "smap.h"
 #include "sset.h"
 #include "unaligned.h"
 #include "timeval.h"
@@ -52,12 +53,14 @@ 
 
 #include "rte_config.h"
 #include "rte_mbuf.h"
+#include "rte_meter.h"
 #include "rte_virtio_net.h"
 
 VLOG_DEFINE_THIS_MODULE(dpdk);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 #define DPDK_PORT_WATCHDOG_INTERVAL 5
+#define DPDK_MAX_QOS_NAME_SIZE 10
 
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"
@@ -142,6 +145,107 @@  static int rte_eal_init_ret = ENODEV;
 
 static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
 
+/* Quality of Service */
+
+/* An instance of a QoS configuration.  Always associated with a particular
+ * network device.
+ *
+ * Each QoS implementation subclasses this with whatever additional data it
+ * needs.
+ */
+struct qos_conf {
+    const struct dpdk_qos_ops *ops;
+};
+
+/* A particular implementation of dpdk QoS operations.
+ *
+ * The functions below return 0 if successful or a positive errno value on
+ * failure, except where otherwise noted. All of them must be provided, except
+ * where otherwise noted.
+ */
+struct dpdk_qos_ops {
+
+    /* Name of the QoS type */
+    const char *qos_name;
+
+    /* Called to construct the QoS implementation on 'netdev'. The
+     * implementation should make the appropriate calls to configure QoS
+     * according to 'details'. The implementation may assume that any current
+     * QoS configuration already installed should be destroyed before
+     * constructing the new configuration.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     *
+     * This function must return 0 if and only if it sets 'netdev->qos_conf'
+     * to an initialized 'struct qos_conf'.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    int (*qos_construct)(struct netdev *netdev, const struct smap *details);
+
+    /* Destroys the data structures allocated by the implementation as part of
+     * 'qos_conf.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    void (*qos_destruct)(struct netdev *netdev, struct qos_conf *conf);
+
+    /* Retrieves details of 'netdev->qos_conf' configuration into 'details'.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     */
+    int (*qos_get)(const struct netdev *netdev, struct smap *details);
+
+    /* Reconfigures 'netdev->qos_conf' according to 'details', performing any
+     * required calls to complete the reconfiguration.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     *
+     * This function may be null if 'qos_conf' is not configurable.
+     */
+    int (*qos_set)(struct netdev *netdev, const struct smap *details);
+
+    /* Modify an array of rte_mbufs. The modification is specific to
+     * each qos implementation.
+     *
+     * The function should take and array of mbufs and an int representing
+     * the current number of mbufs present in the array.
+     *
+     * After the function has performed a qos modification to the array of
+     * mbufs it returns an int representing the number of mbufs now present in
+     * the array. This value is can then be passed to the port send function
+     * along with the modified array for transmission.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf **pkts,
+                           int pkt_cnt);
+};
+
+/* dpdk_qos_ops for each type of user space QoS implementation */
+static const struct dpdk_qos_ops egress_policer_ops;
+
+/*
+ * Array of dpdk_qos_ops, contains pointer to all supported QoS
+ * operations.
+ */
+static const struct dpdk_qos_ops *const qos_confs[] = {
+    &egress_policer_ops,
+    NULL
+};
+
+/* Action that can be set for a packet for rte_meter */
+enum egress_policer_action {
+        GREEN = e_RTE_METER_GREEN,
+        DROP = 1,
+};
+
 /* Contains all 'struct dpdk_dev's. */
 static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
     = OVS_LIST_INITIALIZER(&dpdk_list);
@@ -235,6 +339,11 @@  struct netdev_dpdk {
 
     /* In dpdk_list. */
     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
+
+    /* QoS configuration and lock for the device */
+    struct qos_conf *qos_conf;
+    rte_spinlock_t qos_lock;
+
 };
 
 struct netdev_rxq_dpdk {
@@ -614,6 +723,10 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
         goto unlock;
     }
 
+    /* Initialise QoS configuration to NULL and qos lock to unlocked */
+    netdev->qos_conf = NULL;
+    rte_spinlock_init(&netdev->qos_lock);
+
     netdev_->n_txq = NR_QUEUE;
     netdev_->n_rxq = NR_QUEUE;
     netdev->real_n_txq = NR_QUEUE;
@@ -1038,6 +1151,21 @@  netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet **packets,
     return 0;
 }
 
+static inline int
+netdev_dpdk_qos_process__(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
+                        int cnt)
+{
+    struct netdev *netdev = &dev->up;
+
+    rte_spinlock_lock(&dev->qos_lock);
+    if (dev->qos_conf != NULL) {
+        cnt = dev->qos_conf->ops->qos_alg_process(netdev, pkts, cnt);
+    }
+    rte_spinlock_unlock(&dev->qos_lock);
+
+    return cnt;
+}
+
 static inline void
 netdev_dpdk_vhost_update_tx_counters(struct netdev_stats *stats,
                                      struct dp_packet **packets,
@@ -1072,6 +1200,9 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, struct dp_packet **pkts,
         goto out;
     }
 
+    /* Check has QoS has been configured for the netdev */
+    cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt);
+
     /* There is vHost TX single queue, So we need to lock it for TX. */
     rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
 
@@ -1216,6 +1347,10 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet **pkts,
     if (dev->type == DPDK_DEV_VHOST) {
         __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs, newcnt, true);
     } else {
+
+        /* Check if QoS has been configured for this netdev. */
+        newcnt = netdev_dpdk_qos_process__(dev, mbufs, newcnt);
+
         dpdk_queue_pkts(dev, qid, mbufs, newcnt);
         dpdk_queue_flush(dev, qid);
     }
@@ -1275,9 +1410,15 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
 
             if (OVS_UNLIKELY(size > dev->max_packet_len)) {
                 if (next_tx_idx != i) {
+                    cnt = i - next_tx_idx;
+
+                    cnt = netdev_dpdk_qos_process__(dev,
+                                                    (struct rte_mbuf**)pkts,
+                                                    cnt);
+
                     dpdk_queue_pkts(dev, qid,
                                     (struct rte_mbuf **)&pkts[next_tx_idx],
-                                    i-next_tx_idx);
+                                    cnt);
                 }
 
                 VLOG_WARN_RL(&rl, "Too big size %d max_packet_len %d",
@@ -1289,9 +1430,12 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
             }
         }
         if (next_tx_idx != cnt) {
-           dpdk_queue_pkts(dev, qid,
-                            (struct rte_mbuf **)&pkts[next_tx_idx],
-                            cnt-next_tx_idx);
+            cnt -= next_tx_idx;
+
+            cnt = netdev_dpdk_qos_process__(dev, (struct rte_mbuf**)pkts, cnt);
+
+            dpdk_queue_pkts(dev, qid, (struct rte_mbuf **)&pkts[next_tx_idx],
+                            cnt);
         }
 
         if (OVS_UNLIKELY(dropped)) {
@@ -2032,6 +2176,264 @@  unlock_dpdk:
     return err;
 }
 
+/* QoS Functions */
+
+/*
+ * Initialize QoS configuration operations.
+ */
+static void
+qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops)
+{
+    conf->ops = ops;
+}
+
+/*
+ * Search existing QoS operations in qos_ops and compare each set of
+ * operations qos_name to name. Return a dpdk_qos_ops pointer to a match,
+ * else return NULL
+ */
+static const struct dpdk_qos_ops *
+qos_lookup_name(const char *name)
+{
+    const struct dpdk_qos_ops *const *opsp;
+
+    for (opsp = qos_confs; *opsp != NULL; opsp++) {
+        const struct dpdk_qos_ops *ops = *opsp;
+        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
+            return ops;
+        }
+    }
+    return NULL;
+}
+
+/*
+ * Call qos_destruct to clean up items associated with the netdevs
+ * qos_conf. Set netdevs qos_conf to NULL.
+ */
+static void
+qos_delete_conf(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    rte_spinlock_lock(&netdev->qos_lock);
+    if (netdev->qos_conf) {
+        if (netdev->qos_conf->ops->qos_destruct) {
+            netdev->qos_conf->ops->qos_destruct(netdev_, netdev->qos_conf);
+        }
+        netdev->qos_conf = NULL;
+    }
+    rte_spinlock_unlock(&netdev->qos_lock);
+}
+
+static int
+netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
+                           struct sset *types)
+{
+    const struct dpdk_qos_ops *const *opsp;
+
+    for (opsp = qos_confs; *opsp != NULL; opsp++) {
+        const struct dpdk_qos_ops *ops = *opsp;
+        if (ops->qos_construct && ops->qos_name[0] != '\0') {
+            sset_add(types, ops->qos_name);
+        }
+    }
+    return 0;
+}
+
+static int
+netdev_dpdk_get_qos(const struct netdev *netdev_,
+                    const char **typep, struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    int error = 0;
+
+    ovs_mutex_lock(&netdev->mutex);
+    if(netdev->qos_conf) {
+        *typep = netdev->qos_conf->ops->qos_name;
+        error = (netdev->qos_conf->ops->qos_get
+                 ? netdev->qos_conf->ops->qos_get(netdev_, details): 0);
+    }
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return error;
+}
+
+static int
+netdev_dpdk_set_qos(struct netdev *netdev_,
+                    const char *type, const struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    const struct dpdk_qos_ops *new_ops = NULL;
+    int error = 0;
+
+    /* If type is empty the current QoS configuration for the dpdk-netdev can
+     * be destroyed */
+    if(type[0] == '\0') {
+        qos_delete_conf(netdev_);
+    }
+
+    new_ops = qos_lookup_name(type);
+    if (!new_ops || !new_ops->qos_construct) {
+        return EOPNOTSUPP;
+    }
+
+    ovs_mutex_lock(&netdev->mutex);
+    if (netdev->qos_conf) {
+        if (new_ops == netdev->qos_conf->ops) {
+            error = new_ops->qos_set ? new_ops->qos_set(netdev_, details) : 0;
+        }
+        else {
+            /* Delete existing QoS configuration. */
+            qos_delete_conf(netdev_);
+            ovs_assert(netdev->qos_conf == NULL);
+
+            /* Install new QoS configuration. */
+            error = new_ops->qos_construct(netdev_, details);
+            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
+        }
+    } else {
+        error = new_ops->qos_construct(netdev_, details);
+        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
+    }
+
+    ovs_mutex_unlock(&netdev->mutex);
+    return error;
+}
+
+/* egress-policer details */
+
+struct egress_policer {
+    struct qos_conf qos_conf;
+    struct rte_meter_srtcm_params app_srtcm_params;
+    struct rte_meter_srtcm test_meter;
+};
+
+static struct egress_policer *
+egress_policer_get__(const struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    return CONTAINER_OF(netdev->qos_conf, struct egress_policer, qos_conf);
+}
+
+static int
+egress_policer_qos_construct(struct netdev *netdev_,
+                            const struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    struct egress_policer *policer;
+    const char *cir_s;
+    const char *cbs_s;
+    int err = 0;
+
+    rte_spinlock_lock(&netdev->qos_lock);
+    policer = xmalloc(sizeof *policer);
+    qos_conf_init(&policer->qos_conf, &egress_policer_ops);
+    netdev->qos_conf = &policer->qos_conf;
+    cir_s = smap_get(details, "cir");
+    cbs_s = smap_get(details, "cbs");
+    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
+    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
+    policer->app_srtcm_params.ebs = 0;
+    err = rte_meter_srtcm_config(&policer->test_meter,
+                                    &policer->app_srtcm_params);
+    rte_spinlock_unlock(&netdev->qos_lock);
+
+    return err;
+}
+
+static void
+egress_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED,
+                        struct qos_conf *conf)
+{
+    struct egress_policer *policer = CONTAINER_OF(conf, struct egress_policer,
+                                                qos_conf);
+    free(policer);
+}
+
+static int
+egress_policer_qos_get(const struct netdev *netdev, struct smap *details)
+{
+    struct egress_policer *policer = egress_policer_get__(netdev);
+    smap_add_format(details, "cir", "%llu",
+                    1ULL * policer->app_srtcm_params.cir);
+    smap_add_format(details, "cbs", "%llu",
+                    1ULL * policer->app_srtcm_params.cbs);
+    return 0;
+}
+
+static int
+egress_policer_qos_set(struct netdev *netdev_, const struct smap *details)
+{
+    struct egress_policer *policer;
+    const char *cir_s;
+    const char *cbs_s;
+    int err = 0;
+
+    policer = egress_policer_get__(netdev_);
+    cir_s = smap_get(details, "cir");
+    cbs_s = smap_get(details, "cbs");
+    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
+    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
+    policer->app_srtcm_params.ebs = 0;
+    err = rte_meter_srtcm_config(&policer->test_meter,
+                                    &policer->app_srtcm_params);
+
+    return err;
+}
+
+static inline int
+egress_policer_pkt_handle__(struct rte_meter_srtcm *meter,
+                            struct rte_mbuf *pkt, uint64_t time)
+{
+    uint8_t output_color;
+    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct ether_hdr);
+    enum egress_policer_action action = GREEN;
+
+    /* color input is not used for blind modes */
+    output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter, time,
+                                                                pkt_len);
+
+    /* Check output color, 0 corresponds to GREEN i.e. packet can be
+     * transmitted. If greater than 0 then action is set to DROP. */
+    if (output_color) {
+        action = DROP;
+    }
+
+    return action;
+}
+
+static int
+egress_policer_alg_process(struct netdev *netdev_, struct rte_mbuf **pkts,
+                        int pkt_cnt)
+{
+    int i = 0;
+    int cnt = pkt_cnt;
+    struct egress_policer *policer = egress_policer_get__(netdev_);
+    struct rte_mbuf *pkt = NULL;
+    uint64_t current_time = rte_rdtsc();
+
+    for(i = 0; i < pkt_cnt; i++) {
+        pkt = pkts[i];
+        /* Handle current packet */
+        if (egress_policer_pkt_handle__(&policer->test_meter, pkt,
+                                        current_time) == DROP) {
+            rte_pktmbuf_free(pkt);
+            cnt = cnt - 1;
+        }
+    }
+
+    return cnt;
+}
+
+static const struct dpdk_qos_ops egress_policer_ops = {
+    "egress-policer",    /* qos_name */
+    egress_policer_qos_construct,
+    egress_policer_qos_destruct,
+    egress_policer_qos_get,
+    egress_policer_qos_set,
+    egress_policer_alg_process
+};
+
 #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
     GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
 {                                                             \
@@ -2069,10 +2471,10 @@  unlock_dpdk:
     NULL,                       /* set_advertisements */      \
                                                               \
     NULL,                       /* set_policing */            \
-    NULL,                       /* get_qos_types */           \
+    netdev_dpdk_get_qos_types,                                \
     NULL,                       /* get_qos_capabilities */    \
-    NULL,                       /* get_qos */                 \
-    NULL,                       /* set_qos */                 \
+    netdev_dpdk_get_qos,                                      \
+    netdev_dpdk_set_qos,                                      \
     NULL,                       /* get_queue */               \
     NULL,                       /* set_queue */               \
     NULL,                       /* delete_queue */            \
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 4dbd9e3..5be8eb6 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3298,6 +3298,26 @@ 
           for information on how this classifier works.
         </dd>
       </dl>
+      <dl>
+        <dt><code>egress-policer</code></dt>
+        <dd>
+          An egress policer algorithm. This implementation uses the DPDK
+          rte_meter library. The rte_meter library provides an implementation
+          of srTCM (RFC2697) which allows the metering and policing of traffic.
+          The implementation in OVS creates a color blind srTCM meter with a
+          single token bucket used to police traffic. It should be noted that
+          when the rte_meter is configured as part of QoS there will be a
+          performance overhead as the rte_mter itself will consume CPU cycles
+          in order to police traffic. These CPU cycles ordinarily are used for
+          packet proccessing. As such the drop in performance will be noticed
+          in terms of overall aggregate traffic throughput.
+
+          For more information regarding the usage of the DPDK rte_meter see
+          <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
+          For more information regarding srTCM see
+          <code>https://tools.ietf.org/html/rfc2697</code>
+        </dd>
+      </dl>
     </column>
 
     <column name="queues">
@@ -3334,6 +3354,38 @@ 
       </column>
     </group>
 
+    <group title="Configuration for egress-policer QoS">
+      <p>
+        <ref table="QoS"/> <ref table="QoS" column="type"/>
+        <code>egress-policer</code> provides egress policing for userspace
+        port types with DPDK.
+
+        It has the following key-value pairs defined.
+      </p>
+
+      <column name="other_config" key="cir" type='{"type": "integer"}'>
+        The Committed Information Rate (CIR) is measured in bytes of IP
+        packets per second, i.e. it includes the IP header, but not link
+        specific (e.g. Ethernet) headers. This represents the bytes per second
+        rate at which the token bucket will be updated. The cir value is
+        calculated by (pps x packet data size).  For example assuming a user
+        wishes to limit a stream consisting of 64 byte packets to 1 million
+        packets per second the CIR would be set to to to 46000000. This value
+        can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
+        rate for the number of packets per second and 46 represents the size
+        of the packet data for a 64 byte ip packet.
+      </column>
+      <column name="other_config" key="cbs" type='{"type": "integer"}'>
+        The Committed Burst Size (CBS) is measured in bytes and represents a
+        token bucket. At a minimum this value should be be set to the expected
+        largest size packet in the traffic stream. In practice larger values
+        may be used to increase the size of the token bucket. If a packet can
+        be transmitted then the cbs will be decremented by the number of
+        bytes/tokens of the packet. If there are not enough tokens in the cbs
+        bucket the packet will be dropped.
+      </column>
+    </group>
+
     <group title="Common Columns">
       The overall purpose of these columns is described under <code>Common
       Columns</code> at the beginning of this document.