[ovs-dev] netdev_dpdk.c: Add QoS functionality.
diff mbox

Message ID 1443617115-1987-1-git-send-email-ian.stokes@intel.com
State Changes Requested
Headers show

Commit Message

Stokes, Ian Sept. 30, 2015, 12:45 p.m. UTC
This patch provides the modifications required in netdev-dpdk.c and
vswitch.xml to allow for a DPDK user space QoS algorithm.

This patch adds a QoS configuration structure for netdev-dpdk and
expected QoS operations 'dpdk_qos_ops'. Various helper functions
are also supplied.

Also included are the modifications required for vswitch.xml to allow a
new QoS implementation for netdev-dpdk devices. This includes a new QoS type
`us-policer` as well as its expected QoS table entries.

The QoS functionality implemented for DPDK devices is `us-policer`.
This is an egress policer used to drop packets at configurable rate.

The INSTALL.DPDK.md guide has also been modified to provide an example
configuration of `us-policer` QoS.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>
---
 INSTALL.DPDK.md      |   52 +++++++
 lib/netdev-dpdk.c    |  414 +++++++++++++++++++++++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml |   40 +++++
 3 files changed, 500 insertions(+), 6 deletions(-)

Comments

Stokes, Ian Oct. 8, 2015, 8:59 a.m. UTC | #1
Hi All,

Just wondering if anyone has had time to review this?
Any feedback would be much appreciated.

Thanks
Ian

> -----Original Message-----
> From: Stokes, Ian
> Sent: Wednesday, September 30, 2015 1:45 PM
> To: dev@openvswitch.org
> Cc: Stokes, Ian
> Subject: [PATCH] netdev_dpdk.c: Add QoS functionality.
> 
> This patch provides the modifications required in netdev-dpdk.c and
> vswitch.xml to allow for a DPDK user space QoS algorithm.
> 
> This patch adds a QoS configuration structure for netdev-dpdk and
> expected QoS operations 'dpdk_qos_ops'. Various helper functions
> are also supplied.
> 
> Also included are the modifications required for vswitch.xml to allow a
> new QoS implementation for netdev-dpdk devices. This includes a new QoS
> type
> `us-policer` as well as its expected QoS table entries.
> 
> The QoS functionality implemented for DPDK devices is `us-policer`.
> This is an egress policer used to drop packets at configurable rate.
> 
> The INSTALL.DPDK.md guide has also been modified to provide an example
> configuration of `us-policer` QoS.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> ---
>  INSTALL.DPDK.md      |   52 +++++++
>  lib/netdev-dpdk.c    |  414
> +++++++++++++++++++++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml |   40 +++++
>  3 files changed, 500 insertions(+), 6 deletions(-)
> 
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 7bf110c..61bca31 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -207,6 +207,58 @@ Using the DPDK with ovs-vswitchd:
>     ./ovs-ofctl add-flow br0 in_port=2,action=output:1
>     ```
> 
> +8. QoS usage example
> +
> +   OVS supports egress QoS for physical and virtual DPDK port types.
> Currently
> +   there is only one supported QoS type 'us-policer'.
> +
> +   Assuming you have a vhost-user port transmitting traffic consisting
> of
> +   packets of size 64 bytes, the following command would limit the
> egress
> +   transmission rate of the port to ~1,000,000 packets per second:
> +
> +   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create
> qos
> +   type=us-policer other-config:cir=46000000 other-config:cbs=2048
> +   other-config:ebs=2048`
> +
> +   An explanation of the us-policer specific parameters are detailed
> below.
> +
> +   `type=us-policer`: Sets the QoS type to userspace-policer. This is
> required
> +   so that the dpdk-netdev can be configured with the correct QoS type.
> +   The specifics of us-policer can be found in the vswitch.xml file.
> +
> +   `other-config:cir=46000000`: The Committed Information Rate (cir)
> +   represents the bytes per second rate at which the token buckets will
> be
> +   updated. See RFC2697 for specific details. The cir value is
> calculated by
> +   (pps x packet data size). For example the value 46000000 can be
> broken
> +   into '1,000,000 x 46'. Where 1,000,000 is the policing rate for the
> number
> +   of packets per second and 46 represents the size of the packet data
> for a 64
> +   byte ip packet.
> +
> +   `other-config:cbs=2048`: This represents the value of the Commited
> Burst Size
> +   (cbs) token bucket. At a minimum this value should be be set to the
> expected
> +   largest size packet in the traffic stream. In practice larger values
> may be
> +   used to increase the size of the token bucket. If a packet can be
> +   transmitted then the cbs will be decremented by the number of
> bytes/tokens of
> +   the packet. If there are not enough tokens in the cbs bucket then
> the Excess
> +   Burst Size (ebs) token bucket will be used too and excess bytes
> decremented
> +   in the ebs.
> +
> +   `other-config:ebs=2048`: This represents the value of the Excess
> Burst Size
> +   (ebs). At a minimum this value should be set to the expected largest
> size
> +   packet in the traffic stream. In practice larger values may be used
> to
> +   increase the size of the token bucket. In the us-policer
> implementation any
> +   packet that requires tokens from the ebs token bucket will be marked
> to be
> +   dropped. Any packet that has exhausted the ebs bucket will also be
> dropped.
> +
> +   To examine the QoS configuration of the port:
> +
> +   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
> +
> +   To clear the QoS configuration from the port and ovsdb use the
> following:
> +
> +   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0 qos`
> +
> +
>  Performance Tuning:
>  -------------------
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index b72a33b..7e7fca1 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -44,6 +44,7 @@
>  #include "ovs-rcu.h"
>  #include "packets.h"
>  #include "shash.h"
> +#include "smap.h"
>  #include "sset.h"
>  #include "unaligned.h"
>  #include "timeval.h"
> @@ -52,12 +53,14 @@
> 
>  #include "rte_config.h"
>  #include "rte_mbuf.h"
> +#include "rte_meter.h"
>  #include "rte_virtio_net.h"
> 
>  VLOG_DEFINE_THIS_MODULE(dpdk);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> 
>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> +#define DPDK_MAX_QOS_NAME_SIZE 10
> 
>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>  #define OVS_VPORT_DPDK "ovs_dpdk"
> @@ -142,6 +145,122 @@ static int rte_eal_init_ret = ENODEV;
> 
>  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> 
> +/* Quality of Service */
> +
> +/* An instance of a QoS configuration.  Always associated with a
> particular
> + * network device.
> + *
> + * Each QoS implementation subclasses this with whatever additional
> data it
> + * needs.
> + */
> +struct qos_conf{
> +    const struct dpdk_qos_ops *ops;
> +};
> +
> +/* A particular implementation of dpdk QoS operations.
> + *
> + * The functions below return 0 if successful or a positive errno value
> on
> + * failure, except where otherwise noted. All of them must be provided,
> except
> + * where otherwise noted.
> + */
> +struct dpdk_qos_ops{
> +
> +    /* Name of the QoS type */
> +    const char *qos_name;
> +
> +    /* Called to construct the QoS implementation on 'netdev'. The
> +     * implementation should make the appropriate calls to configure
> QoS
> +     * according to 'details'. The implementation may assume that any
> current
> +     * QoS configuration already installed should be destroyed before
> +     * constructing the new configuration.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function must return 0 if and only if it sets 'netdev-
> >qos_conf'
> +     * to an initialized 'struct qos_conf'.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_construct)(struct netdev *netdev, const struct smap
> *details);
> +
> +    /* Destroys the data structures allocated by the implementation as
> part of
> +     * 'qos_conf.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    void (*qos_destruct)(struct netdev *netdev, struct qos_conf *conf);
> +
> +    /* Retrieves details of 'netdev->qos_conf' configuration into
> 'details'.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     */
> +    int (*qos_get)(const struct netdev *netdev, struct smap *details);
> +
> +    /* Reconfigures 'netdev->qos_conf' according to 'details',
> performing any
> +     * required calls to complete the reconfiguration.
> +     *
> +     * The contents of 'details' should be documented as valid for
> 'ovs_name'
> +     * in the "other_config" column in the "QoS" table in
> vswitchd/vswitch.xml
> +     * (which is built as ovs-vswitchd.conf.db(8)).
> +     *
> +     * This function may be null if 'qos_conf' is not configurable.
> +     */
> +    int (*qos_set)(struct netdev *netdev, const struct smap *details);
> +
> +    /* Modify an array of rte_mbufs. The modification is specific to
> +     * each qos implementation.
> +     *
> +     * The function should take and array of mbufs and an int
> representing
> +     * the current number of mbufs present in the array.
> +     *
> +     * After the function has performed a qos modification to the array
> of mbufs
> +     * it returns an int representing the number of mbufs now present
> in the
> +     * array. This value is can then be passed to the port send
> function along
> +     * with the modified array for transmission.
> +     *
> +     * For all QoS implementations it should always be non-null.
> +     */
> +    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf
> **pkts,
> +                           int pkt_cnt);
> +};
> +
> +/* dpdk_qos_ops for each type of user space QoS implementation */
> +static const struct dpdk_qos_ops us_policer_ops;
> +
> +/*
> + * Array of dpdk_qos_ops, contains pointer to all supported QoS
> + * operations.
> + */
> +static const struct dpdk_qos_ops *const qos_confs[] = {
> +    &us_policer_ops,               /*User Space Strict Priority
> Queuing*/
> +    NULL
> +};
> +
> +/* Action that can be set for a packet for rte_meter */
> +enum us_policer_action {
> +        GREEN = e_RTE_METER_GREEN,
> +        YELLOW = e_RTE_METER_YELLOW,
> +        RED = e_RTE_METER_RED,
> +        DROP = 3,
> +};
> +
> +/*
> + * Table representing the policing policy. Rows indicate the input
> color,
> + * columns indicate the output color, and the value that is stored in
> + * the table indicates the action to be taken for that particular case.
> + */
> +enum us_policer_action us_policer_table[e_RTE_METER_COLORS]
> +                                       [e_RTE_METER_COLORS] =
> +{
> +	{ GREEN, DROP, DROP},
> +	{ DROP, DROP, DROP},
> +	{ DROP, DROP, DROP}
> +};
> +
>  /* Contains all 'struct dpdk_dev's. */
>  static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>      = OVS_LIST_INITIALIZER(&dpdk_list);
> @@ -235,6 +354,11 @@ struct netdev_dpdk {
> 
>      /* In dpdk_list. */
>      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
> +
> +    /* QoS configuration and lock for the device */
> +    struct qos_conf *qos_conf;
> +    rte_spinlock_t qos_lock;
> +
>  };
> 
>  struct netdev_rxq_dpdk {
> @@ -614,6 +738,10 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
> int port_no,
>          goto unlock;
>      }
> 
> +    /* Initialise QoS configuration to NULL and qos lock to unlocked */
> +    netdev->qos_conf = NULL;
> +    rte_spinlock_init(&netdev->qos_lock);
> +
>      netdev_->n_txq = NR_QUEUE;
>      netdev_->n_rxq = NR_QUEUE;
>      netdev->real_n_txq = NR_QUEUE;
> @@ -1072,6 +1200,13 @@ __netdev_dpdk_vhost_send(struct netdev *netdev,
> struct dp_packet **pkts,
>          goto out;
>      }
> 
> +    /* Check has QoS has been configured for the netdev */
> +    rte_spinlock_lock(&vhost_dev->qos_lock);
> +    if (vhost_dev->qos_conf != NULL) {
> +        cnt = vhost_dev->qos_conf->ops->qos_alg_process(netdev,
> cur_pkts, cnt);
> +    }
> +    rte_spinlock_unlock(&vhost_dev->qos_lock);
> +
>      /* There is vHost TX single queue, So we need to lock it for TX. */
>      rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
> 
> @@ -1216,6 +1351,14 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet **pkts,
>      if (dev->type == DPDK_DEV_VHOST) {
>          __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs,
> newcnt, true);
>      } else {
> +
> +        /* Check if QoS has been configured for this netdev. */
> +        rte_spinlock_lock(&dev->qos_lock);
> +        if (dev->qos_conf != NULL) {
> +            newcnt = dev->qos_conf->ops->qos_alg_process(netdev, mbufs,
> newcnt);
> +        }
> +        rte_spinlock_unlock(&dev->qos_lock);
> +
>          dpdk_queue_pkts(dev, qid, mbufs, newcnt);
>          dpdk_queue_flush(dev, qid);
>      }
> @@ -1289,9 +1432,18 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
> qid,
>              }
>          }
>          if (next_tx_idx != cnt) {
> -           dpdk_queue_pkts(dev, qid,
> -                            (struct rte_mbuf **)&pkts[next_tx_idx],
> -                            cnt-next_tx_idx);
> +            cnt -= next_tx_idx;
> +
> +            /* Check if QoS has been configured for this netdev. */
> +            rte_spinlock_lock(&dev->qos_lock);
> +            if (dev->qos_conf != NULL) {
> +                struct netdev *netdev = &dev->up;
> +                cnt = dev->qos_conf->ops->qos_alg_process(netdev,
> +                        (struct rte_mbuf**)pkts, cnt);
> +            }
> +            rte_spinlock_unlock(&dev->qos_lock);
> +
> +            dpdk_queue_pkts(dev, qid, (struct rte_mbuf
> **)&pkts[next_tx_idx], cnt);
>          }
> 
>          if (OVS_UNLIKELY(dropped)) {
> @@ -2032,6 +2184,256 @@ unlock_dpdk:
>      return err;
>  }
> 
> +/* QoS Functions */
> +
> +/*
> + * Initialize QoS configuration operations.
> + */
> +static void
> +qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops)
> +{
> +    conf->ops = ops;
> +}
> +
> +/*
> + * Search existing QoS operations in qos_ops and compare each set of
> operations
> + * qos_name to name. Return a dpdk_qos_ops pointer to a match, else
> return
> + * NULL
> + */
> +static const struct dpdk_qos_ops *
> +qos_lookup_name(const char *name)
> +{
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
> +            return ops;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * Call qos_destruct to clean up items associated with the netdevs
> + * qos_conf. Set netdevs qos_conf to NULL.
> + */
> +static void
> +qos_delete_conf(struct netdev *netdev_)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    if (netdev->qos_conf) {
> +        if (netdev->qos_conf->ops->qos_destruct) {
> +            netdev->qos_conf->ops->qos_destruct(netdev_, netdev-
> >qos_conf);
> +        }
> +        netdev->qos_conf = NULL;
> +    }
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +}
> +
> +static int
> +netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
> +                           struct sset *types)
> +{
> +    const struct dpdk_qos_ops *const *opsp;
> +
> +    for (opsp = qos_confs; *opsp != NULL; opsp++) {
> +        const struct dpdk_qos_ops *ops = *opsp;
> +        if (ops->qos_construct && ops->qos_name[0] != '\0') {
> +            sset_add(types, ops->qos_name);
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int
> +netdev_dpdk_get_qos(const struct netdev *netdev_,
> +                    const char **typep, struct smap *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    int error = 0;
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if(netdev->qos_conf){
> +        *typep = netdev->qos_conf->ops->qos_name;
> +        error = (netdev->qos_conf->ops->qos_get
> +                 ? netdev->qos_conf->ops->qos_get(netdev_, details):
> 0);
> +    }
> +    ovs_mutex_unlock(&netdev->mutex);
> +
> +    return error;
> +}
> +
> +static int
> +netdev_dpdk_set_qos(struct netdev *netdev_,
> +                    const char *type, const struct smap *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    const struct dpdk_qos_ops *new_ops = NULL;
> +    int error = 0;
> +
> +    /* If type is empty the current QoS configuration
> +	   for the dpdk-netdev can be destroyed */
> +    if(type[0] == '\0') {
> +        qos_delete_conf(netdev_);
> +    }
> +
> +    new_ops = qos_lookup_name(type);
> +    if (!new_ops || !new_ops->qos_construct) {
> +        return EOPNOTSUPP;
> +    }
> +
> +    ovs_mutex_lock(&netdev->mutex);
> +    if (netdev->qos_conf) {
> +        if (new_ops == netdev->qos_conf->ops) {
> +            error = new_ops->qos_set ? new_ops->qos_set(netdev_,
> details) : 0;
> +        }
> +        else {
> +            /* Delete existing QoS configuration. */
> +            qos_delete_conf(netdev_);
> +            ovs_assert(netdev->qos_conf == NULL);
> +
> +            /* Install new QoS configuration. */
> +            error = new_ops->qos_construct(netdev_, details);
> +            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +        }
> +    } else {
> +        error = new_ops->qos_construct(netdev_, details);
> +        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
> +    }
> +
> +    ovs_mutex_unlock(&netdev->mutex);
> +    return error;
> +}
> +
> +/* us-policer details */
> +
> +struct us_policer {
> +    struct qos_conf qos_conf;
> +    struct rte_meter_srtcm_params app_srtcm_params;
> +    struct rte_meter_srtcm test_meter;
> +};
> +
> +static struct us_policer *
> +us_policer_get__(const struct netdev *netdev_)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    return CONTAINER_OF(netdev->qos_conf, struct us_policer, qos_conf);
> +}
> +
> +static int
> +us_policer_qos_construct(struct netdev *netdev_, const struct smap
> *details)
> +{
> +    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> +    struct us_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    const char *ebs_s;
> +    int err = 0;
> +
> +    rte_spinlock_lock(&netdev->qos_lock);
> +    policer = xmalloc(sizeof *policer);
> +    qos_conf_init(&policer->qos_conf, &us_policer_ops);
> +    netdev->qos_conf = &policer->qos_conf;
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    ebs_s = smap_get(details, "ebs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.ebs = ebs_s ? strtoull(ebs_s, NULL, 10) :
> 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter, &policer-
> >app_srtcm_params);
> +    rte_spinlock_unlock(&netdev->qos_lock);
> +
> +    return err;
> +}
> +
> +static void
> +us_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED, struct
> qos_conf *conf)
> +{
> +    struct us_policer *policer = CONTAINER_OF(conf, struct us_policer,
> qos_conf);
> +    free(policer);
> +}
> +
> +static int
> +us_policer_qos_get(const struct netdev *netdev, struct smap *details)
> +{
> +    struct us_policer *policer = us_policer_get__(netdev);
> +    smap_add_format(details, "cir", "%llu", 1ULL * policer-
> >app_srtcm_params.cir);
> +    smap_add_format(details, "cbs", "%llu", 1ULL * policer-
> >app_srtcm_params.cbs);
> +    smap_add_format(details, "ebs", "%llu", 1ULL * policer-
> >app_srtcm_params.ebs);
> +    return 0;
> +}
> +
> +static int
> +us_policer_qos_set(struct netdev *netdev_, const struct smap *details)
> +{
> +    struct us_policer *policer;
> +    const char *cir_s;
> +    const char *cbs_s;
> +    const char *ebs_s;
> +    int err = 0;
> +
> +    policer = us_policer_get__(netdev_);
> +    cir_s = smap_get(details, "cir");
> +    cbs_s = smap_get(details, "cbs");
> +    ebs_s = smap_get(details, "ebs");
> +    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) :
> 0;
> +    policer->app_srtcm_params.ebs = ebs_s ? strtoull(ebs_s, NULL, 10) :
> 0;
> +    err = rte_meter_srtcm_config(&policer->test_meter, &policer-
> >app_srtcm_params);
> +
> +    return err;
> +}
> +
> +static inline int
> +__us_policer_pkt_handle(struct rte_meter_srtcm *meter, struct rte_mbuf
> *pkt, uint64_t time)
> +{
> +	uint8_t input_color, output_color;
> +	uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct
> ether_hdr);
> +	input_color = 0;
> +	enum us_policer_action action;
> +
> +	/* color input is not used for blind modes */
> +	output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter,
> time, pkt_len);
> +
> +	/* Apply policing and set the output color */
> +	action = us_policer_table[input_color][output_color];
> +
> +	return action;
> +}
> +
> +static int
> +us_policer_alg_process(struct netdev *netdev_, struct rte_mbuf **pkts,
> int pkt_cnt)
> +{
> +	int i = 0;
> +    int cnt = pkt_cnt;
> +    struct us_policer *policer = us_policer_get__(netdev_);
> +    struct rte_mbuf *pkt = NULL;
> +    uint64_t current_time = rte_rdtsc();
> +
> +    for(i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +        /* Handle current packet */
> +        if (__us_policer_pkt_handle(&policer->test_meter, pkt,
> current_time) == DROP) {
> +            rte_pktmbuf_free(pkt);
> +            cnt = cnt - 1;
> +        }
> +    }
> +
> +    return cnt;
> +}
> +
> +static const struct dpdk_qos_ops us_policer_ops = {
> +    "us-policer",    /* qos_name */
> +    us_policer_qos_construct,
> +	us_policer_qos_destruct,
> +    us_policer_qos_get,
> +    us_policer_qos_set,
> +	us_policer_alg_process
> +};
> +
>  #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
> SEND, \
>      GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
> \
>  {                                                             \
> @@ -2069,10 +2471,10 @@ unlock_dpdk:
>      NULL,                       /* set_advertisements */      \
>                                                                \
>      NULL,                       /* set_policing */            \
> -    NULL,                       /* get_qos_types */           \
> +    netdev_dpdk_get_qos_types,                                \
>      NULL,                       /* get_qos_capabilities */    \
> -    NULL,                       /* get_qos */                 \
> -    NULL,                       /* set_qos */                 \
> +    netdev_dpdk_get_qos,                                      \
> +    netdev_dpdk_set_qos,                                      \
>      NULL,                       /* get_queue */               \
>      NULL,                       /* set_queue */               \
>      NULL,                       /* delete_queue */            \
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0ab4a9a..77f70a4 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3283,6 +3283,18 @@
>            for information on how this classifier works.
>          </dd>
>        </dl>
> +      <dl>
> +        <dt><code>us-policer</code></dt>
> +        <dd>
> +          Userspace egress policer algorithm. This implementation uses
> the DPDK
> +          rte_meter library. The rte_meter library provides an
> implementation of
> +          srTCM (RFC2697) which allows the metering and policing of
> traffic.
> +          For more information regarding the usage of the DPDK
> rte_meter see
> +
> <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
> +          For more information regarding srTCM see
> +          <code>https://tools.ietf.org/html/rfc2697</code>
> +        </dd>
> +      </dl>
>      </column>
> 
>      <column name="queues">
> @@ -3319,6 +3331,34 @@
>        </column>
>      </group>
> 
> +    <group title="Configuration for us-policer QoS">
> +      <p>
> +        <ref table="QoS"/> <ref table="QoS" column="type"/>
> +        <code>us-policer</code> provides egress policing for userspace
> port
> +        types with DPDK.
> +
> +        It has the following key-value pairs defined.
> +      </p>
> +
> +      <column name="other_config" key="cir" type='{"type": "integer"}'>
> +        The Committed Information Rate (CIR) is measured in bytes of IP
> packets
> +        per second, i.e., it includes the IP header, but not link
> specific
> +        (e.g. Ethernet) headers. See RFC2697 for more details.
> +      </column>
> +      <column name="other_config" key="cbs" type='{"type": "integer"}'>
> +        The Committed Burst Size (CBS) is measured in bytes. It must be
> set to a
> +        value greater than 0. It is recommended that the minimum value
> for CBS
> +        be the size of the largest possible IP packet in the stream.
> +        See RFC2697 for more details.
> +      </column>
> +      <column name="other_config" key="ebs" type='{"type": "integer"}'>
> +        The Excess Burst Size (EBS) is measured in bytes. It must be
> set to a
> +        value greater than 0. It is recommended that the minimum value
> for EBS
> +        be the size of the largest possible IP packet in the stream.
> +        See RFC2697 for more details.
> +      </column>
> +    </group>
> +
>      <group title="Common Columns">
>        The overall purpose of these columns is described under
> <code>Common
>        Columns</code> at the beginning of this document.
> --
> 1.7.4.1
Daniele Di Proietto Oct. 9, 2015, 5:52 p.m. UTC | #2
Hi,

Thanks for the patch, the implementation looks simpler than I expected.

General questions:

* It appears that we're using the srCTM, but
  * We're only sending the green packets and dropping the rest.
  * The `input_color` is always green.

  Therefore
  * `other-config:ebs` is ignored (isn't it?). I would rather not put in
the
    database a value that is ignored.
  * The `us_policer_table` seems redundant
  * We just need two `us_policer_action`s, or perhaps just a boolean.

  Can we just call this a token bucket? Are there any good reasons not to?
  (Maybe you're planning to expand this, I'm not sure)

* Out of curiosity, are there plans to support different output queues?

* I did't experience any performance degradation when QoS is not
configured,
  but one can imagine that taking a spinlock might have some performance
impact
  when multiple threads insist on the same NIC. Have you considered using
RCU
  for the `qos_conf` pointer? (I'm not saying that we absolutely should at
this
  point)
 
Minor style issues:

* Lines should be limited to 79 characters
* We use 4 spaces to indent, not tabs.

That said the code looks good, a couple of more comments inline

Thanks!

Daniele

On 30/09/2015 13:45, "Ian Stokes" <ian.stokes@intel.com> wrote:

>This patch provides the modifications required in netdev-dpdk.c and
>vswitch.xml to allow for a DPDK user space QoS algorithm.
>
>This patch adds a QoS configuration structure for netdev-dpdk and
>expected QoS operations 'dpdk_qos_ops'. Various helper functions
>are also supplied.
>
>Also included are the modifications required for vswitch.xml to allow a
>new QoS implementation for netdev-dpdk devices. This includes a new QoS
>type
>`us-policer` as well as its expected QoS table entries.
>
>The QoS functionality implemented for DPDK devices is `us-policer`.
>This is an egress policer used to drop packets at configurable rate.
>
>The INSTALL.DPDK.md guide has also been modified to provide an example
>configuration of `us-policer` QoS.
>
>Signed-off-by: Ian Stokes <ian.stokes@intel.com>
>---
> INSTALL.DPDK.md      |   52 +++++++
> lib/netdev-dpdk.c    |  414
>+++++++++++++++++++++++++++++++++++++++++++++++++-
> vswitchd/vswitch.xml |   40 +++++
> 3 files changed, 500 insertions(+), 6 deletions(-)
[...]
>
>@@ -142,6 +145,122 @@ static int rte_eal_init_ret = ENODEV;
> 
> static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> 
>+/* Quality of Service */
>+
>+/* An instance of a QoS configuration.  Always associated with a
>particular
>+ * network device.
>+ *
>+ * Each QoS implementation subclasses this with whatever additional data
>it
>+ * needs.
>+ */
>+struct qos_conf{

A space is needed before open brace

>+    const struct dpdk_qos_ops *ops;
>+};
>+
>+/* A particular implementation of dpdk QoS operations.
>+ *
>+ * The functions below return 0 if successful or a positive errno value
>on
>+ * failure, except where otherwise noted. All of them must be provided,
>except
>+ * where otherwise noted.
>+ */
>+struct dpdk_qos_ops{

Same here

[...]
>@@ -1289,9 +1432,18 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
>qid,
>             }
>         }

Right before this, there is another call to dpdk_queue_pkts, that should
go through qos_alg_process as well. Have you considered adding a helper
function?

>         if (next_tx_idx != cnt) {
>-           dpdk_queue_pkts(dev, qid,
>-                            (struct rte_mbuf **)&pkts[next_tx_idx],
>-                            cnt-next_tx_idx);
>+            cnt -= next_tx_idx;
>+
>+            /* Check if QoS has been configured for this netdev. */
>+            rte_spinlock_lock(&dev->qos_lock);
>+            if (dev->qos_conf != NULL) {
>+                struct netdev *netdev = &dev->up;
>+                cnt = dev->qos_conf->ops->qos_alg_process(netdev,
>+                        (struct rte_mbuf**)pkts, cnt);
>+            }
>+            rte_spinlock_unlock(&dev->qos_lock);
>+
>+            dpdk_queue_pkts(dev, qid, (struct rte_mbuf
>**)&pkts[next_tx_idx], cnt);
>         }
> 
>         if (OVS_UNLIKELY(dropped)) {
[...]
>+    ovs_mutex_lock(&netdev->mutex);
>+    if(netdev->qos_conf){

A space is needed here

[...]
>+static inline int
>+__us_policer_pkt_handle(struct rte_meter_srtcm *meter, struct rte_mbuf
>*pkt, uint64_t time)

CodingStyle.md says:

  - Do not use names that begin with _.  If you need a name for
    "internal use only", use __ as a suffix instead of a prefix.

Unfortunately we have many exceptions, even in this file, but I would

prefer to have it changed
Stokes, Ian Oct. 12, 2015, 5:02 p.m. UTC | #3
Hi Daniele,

Thanks for providing feedback, answers inline.

> -----Original Message-----
> From: Daniele Di Proietto [mailto:diproiettod@vmware.com]
> Sent: Friday, October 09, 2015 6:53 PM
> To: Stokes, Ian
> Cc: dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH] netdev_dpdk.c: Add QoS functionality.
> 
> Hi,
> 
> Thanks for the patch, the implementation looks simpler than I expected.
> 
> General questions:
> 
> * It appears that we're using the srCTM, but
>   * We're only sending the green packets and dropping the rest.
>   * The `input_color` is always green.
> 
>   Therefore
>   * `other-config:ebs` is ignored (isn't it?). I would rather not put in
> the
>     database a value that is ignored.

Although we do not use the ebs value it is a requirement for the 
rte_meter initialization function. DPDK expects an ebs value 
greater than 0 or it fails to initialize the rte_meter. 

We can remove the other_config:ebs from the command line but
will have to add a define in netdev_dpdk so as to supply some ebs 
value.

Would this be acceptable?

>   * The `us_policer_table` seems redundant
>   * We just need two `us_policer_action`s, or perhaps just a boolean.
> 
>   Can we just call this a token bucket? Are there any good reasons not
> to?
>   (Maybe you're planning to expand this, I'm not sure)

Is this in relation to the name 'us-policer'or the name for the cbs and
ebs paramaters? ?

I think we should change the us-policer name anyhow to signify that it
is egress. Would 'egress_policer' be acceptable? Failing that we ca use
token_bucket can be used. Once it's clear to the end user.

> 
> * Out of curiosity, are there plans to support different output queues?

There are plans to introduce user space queues at a later date to allow 
buffering of traffic. This would be included as maybe a rate limiting type
implementation where traffic is not dropped but instead buffered. This 
will be a separate submission in the future however.
 
> 
> * I did't experience any performance degradation when QoS is not
> configured,
>   but one can imagine that taking a spinlock might have some performance
> impact
>   when multiple threads insist on the same NIC. Have you considered
> using
> RCU
>   for the `qos_conf` pointer? (I'm not saying that we absolutely should
> at
> this
>   point)
> 
I haven't tried an RCU for this. I can look at this but if it's not a
blocking issue perhaps it could be implemented at a later stage?
> Minor style issues:
> 
> * Lines should be limited to 79 characters
> * We use 4 spaces to indent, not tabs.
>
Noted, will make these changes for version 2 submission.
 
> That said the code looks good, a couple of more comments inline
> 
> Thanks!
> 
> Daniele
> 
> On 30/09/2015 13:45, "Ian Stokes" <ian.stokes@intel.com> wrote:
> 
> >This patch provides the modifications required in netdev-dpdk.c and
> >vswitch.xml to allow for a DPDK user space QoS algorithm.
> >
> >This patch adds a QoS configuration structure for netdev-dpdk and
> >expected QoS operations 'dpdk_qos_ops'. Various helper functions
> >are also supplied.
> >
> >Also included are the modifications required for vswitch.xml to allow a
> >new QoS implementation for netdev-dpdk devices. This includes a new QoS
> >type
> >`us-policer` as well as its expected QoS table entries.
> >
> >The QoS functionality implemented for DPDK devices is `us-policer`.
> >This is an egress policer used to drop packets at configurable rate.
> >
> >The INSTALL.DPDK.md guide has also been modified to provide an example
> >configuration of `us-policer` QoS.
> >
> >Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> >---
> > INSTALL.DPDK.md      |   52 +++++++
> > lib/netdev-dpdk.c    |  414
> >+++++++++++++++++++++++++++++++++++++++++++++++++-
> > vswitchd/vswitch.xml |   40 +++++
> > 3 files changed, 500 insertions(+), 6 deletions(-)
> [...]
> >
> >@@ -142,6 +145,122 @@ static int rte_eal_init_ret = ENODEV;
> >
> > static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> >
> >+/* Quality of Service */
> >+
> >+/* An instance of a QoS configuration.  Always associated with a
> >particular
> >+ * network device.
> >+ *
> >+ * Each QoS implementation subclasses this with whatever additional
> data
> >it
> >+ * needs.
> >+ */
> >+struct qos_conf{
> 
> A space is needed before open brace
> 
> >+    const struct dpdk_qos_ops *ops;
> >+};
> >+
> >+/* A particular implementation of dpdk QoS operations.
> >+ *
> >+ * The functions below return 0 if successful or a positive errno
> value
> >on
> >+ * failure, except where otherwise noted. All of them must be
> provided,
> >except
> >+ * where otherwise noted.
> >+ */
> >+struct dpdk_qos_ops{
> 
> Same here
> 
> [...]
> >@@ -1289,9 +1432,18 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
> >qid,
> >             }
> >         }
> 
> Right before this, there is another call to dpdk_queue_pkts, that should
> go through qos_alg_process as well. Have you considered adding a helper
> function?
> 
> >         if (next_tx_idx != cnt) {
> >-           dpdk_queue_pkts(dev, qid,
> >-                            (struct rte_mbuf **)&pkts[next_tx_idx],
> >-                            cnt-next_tx_idx);
> >+            cnt -= next_tx_idx;
> >+
> >+            /* Check if QoS has been configured for this netdev. */
> >+            rte_spinlock_lock(&dev->qos_lock);
> >+            if (dev->qos_conf != NULL) {
> >+                struct netdev *netdev = &dev->up;
> >+                cnt = dev->qos_conf->ops->qos_alg_process(netdev,
> >+                        (struct rte_mbuf**)pkts, cnt);
> >+            }
> >+            rte_spinlock_unlock(&dev->qos_lock);
> >+
> >+            dpdk_queue_pkts(dev, qid, (struct rte_mbuf
> >**)&pkts[next_tx_idx], cnt);
> >         }
> >
> >         if (OVS_UNLIKELY(dropped)) {
> [...]
> >+    ovs_mutex_lock(&netdev->mutex);
> >+    if(netdev->qos_conf){
> 
> A space is needed here
> 
> [...]
> >+static inline int
> >+__us_policer_pkt_handle(struct rte_meter_srtcm *meter, struct rte_mbuf
> >*pkt, uint64_t time)
> 
> CodingStyle.md says:
> 
>   - Do not use names that begin with _.  If you need a name for
>     "internal use only", use __ as a suffix instead of a prefix.
> 
> Unfortunately we have many exceptions, even in this file, but I would
> 
> prefer to have it changed

Thanks for these catches, will make these changes for version 2.

Regards
Ian
Daniele Di Proietto Oct. 13, 2015, 4:14 p.m. UTC | #4
On 12/10/2015 18:02, "Stokes, Ian" <ian.stokes@intel.com> wrote:

>Hi Daniele,
>
>Thanks for providing feedback, answers inline.
>
>> -----Original Message-----
>> From: Daniele Di Proietto [mailto:diproiettod@vmware.com]
>> Sent: Friday, October 09, 2015 6:53 PM
>> To: Stokes, Ian
>> Cc: dev@openvswitch.org
>> Subject: Re: [ovs-dev] [PATCH] netdev_dpdk.c: Add QoS functionality.
>> 
>> Hi,
>> 
>> Thanks for the patch, the implementation looks simpler than I expected.
>> 
>> General questions:
>> 
>> * It appears that we're using the srCTM, but
>>   * We're only sending the green packets and dropping the rest.
>>   * The `input_color` is always green.
>> 
>>   Therefore
>>   * `other-config:ebs` is ignored (isn't it?). I would rather not put in
>> the
>>     database a value that is ignored.
>
>Although we do not use the ebs value it is a requirement for the
>rte_meter initialization function. DPDK expects an ebs value
>greater than 0 or it fails to initialize the rte_meter.
>
>We can remove the other_config:ebs from the command line but
>will have to add a define in netdev_dpdk so as to supply some ebs
>value.
>
>Would this be acceptable?

If there aren't plans to use different input/output colors, that
would be fine for me.

Since that parameter is useless, I would even avoid a define and always
set it
to 0.

>>   * The `us_policer_table` seems redundant
>>   * We just need two `us_policer_action`s, or perhaps just a boolean.
>> 
>>   Can we just call this a token bucket? Are there any good reasons not
>> to?
>>   (Maybe you're planning to expand this, I'm not sure)
>
>Is this in relation to the name 'us-policer'or the name for the cbs and
>ebs paramaters? ?

The name is fine.  My point is:

This patch implements a token bucket, using (internally) srTCM.  Is it
necessary
to expose to the user the `ebs` parameter, or references to RFC2697, since
we're
not using the colors at all?

>
>I think we should change the us-policer name anyhow to signify that it
>is egress. Would 'egress_policer' be acceptable? Failing that we ca use
>token_bucket can be used. Once it's clear to the end user.
>
>> 
>> * Out of curiosity, are there plans to support different output queues?
>
>There are plans to introduce user space queues at a later date to allow
>buffering of traffic. This would be included as maybe a rate limiting type
>implementation where traffic is not dropped but instead buffered. This
>will be a separate submission in the future however.

Sorry, I wasn't clear enough.  By `different output queues` I meant having
different instances of a meter on the same port (that can be configured
using
the `Queue` table, and can be used via the `enqueue` OpenFlow action).

> 
>> 
>> * I did't experience any performance degradation when QoS is not
>> configured,
>>   but one can imagine that taking a spinlock might have some performance
>> impact
>>   when multiple threads insist on the same NIC. Have you considered
>> using
>> RCU
>>   for the `qos_conf` pointer? (I'm not saying that we absolutely should
>> at
>> this
>>   point)
>> 
>I haven't tried an RCU for this. I can look at this but if it's not a
>blocking issue perhaps it could be implemented at a later stage?

Absolutely, we can implement that later, if we find a bottleneck.

Thanks,

Daniele
Ben Pfaff Oct. 13, 2015, 4:24 p.m. UTC | #5
On Wed, Sep 30, 2015 at 01:45:15PM +0100, Ian Stokes wrote:
> This patch provides the modifications required in netdev-dpdk.c and
> vswitch.xml to allow for a DPDK user space QoS algorithm.
> 
> This patch adds a QoS configuration structure for netdev-dpdk and
> expected QoS operations 'dpdk_qos_ops'. Various helper functions
> are also supplied.
> 
> Also included are the modifications required for vswitch.xml to allow a
> new QoS implementation for netdev-dpdk devices. This includes a new QoS type
> `us-policer` as well as its expected QoS table entries.
> 
> The QoS functionality implemented for DPDK devices is `us-policer`.
> This is an egress policer used to drop packets at configurable rate.
> 
> The INSTALL.DPDK.md guide has also been modified to provide an example
> configuration of `us-policer` QoS.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>

Hi Ian.  I've reviewed the documentation changes and the conceptual
design, but not the code itself.

First, I'm not sure why it's a good idea to introduce a policer to begin
with.  Our experience is that ingress policing does not produce useful
results.  Perhaps it will be more effective for egress, but I have my
doubts about that; otherwise why would shaping generally be preferred
for egress?

With what kinds of traffic has this implementation been tested?  How
accurate is it?  TCP and UDP respond quite differently to policing.
Additionally, policing can be especially hard on traffic that involves
IP fragments, since dropping a single fragment causes the entire IP
datagram to be discarded after soaking up considerable CPU time on the
destination host for reassembly.

Since DPDK is all about performance, I'd expect some commentary on the
performance of this technique.  How much CPU overhead does it require on
the sender?

I don't like this trend toward putting essentially all of the DPDK
documentation in INSTALL.DPDK.md.  This commit adds far more of the
configuration details to that file than to vswitch.xml.  I'd prefer it
to be the opposite, only adding an example or two to INSTALL.DPDK.md and
the bulk of the information to vswitch.xml.

I would have guessed that Intel NICs have hardware queuing and QoS
features.  If so, they are undoubtedly more CPU-efficient than software
versions.  Are there plans to make use of those features in DPDK?

Thanks,

Ben.
Stokes, Ian Oct. 14, 2015, 1:14 p.m. UTC | #6
> -----Original Message-----
> From: Ben Pfaff [mailto:blp@nicira.com]
> Sent: Tuesday, October 13, 2015 5:24 PM
> To: Stokes, Ian
> Cc: dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH] netdev_dpdk.c: Add QoS functionality.
> 
> On Wed, Sep 30, 2015 at 01:45:15PM +0100, Ian Stokes wrote:
> > This patch provides the modifications required in netdev-dpdk.c and
> > vswitch.xml to allow for a DPDK user space QoS algorithm.
> >
> > This patch adds a QoS configuration structure for netdev-dpdk and
> > expected QoS operations 'dpdk_qos_ops'. Various helper functions are
> > also supplied.
> >
> > Also included are the modifications required for vswitch.xml to allow
> > a new QoS implementation for netdev-dpdk devices. This includes a new
> > QoS type `us-policer` as well as its expected QoS table entries.
> >
> > The QoS functionality implemented for DPDK devices is `us-policer`.
> > This is an egress policer used to drop packets at configurable rate.
> >
> > The INSTALL.DPDK.md guide has also been modified to provide an example
> > configuration of `us-policer` QoS.
> >
> > Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> 
> Hi Ian.  I've reviewed the documentation changes and the conceptual
> design, but not the code itself.

Hi Ben, thanks for taking the time to review this.
> 
> First, I'm not sure why it's a good idea to introduce a policer to begin
> with.  Our experience is that ingress policing does not produce useful
> results.  Perhaps it will be more effective for egress, but I have my
> doubts about that; otherwise why would shaping generally be preferred
> for egress?

I understand and agree. From our side when we originally proposed the QoS API (netdev functions, QoS function pointers + helper functions) it was suggested that we provide a simple QoS algorithm along with it just to demonstrate how all these components are used.

The policer is a simple implementation and as such is limited in its application as you described above, this is certainly true for the physical port types, however there is a use case for virtual ports attached to virtual machines. 

It can be the case that the virtual machine performs some type of packet processing and will receive traffic from multiple ports to process. A user may make a conscious decision to limit traffic from one of these ports in order to allow traffic from other attached ports to be prioritized. This requires some pre-knowledge of ports and their traffic types but it is a use case we have heard echoed from both OVS and Openstack users.
   
Ultimately our goal will be to implement an egress rate limiter. However there are still design decisions regarding how it should be implemented that we need to discuss with the community first (user space queues for buffereing, algorithms for reading buffered packets etc). 

Our aim with this patch was to provide the QoS API required along with a simple QoS implementation so as to enable others in the community begin contributing on the QoS front as soon as possible.
> 
> With what kinds of traffic has this implementation been tested?  How
> accurate is it?  TCP and UDP respond quite differently to policing.
> Additionally, policing can be especially hard on traffic that involves
> IP fragments, since dropping a single fragment causes the entire IP
> datagram to be discarded after soaking up considerable CPU time on the
> destination host for reassembly.
>
We've tested with UDP and TCP. There are no provisions made for ip fragments and re-assembly cases. The situation you describe above can occur, but this is the nature of policing and as such that type of traffic will not be an ideal fit, the user should be aware(or made aware) of this, It can be flagged in the vswtich.xml maybe?
 
> Since DPDK is all about performance, I'd expect some commentary on the
> performance of this technique.  How much CPU overhead does it require on
> the sender?
> 
There is an overhead cost as the rte_meter will use cpu cyclces. How this can be measured is up for debate. The PMD will report 100% usage due to the nature of DPDK, so looking for the extra overhead here in terms of CPU % usage will not be immediately clear. We can provide the drop in expected performance in terms of traffic processing with the QoS on the port enabled and disabled? 

> I don't like this trend toward putting essentially all of the DPDK
> documentation in INSTALL.DPDK.md.  This commit adds far more of the
> configuration details to that file than to vswitch.xml.  I'd prefer it
> to be the opposite, only adding an example or two to INSTALL.DPDK.md and
> the bulk of the information to vswitch.xml.

Agreed, apologies for putting this info in the INSTALL.DPDK.md. I will move it to the vswitch.xml as per your recommendation.
> 
> I would have guessed that Intel NICs have hardware queuing and QoS
> features.  If so, they are undoubtedly more CPU-efficient than software
> versions.  Are there plans to make use of those features in DPDK?
> 
This would be correct for the physical ports at least; however this support is not present in DPDK as we speak. There are plans for this type of support but I'm not sure when it will be implemented. 

However a software solution can still be used for virtual port types and it could also support more basic physical port types that may not support QoS offload functionality to hardware.

> Thanks,
> 
> Ben.

Thanks
Ian
Stokes, Ian Oct. 14, 2015, 3:22 p.m. UTC | #7
> -----Original Message-----
> From: Daniele Di Proietto [mailto:diproiettod@vmware.com]
> Sent: Tuesday, October 13, 2015 5:14 PM
> To: Stokes, Ian
> Cc: dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH] netdev_dpdk.c: Add QoS functionality.
> 
> 
> 
> On 12/10/2015 18:02, "Stokes, Ian" <ian.stokes@intel.com> wrote:
> 
> >Hi Daniele,
> >
> >Thanks for providing feedback, answers inline.
> >
> >> -----Original Message-----
> >> From: Daniele Di Proietto [mailto:diproiettod@vmware.com]
> >> Sent: Friday, October 09, 2015 6:53 PM
> >> To: Stokes, Ian
> >> Cc: dev@openvswitch.org
> >> Subject: Re: [ovs-dev] [PATCH] netdev_dpdk.c: Add QoS functionality.
> >>
> >> Hi,
> >>
> >> Thanks for the patch, the implementation looks simpler than I
> expected.
> >>
> >> General questions:
> >>
> >> * It appears that we're using the srCTM, but
> >>   * We're only sending the green packets and dropping the rest.
> >>   * The `input_color` is always green.
> >>
> >>   Therefore
> >>   * `other-config:ebs` is ignored (isn't it?). I would rather not put
> >> in the
> >>     database a value that is ignored.
> >
> >Although we do not use the ebs value it is a requirement for the
> >rte_meter initialization function. DPDK expects an ebs value greater
> >than 0 or it fails to initialize the rte_meter.
> >
> >We can remove the other_config:ebs from the command line but will have
> >to add a define in netdev_dpdk so as to supply some ebs value.
> >
> >Would this be acceptable?
> 
> If there aren't plans to use different input/output colors, that would
> be fine for me.
> 
> Since that parameter is useless, I would even avoid a define and always
> set it to 0.

Agreed.
> >>   * The `us_policer_table` seems redundant
> >>   * We just need two `us_policer_action`s, or perhaps just a boolean.
> >>
> >>   Can we just call this a token bucket? Are there any good reasons
> >> not to?
> >>   (Maybe you're planning to expand this, I'm not sure)
> >
> >Is this in relation to the name 'us-policer'or the name for the cbs and
> >ebs paramaters? ?
> 
> The name is fine.  My point is:
> 
> This patch implements a token bucket, using (internally) srTCM.  Is it
> necessary to expose to the user the `ebs` parameter, or references to
> RFC2697, since we're not using the colors at all?
> 
> >
> >I think we should change the us-policer name anyhow to signify that it
> >is egress. Would 'egress_policer' be acceptable? Failing that we ca use
> >token_bucket can be used. Once it's clear to the end user.
> >
> >>
> >> * Out of curiosity, are there plans to support different output
> queues?
> >
> >There are plans to introduce user space queues at a later date to allow
> >buffering of traffic. This would be included as maybe a rate limiting
> >type implementation where traffic is not dropped but instead buffered.
> >This will be a separate submission in the future however.
> 
> Sorry, I wasn't clear enough.  By `different output queues` I meant
> having different instances of a meter on the same port (that can be
> configured using the `Queue` table, and can be used via the `enqueue`
> OpenFlow action).
> 
I haven't looked at this yet, although there is potential for work around this.

I had a similar thought to how multiple pmds could work with QoS? i.e. could there be a situation where to avoid the spin lock you could have a QoS conf per configured pmd (in this case there would be a meter per pmd). There's definitely some work to be done here in the future.
> >
> >>
> >> * I did't experience any performance degradation when QoS is not
> >> configured,
> >>   but one can imagine that taking a spinlock might have some
> >> performance impact
> >>   when multiple threads insist on the same NIC. Have you considered
> >> using RCU
> >>   for the `qos_conf` pointer? (I'm not saying that we absolutely
> >> should at this
> >>   point)
> >>
> >I haven't tried an RCU for this. I can look at this but if it's not a
> >blocking issue perhaps it could be implemented at a later stage?
> 
> Absolutely, we can implement that later, if we find a bottleneck.
> 
> Thanks,
> 
> Daniele

Patch
diff mbox

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 7bf110c..61bca31 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -207,6 +207,58 @@  Using the DPDK with ovs-vswitchd:
    ./ovs-ofctl add-flow br0 in_port=2,action=output:1
    ```
 
+8. QoS usage example
+
+   OVS supports egress QoS for physical and virtual DPDK port types. Currently
+   there is only one supported QoS type 'us-policer'.
+
+   Assuming you have a vhost-user port transmitting traffic consisting of
+   packets of size 64 bytes, the following command would limit the egress
+   transmission rate of the port to ~1,000,000 packets per second:
+
+   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos
+   type=us-policer other-config:cir=46000000 other-config:cbs=2048
+   other-config:ebs=2048`
+
+   An explanation of the us-policer specific parameters are detailed below.
+
+   `type=us-policer`: Sets the QoS type to userspace-policer. This is required
+   so that the dpdk-netdev can be configured with the correct QoS type.
+   The specifics of us-policer can be found in the vswitch.xml file.
+
+   `other-config:cir=46000000`: The Committed Information Rate (cir)
+   represents the bytes per second rate at which the token buckets will be
+   updated. See RFC2697 for specific details. The cir value is calculated by
+   (pps x packet data size). For example the value 46000000 can be broken
+   into '1,000,000 x 46'. Where 1,000,000 is the policing rate for the number
+   of packets per second and 46 represents the size of the packet data for a 64
+   byte ip packet.
+
+   `other-config:cbs=2048`: This represents the value of the Commited Burst Size
+   (cbs) token bucket. At a minimum this value should be be set to the expected
+   largest size packet in the traffic stream. In practice larger values may be
+   used to increase the size of the token bucket. If a packet can be
+   transmitted then the cbs will be decremented by the number of bytes/tokens of
+   the packet. If there are not enough tokens in the cbs bucket then the Excess
+   Burst Size (ebs) token bucket will be used too and excess bytes decremented
+   in the ebs.
+
+   `other-config:ebs=2048`: This represents the value of the Excess Burst Size
+   (ebs). At a minimum this value should be set to the expected largest size
+   packet in the traffic stream. In practice larger values may be used to
+   increase the size of the token bucket. In the us-policer implementation any
+   packet that requires tokens from the ebs token bucket will be marked to be
+   dropped. Any packet that has exhausted the ebs bucket will also be dropped.
+
+   To examine the QoS configuration of the port:
+
+   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
+
+   To clear the QoS configuration from the port and ovsdb use the following:
+
+   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port vhost-user0 qos`
+
+
 Performance Tuning:
 -------------------
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index b72a33b..7e7fca1 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -44,6 +44,7 @@ 
 #include "ovs-rcu.h"
 #include "packets.h"
 #include "shash.h"
+#include "smap.h"
 #include "sset.h"
 #include "unaligned.h"
 #include "timeval.h"
@@ -52,12 +53,14 @@ 
 
 #include "rte_config.h"
 #include "rte_mbuf.h"
+#include "rte_meter.h"
 #include "rte_virtio_net.h"
 
 VLOG_DEFINE_THIS_MODULE(dpdk);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 #define DPDK_PORT_WATCHDOG_INTERVAL 5
+#define DPDK_MAX_QOS_NAME_SIZE 10
 
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"
@@ -142,6 +145,122 @@  static int rte_eal_init_ret = ENODEV;
 
 static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
 
+/* Quality of Service */
+
+/* An instance of a QoS configuration.  Always associated with a particular
+ * network device.
+ *
+ * Each QoS implementation subclasses this with whatever additional data it
+ * needs.
+ */
+struct qos_conf{
+    const struct dpdk_qos_ops *ops;
+};
+
+/* A particular implementation of dpdk QoS operations.
+ *
+ * The functions below return 0 if successful or a positive errno value on
+ * failure, except where otherwise noted. All of them must be provided, except
+ * where otherwise noted.
+ */
+struct dpdk_qos_ops{
+
+    /* Name of the QoS type */
+    const char *qos_name;
+
+    /* Called to construct the QoS implementation on 'netdev'. The
+     * implementation should make the appropriate calls to configure QoS
+     * according to 'details'. The implementation may assume that any current
+     * QoS configuration already installed should be destroyed before
+     * constructing the new configuration.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     *
+     * This function must return 0 if and only if it sets 'netdev->qos_conf'
+     * to an initialized 'struct qos_conf'.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    int (*qos_construct)(struct netdev *netdev, const struct smap *details);
+
+    /* Destroys the data structures allocated by the implementation as part of
+     * 'qos_conf.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    void (*qos_destruct)(struct netdev *netdev, struct qos_conf *conf);
+
+    /* Retrieves details of 'netdev->qos_conf' configuration into 'details'.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     */
+    int (*qos_get)(const struct netdev *netdev, struct smap *details);
+
+    /* Reconfigures 'netdev->qos_conf' according to 'details', performing any
+     * required calls to complete the reconfiguration.
+     *
+     * The contents of 'details' should be documented as valid for 'ovs_name'
+     * in the "other_config" column in the "QoS" table in vswitchd/vswitch.xml
+     * (which is built as ovs-vswitchd.conf.db(8)).
+     *
+     * This function may be null if 'qos_conf' is not configurable.
+     */
+    int (*qos_set)(struct netdev *netdev, const struct smap *details);
+
+    /* Modify an array of rte_mbufs. The modification is specific to
+     * each qos implementation.
+     *
+     * The function should take and array of mbufs and an int representing
+     * the current number of mbufs present in the array.
+     *
+     * After the function has performed a qos modification to the array of mbufs
+     * it returns an int representing the number of mbufs now present in the
+     * array. This value is can then be passed to the port send function along
+     * with the modified array for transmission.
+     *
+     * For all QoS implementations it should always be non-null.
+     */
+    int (*qos_alg_process)(struct netdev *netdev, struct rte_mbuf **pkts,
+                           int pkt_cnt);
+};
+
+/* dpdk_qos_ops for each type of user space QoS implementation */
+static const struct dpdk_qos_ops us_policer_ops;
+
+/*
+ * Array of dpdk_qos_ops, contains pointer to all supported QoS
+ * operations.
+ */
+static const struct dpdk_qos_ops *const qos_confs[] = {
+    &us_policer_ops,               /*User Space Strict Priority Queuing*/
+    NULL
+};
+
+/* Action that can be set for a packet for rte_meter */
+enum us_policer_action {
+        GREEN = e_RTE_METER_GREEN,
+        YELLOW = e_RTE_METER_YELLOW,
+        RED = e_RTE_METER_RED,
+        DROP = 3,
+};
+
+/*
+ * Table representing the policing policy. Rows indicate the input color,
+ * columns indicate the output color, and the value that is stored in
+ * the table indicates the action to be taken for that particular case.
+ */
+enum us_policer_action us_policer_table[e_RTE_METER_COLORS]
+                                       [e_RTE_METER_COLORS] =
+{
+	{ GREEN, DROP, DROP},
+	{ DROP, DROP, DROP},
+	{ DROP, DROP, DROP}
+};
+
 /* Contains all 'struct dpdk_dev's. */
 static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
     = OVS_LIST_INITIALIZER(&dpdk_list);
@@ -235,6 +354,11 @@  struct netdev_dpdk {
 
     /* In dpdk_list. */
     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
+
+    /* QoS configuration and lock for the device */
+    struct qos_conf *qos_conf;
+    rte_spinlock_t qos_lock;
+
 };
 
 struct netdev_rxq_dpdk {
@@ -614,6 +738,10 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
         goto unlock;
     }
 
+    /* Initialise QoS configuration to NULL and qos lock to unlocked */
+    netdev->qos_conf = NULL;
+    rte_spinlock_init(&netdev->qos_lock);
+
     netdev_->n_txq = NR_QUEUE;
     netdev_->n_rxq = NR_QUEUE;
     netdev->real_n_txq = NR_QUEUE;
@@ -1072,6 +1200,13 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, struct dp_packet **pkts,
         goto out;
     }
 
+    /* Check has QoS has been configured for the netdev */
+    rte_spinlock_lock(&vhost_dev->qos_lock);
+    if (vhost_dev->qos_conf != NULL) {
+        cnt = vhost_dev->qos_conf->ops->qos_alg_process(netdev, cur_pkts, cnt);
+    }
+    rte_spinlock_unlock(&vhost_dev->qos_lock);
+
     /* There is vHost TX single queue, So we need to lock it for TX. */
     rte_spinlock_lock(&vhost_dev->vhost_tx_lock);
 
@@ -1216,6 +1351,14 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet **pkts,
     if (dev->type == DPDK_DEV_VHOST) {
         __netdev_dpdk_vhost_send(netdev, (struct dp_packet **) mbufs, newcnt, true);
     } else {
+
+        /* Check if QoS has been configured for this netdev. */
+        rte_spinlock_lock(&dev->qos_lock);
+        if (dev->qos_conf != NULL) {
+            newcnt = dev->qos_conf->ops->qos_alg_process(netdev, mbufs, newcnt);
+        }
+        rte_spinlock_unlock(&dev->qos_lock);
+
         dpdk_queue_pkts(dev, qid, mbufs, newcnt);
         dpdk_queue_flush(dev, qid);
     }
@@ -1289,9 +1432,18 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
             }
         }
         if (next_tx_idx != cnt) {
-           dpdk_queue_pkts(dev, qid,
-                            (struct rte_mbuf **)&pkts[next_tx_idx],
-                            cnt-next_tx_idx);
+            cnt -= next_tx_idx;
+
+            /* Check if QoS has been configured for this netdev. */
+            rte_spinlock_lock(&dev->qos_lock);
+            if (dev->qos_conf != NULL) {
+                struct netdev *netdev = &dev->up;
+                cnt = dev->qos_conf->ops->qos_alg_process(netdev,
+                        (struct rte_mbuf**)pkts, cnt);
+            }
+            rte_spinlock_unlock(&dev->qos_lock);
+
+            dpdk_queue_pkts(dev, qid, (struct rte_mbuf **)&pkts[next_tx_idx], cnt);
         }
 
         if (OVS_UNLIKELY(dropped)) {
@@ -2032,6 +2184,256 @@  unlock_dpdk:
     return err;
 }
 
+/* QoS Functions */
+
+/*
+ * Initialize QoS configuration operations.
+ */
+static void
+qos_conf_init(struct qos_conf *conf, const struct dpdk_qos_ops *ops)
+{
+    conf->ops = ops;
+}
+
+/*
+ * Search existing QoS operations in qos_ops and compare each set of operations
+ * qos_name to name. Return a dpdk_qos_ops pointer to a match, else return
+ * NULL
+ */
+static const struct dpdk_qos_ops *
+qos_lookup_name(const char *name)
+{
+    const struct dpdk_qos_ops *const *opsp;
+
+    for (opsp = qos_confs; *opsp != NULL; opsp++) {
+        const struct dpdk_qos_ops *ops = *opsp;
+        if (!strncmp(name, ops->qos_name,DPDK_MAX_QOS_NAME_SIZE)) {
+            return ops;
+        }
+    }
+    return NULL;
+}
+
+/*
+ * Call qos_destruct to clean up items associated with the netdevs
+ * qos_conf. Set netdevs qos_conf to NULL.
+ */
+static void
+qos_delete_conf(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    rte_spinlock_lock(&netdev->qos_lock);
+    if (netdev->qos_conf) {
+        if (netdev->qos_conf->ops->qos_destruct) {
+            netdev->qos_conf->ops->qos_destruct(netdev_, netdev->qos_conf);
+        }
+        netdev->qos_conf = NULL;
+    }
+    rte_spinlock_unlock(&netdev->qos_lock);
+}
+
+static int
+netdev_dpdk_get_qos_types(const struct netdev *netdev OVS_UNUSED,
+                           struct sset *types)
+{
+    const struct dpdk_qos_ops *const *opsp;
+
+    for (opsp = qos_confs; *opsp != NULL; opsp++) {
+        const struct dpdk_qos_ops *ops = *opsp;
+        if (ops->qos_construct && ops->qos_name[0] != '\0') {
+            sset_add(types, ops->qos_name);
+        }
+    }
+    return 0;
+}
+
+static int
+netdev_dpdk_get_qos(const struct netdev *netdev_,
+                    const char **typep, struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    int error = 0;
+
+    ovs_mutex_lock(&netdev->mutex);
+    if(netdev->qos_conf){
+        *typep = netdev->qos_conf->ops->qos_name;
+        error = (netdev->qos_conf->ops->qos_get
+                 ? netdev->qos_conf->ops->qos_get(netdev_, details): 0);
+    }
+    ovs_mutex_unlock(&netdev->mutex);
+
+    return error;
+}
+
+static int
+netdev_dpdk_set_qos(struct netdev *netdev_,
+                    const char *type, const struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    const struct dpdk_qos_ops *new_ops = NULL;
+    int error = 0;
+
+    /* If type is empty the current QoS configuration
+	   for the dpdk-netdev can be destroyed */
+    if(type[0] == '\0') {
+        qos_delete_conf(netdev_);
+    }
+
+    new_ops = qos_lookup_name(type);
+    if (!new_ops || !new_ops->qos_construct) {
+        return EOPNOTSUPP;
+    }
+
+    ovs_mutex_lock(&netdev->mutex);
+    if (netdev->qos_conf) {
+        if (new_ops == netdev->qos_conf->ops) {
+            error = new_ops->qos_set ? new_ops->qos_set(netdev_, details) : 0;
+        }
+        else {
+            /* Delete existing QoS configuration. */
+            qos_delete_conf(netdev_);
+            ovs_assert(netdev->qos_conf == NULL);
+
+            /* Install new QoS configuration. */
+            error = new_ops->qos_construct(netdev_, details);
+            ovs_assert((error == 0) == (netdev->qos_conf != NULL));
+        }
+    } else {
+        error = new_ops->qos_construct(netdev_, details);
+        ovs_assert((error == 0) == (netdev->qos_conf != NULL));
+    }
+
+    ovs_mutex_unlock(&netdev->mutex);
+    return error;
+}
+
+/* us-policer details */
+
+struct us_policer {
+    struct qos_conf qos_conf;
+    struct rte_meter_srtcm_params app_srtcm_params;
+    struct rte_meter_srtcm test_meter;
+};
+
+static struct us_policer *
+us_policer_get__(const struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    return CONTAINER_OF(netdev->qos_conf, struct us_policer, qos_conf);
+}
+
+static int
+us_policer_qos_construct(struct netdev *netdev_, const struct smap *details)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+    struct us_policer *policer;
+    const char *cir_s;
+    const char *cbs_s;
+    const char *ebs_s;
+    int err = 0;
+
+    rte_spinlock_lock(&netdev->qos_lock);
+    policer = xmalloc(sizeof *policer);
+    qos_conf_init(&policer->qos_conf, &us_policer_ops);
+    netdev->qos_conf = &policer->qos_conf;
+    cir_s = smap_get(details, "cir");
+    cbs_s = smap_get(details, "cbs");
+    ebs_s = smap_get(details, "ebs");
+    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
+    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
+    policer->app_srtcm_params.ebs = ebs_s ? strtoull(ebs_s, NULL, 10) : 0;
+    err = rte_meter_srtcm_config(&policer->test_meter, &policer->app_srtcm_params);
+    rte_spinlock_unlock(&netdev->qos_lock);
+
+    return err;
+}
+
+static void
+us_policer_qos_destruct(struct netdev *netdev_ OVS_UNUSED, struct qos_conf *conf)
+{
+    struct us_policer *policer = CONTAINER_OF(conf, struct us_policer, qos_conf);
+    free(policer);
+}
+
+static int
+us_policer_qos_get(const struct netdev *netdev, struct smap *details)
+{
+    struct us_policer *policer = us_policer_get__(netdev);
+    smap_add_format(details, "cir", "%llu", 1ULL * policer->app_srtcm_params.cir);
+    smap_add_format(details, "cbs", "%llu", 1ULL * policer->app_srtcm_params.cbs);
+    smap_add_format(details, "ebs", "%llu", 1ULL * policer->app_srtcm_params.ebs);
+    return 0;
+}
+
+static int
+us_policer_qos_set(struct netdev *netdev_, const struct smap *details)
+{
+    struct us_policer *policer;
+    const char *cir_s;
+    const char *cbs_s;
+    const char *ebs_s;
+    int err = 0;
+
+    policer = us_policer_get__(netdev_);
+    cir_s = smap_get(details, "cir");
+    cbs_s = smap_get(details, "cbs");
+    ebs_s = smap_get(details, "ebs");
+    policer->app_srtcm_params.cir = cir_s ? strtoull(cir_s, NULL, 10) : 0;
+    policer->app_srtcm_params.cbs = cbs_s ? strtoull(cbs_s, NULL, 10) : 0;
+    policer->app_srtcm_params.ebs = ebs_s ? strtoull(ebs_s, NULL, 10) : 0;
+    err = rte_meter_srtcm_config(&policer->test_meter, &policer->app_srtcm_params);
+
+    return err;
+}
+
+static inline int
+__us_policer_pkt_handle(struct rte_meter_srtcm *meter, struct rte_mbuf *pkt, uint64_t time)
+{
+	uint8_t input_color, output_color;
+	uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct ether_hdr);
+	input_color = 0;
+	enum us_policer_action action;
+
+	/* color input is not used for blind modes */
+	output_color = (uint8_t) rte_meter_srtcm_color_blind_check(meter, time, pkt_len);
+
+	/* Apply policing and set the output color */
+	action = us_policer_table[input_color][output_color];
+
+	return action;
+}
+
+static int
+us_policer_alg_process(struct netdev *netdev_, struct rte_mbuf **pkts, int pkt_cnt)
+{
+	int i = 0;
+    int cnt = pkt_cnt;
+    struct us_policer *policer = us_policer_get__(netdev_);
+    struct rte_mbuf *pkt = NULL;
+    uint64_t current_time = rte_rdtsc();
+
+    for(i = 0; i < pkt_cnt; i++) {
+        pkt = pkts[i];
+        /* Handle current packet */
+        if (__us_policer_pkt_handle(&policer->test_meter, pkt, current_time) == DROP) {
+            rte_pktmbuf_free(pkt);
+            cnt = cnt - 1;
+        }
+    }
+
+    return cnt;
+}
+
+static const struct dpdk_qos_ops us_policer_ops = {
+    "us-policer",    /* qos_name */
+    us_policer_qos_construct,
+	us_policer_qos_destruct,
+    us_policer_qos_get,
+    us_policer_qos_set,
+	us_policer_alg_process
+};
+
 #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
     GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
 {                                                             \
@@ -2069,10 +2471,10 @@  unlock_dpdk:
     NULL,                       /* set_advertisements */      \
                                                               \
     NULL,                       /* set_policing */            \
-    NULL,                       /* get_qos_types */           \
+    netdev_dpdk_get_qos_types,                                \
     NULL,                       /* get_qos_capabilities */    \
-    NULL,                       /* get_qos */                 \
-    NULL,                       /* set_qos */                 \
+    netdev_dpdk_get_qos,                                      \
+    netdev_dpdk_set_qos,                                      \
     NULL,                       /* get_queue */               \
     NULL,                       /* set_queue */               \
     NULL,                       /* delete_queue */            \
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0ab4a9a..77f70a4 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3283,6 +3283,18 @@ 
           for information on how this classifier works.
         </dd>
       </dl>
+      <dl>
+        <dt><code>us-policer</code></dt>
+        <dd>
+          Userspace egress policer algorithm. This implementation uses the DPDK
+          rte_meter library. The rte_meter library provides an implementation of
+          srTCM (RFC2697) which allows the metering and policing of traffic.
+          For more information regarding the usage of the DPDK rte_meter see
+          <code>http://dpdk.org/doc/guides/sample_app_ug/qos_metering.html</code>
+          For more information regarding srTCM see
+          <code>https://tools.ietf.org/html/rfc2697</code>
+        </dd>
+      </dl>
     </column>
 
     <column name="queues">
@@ -3319,6 +3331,34 @@ 
       </column>
     </group>
 
+    <group title="Configuration for us-policer QoS">
+      <p>
+        <ref table="QoS"/> <ref table="QoS" column="type"/>
+        <code>us-policer</code> provides egress policing for userspace port
+        types with DPDK.
+
+        It has the following key-value pairs defined.
+      </p>
+
+      <column name="other_config" key="cir" type='{"type": "integer"}'>
+        The Committed Information Rate (CIR) is measured in bytes of IP packets
+        per second, i.e., it includes the IP header, but not link specific
+        (e.g. Ethernet) headers. See RFC2697 for more details.
+      </column>
+      <column name="other_config" key="cbs" type='{"type": "integer"}'>
+        The Committed Burst Size (CBS) is measured in bytes. It must be set to a
+        value greater than 0. It is recommended that the minimum value for CBS
+        be the size of the largest possible IP packet in the stream.
+        See RFC2697 for more details.
+      </column>
+      <column name="other_config" key="ebs" type='{"type": "integer"}'>
+        The Excess Burst Size (EBS) is measured in bytes. It must be set to a
+        value greater than 0. It is recommended that the minimum value for EBS
+        be the size of the largest possible IP packet in the stream.
+        See RFC2697 for more details.
+      </column>
+    </group>
+
     <group title="Common Columns">
       The overall purpose of these columns is described under <code>Common
       Columns</code> at the beginning of this document.