[ovs-dev,v5,2/2] netdev-dpdk: Add new DPDK RFC 4115 egress policer
diff mbox series

Message ID 20200114161247.19950.22828.stgit@netdev64
State Accepted
Headers show
Series
  • netdev-dpdk: Add new DPDK RFC 4115 egress policer
Related show

Commit Message

Eelco Chaudron Jan. 14, 2020, 4:12 p.m. UTC
This patch adds a new policer to the DPDK datapath based on RFC 4115's
Two-Rate, Three-Color marker. It's a two-level hierarchical policer
which first does a color-blind marking of the traffic at the queue
level, followed by a color-aware marking at the port level. At the end
traffic marked as Green or Yellow is forwarded, Red is dropped. For
details on how traffic is marked, see RFC 4115.

This egress policer can be used to limit traffic at different rated
based on the queues the traffic is in. In addition, it can also be used
to prioritize certain traffic over others at a port level.

For example, the following configuration will limit the traffic rate at a
port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
Information Rate). High priority traffic is routed to queue 10, which marks
all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
marked as EIR, i.e. Yellow.

ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
  --id=@myqos create qos type=trtcm-policer \
  other-config:cir=52000 other-config:cbs=2048 \
  other-config:eir=52000 other-config:ebs=2048  \
  queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
  --id=@dpdk1Q10 create queue \
    other-config:cir=41600000 other-config:cbs=2048 \
    other-config:eir=0 other-config:ebs=0 -- \
  --id=@dpdk1Q20 create queue \
    other-config:cir=0 other-config:cbs=0 \
    other-config:eir=41600000 other-config:ebs=2048 \

This configuration accomplishes that the high priority traffic has a
guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
use the EIR, so a total of 2000pps at max. These additional 1000pps is
shared with the low priority traffic. The low priority traffic can use at
maximum 1000pps.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
---
 Documentation/topics/dpdk/qos.rst |   43 ++++
 lib/netdev-dpdk.c                 |  359 +++++++++++++++++++++++++++++++++++--
 vswitchd/vswitch.xml              |   34 ++++
 3 files changed, 421 insertions(+), 15 deletions(-)

Comments

Stokes, Ian Jan. 15, 2020, 11:28 a.m. UTC | #1
On 1/14/2020 4:12 PM, Eelco Chaudron wrote:
> This patch adds a new policer to the DPDK datapath based on RFC 4115's
> Two-Rate, Three-Color marker. It's a two-level hierarchical policer
> which first does a color-blind marking of the traffic at the queue
> level, followed by a color-aware marking at the port level. At the end
> traffic marked as Green or Yellow is forwarded, Red is dropped. For
> details on how traffic is marked, see RFC 4115.
> 
> This egress policer can be used to limit traffic at different rated
> based on the queues the traffic is in. In addition, it can also be used
> to prioritize certain traffic over others at a port level.
> 
> For example, the following configuration will limit the traffic rate at a
> port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
> 100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
> Information Rate). High priority traffic is routed to queue 10, which marks
> all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
> marked as EIR, i.e. Yellow.
> 
> ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>    --id=@myqos create qos type=trtcm-policer \
>    other-config:cir=52000 other-config:cbs=2048 \
>    other-config:eir=52000 other-config:ebs=2048  \
>    queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>    --id=@dpdk1Q10 create queue \
>      other-config:cir=41600000 other-config:cbs=2048 \
>      other-config:eir=0 other-config:ebs=0 -- \
>    --id=@dpdk1Q20 create queue \
>      other-config:cir=0 other-config:cbs=0 \
>      other-config:eir=41600000 other-config:ebs=2048 \
> 
> This configuration accomplishes that the high priority traffic has a
> guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
> use the EIR, so a total of 2000pps at max. These additional 1000pps is
> shared with the low priority traffic. The low priority traffic can use at
> maximum 1000pps.
> 
Thanks for the patch Eelco, minor comment below.

<snip>

>   
>   Rate Limiting (Ingress Policing)
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 128963f..1ed4a47 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -26,6 +26,12 @@
>   #include <sys/socket.h>
>   #include <linux/if.h>
>   
> +/* Include rte_compat.h first to allow experimental API's needed for the
> + * rte_meter.h rfc4115 functions. Once they are no longer marked as
> + * experimental the #define and rte_compat.h include can be removed.
> + */
> +#define ALLOW_EXPERIMENTAL_API
> +#include <rte_compat.h>

I guess the risk here, from what I understand, this approach is all or 
nothing in terms of experimental APIs from the included headers, other 
experimental APIs could be used from DPDK going forward without causing 
warning?

If so, there would have to be extra dilligence taken when reviewing 
future patches and a discussion if an API is expereimental, should it 
wait until it is marked as non experimental. (In this case The TRTCM 
looks stable and is highly unlikey to be removed so it's not an issue IMO).

@Ilya/Kevin: Would you agree with above? Thoughts?

Ian
>   #include <rte_bus_pci.h>
>   #include <rte_config.h>
>   #include <rte_cycles.h>
> @@ -329,8 +335,9 @@ struct dpdk_qos_ops {
>                                        struct netdev_dpdk_queue_state *state);
>   };
>   
> -/* dpdk_qos_ops for each type of user space QoS implementation */
> +/* dpdk_qos_ops for each type of user space QoS implementation. */
>   static const struct dpdk_qos_ops egress_policer_ops;
> +static const struct dpdk_qos_ops trtcm_policer_ops;
>   
>   /*
>    * Array of dpdk_qos_ops, contains pointer to all supported QoS
> @@ -338,6 +345,7 @@ static const struct dpdk_qos_ops egress_policer_ops;
>    */
>   static const struct dpdk_qos_ops *const qos_confs[] = {
>       &egress_policer_ops,
> +    &trtcm_policer_ops,
>       NULL
>   };
>   
> @@ -2162,9 +2170,9 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
>   }
>   
>   static inline bool
> -netdev_dpdk_policer_pkt_handle(struct rte_meter_srtcm *meter,
> -                               struct rte_meter_srtcm_profile *profile,
> -                               struct rte_mbuf *pkt, uint64_t time)
> +netdev_dpdk_srtcm_policer_pkt_handle(struct rte_meter_srtcm *meter,
> +                                     struct rte_meter_srtcm_profile *profile,
> +                                     struct rte_mbuf *pkt, uint64_t time)
>   {
>       uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct rte_ether_hdr);
>   
> @@ -2173,10 +2181,10 @@ netdev_dpdk_policer_pkt_handle(struct rte_meter_srtcm *meter,
>   }
>   
>   static int
> -netdev_dpdk_policer_run(struct rte_meter_srtcm *meter,
> -                        struct rte_meter_srtcm_profile *profile,
> -                        struct rte_mbuf **pkts, int pkt_cnt,
> -                        bool should_steal)
> +srtcm_policer_run_single_packet(struct rte_meter_srtcm *meter,
> +                                struct rte_meter_srtcm_profile *profile,
> +                                struct rte_mbuf **pkts, int pkt_cnt,
> +                                bool should_steal)
>   {
>       int i = 0;
>       int cnt = 0;
> @@ -2186,8 +2194,8 @@ netdev_dpdk_policer_run(struct rte_meter_srtcm *meter,
>       for (i = 0; i < pkt_cnt; i++) {
>           pkt = pkts[i];
>           /* Handle current packet */
> -        if (netdev_dpdk_policer_pkt_handle(meter, profile,
> -                                           pkt, current_time)) {
> +        if (netdev_dpdk_srtcm_policer_pkt_handle(meter, profile,
> +                                                 pkt, current_time)) {
>               if (cnt != i) {
>                   pkts[cnt] = pkt;
>               }
> @@ -2209,8 +2217,9 @@ ingress_policer_run(struct ingress_policer *policer, struct rte_mbuf **pkts,
>       int cnt = 0;
>   
>       rte_spinlock_lock(&policer->policer_lock);
> -    cnt = netdev_dpdk_policer_run(&policer->in_policer, &policer->in_prof,
> -                                  pkts, pkt_cnt, should_steal);
> +    cnt = srtcm_policer_run_single_packet(&policer->in_policer,
> +                                          &policer->in_prof,
> +                                          pkts, pkt_cnt, should_steal);
>       rte_spinlock_unlock(&policer->policer_lock);
>   
>       return cnt;
> @@ -4480,9 +4489,9 @@ egress_policer_run(struct qos_conf *conf, struct rte_mbuf **pkts, int pkt_cnt,
>       struct egress_policer *policer =
>           CONTAINER_OF(conf, struct egress_policer, qos_conf);
>   
> -    cnt = netdev_dpdk_policer_run(&policer->egress_meter,
> -                                  &policer->egress_prof, pkts,
> -                                  pkt_cnt, should_steal);
> +    cnt = srtcm_policer_run_single_packet(&policer->egress_meter,
> +                                          &policer->egress_prof, pkts,
> +                                          pkt_cnt, should_steal);
>   
>       return cnt;
>   }
> @@ -4496,6 +4505,326 @@ static const struct dpdk_qos_ops egress_policer_ops = {
>       .qos_run = egress_policer_run
>   };
>   
> +/* trtcm-policer details */
> +
> +struct trtcm_policer {
> +    struct qos_conf qos_conf;
> +    struct rte_meter_trtcm_rfc4115_params meter_params;
> +    struct rte_meter_trtcm_rfc4115_profile meter_profile;
> +    struct rte_meter_trtcm_rfc4115 meter;
> +    struct netdev_queue_stats stats;
> +    struct hmap queues;
> +};
> +
> +struct trtcm_policer_queue {
> +    struct hmap_node hmap_node;
> +    uint32_t queue_id;
> +    struct rte_meter_trtcm_rfc4115_params meter_params;
> +    struct rte_meter_trtcm_rfc4115_profile meter_profile;
> +    struct rte_meter_trtcm_rfc4115 meter;
> +    struct netdev_queue_stats stats;
> +};
> +
> +static void
> +trtcm_policer_details_to_param(const struct smap *details,
> +                               struct rte_meter_trtcm_rfc4115_params *params)
> +{
> +    memset(params, 0, sizeof *params);
> +    params->cir = smap_get_ullong(details, "cir", 0);
> +    params->eir = smap_get_ullong(details, "eir", 0);
> +    params->cbs = smap_get_ullong(details, "cbs", 0);
> +    params->ebs = smap_get_ullong(details, "ebs", 0);
> +}
> +
> +static void
> +trtcm_policer_param_to_detail(
> +    const struct rte_meter_trtcm_rfc4115_params *params,
> +    struct smap *details)
> +{
> +    smap_add_format(details, "cir", "%"PRIu64, params->cir);
> +    smap_add_format(details, "eir", "%"PRIu64, params->eir);
> +    smap_add_format(details, "cbs", "%"PRIu64, params->cbs);
> +    smap_add_format(details, "ebs", "%"PRIu64, params->ebs);
> +}
> +
> +
> +static int
> +trtcm_policer_qos_construct(const struct smap *details,
> +                            struct qos_conf **conf)
> +{
> +    struct trtcm_policer *policer;
> +    int err = 0;
> +
> +    policer = xmalloc(sizeof *policer);
> +    qos_conf_init(&policer->qos_conf, &trtcm_policer_ops);
> +    trtcm_policer_details_to_param(details, &policer->meter_params);
> +    err = rte_meter_trtcm_rfc4115_profile_config(&policer->meter_profile,
> +                                                 &policer->meter_params);
> +    if (!err) {
> +        err = rte_meter_trtcm_rfc4115_config(&policer->meter,
> +                                             &policer->meter_profile);
> +    }
> +
> +    if (!err) {
> +        *conf = &policer->qos_conf;
> +        memset(&policer->stats, 0, sizeof policer->stats);
> +        hmap_init(&policer->queues);
> +    } else {
> +        free(policer);
> +        *conf = NULL;
> +        err = -err;
> +    }
> +
> +    return err;
> +}
> +
> +static void
> +trtcm_policer_qos_destruct(struct qos_conf *conf)
> +{
> +    struct trtcm_policer_queue *queue, *next_queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    HMAP_FOR_EACH_SAFE (queue, next_queue, hmap_node, &policer->queues) {
> +        hmap_remove(&policer->queues, &queue->hmap_node);
> +        free(queue);
> +    }
> +    hmap_destroy(&policer->queues);
> +    free(policer);
> +}
> +
> +static int
> +trtcm_policer_qos_get(const struct qos_conf *conf, struct smap *details)
> +{
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    trtcm_policer_param_to_detail(&policer->meter_params, details);
> +    return 0;
> +}
> +
> +static bool
> +trtcm_policer_qos_is_equal(const struct qos_conf *conf,
> +                           const struct smap *details)
> +{
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +    struct rte_meter_trtcm_rfc4115_params params;
> +
> +    trtcm_policer_details_to_param(details, &params);
> +
> +    return !memcmp(&params, &policer->meter_params, sizeof params);
> +}
> +
> +static struct trtcm_policer_queue *
> +trtcm_policer_qos_find_queue(struct trtcm_policer *policer, uint32_t queue_id)
> +{
> +    struct trtcm_policer_queue *queue;
> +    HMAP_FOR_EACH_WITH_HASH (queue, hmap_node, hash_2words(queue_id, 0),
> +                             &policer->queues) {
> +        if (queue->queue_id == queue_id) {
> +            return queue;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static inline bool
> +trtcm_policer_run_single_packet(struct trtcm_policer *policer,
> +                                struct rte_mbuf *pkt, uint64_t time)
> +{
> +    enum rte_color pkt_color;
> +    struct trtcm_policer_queue *queue;
> +    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct rte_ether_hdr);
> +    struct dp_packet *dpkt = CONTAINER_OF(pkt, struct dp_packet, mbuf);
> +
> +    queue = trtcm_policer_qos_find_queue(policer, dpkt->md.skb_priority);
> +    if (!queue) {
> +        /* If no queue is found, use the default queue, which MUST exist. */
> +        queue = trtcm_policer_qos_find_queue(policer, 0);
> +        if (!queue) {
> +            return false;
> +        }
> +    }
> +
> +    pkt_color = rte_meter_trtcm_rfc4115_color_blind_check(&queue->meter,
> +                                                          &queue->meter_profile,
> +                                                          time,
> +                                                          pkt_len);
> +
> +    if (pkt_color == RTE_COLOR_RED) {
> +        queue->stats.tx_errors++;
> +    } else {
> +        queue->stats.tx_bytes += pkt_len;
> +        queue->stats.tx_packets++;
> +    }
> +
> +    pkt_color = rte_meter_trtcm_rfc4115_color_aware_check(&policer->meter,
> +                                                     &policer->meter_profile,
> +                                                     time, pkt_len,
> +                                                     pkt_color);
> +
> +    if (pkt_color == RTE_COLOR_RED) {
> +        policer->stats.tx_errors++;
> +        return false;
> +    }
> +
> +    policer->stats.tx_bytes += pkt_len;
> +    policer->stats.tx_packets++;
> +    return true;
> +}
> +
> +static int
> +trtcm_policer_run(struct qos_conf *conf, struct rte_mbuf **pkts, int pkt_cnt,
> +                  bool should_steal)
> +{
> +    int i = 0;
> +    int cnt = 0;
> +    struct rte_mbuf *pkt = NULL;
> +    uint64_t current_time = rte_rdtsc();
> +
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    for (i = 0; i < pkt_cnt; i++) {
> +        pkt = pkts[i];
> +
> +        if (trtcm_policer_run_single_packet(policer, pkt, current_time)) {
> +            if (cnt != i) {
> +                pkts[cnt] = pkt;
> +            }
> +            cnt++;
> +        } else {
> +            if (should_steal) {
> +                rte_pktmbuf_free(pkt);
> +            }
> +        }
> +    }
> +    return cnt;
> +}
> +
> +static int
> +trtcm_policer_qos_queue_construct(const struct smap *details,
> +                                  uint32_t queue_id, struct qos_conf *conf)
> +{
> +    int err = 0;
> +    struct trtcm_policer_queue *queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    queue = trtcm_policer_qos_find_queue(policer, queue_id);
> +    if (!queue) {
> +        queue = xmalloc(sizeof *queue);
> +        queue->queue_id = queue_id;
> +        memset(&queue->stats, 0, sizeof queue->stats);
> +        queue->stats.created = time_msec();
> +        hmap_insert(&policer->queues, &queue->hmap_node,
> +                    hash_2words(queue_id, 0));
> +    }
> +    if (queue_id == 0 && smap_is_empty(details)) {
> +        /* No default queue configured, use port values */
> +        memcpy(&queue->meter_params, &policer->meter_params,
> +               sizeof queue->meter_params);
> +    } else {
> +        trtcm_policer_details_to_param(details, &queue->meter_params);
> +    }
> +
> +    err = rte_meter_trtcm_rfc4115_profile_config(&queue->meter_profile,
> +                                                 &queue->meter_params);
> +
> +    if (!err) {
> +        err = rte_meter_trtcm_rfc4115_config(&queue->meter,
> +                                             &queue->meter_profile);
> +    }
> +    if (err) {
> +        hmap_remove(&policer->queues, &queue->hmap_node);
> +        free(queue);
> +        err = -err;
> +    }
> +    return err;
> +}
> +
> +static void
> +trtcm_policer_qos_queue_destruct(struct qos_conf *conf, uint32_t queue_id)
> +{
> +    struct trtcm_policer_queue *queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    queue = trtcm_policer_qos_find_queue(policer, queue_id);
> +    if (queue) {
> +        hmap_remove(&policer->queues, &queue->hmap_node);
> +        free(queue);
> +    }
> +}
> +
> +static int
> +trtcm_policer_qos_queue_get(struct smap *details, uint32_t queue_id,
> +                            const struct qos_conf *conf)
> +{
> +    struct trtcm_policer_queue *queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    queue = trtcm_policer_qos_find_queue(policer, queue_id);
> +    if (!queue) {
> +        return EINVAL;
> +    }
> +
> +    trtcm_policer_param_to_detail(&queue->meter_params, details);
> +    return 0;
> +}
> +
> +static int
> +trtcm_policer_qos_queue_get_stats(const struct qos_conf *conf,
> +                                  uint32_t queue_id,
> +                                  struct netdev_queue_stats *stats)
> +{
> +    struct trtcm_policer_queue *queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    queue = trtcm_policer_qos_find_queue(policer, queue_id);
> +    if (!queue) {
> +        return EINVAL;
> +    }
> +    memcpy(stats, &queue->stats, sizeof *stats);
> +    return 0;
> +}
> +
> +static int
> +trtcm_policer_qos_queue_dump_state_init(const struct qos_conf *conf,
> +                                        struct netdev_dpdk_queue_state *state)
> +{
> +    uint32_t i = 0;
> +    struct trtcm_policer_queue *queue;
> +    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
> +                                                 qos_conf);
> +
> +    state->n_queues = hmap_count(&policer->queues);
> +    state->cur_queue = 0;
> +    state->queues = xmalloc(state->n_queues * sizeof *state->queues);
> +
> +    HMAP_FOR_EACH (queue, hmap_node, &policer->queues) {
> +        state->queues[i++] = queue->queue_id;
> +    }
> +    return 0;
> +}
> +
> +static const struct dpdk_qos_ops trtcm_policer_ops = {
> +    .qos_name = "trtcm-policer",
> +    .qos_construct = trtcm_policer_qos_construct,
> +    .qos_destruct = trtcm_policer_qos_destruct,
> +    .qos_get = trtcm_policer_qos_get,
> +    .qos_is_equal = trtcm_policer_qos_is_equal,
> +    .qos_run = trtcm_policer_run,
> +    .qos_queue_construct = trtcm_policer_qos_queue_construct,
> +    .qos_queue_destruct = trtcm_policer_qos_queue_destruct,
> +    .qos_queue_get = trtcm_policer_qos_queue_get,
> +    .qos_queue_get_stats = trtcm_policer_qos_queue_get_stats,
> +    .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
> +};
> +
>   static int
>   netdev_dpdk_reconfigure(struct netdev *netdev)
>   {
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0ec726c..c43cb1a 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -4441,6 +4441,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>             in performance will be noticed in terms of overall aggregate traffic
>             throughput.
>           </dd>
> +        <dt><code>trtcm-policer</code></dt>
> +        <dd>
> +          A DPDK egress policer algorithm using RFC 4115's Two-Rate,
> +          Three-Color marker. It's a two-level hierarchical policer
> +          which first does a color-blind marking of the traffic at the queue
> +          level, followed by a color-aware marking at the port level. At the
> +          end traffic marked as Green or Yellow is forwarded, Red is dropped.
> +          For details on how traffic is marked, see RFC 4115.
> +
> +          If the ``default queue'', 0, is not configured it's automatically
> +          created with the same <code>other_config</code> values as the
> +          physical port.
> +        </dd>
>         </dl>
>       </column>
>   
> @@ -4508,6 +4521,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>           bytes/tokens of the packet. If there are not enough tokens in the cbs
>           bucket the packet will be dropped.
>         </column>
> +      <column name="other_config" key="eir" type='{"type": "integer"}'>
> +        The Excess Information Rate (EIR) is measured in bytes of IP
> +        packets per second, i.e. it includes the IP header, but not link
> +        specific (e.g. Ethernet) headers. This represents the bytes per second
> +        rate at which the token bucket will be updated. The eir value is
> +        calculated by (pps x packet data size).  For example assuming a user
> +        wishes to limit a stream consisting of 64 byte packets to 1 million
> +        packets per second the EIR would be set to to to 46000000. This value
> +        can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
> +        rate for the number of packets per second and 46 represents the size
> +        of the packet data for a 64 byte ip packet.
> +      </column>
> +      <column name="other_config" key="ebs" type='{"type": "integer"}'>
> +        The Excess Burst Size (EBS) is measured in bytes and represents a
> +        token bucket. At a minimum this value should be be set to the expected
> +        largest size packet in the traffic stream. In practice larger values
> +        may be used to increase the size of the token bucket. If a packet can
> +        be transmitted then the ebs will be decremented by the number of
> +        bytes/tokens of the packet. If there are not enough tokens in the cbs
> +        bucket the packet might be dropped.
> +      </column>
>       </group>
>   
>       <group title="Configuration for linux-sfq">
>
Ilya Maximets Jan. 15, 2020, 12:19 p.m. UTC | #2
On 15.01.2020 12:28, Stokes, Ian wrote:
> 
> 
> On 1/14/2020 4:12 PM, Eelco Chaudron wrote:
>> This patch adds a new policer to the DPDK datapath based on RFC 4115's
>> Two-Rate, Three-Color marker. It's a two-level hierarchical policer
>> which first does a color-blind marking of the traffic at the queue
>> level, followed by a color-aware marking at the port level. At the end
>> traffic marked as Green or Yellow is forwarded, Red is dropped. For
>> details on how traffic is marked, see RFC 4115.
>>
>> This egress policer can be used to limit traffic at different rated
>> based on the queues the traffic is in. In addition, it can also be used
>> to prioritize certain traffic over others at a port level.
>>
>> For example, the following configuration will limit the traffic rate at a
>> port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
>> 100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
>> Information Rate). High priority traffic is routed to queue 10, which marks
>> all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
>> marked as EIR, i.e. Yellow.
>>
>> ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>>    --id=@myqos create qos type=trtcm-policer \
>>    other-config:cir=52000 other-config:cbs=2048 \
>>    other-config:eir=52000 other-config:ebs=2048  \
>>    queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>>    --id=@dpdk1Q10 create queue \
>>      other-config:cir=41600000 other-config:cbs=2048 \
>>      other-config:eir=0 other-config:ebs=0 -- \
>>    --id=@dpdk1Q20 create queue \
>>      other-config:cir=0 other-config:cbs=0 \
>>      other-config:eir=41600000 other-config:ebs=2048 \
>>
>> This configuration accomplishes that the high priority traffic has a
>> guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
>> use the EIR, so a total of 2000pps at max. These additional 1000pps is
>> shared with the low priority traffic. The low priority traffic can use at
>> maximum 1000pps.
>>
> Thanks for the patch Eelco, minor comment below.
> 
> <snip>
> 
>>     Rate Limiting (Ingress Policing)
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 128963f..1ed4a47 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -26,6 +26,12 @@
>>   #include <sys/socket.h>
>>   #include <linux/if.h>
>>   +/* Include rte_compat.h first to allow experimental API's needed for the
>> + * rte_meter.h rfc4115 functions. Once they are no longer marked as
>> + * experimental the #define and rte_compat.h include can be removed.
>> + */
>> +#define ALLOW_EXPERIMENTAL_API
>> +#include <rte_compat.h>
> 
> I guess the risk here, from what I understand, this approach is all or nothing in terms of experimental APIs from the included headers, other experimental APIs could be used from DPDK going forward without causing warning?
> 
> If so, there would have to be extra dilligence taken when reviewing future patches and a discussion if an API is expereimental, should it wait until it is marked as non experimental. (In this case The TRTCM looks stable and is highly unlikey to be removed so it's not an issue IMO).
> 
> @Ilya/Kevin: Would you agree with above? Thoughts?

I think it make sense to wait for API to become non-experimental
if we have no easy way to enable it only on functions we need.
I agree that having widly enabled experimental api might produce
additional issues and will require more careful review of all the
DPDK related patches.

BTW, another thought I have in mind about all the release management is:
Shouldn't we hold OVS updates to new DPDK LTS until the first correction
release is out?  I mean, for example, Ubuntu triggers updates from one
LTS release to another only after .1 correcting relese is out (users
of Ubuntu 18.04 will receive upgrade notifications only after 20.04.1
is released).  Shouldn't we do the same thing?  Shouldn't we upgrade
to the next DPDK LTS only after XX.11.1 is ready?  This might make sense
in order to not have obviously broken functionality in OVS releases but
at the same time might just defer actual revealing of DPDK issues, so
I'm not fully sure about this.  Since OVS is not the only user of DPDK,
this still might make sense anyway.  Would like to hear some thoughts.

Best regards, Ilya Maximets.
Stokes, Ian Jan. 15, 2020, 1:19 p.m. UTC | #3
On 1/15/2020 12:19 PM, Ilya Maximets wrote:
> On 15.01.2020 12:28, Stokes, Ian wrote:
>>
>>
>> On 1/14/2020 4:12 PM, Eelco Chaudron wrote:
>>> This patch adds a new policer to the DPDK datapath based on RFC 4115's
>>> Two-Rate, Three-Color marker. It's a two-level hierarchical policer
>>> which first does a color-blind marking of the traffic at the queue
>>> level, followed by a color-aware marking at the port level. At the end
>>> traffic marked as Green or Yellow is forwarded, Red is dropped. For
>>> details on how traffic is marked, see RFC 4115.
>>>
>>> This egress policer can be used to limit traffic at different rated
>>> based on the queues the traffic is in. In addition, it can also be used
>>> to prioritize certain traffic over others at a port level.
>>>
>>> For example, the following configuration will limit the traffic rate at a
>>> port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
>>> 100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
>>> Information Rate). High priority traffic is routed to queue 10, which marks
>>> all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
>>> marked as EIR, i.e. Yellow.
>>>
>>> ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>>>     --id=@myqos create qos type=trtcm-policer \
>>>     other-config:cir=52000 other-config:cbs=2048 \
>>>     other-config:eir=52000 other-config:ebs=2048  \
>>>     queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>>>     --id=@dpdk1Q10 create queue \
>>>       other-config:cir=41600000 other-config:cbs=2048 \
>>>       other-config:eir=0 other-config:ebs=0 -- \
>>>     --id=@dpdk1Q20 create queue \
>>>       other-config:cir=0 other-config:cbs=0 \
>>>       other-config:eir=41600000 other-config:ebs=2048 \
>>>
>>> This configuration accomplishes that the high priority traffic has a
>>> guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
>>> use the EIR, so a total of 2000pps at max. These additional 1000pps is
>>> shared with the low priority traffic. The low priority traffic can use at
>>> maximum 1000pps.
>>>
>> Thanks for the patch Eelco, minor comment below.
>>
>> <snip>
>>
>>>      Rate Limiting (Ingress Policing)
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 128963f..1ed4a47 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -26,6 +26,12 @@
>>>    #include <sys/socket.h>
>>>    #include <linux/if.h>
>>>    +/* Include rte_compat.h first to allow experimental API's needed for the
>>> + * rte_meter.h rfc4115 functions. Once they are no longer marked as
>>> + * experimental the #define and rte_compat.h include can be removed.
>>> + */
>>> +#define ALLOW_EXPERIMENTAL_API
>>> +#include <rte_compat.h>
>>
>> I guess the risk here, from what I understand, this approach is all or nothing in terms of experimental APIs from the included headers, other experimental APIs could be used from DPDK going forward without causing warning?
>>
>> If so, there would have to be extra dilligence taken when reviewing future patches and a discussion if an API is expereimental, should it wait until it is marked as non experimental. (In this case The TRTCM looks stable and is highly unlikey to be removed so it's not an issue IMO).
>>
>> @Ilya/Kevin: Would you agree with above? Thoughts?
> 
> I think it make sense to wait for API to become non-experimental
> if we have no easy way to enable it only on functions we need.
> I agree that having widly enabled experimental api might produce
> additional issues and will require more careful review of all the
> DPDK related patches.

So I'm ok with making an exception in this case for the feature, I think 
it was an oversight but the API is stable. But what I'm hearing is in 
general the rule should be for new features the API should not be 
experimental and if it is then it would have to have it's experimental 
tag removed by the next DPDK upgrade (in this case 20.11 for arguments 
sake) and then can be upstreamed to OVS master.

> 
> BTW, another thought I have in mind about all the release management is:
> Shouldn't we hold OVS updates to new DPDK LTS until the first correction
> release is out?  I mean, for example, Ubuntu triggers updates from one
> LTS release to another only after .1 correcting relese is out (users
> of Ubuntu 18.04 will receive upgrade notifications only after 20.04.1
> is released).  Shouldn't we do the same thing?  Shouldn't we upgrade
> to the next DPDK LTS only after XX.11.1 is ready?  This might make sense
> in order to not have obviously broken functionality in OVS releases but
> at the same time might just defer actual revealing of DPDK issues, so
> I'm not fully sure about this.  Since OVS is not the only user of DPDK,
> this still might make sense anyway.  Would like to hear some thoughts.
> 

Yes, I've thought about this as well. There certainly is advantage to 
moving to a .1 release in terms of stability instead of .0. However when 
I thought about the release two things came to mind.

(i) By moving to the .0 release, is OVS in a position to better 
contribute feedback for the .1 release and ensure relevant patches for 
OVS with DPDK fixes are upstreamed in the .1 (rather than a .2). 
Feedback coming not from just the developers but also from end users.

(ii) With the timeing of the OVS to .1 releases, does it make a massive 
differemce? For example with OVS 2.13 being released late February and 
19.11.1 being released early March, I would think OVS would move to .1 
pretty quickly to benefit from the latest fixes on OVS master. This 
still allows our OVS release to use the latest stable DPDK without 
waiting for OVS 2.14 in August.

There's definitely advantages to both approaches and worth further 
discussion. I'd be interested in hearing what others think?

BR
Ian

> Best regards, Ilya Maximets.
>
Kevin Traynor Jan. 15, 2020, 4:49 p.m. UTC | #4
On 15/01/2020 13:19, Stokes, Ian wrote:
> 
> 
> On 1/15/2020 12:19 PM, Ilya Maximets wrote:
>> On 15.01.2020 12:28, Stokes, Ian wrote:
>>>
>>>
>>> On 1/14/2020 4:12 PM, Eelco Chaudron wrote:
>>>> This patch adds a new policer to the DPDK datapath based on RFC 4115's
>>>> Two-Rate, Three-Color marker. It's a two-level hierarchical policer
>>>> which first does a color-blind marking of the traffic at the queue
>>>> level, followed by a color-aware marking at the port level. At the end
>>>> traffic marked as Green or Yellow is forwarded, Red is dropped. For
>>>> details on how traffic is marked, see RFC 4115.
>>>>
>>>> This egress policer can be used to limit traffic at different rated
>>>> based on the queues the traffic is in. In addition, it can also be used
>>>> to prioritize certain traffic over others at a port level.
>>>>
>>>> For example, the following configuration will limit the traffic rate at a
>>>> port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
>>>> 100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
>>>> Information Rate). High priority traffic is routed to queue 10, which marks
>>>> all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
>>>> marked as EIR, i.e. Yellow.
>>>>
>>>> ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>>>>     --id=@myqos create qos type=trtcm-policer \
>>>>     other-config:cir=52000 other-config:cbs=2048 \
>>>>     other-config:eir=52000 other-config:ebs=2048  \
>>>>     queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>>>>     --id=@dpdk1Q10 create queue \
>>>>       other-config:cir=41600000 other-config:cbs=2048 \
>>>>       other-config:eir=0 other-config:ebs=0 -- \
>>>>     --id=@dpdk1Q20 create queue \
>>>>       other-config:cir=0 other-config:cbs=0 \
>>>>       other-config:eir=41600000 other-config:ebs=2048 \
>>>>
>>>> This configuration accomplishes that the high priority traffic has a
>>>> guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
>>>> use the EIR, so a total of 2000pps at max. These additional 1000pps is
>>>> shared with the low priority traffic. The low priority traffic can use at
>>>> maximum 1000pps.
>>>>
>>> Thanks for the patch Eelco, minor comment below.
>>>
>>> <snip>
>>>
>>>>      Rate Limiting (Ingress Policing)
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 128963f..1ed4a47 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -26,6 +26,12 @@
>>>>    #include <sys/socket.h>
>>>>    #include <linux/if.h>
>>>>    +/* Include rte_compat.h first to allow experimental API's needed for the
>>>> + * rte_meter.h rfc4115 functions. Once they are no longer marked as
>>>> + * experimental the #define and rte_compat.h include can be removed.
>>>> + */
>>>> +#define ALLOW_EXPERIMENTAL_API
>>>> +#include <rte_compat.h>
>>>
>>> I guess the risk here, from what I understand, this approach is all or nothing in terms of experimental APIs from the included headers, other experimental APIs could be used from DPDK going forward without causing warning?
>>>
>>> If so, there would have to be extra dilligence taken when reviewing future patches and a discussion if an API is expereimental, should it wait until it is marked as non experimental. (In this case The TRTCM looks stable and is highly unlikey to be removed so it's not an issue IMO).
>>>
>>> @Ilya/Kevin: Would you agree with above? Thoughts?
>>
>> I think it make sense to wait for API to become non-experimental
>> if we have no easy way to enable it only on functions we need.
>> I agree that having widly enabled experimental api might produce
>> additional issues and will require more careful review of all the
>> DPDK related patches.
> 
> So I'm ok with making an exception in this case for the feature, I think 
> it was an oversight but the API is stable. But what I'm hearing is in 
> general the rule should be for new features the API should not be 
> experimental and if it is then it would have to have it's experimental 
> tag removed by the next DPDK upgrade (in this case 20.11 for arguments 
> sake) and then can be upstreamed to OVS master.
> 

In this case, it sounds ok to me. We know it will be non-experimental in
the next hours/days and we will not be left with some zombie feature on
upgrade. Plus, there might be an acceptable way of adding a
non-experimental symbol for this to the v19.11 stable branch too (not
guaranteed though), and if so we could turn this off on upgrade to
v19.11.X. OTOH, I don't know how important it is to have this feature in
OVS 2.13.

>>
>> BTW, another thought I have in mind about all the release management is:
>> Shouldn't we hold OVS updates to new DPDK LTS until the first correction
>> release is out?  I mean, for example, Ubuntu triggers updates from one
>> LTS release to another only after .1 correcting relese is out (users
>> of Ubuntu 18.04 will receive upgrade notifications only after 20.04.1
>> is released).  Shouldn't we do the same thing?  Shouldn't we upgrade
>> to the next DPDK LTS only after XX.11.1 is ready?  This might make sense
>> in order to not have obviously broken functionality in OVS releases but
>> at the same time might just defer actual revealing of DPDK issues, so
>> I'm not fully sure about this.  Since OVS is not the only user of DPDK,
>> this still might make sense anyway.  Would like to hear some thoughts.
>>
> 
> Yes, I've thought about this as well. There certainly is advantage to 
> moving to a .1 release in terms of stability instead of .0. However when 
> I thought about the release two things came to mind.
> 

Hmm, I naturally think this about stability too, but then I look at the
commits for the one I am most familiar with (excluding CVE releases),
18.11.1: 232
18.11.2: 361
18.11.3: 329
18.11.6-rc2: 280

It's hard to know how to interpret. Maybe it can be that "big" bugs are
found quickly? Maybe it takes a while for integrations and bugfixes to
feed back into DPDK? But at least we cannot say that everything gets
fixed in .1 and then it's stable.

> (i) By moving to the .0 release, is OVS in a position to better 
> contribute feedback for the .1 release and ensure relevant patches for 
> OVS with DPDK fixes are upstreamed in the .1 (rather than a .2). 
> Feedback coming not from just the developers but also from end users.
> 

Well, the sooner it is integrated and used, the sooner we will find bugs
that is sure. For example, iirc the non-experimental API issue was only
fully realised because we upgraded and wanted to apply this patch.

> (ii) With the timeing of the OVS to .1 releases, does it make a massive 
> differemce? For example with OVS 2.13 being released late February and 
> 19.11.1 being released early March, I would think OVS would move to .1 
> pretty quickly to benefit from the latest fixes on OVS master. This 
> still allows our OVS release to use the latest stable DPDK without 
> waiting for OVS 2.14 in August.
> 

It is a long time for some having to wait up to 12 months for the
feature or driver they put in DPDK to be usable in OVS as it is, so I
wouldn't be in favour of extending that.

Also, it seems to have settled into a cadence of Feb release of OVS
getting Nov LTS release of DPDK. I know that at least some people plan
around this and then integration into Openstack etc. so any change would
have an impact.

> There's definitely advantages to both approaches and worth further 
> discussion. I'd be interested in hearing what others think?

I don't think there is a real issue with the current model. And nothing
that more testing of intermediate DPDK releases (.02/.05/.08) and DPDK
.11-RC could not iron out.

Just my 0.02c,
Kevin.

> 
> BR
> Ian
> 
>> Best regards, Ilya Maximets.
>>
>
Stokes, Ian Jan. 15, 2020, 7:55 p.m. UTC | #5
On 1/15/2020 4:49 PM, Kevin Traynor wrote:
> On 15/01/2020 13:19, Stokes, Ian wrote:
>>
>>
>> On 1/15/2020 12:19 PM, Ilya Maximets wrote:
>>> On 15.01.2020 12:28, Stokes, Ian wrote:
>>>>
>>>>
>>>> On 1/14/2020 4:12 PM, Eelco Chaudron wrote:
>>>>> This patch adds a new policer to the DPDK datapath based on RFC 4115's
>>>>> Two-Rate, Three-Color marker. It's a two-level hierarchical policer
>>>>> which first does a color-blind marking of the traffic at the queue
>>>>> level, followed by a color-aware marking at the port level. At the end
>>>>> traffic marked as Green or Yellow is forwarded, Red is dropped. For
>>>>> details on how traffic is marked, see RFC 4115.
>>>>>
>>>>> This egress policer can be used to limit traffic at different rated
>>>>> based on the queues the traffic is in. In addition, it can also be used
>>>>> to prioritize certain traffic over others at a port level.
>>>>>
>>>>> For example, the following configuration will limit the traffic rate at a
>>>>> port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
>>>>> 100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
>>>>> Information Rate). High priority traffic is routed to queue 10, which marks
>>>>> all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
>>>>> marked as EIR, i.e. Yellow.
>>>>>
>>>>> ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
>>>>>      --id=@myqos create qos type=trtcm-policer \
>>>>>      other-config:cir=52000 other-config:cbs=2048 \
>>>>>      other-config:eir=52000 other-config:ebs=2048  \
>>>>>      queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
>>>>>      --id=@dpdk1Q10 create queue \
>>>>>        other-config:cir=41600000 other-config:cbs=2048 \
>>>>>        other-config:eir=0 other-config:ebs=0 -- \
>>>>>      --id=@dpdk1Q20 create queue \
>>>>>        other-config:cir=0 other-config:cbs=0 \
>>>>>        other-config:eir=41600000 other-config:ebs=2048 \
>>>>>
>>>>> This configuration accomplishes that the high priority traffic has a
>>>>> guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
>>>>> use the EIR, so a total of 2000pps at max. These additional 1000pps is
>>>>> shared with the low priority traffic. The low priority traffic can use at
>>>>> maximum 1000pps.
>>>>>
>>>> Thanks for the patch Eelco, minor comment below.

Ok, with the below discussion in mind I've pushed the series master. The 
understanding is that a better approach will be sought so that we can 
remove the experimental API exception in the future. I can raise this 
with the DPDK community call tomorrow and see what possibilities we have 
for removing the tag. If in the meantime, a new feature is submitted for 
OVS with DPDK that has an experimental tag for an API call it will be 
pushed back on until the experimental tag is removed.

Thanks
Ian

>>>>
>>>> <snip>
>>>>
>>>>>       Rate Limiting (Ingress Policing)
>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>> index 128963f..1ed4a47 100644
>>>>> --- a/lib/netdev-dpdk.c
>>>>> +++ b/lib/netdev-dpdk.c
>>>>> @@ -26,6 +26,12 @@
>>>>>     #include <sys/socket.h>
>>>>>     #include <linux/if.h>
>>>>>     +/* Include rte_compat.h first to allow experimental API's needed for the
>>>>> + * rte_meter.h rfc4115 functions. Once they are no longer marked as
>>>>> + * experimental the #define and rte_compat.h include can be removed.
>>>>> + */
>>>>> +#define ALLOW_EXPERIMENTAL_API
>>>>> +#include <rte_compat.h>
>>>>
>>>> I guess the risk here, from what I understand, this approach is all or nothing in terms of experimental APIs from the included headers, other experimental APIs could be used from DPDK going forward without causing warning?
>>>>
>>>> If so, there would have to be extra dilligence taken when reviewing future patches and a discussion if an API is expereimental, should it wait until it is marked as non experimental. (In this case The TRTCM looks stable and is highly unlikey to be removed so it's not an issue IMO).
>>>>
>>>> @Ilya/Kevin: Would you agree with above? Thoughts?
>>>
>>> I think it make sense to wait for API to become non-experimental
>>> if we have no easy way to enable it only on functions we need.
>>> I agree that having widly enabled experimental api might produce
>>> additional issues and will require more careful review of all the
>>> DPDK related patches.
>>
>> So I'm ok with making an exception in this case for the feature, I think
>> it was an oversight but the API is stable. But what I'm hearing is in
>> general the rule should be for new features the API should not be
>> experimental and if it is then it would have to have it's experimental
>> tag removed by the next DPDK upgrade (in this case 20.11 for arguments
>> sake) and then can be upstreamed to OVS master.
>>
> 
> In this case, it sounds ok to me. We know it will be non-experimental in
> the next hours/days and we will not be left with some zombie feature on
> upgrade. Plus, there might be an acceptable way of adding a
> non-experimental symbol for this to the v19.11 stable branch too (not
> guaranteed though), and if so we could turn this off on upgrade to
> v19.11.X. OTOH, I don't know how important it is to have this feature in
> OVS 2.13. >
>>>
>>> BTW, another thought I have in mind about all the release management is:
>>> Shouldn't we hold OVS updates to new DPDK LTS until the first correction
>>> release is out?  I mean, for example, Ubuntu triggers updates from one
>>> LTS release to another only after .1 correcting relese is out (users
>>> of Ubuntu 18.04 will receive upgrade notifications only after 20.04.1
>>> is released).  Shouldn't we do the same thing?  Shouldn't we upgrade
>>> to the next DPDK LTS only after XX.11.1 is ready?  This might make sense
>>> in order to not have obviously broken functionality in OVS releases but
>>> at the same time might just defer actual revealing of DPDK issues, so
>>> I'm not fully sure about this.  Since OVS is not the only user of DPDK,
>>> this still might make sense anyway.  Would like to hear some thoughts.
>>>
>>
>> Yes, I've thought about this as well. There certainly is advantage to
>> moving to a .1 release in terms of stability instead of .0. However when
>> I thought about the release two things came to mind.
>>
> 
> Hmm, I naturally think this about stability too, but then I look at the
> commits for the one I am most familiar with (excluding CVE releases),
> 18.11.1: 232
> 18.11.2: 361
> 18.11.3: 329
> 18.11.6-rc2: 280
> 
> It's hard to know how to interpret. Maybe it can be that "big" bugs are
> found quickly? Maybe it takes a while for integrations and bugfixes to
> feed back into DPDK? But at least we cannot say that everything gets
> fixed in .1 and then it's stable.
> 
>> (i) By moving to the .0 release, is OVS in a position to better
>> contribute feedback for the .1 release and ensure relevant patches for
>> OVS with DPDK fixes are upstreamed in the .1 (rather than a .2).
>> Feedback coming not from just the developers but also from end users.
>>
> 
> Well, the sooner it is integrated and used, the sooner we will find bugs
> that is sure. For example, iirc the non-experimental API issue was only
> fully realised because we upgraded and wanted to apply this patch.
> 
>> (ii) With the timeing of the OVS to .1 releases, does it make a massive
>> differemce? For example with OVS 2.13 being released late February and
>> 19.11.1 being released early March, I would think OVS would move to .1
>> pretty quickly to benefit from the latest fixes on OVS master. This
>> still allows our OVS release to use the latest stable DPDK without
>> waiting for OVS 2.14 in August.
>>
> 
> It is a long time for some having to wait up to 12 months for the
> feature or driver they put in DPDK to be usable in OVS as it is, so I
> wouldn't be in favour of extending that.
> 
> Also, it seems to have settled into a cadence of Feb release of OVS
> getting Nov LTS release of DPDK. I know that at least some people plan
> around this and then integration into Openstack etc. so any change would
> have an impact.
> 
>> There's definitely advantages to both approaches and worth further
>> discussion. I'd be interested in hearing what others think?
> 
> I don't think there is a real issue with the current model. And nothing
> that more testing of intermediate DPDK releases (.02/.05/.08) and DPDK
> .11-RC could not iron out.
> 
> Just my 0.02c,
> Kevin.
> 
>>
>> BR
>> Ian
>>
>>> Best regards, Ilya Maximets.
>>>
>>
>

Patch
diff mbox series

diff --git a/Documentation/topics/dpdk/qos.rst b/Documentation/topics/dpdk/qos.rst
index c0aec5d..1034954 100644
--- a/Documentation/topics/dpdk/qos.rst
+++ b/Documentation/topics/dpdk/qos.rst
@@ -33,6 +33,9 @@  datapath. These are referred to as *QoS* and *Rate Limiting*, respectively.
 QoS (Egress Policing)
 ---------------------
 
+Single Queue Policer
+~~~~~~~~~~~~~~~~~~~~
+
 Assuming you have a :doc:`vhost-user port <vhost-user>` transmitting traffic
 consisting of packets of size 64 bytes, the following command would limit the
 egress transmission rate of the port to ~1,000,000 packets per second::
@@ -49,6 +52,46 @@  To clear the QoS configuration from the port and ovsdb, run::
 
     $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos
 
+
+Multi Queue Policer
+~~~~~~~~~~~~~~~~~~~
+
+In addition to the egress-policer OVS-DPDK also has support for a RFC
+4115's Two-Rate, Three-Color marker meter. It's a two-level hierarchical
+policer which first does a color-blind marking of the traffic at the queue
+level, followed by a color-aware marking at the port level. At the end
+traffic marked as Green or Yellow is forwarded, Red is dropped. For
+details on how traffic is marked, see RFC 4115.
+
+This egress policer can be used to limit traffic at different rated
+based on the queues the traffic is in. In addition, it can also be used
+to prioritize certain traffic over others at a port level.
+
+For example, the following configuration will limit the traffic rate at a
+port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
+100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
+Information Rate). High priority traffic is routed to queue 10, which marks
+all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
+marked as EIR, i.e. Yellow::
+
+    $ ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
+        --id=@myqos create qos type=trtcm-policer \
+        other-config:cir=52000 other-config:cbs=2048 \
+        other-config:eir=52000 other-config:ebs=2048  \
+        queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
+         --id=@dpdk1Q10 create queue \
+          other-config:cir=41600000 other-config:cbs=2048 \
+          other-config:eir=0 other-config:ebs=0 -- \
+         --id=@dpdk1Q20 create queue \
+           other-config:cir=0 other-config:cbs=0 \
+           other-config:eir=41600000 other-config:ebs=2048 \
+
+This configuration accomplishes that the high priority traffic has a
+guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
+use the EIR, so a total of 2000pps at max. These additional 1000pps is
+shared with the low priority traffic. The low priority traffic can use at
+maximum 1000pps.
+
 Refer to ``vswitch.xml`` for more details on egress policer.
 
 Rate Limiting (Ingress Policing)
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 128963f..1ed4a47 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -26,6 +26,12 @@ 
 #include <sys/socket.h>
 #include <linux/if.h>
 
+/* Include rte_compat.h first to allow experimental API's needed for the
+ * rte_meter.h rfc4115 functions. Once they are no longer marked as
+ * experimental the #define and rte_compat.h include can be removed.
+ */
+#define ALLOW_EXPERIMENTAL_API
+#include <rte_compat.h>
 #include <rte_bus_pci.h>
 #include <rte_config.h>
 #include <rte_cycles.h>
@@ -329,8 +335,9 @@  struct dpdk_qos_ops {
                                      struct netdev_dpdk_queue_state *state);
 };
 
-/* dpdk_qos_ops for each type of user space QoS implementation */
+/* dpdk_qos_ops for each type of user space QoS implementation. */
 static const struct dpdk_qos_ops egress_policer_ops;
+static const struct dpdk_qos_ops trtcm_policer_ops;
 
 /*
  * Array of dpdk_qos_ops, contains pointer to all supported QoS
@@ -338,6 +345,7 @@  static const struct dpdk_qos_ops egress_policer_ops;
  */
 static const struct dpdk_qos_ops *const qos_confs[] = {
     &egress_policer_ops,
+    &trtcm_policer_ops,
     NULL
 };
 
@@ -2162,9 +2170,9 @@  netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
 }
 
 static inline bool
-netdev_dpdk_policer_pkt_handle(struct rte_meter_srtcm *meter,
-                               struct rte_meter_srtcm_profile *profile,
-                               struct rte_mbuf *pkt, uint64_t time)
+netdev_dpdk_srtcm_policer_pkt_handle(struct rte_meter_srtcm *meter,
+                                     struct rte_meter_srtcm_profile *profile,
+                                     struct rte_mbuf *pkt, uint64_t time)
 {
     uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct rte_ether_hdr);
 
@@ -2173,10 +2181,10 @@  netdev_dpdk_policer_pkt_handle(struct rte_meter_srtcm *meter,
 }
 
 static int
-netdev_dpdk_policer_run(struct rte_meter_srtcm *meter,
-                        struct rte_meter_srtcm_profile *profile,
-                        struct rte_mbuf **pkts, int pkt_cnt,
-                        bool should_steal)
+srtcm_policer_run_single_packet(struct rte_meter_srtcm *meter,
+                                struct rte_meter_srtcm_profile *profile,
+                                struct rte_mbuf **pkts, int pkt_cnt,
+                                bool should_steal)
 {
     int i = 0;
     int cnt = 0;
@@ -2186,8 +2194,8 @@  netdev_dpdk_policer_run(struct rte_meter_srtcm *meter,
     for (i = 0; i < pkt_cnt; i++) {
         pkt = pkts[i];
         /* Handle current packet */
-        if (netdev_dpdk_policer_pkt_handle(meter, profile,
-                                           pkt, current_time)) {
+        if (netdev_dpdk_srtcm_policer_pkt_handle(meter, profile,
+                                                 pkt, current_time)) {
             if (cnt != i) {
                 pkts[cnt] = pkt;
             }
@@ -2209,8 +2217,9 @@  ingress_policer_run(struct ingress_policer *policer, struct rte_mbuf **pkts,
     int cnt = 0;
 
     rte_spinlock_lock(&policer->policer_lock);
-    cnt = netdev_dpdk_policer_run(&policer->in_policer, &policer->in_prof,
-                                  pkts, pkt_cnt, should_steal);
+    cnt = srtcm_policer_run_single_packet(&policer->in_policer,
+                                          &policer->in_prof,
+                                          pkts, pkt_cnt, should_steal);
     rte_spinlock_unlock(&policer->policer_lock);
 
     return cnt;
@@ -4480,9 +4489,9 @@  egress_policer_run(struct qos_conf *conf, struct rte_mbuf **pkts, int pkt_cnt,
     struct egress_policer *policer =
         CONTAINER_OF(conf, struct egress_policer, qos_conf);
 
-    cnt = netdev_dpdk_policer_run(&policer->egress_meter,
-                                  &policer->egress_prof, pkts,
-                                  pkt_cnt, should_steal);
+    cnt = srtcm_policer_run_single_packet(&policer->egress_meter,
+                                          &policer->egress_prof, pkts,
+                                          pkt_cnt, should_steal);
 
     return cnt;
 }
@@ -4496,6 +4505,326 @@  static const struct dpdk_qos_ops egress_policer_ops = {
     .qos_run = egress_policer_run
 };
 
+/* trtcm-policer details */
+
+struct trtcm_policer {
+    struct qos_conf qos_conf;
+    struct rte_meter_trtcm_rfc4115_params meter_params;
+    struct rte_meter_trtcm_rfc4115_profile meter_profile;
+    struct rte_meter_trtcm_rfc4115 meter;
+    struct netdev_queue_stats stats;
+    struct hmap queues;
+};
+
+struct trtcm_policer_queue {
+    struct hmap_node hmap_node;
+    uint32_t queue_id;
+    struct rte_meter_trtcm_rfc4115_params meter_params;
+    struct rte_meter_trtcm_rfc4115_profile meter_profile;
+    struct rte_meter_trtcm_rfc4115 meter;
+    struct netdev_queue_stats stats;
+};
+
+static void
+trtcm_policer_details_to_param(const struct smap *details,
+                               struct rte_meter_trtcm_rfc4115_params *params)
+{
+    memset(params, 0, sizeof *params);
+    params->cir = smap_get_ullong(details, "cir", 0);
+    params->eir = smap_get_ullong(details, "eir", 0);
+    params->cbs = smap_get_ullong(details, "cbs", 0);
+    params->ebs = smap_get_ullong(details, "ebs", 0);
+}
+
+static void
+trtcm_policer_param_to_detail(
+    const struct rte_meter_trtcm_rfc4115_params *params,
+    struct smap *details)
+{
+    smap_add_format(details, "cir", "%"PRIu64, params->cir);
+    smap_add_format(details, "eir", "%"PRIu64, params->eir);
+    smap_add_format(details, "cbs", "%"PRIu64, params->cbs);
+    smap_add_format(details, "ebs", "%"PRIu64, params->ebs);
+}
+
+
+static int
+trtcm_policer_qos_construct(const struct smap *details,
+                            struct qos_conf **conf)
+{
+    struct trtcm_policer *policer;
+    int err = 0;
+
+    policer = xmalloc(sizeof *policer);
+    qos_conf_init(&policer->qos_conf, &trtcm_policer_ops);
+    trtcm_policer_details_to_param(details, &policer->meter_params);
+    err = rte_meter_trtcm_rfc4115_profile_config(&policer->meter_profile,
+                                                 &policer->meter_params);
+    if (!err) {
+        err = rte_meter_trtcm_rfc4115_config(&policer->meter,
+                                             &policer->meter_profile);
+    }
+
+    if (!err) {
+        *conf = &policer->qos_conf;
+        memset(&policer->stats, 0, sizeof policer->stats);
+        hmap_init(&policer->queues);
+    } else {
+        free(policer);
+        *conf = NULL;
+        err = -err;
+    }
+
+    return err;
+}
+
+static void
+trtcm_policer_qos_destruct(struct qos_conf *conf)
+{
+    struct trtcm_policer_queue *queue, *next_queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    HMAP_FOR_EACH_SAFE (queue, next_queue, hmap_node, &policer->queues) {
+        hmap_remove(&policer->queues, &queue->hmap_node);
+        free(queue);
+    }
+    hmap_destroy(&policer->queues);
+    free(policer);
+}
+
+static int
+trtcm_policer_qos_get(const struct qos_conf *conf, struct smap *details)
+{
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    trtcm_policer_param_to_detail(&policer->meter_params, details);
+    return 0;
+}
+
+static bool
+trtcm_policer_qos_is_equal(const struct qos_conf *conf,
+                           const struct smap *details)
+{
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+    struct rte_meter_trtcm_rfc4115_params params;
+
+    trtcm_policer_details_to_param(details, &params);
+
+    return !memcmp(&params, &policer->meter_params, sizeof params);
+}
+
+static struct trtcm_policer_queue *
+trtcm_policer_qos_find_queue(struct trtcm_policer *policer, uint32_t queue_id)
+{
+    struct trtcm_policer_queue *queue;
+    HMAP_FOR_EACH_WITH_HASH (queue, hmap_node, hash_2words(queue_id, 0),
+                             &policer->queues) {
+        if (queue->queue_id == queue_id) {
+            return queue;
+        }
+    }
+    return NULL;
+}
+
+static inline bool
+trtcm_policer_run_single_packet(struct trtcm_policer *policer,
+                                struct rte_mbuf *pkt, uint64_t time)
+{
+    enum rte_color pkt_color;
+    struct trtcm_policer_queue *queue;
+    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct rte_ether_hdr);
+    struct dp_packet *dpkt = CONTAINER_OF(pkt, struct dp_packet, mbuf);
+
+    queue = trtcm_policer_qos_find_queue(policer, dpkt->md.skb_priority);
+    if (!queue) {
+        /* If no queue is found, use the default queue, which MUST exist. */
+        queue = trtcm_policer_qos_find_queue(policer, 0);
+        if (!queue) {
+            return false;
+        }
+    }
+
+    pkt_color = rte_meter_trtcm_rfc4115_color_blind_check(&queue->meter,
+                                                          &queue->meter_profile,
+                                                          time,
+                                                          pkt_len);
+
+    if (pkt_color == RTE_COLOR_RED) {
+        queue->stats.tx_errors++;
+    } else {
+        queue->stats.tx_bytes += pkt_len;
+        queue->stats.tx_packets++;
+    }
+
+    pkt_color = rte_meter_trtcm_rfc4115_color_aware_check(&policer->meter,
+                                                     &policer->meter_profile,
+                                                     time, pkt_len,
+                                                     pkt_color);
+
+    if (pkt_color == RTE_COLOR_RED) {
+        policer->stats.tx_errors++;
+        return false;
+    }
+
+    policer->stats.tx_bytes += pkt_len;
+    policer->stats.tx_packets++;
+    return true;
+}
+
+static int
+trtcm_policer_run(struct qos_conf *conf, struct rte_mbuf **pkts, int pkt_cnt,
+                  bool should_steal)
+{
+    int i = 0;
+    int cnt = 0;
+    struct rte_mbuf *pkt = NULL;
+    uint64_t current_time = rte_rdtsc();
+
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    for (i = 0; i < pkt_cnt; i++) {
+        pkt = pkts[i];
+
+        if (trtcm_policer_run_single_packet(policer, pkt, current_time)) {
+            if (cnt != i) {
+                pkts[cnt] = pkt;
+            }
+            cnt++;
+        } else {
+            if (should_steal) {
+                rte_pktmbuf_free(pkt);
+            }
+        }
+    }
+    return cnt;
+}
+
+static int
+trtcm_policer_qos_queue_construct(const struct smap *details,
+                                  uint32_t queue_id, struct qos_conf *conf)
+{
+    int err = 0;
+    struct trtcm_policer_queue *queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    queue = trtcm_policer_qos_find_queue(policer, queue_id);
+    if (!queue) {
+        queue = xmalloc(sizeof *queue);
+        queue->queue_id = queue_id;
+        memset(&queue->stats, 0, sizeof queue->stats);
+        queue->stats.created = time_msec();
+        hmap_insert(&policer->queues, &queue->hmap_node,
+                    hash_2words(queue_id, 0));
+    }
+    if (queue_id == 0 && smap_is_empty(details)) {
+        /* No default queue configured, use port values */
+        memcpy(&queue->meter_params, &policer->meter_params,
+               sizeof queue->meter_params);
+    } else {
+        trtcm_policer_details_to_param(details, &queue->meter_params);
+    }
+
+    err = rte_meter_trtcm_rfc4115_profile_config(&queue->meter_profile,
+                                                 &queue->meter_params);
+
+    if (!err) {
+        err = rte_meter_trtcm_rfc4115_config(&queue->meter,
+                                             &queue->meter_profile);
+    }
+    if (err) {
+        hmap_remove(&policer->queues, &queue->hmap_node);
+        free(queue);
+        err = -err;
+    }
+    return err;
+}
+
+static void
+trtcm_policer_qos_queue_destruct(struct qos_conf *conf, uint32_t queue_id)
+{
+    struct trtcm_policer_queue *queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    queue = trtcm_policer_qos_find_queue(policer, queue_id);
+    if (queue) {
+        hmap_remove(&policer->queues, &queue->hmap_node);
+        free(queue);
+    }
+}
+
+static int
+trtcm_policer_qos_queue_get(struct smap *details, uint32_t queue_id,
+                            const struct qos_conf *conf)
+{
+    struct trtcm_policer_queue *queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    queue = trtcm_policer_qos_find_queue(policer, queue_id);
+    if (!queue) {
+        return EINVAL;
+    }
+
+    trtcm_policer_param_to_detail(&queue->meter_params, details);
+    return 0;
+}
+
+static int
+trtcm_policer_qos_queue_get_stats(const struct qos_conf *conf,
+                                  uint32_t queue_id,
+                                  struct netdev_queue_stats *stats)
+{
+    struct trtcm_policer_queue *queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    queue = trtcm_policer_qos_find_queue(policer, queue_id);
+    if (!queue) {
+        return EINVAL;
+    }
+    memcpy(stats, &queue->stats, sizeof *stats);
+    return 0;
+}
+
+static int
+trtcm_policer_qos_queue_dump_state_init(const struct qos_conf *conf,
+                                        struct netdev_dpdk_queue_state *state)
+{
+    uint32_t i = 0;
+    struct trtcm_policer_queue *queue;
+    struct trtcm_policer *policer = CONTAINER_OF(conf, struct trtcm_policer,
+                                                 qos_conf);
+
+    state->n_queues = hmap_count(&policer->queues);
+    state->cur_queue = 0;
+    state->queues = xmalloc(state->n_queues * sizeof *state->queues);
+
+    HMAP_FOR_EACH (queue, hmap_node, &policer->queues) {
+        state->queues[i++] = queue->queue_id;
+    }
+    return 0;
+}
+
+static const struct dpdk_qos_ops trtcm_policer_ops = {
+    .qos_name = "trtcm-policer",
+    .qos_construct = trtcm_policer_qos_construct,
+    .qos_destruct = trtcm_policer_qos_destruct,
+    .qos_get = trtcm_policer_qos_get,
+    .qos_is_equal = trtcm_policer_qos_is_equal,
+    .qos_run = trtcm_policer_run,
+    .qos_queue_construct = trtcm_policer_qos_queue_construct,
+    .qos_queue_destruct = trtcm_policer_qos_queue_destruct,
+    .qos_queue_get = trtcm_policer_qos_queue_get,
+    .qos_queue_get_stats = trtcm_policer_qos_queue_get_stats,
+    .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
+};
+
 static int
 netdev_dpdk_reconfigure(struct netdev *netdev)
 {
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0ec726c..c43cb1a 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -4441,6 +4441,19 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
           in performance will be noticed in terms of overall aggregate traffic
           throughput.
         </dd>
+        <dt><code>trtcm-policer</code></dt>
+        <dd>
+          A DPDK egress policer algorithm using RFC 4115's Two-Rate,
+          Three-Color marker. It's a two-level hierarchical policer
+          which first does a color-blind marking of the traffic at the queue
+          level, followed by a color-aware marking at the port level. At the
+          end traffic marked as Green or Yellow is forwarded, Red is dropped.
+          For details on how traffic is marked, see RFC 4115.
+
+          If the ``default queue'', 0, is not configured it's automatically
+          created with the same <code>other_config</code> values as the
+          physical port.
+        </dd>
       </dl>
     </column>
 
@@ -4508,6 +4521,27 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         bytes/tokens of the packet. If there are not enough tokens in the cbs
         bucket the packet will be dropped.
       </column>
+      <column name="other_config" key="eir" type='{"type": "integer"}'>
+        The Excess Information Rate (EIR) is measured in bytes of IP
+        packets per second, i.e. it includes the IP header, but not link
+        specific (e.g. Ethernet) headers. This represents the bytes per second
+        rate at which the token bucket will be updated. The eir value is
+        calculated by (pps x packet data size).  For example assuming a user
+        wishes to limit a stream consisting of 64 byte packets to 1 million
+        packets per second the EIR would be set to to to 46000000. This value
+        can be broken into '1,000,000 x 46'. Where 1,000,000 is the policing
+        rate for the number of packets per second and 46 represents the size
+        of the packet data for a 64 byte ip packet.
+      </column>
+      <column name="other_config" key="ebs" type='{"type": "integer"}'>
+        The Excess Burst Size (EBS) is measured in bytes and represents a
+        token bucket. At a minimum this value should be be set to the expected
+        largest size packet in the traffic stream. In practice larger values
+        may be used to increase the size of the token bucket. If a packet can
+        be transmitted then the ebs will be decremented by the number of
+        bytes/tokens of the packet. If there are not enough tokens in the cbs
+        bucket the packet might be dropped.
+      </column>
     </group>
 
     <group title="Configuration for linux-sfq">