diff mbox series

[ovs-dev,v8,3/3] revalidator: Rebalance offloaded flows based on the pps rate

Message ID 20181018161314.27854-4-sriharsha.basavapatna@broadcom.com
State Deferred
Headers show
Series Support dynamic rebalancing of offloaded flows | expand

Commit Message

Li,Rongqing via dev Oct. 18, 2018, 4:13 p.m. UTC
This is the third patch in the patch-set to support dynamic rebalancing
of offloaded flows.

The dynamic rebalancing functionality is implemented in this patch. The
ukeys that are not scheduled for deletion are obtained and passed as input
to the rebalancing routine. The rebalancing is done in the context of
revalidation leader thread, after all other revalidator threads are
done with gathering rebalancing data for flows.

For each netdev that is in OOR state, a list of flows - both offloaded
and non-offloaded (pending) - is obtained using the ukeys. For each netdev
that is in OOR state, the flows are grouped and sorted into offloaded and
pending flows.  The offloaded flows are sorted in descending order of
pps-rate, while pending flows are sorted in ascending order of pps-rate.

The rebalancing is done in two phases. In the first phase, we try to
offload all pending flows and if that succeeds, the OOR state on the device
is cleared. If some (or none) of the pending flows could not be offloaded,
then we start replacing an offloaded flow that has a lower pps-rate than
a pending flow, until there are no more pending flows with a higher rate
than an offloaded flow. The flows that are replaced from the device are
added into kernel datapath.

A new OVS configuration parameter "offload-rebalance", is added to ovsdb.
The default value of this is "false". To enable this feature, set the
value of this parameter to "true", which provides packets-per-second
rate based policy to dynamically offload and un-offload flows.

Note: This option can be enabled only when 'hw-offload' policy is enabled.
It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
offload errors (specifically ENOSPC error this feature depends on) reported
by an offloaded device are supressed by TC-Flower kernel module.

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Ben Pfaff <blp@ovn.org>
---
 NEWS                          |   3 +
 lib/dpif-netdev.c             |   3 +-
 lib/dpif-netlink.c            |  29 ++-
 lib/dpif-provider.h           |   8 +-
 lib/dpif.c                    |  30 ++-
 lib/dpif.h                    |  12 +-
 lib/netdev-provider.h         |   7 +-
 lib/netdev.c                  |  62 ++++-
 lib/netdev.h                  |   1 +
 ofproto/ofproto-dpif-upcall.c | 446 +++++++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml          |  21 ++
 11 files changed, 592 insertions(+), 30 deletions(-)

Comments

0-day Robot Oct. 18, 2018, 5 p.m. UTC | #1
Bleep bloop.  Greetings Sriharsha Basavapatna via dev, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


git-am:
fatal: sha1 information is lacking or useless (lib/dpif-netlink.c).
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.
Patch failed at 0001 revalidator: Rebalance offloaded flows based on the pps rate
The copy of the patch that failed is found in:
   /var/lib/jenkins/jobs/upstream_build_from_pw/workspace/.git/rebase-apply/patch
When you have resolved this problem, run "git am --resolved".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


Please check this out.  If you feel there has been an error, please email aconole@bytheb.org

Thanks,
0-day Robot
Eelco Chaudron Oct. 19, 2018, 2:23 p.m. UTC | #2
On 18 Oct 2018, at 18:13, Sriharsha Basavapatna via dev wrote:

> This is the third patch in the patch-set to support dynamic 
> rebalancing
> of offloaded flows.
>
> The dynamic rebalancing functionality is implemented in this patch. 
> The
> ukeys that are not scheduled for deletion are obtained and passed as 
> input
> to the rebalancing routine. The rebalancing is done in the context of
> revalidation leader thread, after all other revalidator threads are
> done with gathering rebalancing data for flows.
>
> For each netdev that is in OOR state, a list of flows - both offloaded
> and non-offloaded (pending) - is obtained using the ukeys. For each 
> netdev
> that is in OOR state, the flows are grouped and sorted into offloaded 
> and
> pending flows.  The offloaded flows are sorted in descending order of
> pps-rate, while pending flows are sorted in ascending order of 
> pps-rate.
>
> The rebalancing is done in two phases. In the first phase, we try to
> offload all pending flows and if that succeeds, the OOR state on the 
> device
> is cleared. If some (or none) of the pending flows could not be 
> offloaded,
> then we start replacing an offloaded flow that has a lower pps-rate 
> than
> a pending flow, until there are no more pending flows with a higher 
> rate
> than an offloaded flow. The flows that are replaced from the device 
> are
> added into kernel datapath.
>
> A new OVS configuration parameter "offload-rebalance", is added to 
> ovsdb.
> The default value of this is "false". To enable this feature, set the
> value of this parameter to "true", which provides packets-per-second
> rate based policy to dynamically offload and un-offload flows.
>
> Note: This option can be enabled only when 'hw-offload' policy is 
> enabled.
> It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
> offload errors (specifically ENOSPC error this feature depends on) 
> reported
> by an offloaded device are supressed by TC-Flower kernel module.
>
> Signed-off-by: Sriharsha Basavapatna 
> <sriharsha.basavapatna@broadcom.com>
> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>
> Reviewed-by: Ben Pfaff <blp@ovn.org>
> ---
>  NEWS                          |   3 +
>  lib/dpif-netdev.c             |   3 +-
>  lib/dpif-netlink.c            |  29 ++-
>  lib/dpif-provider.h           |   8 +-
>  lib/dpif.c                    |  30 ++-
>  lib/dpif.h                    |  12 +-
>  lib/netdev-provider.h         |   7 +-
>  lib/netdev.c                  |  62 ++++-
>  lib/netdev.h                  |   1 +
>  ofproto/ofproto-dpif-upcall.c | 446 
> +++++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml          |  21 ++
>  11 files changed, 592 insertions(+), 30 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 33b4d8a23..846b46fb5 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -8,6 +8,9 @@ Post-v2.10.0
>       as the default timeout for control utilities.
>     - ovn:
>       * ovn-ctl: allow passing user:group ids to the OVN daemons.
> +   - ovs-vswitchd:
> +     * New configuration option "offload-rebalance", that enables 
> dynamic
> +       rebalancing of offloaded flows.
>
>
>  v2.10.0 - xx xxx xxxx
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7c0300cc5..1c01d2278 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3689,7 +3689,8 @@ dpif_netdev_execute(struct dpif *dpif, struct 
> dpif_execute *execute)
>  }
>
>  static void
> -dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t 
> n_ops)
> +dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t 
> n_ops,
> +                    enum dpif_offload_type offload_type OVS_UNUSED)
>  {
>      size_t i;
>
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index b9ce9cbe2..2e01f5750 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2281,7 +2281,8 @@ dpif_netlink_operate_chunks(struct dpif_netlink 
> *dpif, struct dpif_op **ops,
>  }
>
>  static void
> -dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t 
> n_ops)
> +dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t 
> n_ops,
> +                     enum dpif_offload_type offload_type)
>  {
>      struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
>      struct dpif_op *new_ops[OPERATE_MAX_OPS];
> @@ -2289,7 +2290,12 @@ dpif_netlink_operate(struct dpif *dpif_, struct 
> dpif_op **ops, size_t n_ops)
>      int i = 0;
>      int err = 0;
>
> -    if (netdev_is_flow_api_enabled()) {
> +    if (offload_type == DPIF_OFFLOAD_ALWAYS && 
> !netdev_is_flow_api_enabled()) {
> +        VLOG_DBG("Invalid offload_type: %d", offload_type);

Here we are not returning any errors just a silent return, should we 
return EINVAL instead?

> +        return;
> +    }
> +
> +    if (offload_type != DPIF_OFFLOAD_NEVER && 
> netdev_is_flow_api_enabled()) {
>          while (n_ops > 0) {
>              count = 0;
>
> @@ -2298,6 +2304,23 @@ dpif_netlink_operate(struct dpif *dpif_, struct 
> dpif_op **ops, size_t n_ops)
>
>                  err = try_send_to_netdev(dpif, op);
>                  if (err && err != EEXIST) {
> +                    if (offload_type == DPIF_OFFLOAD_ALWAYS) {
> +                        /* We got an error while offloading an op. 
> Since
> +                         * OFFLOAD_ALWAYS is specified, we stop 
> further
> +                         * processing and return to the caller 
> without
> +                         * invoking kernel datapath as fallback. But 
> the
> +                         * interface requires us to process all 
> n_ops; so
> +                         * return the same error in the remaining ops 
> too.
> +                         */

Why are we ok with failing all possible operations here? What if we are 
doing a FLOW_PUT and FLOW_DEL in a set?
Also one add might fail but another might pass, due to specific 
resources, i.e. seperate hw resources for excact match, and masked 
matches?

> +                        op->error = err;
> +                        n_ops--;
> +                        while (n_ops > 0) {
> +                            op = ops[i++];
> +                            op->error = err;
> +                            n_ops--;
> +                        }
> +                        return;
> +                    }
>                      new_ops[count++] = op;
>                  } else {
>                      op->error = err;
> @@ -2308,7 +2331,7 @@ dpif_netlink_operate(struct dpif *dpif_, struct 
> dpif_op **ops, size_t n_ops)
>
>              dpif_netlink_operate_chunks(dpif, new_ops, count);
>          }
> -    } else {
> +    } else if (offload_type != DPIF_OFFLOAD_ALWAYS) {
>          dpif_netlink_operate_chunks(dpif, ops, n_ops);

          } else {
          What should happen in this case? silently pass, if so we might 
want to clear all the errors just in case?
        }

>      }
>  }
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index 7a71f5c0a..a30de740f 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -296,12 +296,14 @@ struct dpif_class {
>
>      int (*flow_dump_next)(struct dpif_flow_dump_thread *thread,
>                            struct dpif_flow *flows, int max_flows);
> -
>      /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in 
> the order
>       * in which they are specified, placing each operation's results 
> in the
>       * "output" members documented in comments and the 'error' member 
> of each
> -     * dpif_op. */
> -    void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t 
> n_ops);
> +     * dpif_op. The offload_type argument tells the provider if 'ops' 
> should
> +     * be submitted to to a netdev (only offload) or to the kernel 
> datapath
> +     * (never offload) or to both (offload if possible; software 
> fallback). */
> +    void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t 
> n_ops,
> +                    enum dpif_offload_type offload_type);
>
>      /* Enables or disables receiving packets with dpif_recv() for 
> 'dpif'.
>       * Turning packet receive off and then back on is allowed to 
> change Netlink
> diff --git a/lib/dpif.c b/lib/dpif.c
> index d799f972c..65880b86a 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -49,6 +49,7 @@
>  #include "valgrind.h"
>  #include "openvswitch/ofp-errors.h"
>  #include "openvswitch/vlog.h"
> +#include "lib/netdev-provider.h"
>
>  VLOG_DEFINE_THIS_MODULE(dpif);
>
> @@ -1015,7 +1016,7 @@ dpif_flow_get(struct dpif *dpif,
>      op.flow_get.flow->key_len = key_len;
>
>      opp = &op;
> -    dpif_operate(dpif, &opp, 1);
> +    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
>      return op.error;
>  }
> @@ -1045,7 +1046,7 @@ dpif_flow_put(struct dpif *dpif, enum 
> dpif_flow_put_flags flags,
>      op.flow_put.stats = stats;
>
>      opp = &op;
> -    dpif_operate(dpif, &opp, 1);
> +    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
>      return op.error;
>  }
> @@ -1068,7 +1069,7 @@ dpif_flow_del(struct dpif *dpif,
>      op.flow_del.terse = false;
>
>      opp = &op;
> -    dpif_operate(dpif, &opp, 1);
> +    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
>      return op.error;
>  }
> @@ -1325,7 +1326,7 @@ dpif_execute(struct dpif *dpif, struct 
> dpif_execute *execute)
>          op.execute = *execute;
>
>          opp = &op;
> -        dpif_operate(dpif, &opp, 1);
> +        dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
>          return op.error;
>      } else {
> @@ -1336,10 +1337,21 @@ dpif_execute(struct dpif *dpif, struct 
> dpif_execute *execute)
>  /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the 
> order in
>   * which they are specified.  Places each operation's results in the 
> "output"
>   * members documented in comments, and 0 in the 'error' member on 
> success or a
> - * positive errno on failure. */
> + * positive errno on failure.
> + */

Guess this should be undone as all comments in this file use this style 
of closing.

>  void
> -dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
> -{
> +dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
> +             enum dpif_offload_type offload_type)
> +{
> +    if (offload_type == DPIF_OFFLOAD_ALWAYS && 
> !netdev_is_flow_api_enabled()) {
> +        size_t i;
> +        for (i = 0; i < n_ops; i++) {
> +            struct dpif_op *op = ops[i];
> +            op->error = EINVAL;
> +        }
> +        return;
> +    }
> +
>      while (n_ops > 0) {
>          size_t chunk;
>
> @@ -1360,7 +1372,7 @@ dpif_operate(struct dpif *dpif, struct dpif_op 
> **ops, size_t n_ops)
>               * handle itself, without help. */
>              size_t i;
>
> -            dpif->dpif_class->operate(dpif, ops, chunk);
> +            dpif->dpif_class->operate(dpif, ops, chunk, 
> offload_type);
>
>              for (i = 0; i < chunk; i++) {
>                  struct dpif_op *op = ops[i];
> @@ -1657,7 +1669,7 @@ dpif_queue_to_priority(const struct dpif *dpif, 
> uint32_t queue_id,
>      log_operation(dpif, "queue_to_priority", error);
>      return error;
>  }
> -
> +
Removed by accident?

>  void
>  dpif_init(struct dpif *dpif, const struct dpif_class *dpif_class,
>            const char *name,
> diff --git a/lib/dpif.h b/lib/dpif.h
> index bbdc3eb6c..0675ab19f 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -614,6 +614,13 @@ enum dpif_op_type {
>      DPIF_OP_FLOW_GET,
>  };
>
> +/* offload_type argument types to (*operate) interface */
> +enum dpif_offload_type {
> +    DPIF_OFFLOAD_AUTO,         /* Offload if possible, fallback to 
> software. */
> +    DPIF_OFFLOAD_NEVER,        /* Never offload to hardware. */
> +    DPIF_OFFLOAD_ALWAYS,       /* Always offload to hardware. */
> +};
> +
>  /* Add or modify a flow.
>   *
>   * The flow is specified by the Netlink attributes with types 
> OVS_KEY_ATTR_* in
> @@ -768,8 +775,9 @@ struct dpif_op {
>      };
>  };
>
> -void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops);
> -

Same here?

> +void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops,
> +                  enum dpif_offload_type);
> +
>  /* Upcalls. */
>
>  enum dpif_upcall_type {
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index e320dad61..fb0c27e6e 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -38,10 +38,14 @@ struct netdev_tnl_build_header_params;
>  /* Offload-capable (HW) netdev information */
>  struct netdev_hw_info {
>      bool oor;		/* Out of Offload Resources ? */
> +    int offload_count;  /* Pending (non-offloaded) flow count */
> +    int pending_count;  /* Offloaded flow count */

Guess they should be uint32_t assuming we do not expect negative counts.

>  };
>
>  enum hw_info_type {
> -    HW_INFO_TYPE_OOR = 1	/* OOR state */
> +    HW_INFO_TYPE_OOR = 1,		/* OOR state */
> +    HW_INFO_TYPE_PEND_COUNT = 2,	/* Pending(non-offloaded) flow count 
> */
> +    HW_INFO_TYPE_OFFL_COUNT = 3		/* Offloaded flow count */
>  };
>
>  /* A network device (e.g. an Ethernet device).
> @@ -89,7 +93,6 @@ struct netdev {
>      int n_rxq;
>      struct shash_node *node;            /* Pointer to element in 
> global map. */
>      struct ovs_list saved_flags_list; /* Contains "struct 
> netdev_saved_flags". */
> -
>      struct netdev_hw_info hw_info;	/* offload-capable netdev info */
>  };
>
> diff --git a/lib/netdev.c b/lib/netdev.c
> index f3fa08ca3..5d7f9c89b 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -2260,11 +2260,23 @@ netdev_get_block_id(struct netdev *netdev)
>  int
>  netdev_get_hw_info(struct netdev *netdev, int type)
>  {
> -    if (type == HW_INFO_TYPE_OOR) {
> -        return netdev->hw_info.oor;
> +    int val = -1;
> +
> +    switch (type) {
> +    case HW_INFO_TYPE_OOR:
> +        val = netdev->hw_info.oor;
> +        break;
> +    case HW_INFO_TYPE_PEND_COUNT:
> +        val = netdev->hw_info.pending_count;
> +        break;
> +    case HW_INFO_TYPE_OFFL_COUNT:
> +        val = netdev->hw_info.offload_count;
> +        break;
> +    default:
> +        break;
>      }
>
> -    return -1;
> +    return val;
>  }
>
>  /*
> @@ -2273,9 +2285,47 @@ netdev_get_hw_info(struct netdev *netdev, int 
> type)
>  void
>  netdev_set_hw_info(struct netdev *netdev, int type, int val)
>  {
> -    if (type == HW_INFO_TYPE_OOR) {
> +    switch (type) {
> +    case HW_INFO_TYPE_OOR:
> +        if (val == 0) {
> +            VLOG_DBG("Offload rebalance: netdev: %s is not OOR", 
> netdev->name);
> +        }
>          netdev->hw_info.oor = val;
> +        break;
> +    case HW_INFO_TYPE_PEND_COUNT:
> +        netdev->hw_info.pending_count = val;
> +        break;
> +    case HW_INFO_TYPE_OFFL_COUNT:
> +        netdev->hw_info.offload_count = val;
> +        break;
> +    default:
> +        break;
> +    }
> +}

See comment in first patchset about creating separate functions for 
this.

> +
> +/*
> + * Find if any netdev is in OOR state. Return true if there's at 
> least
> + * one netdev that's in OOR state; otherwise return false.
> + */
> +bool
> +netdev_any_oor(void)
> +    OVS_EXCLUDED(netdev_mutex)
> +{
> +    struct shash_node *node;
> +    bool oor = false;
> +
> +    ovs_mutex_lock(&netdev_mutex);
> +    SHASH_FOR_EACH (node, &netdev_shash) {
> +        struct netdev *dev = node->data;
> +
> +        if (dev->hw_info.oor) {
> +            oor = true;
> +            break;
> +        }
>      }
> +    ovs_mutex_unlock(&netdev_mutex);
> +
> +    return oor;
>  }
>
>  bool
> @@ -2549,6 +2599,10 @@ netdev_set_flow_api_enabled(const struct smap 
> *ovs_other_config)
>              tc_set_policy(smap_get_def(ovs_other_config, "tc-policy",
>                                         TC_POLICY_DEFAULT));
>
> +            if (smap_get_bool(ovs_other_config, "offload-rebalance", 
> false)) {
> +                netdev_offload_rebalance_policy = true;
> +            }
> +
>              netdev_ports_flow_init();
>
>              ovsthread_once_done(&once);
> diff --git a/lib/netdev.h b/lib/netdev.h
> index b0e5c5b72..373be7cc0 100644
> --- a/lib/netdev.h
> +++ b/lib/netdev.h
> @@ -229,6 +229,7 @@ int netdev_init_flow_api(struct netdev *);
>  uint32_t netdev_get_block_id(struct netdev *);
>  int netdev_get_hw_info(struct netdev *, int);
>  void netdev_set_hw_info(struct netdev *, int, int);
> +bool netdev_any_oor(void);
>  bool netdev_is_flow_api_enabled(void);
>  void netdev_set_flow_api_enabled(const struct smap 
> *ovs_other_config);
>  bool netdev_is_offload_rebalance_policy_enabled(void);
> diff --git a/ofproto/ofproto-dpif-upcall.c 
> b/ofproto/ofproto-dpif-upcall.c
> index a372d6252..bb9e61b7c 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -22,6 +22,7 @@
>  #include "connmgr.h"
>  #include "coverage.h"
>  #include "cmap.h"
> +#include "lib/dpif-provider.h"
>  #include "dpif.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "fail-open.h"
> @@ -42,7 +43,6 @@
>  #include "tunnel.h"
>  #include "unixctl.h"
>  #include "openvswitch/vlog.h"
> -#include "lib/dpif-provider.h"
>  #include "lib/netdev-provider.h"
>
>  #define MAX_QUEUE_LENGTH 512
> @@ -182,6 +182,8 @@ struct udpif {
>      uint64_t conn_seq;                 /* Corresponds to 'dump_seq' 
> when
>                                            conns[n_conns-1] was 
> stored. */
>      size_t n_conns;                    /* Number of connections 
> waiting. */
> +
> +    long long int offload_rebalance_time;  /* Time of last offload 
> rebalance */
>  };
>
>  enum upcall_type {
> @@ -308,6 +310,7 @@ struct udpif_key {
>      struct recirc_refs recircs;  /* Action recirc IDs with references 
> held. */
>
>  #define OFFL_REBAL_INTVL_MSEC  3000	/* dynamic offload rebalance freq 
> */
> +    struct netdev *in_netdev;		/* in_odp_port's netdev */
>      bool offloaded;			/* True if flow is offloaded */
>      uint64_t flow_pps_rate;		/* Packets-Per-Second rate */
>      long long int flow_time;		/* last pps update time */
> @@ -396,6 +399,12 @@ static int upcall_receive(struct upcall *, const 
> struct dpif_backer *,
>                            const ovs_u128 *ufid, const unsigned 
> pmd_id);
>  static void upcall_uninit(struct upcall *);
>
> +static void udpif_flow_rebalance(struct udpif *udpif);
> +static int udpif_flow_program(struct udpif *udpif, struct udpif_key 
> *ukey,
> +                              enum dpif_offload_type offload_type);
> +static int udpif_flow_unprogram(struct udpif *udpif, struct udpif_key 
> *ukey,
> +                                enum dpif_offload_type offload_type);
> +
>  static upcall_callback upcall_cb;
>  static dp_purge_callback dp_purge_cb;
>
> @@ -567,6 +576,7 @@ udpif_start_threads(struct udpif *udpif, size_t 
> n_handlers_,
>          ovs_barrier_init(&udpif->pause_barrier, udpif->n_revalidators 
> + 1);
>          udpif->reval_exit = false;
>          udpif->pause = false;
> +        udpif->offload_rebalance_time = time_msec();
>          udpif->revalidators = xzalloc(udpif->n_revalidators
>                                        * sizeof *udpif->revalidators);
>          for (size_t i = 0; i < udpif->n_revalidators; i++) {
> @@ -859,6 +869,26 @@ free_dupcall:
>      return n_upcalls;
>  }
>
> +static void
> +udpif_run_flow_rebalance(struct udpif *udpif)
> +{
> +    long long int now = 0;
> +
> +    /* Don't rebalance if OFFL_REBAL_INTVL_MSEC have not elapsed */
> +    now = time_msec();
> +    if (now < udpif->offload_rebalance_time + OFFL_REBAL_INTVL_MSEC) 
> {
> +        return;
> +    }
> +
> +    if (!netdev_any_oor()) {
> +        return;
> +    }
> +
> +    VLOG_DBG("Offload rebalance: Found OOR netdevs");

Can we add a coverage counter here?

        COVERAGE_INC(offload_flow_rebalance);

> +    udpif->offload_rebalance_time = now;
> +    udpif_flow_rebalance(udpif);
> +}
> +
>  static void *
>  udpif_revalidator(void *arg)
>  {
> @@ -933,6 +963,9 @@ udpif_revalidator(void *arg)
>
>              dpif_flow_dump_destroy(udpif->dump);
>              seq_change(udpif->dump_seq);
> +            if (netdev_is_offload_rebalance_policy_enabled()) {
> +                udpif_run_flow_rebalance(udpif);
> +            }
>
>              duration = MAX(time_msec() - start_time, 1);
>              udpif->dump_duration = duration;
> @@ -977,7 +1010,7 @@ udpif_revalidator(void *arg)
>
>      return NULL;
>  }
> -

> +
>  static enum upcall_type
>  classify_upcall(enum dpif_upcall_type type, const struct nlattr 
> *userdata,
>                  struct user_action_cookie *cookie)
> @@ -1579,7 +1612,7 @@ handle_upcalls(struct udpif *udpif, struct 
> upcall *upcalls,
>      for (i = 0; i < n_ops; i++) {
>          opsp[n_opsp++] = &ops[i].dop;
>      }
> -    dpif_operate(udpif->dpif, opsp, n_opsp);
> +    dpif_operate(udpif->dpif, opsp, n_opsp, DPIF_OFFLOAD_AUTO);
>      for (i = 0; i < n_ops; i++) {
>          struct udpif_key *ukey = ops[i].ukey;
>
> @@ -1671,13 +1704,13 @@ ukey_create__(const struct nlattr *key, size_t 
> key_len,
>      ukey->state = UKEY_CREATED;
>      ukey->state_thread = ovsthread_id_self();
>      ukey->state_where = OVS_SOURCE_LOCATOR;
> -    ukey->created = time_msec();
> +    ukey->created = ukey->flow_time = time_msec();
>      memset(&ukey->stats, 0, sizeof ukey->stats);
>      ukey->stats.used = used;
>      ukey->xcache = NULL;
>
>      ukey->offloaded = false;
> -    ukey->flow_time = 0;
> +    ukey->in_netdev = NULL;
>      ukey->flow_packets = ukey->flow_backlog_packets = 0;
>
>      ukey->key_recirc_id = key_recirc_id;
> @@ -2329,7 +2362,7 @@ push_dp_ops(struct udpif *udpif, struct ukey_op 
> *ops, size_t n_ops)
>      for (i = 0; i < n_ops; i++) {
>          opsp[i] = &ops[i].dop;
>      }
> -    dpif_operate(udpif->dpif, opsp, n_ops);
> +    dpif_operate(udpif->dpif, opsp, n_ops, DPIF_OFFLOAD_AUTO);
>
>      for (i = 0; i < n_ops; i++) {
>          struct ukey_op *op = &ops[i];
> @@ -2455,6 +2488,57 @@ reval_op_init(struct ukey_op *op, enum 
> reval_result result,
>      }
>  }
>
> +static void
> +ukey_netdev_unref(struct udpif_key *ukey)
> +{
> +    if (!ukey->in_netdev) {
> +        return;
> +    }
> +    netdev_close(ukey->in_netdev);
> +    ukey->in_netdev = NULL;
> +}
> +
> +/*
> + * Given a udpif_key, get its input port (netdev) by parsing the flow 
> keys
> + * and actions. The flow may not contain flow attributes if it is a 
> terse
> + * dump; read its attributes from the ukey and then parse the flow to 
> get
> + * the port info. Save them in udpif_key.
> + */
> +static void
> +ukey_to_flow_netdev(struct udpif *udpif, struct udpif_key *ukey)
> +{
> +    const struct dpif *dpif = udpif->dpif;
> +    const struct dpif_class *dpif_class = dpif->dpif_class;
> +    const struct nlattr *k;
> +    unsigned int left;
> +
> +    /* Remove existing references to netdev */
> +    ukey_netdev_unref(ukey);

If for some reason we already have a reference, we should just return 
it, as the netdev can not change.

> +
> +    /* Find the input port and get a reference to its netdev */
> +    NL_ATTR_FOR_EACH (k, left, ukey->key, ukey->key_len) {
> +        enum ovs_key_attr type = nl_attr_type(k);
> +
> +        if (type == OVS_KEY_ATTR_IN_PORT) {
> +            ukey->in_netdev = 
> netdev_ports_get(nl_attr_get_odp_port(k),
> +                                               dpif_class);
> +        } else if (type == OVS_KEY_ATTR_TUNNEL) {
> +            struct flow_tnl tnl;
> +            enum odp_key_fitness res;
> +
> +            if (ukey->in_netdev) {
> +                netdev_close(ukey->in_netdev);
> +                ukey->in_netdev = NULL;
> +            }
> +            res = odp_tun_key_from_attr(k, &tnl);
> +            if (res != ODP_FIT_ERROR) {
> +                ukey->in_netdev = flow_get_tunnel_netdev(&tnl);
> +                break;
> +            }
> +        }
> +    }
> +}
> +
>  static uint64_t
>  udpif_flow_packet_delta(struct udpif_key *ukey, const struct 
> dpif_flow *f)
>  {
> @@ -2468,6 +2552,16 @@ udpif_flow_time_delta(struct udpif *udpif, 
> struct udpif_key *ukey)
>      return (udpif->dpif->current_ms - ukey->flow_time) / 1000;
>  }
>
> +/*
> + * Save backlog packet count while switching modes
> + * between offloaded and kernel datapaths.
> + */
> +static void
> +udpif_set_ukey_backlog_packets(struct udpif_key *ukey)
> +{
> +    ukey->flow_backlog_packets = ukey->flow_packets;
> +}
> +
>  /* Gather pps-rate for the given dpif_flow and save it in its ukey */
>  static void
>  udpif_update_flow_pps(struct udpif *udpif, struct udpif_key *ukey,
> @@ -2539,6 +2633,7 @@ revalidate(struct revalidator *revalidator)
>          kill_them_all = n_dp_flows > flow_limit * 2;
>          max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
>
> +        udpif->dpif->current_ms = time_msec();

Can we not use udpif->dpif->current_ms = now, as it's set two lines 
above.

>          for (f = flows; f < &flows[n_dumped]; f++) {
>              long long int used = f->stats.used;
>              struct recirc_refs recircs = 
> RECIRC_REFS_EMPTY_INITIALIZER;
> @@ -2915,3 +3010,342 @@ upcall_unixctl_purge(struct unixctl_conn 
> *conn, int argc OVS_UNUSED,
>      }
>      unixctl_command_reply(conn, "");
>  }
> +
> +/* Flows are sorted in the following order:
> + * netdev, flow state (offloaded/kernel path), flow_pps_rate.
> + */
> +static int
> +flow_compare_rebalance(const void *elem1, const void *elem2)
> +{
> +    const struct udpif_key *f1 = *(struct udpif_key **)elem1;
> +    const struct udpif_key *f2 = *(struct udpif_key **)elem2;
> +    int64_t diff;
> +
> +    if (f1->in_netdev < f2->in_netdev) {
> +        return -1;
> +    } else if (f1->in_netdev > f2->in_netdev) {
> +        return 1;
> +    }
> +
> +    if (f1->offloaded != f2->offloaded) {
> +        return f2->offloaded - f1->offloaded;
> +    }
> +
> +    diff = (f1->offloaded == true) ?
> +        f1->flow_pps_rate - f2->flow_pps_rate :
> +        f2->flow_pps_rate - f1->flow_pps_rate;
> +
> +    return (diff < 0) ? -1 : 1;
> +}
> +
> +/* Insert flows from pending array during rebalancing */
> +static int
> +rebalance_insert_pending(struct udpif *udpif, struct udpif_key 
> **pending_flows,
> +                         int pending_count, int insert_count,
> +                         uint64_t rate_threshold)
> +{
> +    int count = 0;
> +
> +    for (int i = 0; i < pending_count; i++) {
> +        struct udpif_key *flow = pending_flows[i];
> +        int err;
> +
> +        /* Stop offloading pending flows if the insert count is
> +         * reached and the flow rate is less than the threshold
> +         */
> +        if (count >= insert_count && flow->flow_pps_rate < 
> rate_threshold) {
> +                break;
> +        }
> +
> +        /* Offload the flow to netdev */
> +        err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_ALWAYS);
> +
> +        if (err == ENOSPC) {
> +            /* Stop if we are out of resources */

Are we sure we want to stop? We did release X flows, maybe some others 
might fit (See other comments on different type of hw tables).

> +            break;
> +        }
> +
> +        if (err) {
> +            continue;
> +        }
> +
> +        /* Offload succeeded; delete it from the kernel datapath */
> +        udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);

No error checking here? for the other order you do it.

> +
> +        /* Change the state of the flow, adjust dpif counters */
> +        flow->offloaded = true;
> +
> +        udpif_set_ukey_backlog_packets(flow);

Are ok with not clearing the flow_packets here? As now some of the 
previous count is used for the new mode while calculating.

> +        count++;
> +    }
> +
> +    return count;
> +}
> +
> +/* Remove flows from offloaded array during rebalancing */
> +static void
> +rebalance_remove_offloaded(struct udpif *udpif,
> +                           struct udpif_key **offloaded_flows,
> +                           int offload_count)
> +{
> +    for (int i = 0; i < offload_count; i++) {
> +        struct udpif_key *flow = offloaded_flows[i];
> +        int err;
> +
> +        /* Install the flow into kernel path first */
> +        err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_NEVER);
> +        if (err) {
> +            continue;
> +        }
> +
> +        /* Success; now remove offloaded flow from netdev */
> +        err = udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_ALWAYS);
> +        if (err) {
> +            udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
> +            continue;
> +        }
> +        udpif_set_ukey_backlog_packets(flow);

Same here, are ok with not clearing the flow_packets here? As now some 
of the previous count is used for the new mode while calculating.

> +        flow->offloaded = false;
> +    }
> +}
> +
> +/*
> + * Rebalance offloaded flows on a netdev that's in OOR state.
> + *
> + * The rebalancing is done in two phases. In the first phase, we 
> check if
> + * the pending flows can be offloaded (if some resources became 
> available
> + * in the meantime) by trying to offload each pending flow. If all 
> pending
> + * flows get successfully offloaded, the OOR state is cleared on the 
> netdev
> + * and there's nothing to rebalance.
> + *
> + * If some of the pending flows could not be offloaded, i.e, we still 
> see
> + * the OOR error, then we move to the second phase of rebalancing. In 
> this
> + * phase, the rebalancer compares pps-rate of an offloaded flow with 
> the
> + * least pps-rate with that of a pending flow with the highest 
> pps-rate from
> + * their respective sorted arrays. If pps-rate of the offloaded flow 
> is less
> + * than the pps-rate of the pending flow, then it deletes the 
> offloaded flow
> + * from the HW/netdev and adds it to kernel datapath and then 
> offloads pending
> + * to HW/netdev. This process is repeated for every pair of offloaded 
> and
> + * pending flows in the ordered list. The process stops when we 
> encounter an
> + * offloaded flow that has a higher pps-rate than the corresponding 
> pending
> + * flow. The entire rebalancing process is repeated in the next 
> iteration.
> + */
> +static bool
> +rebalance_device(struct udpif *udpif, struct udpif_key 
> **offloaded_flows,
> +                 int offload_count, struct udpif_key **pending_flows,
> +                 int pending_count)
> +{
> +
> +    /* Phase 1 */
> +    int num_inserted = rebalance_insert_pending(udpif, pending_flows,
> +                                                pending_count, 
> pending_count,
> +                                                0);
> +    if (num_inserted) {
> +        VLOG_DBG("Offload rebalance: Phase1: inserted %d pending 
> flows",
> +                  num_inserted);
> +    }
> +
> +    /* Adjust pending array */
> +    pending_flows = &pending_flows[num_inserted];
> +    pending_count -= num_inserted;
> +
> +    if (!pending_count) {
> +        /*
> +         * Successfully offloaded all pending flows. The device
> +         * is no longer in OOR state; done rebalancing this device.
> +         */
> +        return false;
> +    }
> +
> +    /*
> +     * Phase 2; determine how many offloaded flows to churn.
> +     */
> +#define	OFFL_REBAL_MAX_CHURN    1024
> +    int churn_count = 0;
> +    while (churn_count < OFFL_REBAL_MAX_CHURN && churn_count < 
> offload_count
> +           && churn_count < pending_count) {
> +        if (pending_flows[churn_count]->flow_pps_rate <=
> +            offloaded_flows[churn_count]->flow_pps_rate)
                                                             {
> +                break;
             }
> +        churn_count++;
> +    }
> +
> +    if (churn_count) {
> +        VLOG_DBG("Offload rebalance: Phase2: removing %d offloaded 
> flows",
> +                  churn_count);
> +    }
> +
> +    /* Bail early if nothing to churn */
> +    if (!churn_count) {
> +        return true;
> +    }
> +
> +    /* Remove offloaded flows */
> +    rebalance_remove_offloaded(udpif, offloaded_flows, churn_count);
> +
> +    /* Adjust offloaded array */
> +    offloaded_flows = &offloaded_flows[churn_count];
> +    offload_count -= churn_count;
> +
> +    /* Replace offloaded flows with pending flows */
> +    num_inserted = rebalance_insert_pending(udpif, pending_flows,
> +                                            pending_count, 
> churn_count,
> +                                            offload_count ?
> +                                            
> offloaded_flows[0]->flow_pps_rate :
> +                                            0);
> +    if (num_inserted) {
> +        VLOG_DBG("Offload rebalance: Phase2: inserted %d pending 
> flows",
> +                  num_inserted);
> +    }
> +
> +    return true;
> +}
> +
> +static struct udpif_key **
> +udpif_add_oor_flows(struct udpif_key **sort_flows, size_t 
> *total_flow_count,
> +                    size_t *alloc_flow_count, struct udpif_key *ukey)
> +{
> +    if (*total_flow_count >= *alloc_flow_count) {

I might be missing the clue here, but alloc_flow_count is always 0, and 
can't find the place where it is incremented. So are we always writing 
to unallocated memory?

In addition in the sake of performance, we might want to start with 
allocating X number, and increment by X, not by 1.

> +        sort_flows = x2nrealloc(sort_flows, alloc_flow_count, sizeof 
> ukey);
> +    }
> +    sort_flows[(*total_flow_count)++] = ukey;
> +    return sort_flows;
> +}
> +
> +/*
> + * Build sort_flows[] initially with flows that
> + * reference an 'OOR' netdev as their input port.
> + */
> +static struct udpif_key **
> +udpif_build_oor_flows(struct udpif_key **sort_flows, size_t 
> *total_flow_count,
> +                      size_t *alloc_flow_count, struct udpif_key 
> *ukey,
> +                      int *oor_netdev_count)
> +{
> +    struct netdev *netdev;
> +    int count;
> +
> +    /* Input netdev must be available for the flow */
> +    netdev = ukey->in_netdev;
> +    if (!netdev) {
> +        return sort_flows;
> +    }
> +
> +    /* Is the in-netdev for this flow in OOR state ? */
> +    if (!netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR)) {
> +        ukey_netdev_unref(ukey);
> +        return sort_flows;
> +    }
> +
> +    /* Add the flow to sort_flows[] */
> +    sort_flows = udpif_add_oor_flows(sort_flows, total_flow_count,
> +                                      alloc_flow_count, ukey);
> +    if (ukey->offloaded) {
> +        count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
> +        ovs_assert(count >= 0);
> +        if (count++ == 0) {
> +            (*oor_netdev_count)++;
> +        }
> +        netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, count);
> +    } else {
> +        count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
> +        ovs_assert(count >= 0);

We can loose the asserts if we make the value unsigned

> +        netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, ++count);
> +    }
> +
> +    return sort_flows;
> +}
> +
> +/*
> + * Rebalance offloaded flows on HW netdevs that are in OOR state.
> + */
> +static void
> +udpif_flow_rebalance(struct udpif *udpif)
> +{
> +    struct udpif_key **sort_flows = NULL;
> +    size_t alloc_flow_count = 0;
> +    size_t total_flow_count = 0;
> +    int oor_netdev_count = 0;
> +    int offload_index = 0;
> +    int pending_index;
> +
> +    /* Collect flows (offloaded and pending) that reference OOR 
> netdevs */
> +    for (size_t i = 0; i < N_UMAPS; i++) {
> +        struct udpif_key *ukey;
> +        struct umap *umap = &udpif->ukeys[i];
> +
> +        CMAP_FOR_EACH (ukey, cmap_node, &umap->cmap) {
> +            ukey_to_flow_netdev(udpif, ukey);
> +            sort_flows = udpif_build_oor_flows(Sort_flows, 
> &total_flow_count,
> +                                               &alloc_flow_count, 
> ukey,
> +                                               &oor_netdev_count);
> +        }
> +    }
> +
> +    /* Sort flows by OOR netdevs, state (offloaded/pending) and 
> pps-rate  */
> +    qsort(sort_flows, total_flow_count, sizeof(struct udpif_key *),
> +          flow_compare_rebalance);
> +
> +    /*
> +     * We now have flows referencing OOR netdevs, that are sorted. We 
> also
> +     * have a count of offloaded and pending flows on each of the 
> netdevs
> +     * that are in OOR state. Now rebalance each oor-netdev.
> +     */
> +    while (oor_netdev_count) {
> +        struct netdev *netdev;
> +        int offload_count;
> +        int pending_count;
> +        bool oor;
> +
> +        netdev = sort_flows[offload_index]->in_netdev;
> +        ovs_assert(netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR) == 
> true);

Think you should loose the assert, as you just checked this in 
udpif_build_oor_flows()

If you are trying to safeguard stuff here I would make sure that none of 
the indexes used are above the total_flow_count value.

> +        VLOG_DBG("Offload rebalance: netdev: %s is OOR", 
> netdev->name);
> +
> +        offload_count = netdev_get_hw_info(netdev, 
> HW_INFO_TYPE_OFFL_COUNT);
> +        pending_count = netdev_get_hw_info(netdev, 
> HW_INFO_TYPE_PEND_COUNT);
> +        pending_index = offload_index + offload_count;
> +
> +        oor = rebalance_device(udpif,
> +                               &sort_flows[offload_index], 
> offload_count,
> +                               &sort_flows[pending_index], 
> pending_count);
> +        netdev_set_hw_info(netdev, HW_INFO_TYPE_OOR, oor);
> +
> +        offload_index = pending_index + pending_count;
> +        netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, 0);
> +        netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, 0);
> +        oor_netdev_count--;
> +    }
> +
> +    for (int i = 0; i < total_flow_count; i++) {
> +        struct udpif_key *ukey = sort_flows[i];
> +        ukey_netdev_unref(ukey);
> +    }
> +    free(sort_flows);
> +}
> +
> +static int
> +udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
> +                   enum dpif_offload_type offload_type)
> +{
> +    struct dpif_op *opsp;
> +    struct ukey_op uop;
> +
> +    opsp = &uop.dop;
> +    put_op_init(&uop, ukey, DPIF_FP_CREATE);
> +    dpif_operate(udpif->dpif, &opsp, 1, offload_type);
> +
> +    return opsp->error;
> +}
> +
> +static int
> +udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
> +                     enum dpif_offload_type offload_type)
> +{
> +    struct dpif_op *opsp;
> +    struct ukey_op uop;
> +
> +    opsp = &uop.dop;
> +    delete_op_init(udpif, &uop, ukey);
> +    dpif_operate(udpif->dpif, &opsp, 1, offload_type);
> +
> +    return opsp->error;
> +}
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f05f616fe..2bfe4ff24 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -519,6 +519,27 @@
>          </p>
>      </column>
>
> +      <column name="other_config" key="offload-rebalance"
> +              type='{"type": "boolean"}'>
> +        <p>
> +            Configures HW offload rebalancing, that allows to 
> dynamically
> +            offload and un-offload flows while an offload-device is 
> out of
> +            resources (OOR). This policy allows flows to be selected 
> for
> +            offloading based on the packets-per-second (pps) rate of 
> flows.
> +        </p>
> +        <p>
> +          Set this value to <code>true</code> to enable this option.
> +        </p>
> +        <p>
> +          The default value is <code>false</code>. Changing this 
> value requires
> +          restarting the daemon.
> +        </p>
> +        <p>
> +            This is only relevant if HW offloading is enabled 
> (hw-offload).
> +            When this policy is enabled, it also requires 'tc-policy' 
> to
> +            be set to 'skip_sw'.
> +        </p>
> +      </column>
>      </group>
>
>      <group title="Status">
> -- 
> 2.18.0.rc1.1.g6f333ff
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Li,Rongqing via dev Oct. 25, 2018, 2:02 p.m. UTC | #3
Hi Eelco,

Thanks for your review comments. Please see my response below.
On Fri, Oct 19, 2018 at 7:53 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>
> On 18 Oct 2018, at 18:13, Sriharsha Basavapatna via dev wrote:
>
> This is the third patch in the patch-set to support dynamic rebalancing
> of offloaded flows.
>
> The dynamic rebalancing functionality is implemented in this patch. The
> ukeys that are not scheduled for deletion are obtained and passed as input
> to the rebalancing routine. The rebalancing is done in the context of
> revalidation leader thread, after all other revalidator threads are
> done with gathering rebalancing data for flows.
>
> For each netdev that is in OOR state, a list of flows - both offloaded
> and non-offloaded (pending) - is obtained using the ukeys. For each netdev
> that is in OOR state, the flows are grouped and sorted into offloaded and
> pending flows. The offloaded flows are sorted in descending order of
> pps-rate, while pending flows are sorted in ascending order of pps-rate.
>
> The rebalancing is done in two phases. In the first phase, we try to
> offload all pending flows and if that succeeds, the OOR state on the device
> is cleared. If some (or none) of the pending flows could not be offloaded,
> then we start replacing an offloaded flow that has a lower pps-rate than
> a pending flow, until there are no more pending flows with a higher rate
> than an offloaded flow. The flows that are replaced from the device are
> added into kernel datapath.
>
> A new OVS configuration parameter "offload-rebalance", is added to ovsdb.
> The default value of this is "false". To enable this feature, set the
> value of this parameter to "true", which provides packets-per-second
> rate based policy to dynamically offload and un-offload flows.
>
> Note: This option can be enabled only when 'hw-offload' policy is enabled.
> It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
> offload errors (specifically ENOSPC error this feature depends on) reported
> by an offloaded device are supressed by TC-Flower kernel module.
>
> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>
> Reviewed-by: Ben Pfaff <blp@ovn.org>
> ---
> NEWS | 3 +
> lib/dpif-netdev.c | 3 +-
> lib/dpif-netlink.c | 29 ++-
> lib/dpif-provider.h | 8 +-
> lib/dpif.c | 30 ++-
> lib/dpif.h | 12 +-
> lib/netdev-provider.h | 7 +-
> lib/netdev.c | 62 ++++-
> lib/netdev.h | 1 +
> ofproto/ofproto-dpif-upcall.c | 446 +++++++++++++++++++++++++++++++++-
> vswitchd/vswitch.xml | 21 ++
> 11 files changed, 592 insertions(+), 30 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 33b4d8a23..846b46fb5 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -8,6 +8,9 @@ Post-v2.10.0
> as the default timeout for control utilities.
> - ovn:
> * ovn-ctl: allow passing user:group ids to the OVN daemons.
> + - ovs-vswitchd:
> + * New configuration option "offload-rebalance", that enables dynamic
> + rebalancing of offloaded flows.
>
>
> v2.10.0 - xx xxx xxxx
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7c0300cc5..1c01d2278 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3689,7 +3689,8 @@ dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
> }
>
> static void
> -dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
> +dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
> + enum dpif_offload_type offload_type OVS_UNUSED)
> {
> size_t i;
>
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index b9ce9cbe2..2e01f5750 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2281,7 +2281,8 @@ dpif_netlink_operate_chunks(struct dpif_netlink *dpif, struct dpif_op **ops,
> }
>
> static void
> -dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
> +dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
> + enum dpif_offload_type offload_type)
> {
> struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
> struct dpif_op *new_ops[OPERATE_MAX_OPS];
> @@ -2289,7 +2290,12 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
> int i = 0;
> int err = 0;
>
> - if (netdev_is_flow_api_enabled()) {
> + if (offload_type == DPIF_OFFLOAD_ALWAYS && !netdev_is_flow_api_enabled()) {
> + VLOG_DBG("Invalid offload_type: %d", offload_type);
>
> Here we are not returning any errors just a silent return, should we return EINVAL instead?
The interface has no return value (void).
>
> + return;
> + }
> +
> + if (offload_type != DPIF_OFFLOAD_NEVER && netdev_is_flow_api_enabled()) {
> while (n_ops > 0) {
> count = 0;
>
> @@ -2298,6 +2304,23 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
>
> err = try_send_to_netdev(dpif, op);
> if (err && err != EEXIST) {
> + if (offload_type == DPIF_OFFLOAD_ALWAYS) {
> + /* We got an error while offloading an op. Since
> + * OFFLOAD_ALWAYS is specified, we stop further
> + * processing and return to the caller without
> + * invoking kernel datapath as fallback. But the
> + * interface requires us to process all n_ops; so
> + * return the same error in the remaining ops too.
> + */
>
> Why are we ok with failing all possible operations here? What if we are doing a FLOW_PUT and FLOW_DEL in a set?
> Also one add might fail but another might pass, due to specific resources, i.e. seperate hw resources for excact match, and masked matches?

Currently we don't support combining PUT/DEL operations when offload_type is
DPIF_OFFLOAD_ALWAYS. May be we should add a check or document the interface.
Regarding separate hw resource tables, I responded to your comment on the
cover letter patch.
>
> + op->error = err;
> + n_ops--;
> + while (n_ops > 0) {
> + op = ops[i++];
> + op->error = err;
> + n_ops--;
> + }
> + return;
> + }
> new_ops[count++] = op;
> } else {
> op->error = err;
> @@ -2308,7 +2331,7 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
>
> dpif_netlink_operate_chunks(dpif, new_ops, count);
> }
> - } else {
> + } else if (offload_type != DPIF_OFFLOAD_ALWAYS) {
> dpif_netlink_operate_chunks(dpif, ops, n_ops);
>
>      } else {
>      What should happen in this case? silently pass, if so we might want to clear all the errors just in case?
>    }

There's no need for this else case; it is handled by the if() condition at
the beginning of the function (if (offload_type == DPIF_OFFLOAD_ALWAYS && ...))
>
> }
> }
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index 7a71f5c0a..a30de740f 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -296,12 +296,14 @@ struct dpif_class {
>
> int (*flow_dump_next)(struct dpif_flow_dump_thread *thread,
> struct dpif_flow *flows, int max_flows);
> -
> /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the order
> * in which they are specified, placing each operation's results in the
> * "output" members documented in comments and the 'error' member of each
> - * dpif_op. */
> - void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops);
> + * dpif_op. The offload_type argument tells the provider if 'ops' should
> + * be submitted to to a netdev (only offload) or to the kernel datapath
> + * (never offload) or to both (offload if possible; software fallback). */
> + void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
> + enum dpif_offload_type offload_type);
>
> /* Enables or disables receiving packets with dpif_recv() for 'dpif'.
> * Turning packet receive off and then back on is allowed to change Netlink
> diff --git a/lib/dpif.c b/lib/dpif.c
> index d799f972c..65880b86a 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -49,6 +49,7 @@
> #include "valgrind.h"
> #include "openvswitch/ofp-errors.h"
> #include "openvswitch/vlog.h"
> +#include "lib/netdev-provider.h"
>
> VLOG_DEFINE_THIS_MODULE(dpif);
>
> @@ -1015,7 +1016,7 @@ dpif_flow_get(struct dpif *dpif,
> op.flow_get.flow->key_len = key_len;
>
> opp = &op;
> - dpif_operate(dpif, &opp, 1);
> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
> return op.error;
> }
> @@ -1045,7 +1046,7 @@ dpif_flow_put(struct dpif *dpif, enum dpif_flow_put_flags flags,
> op.flow_put.stats = stats;
>
> opp = &op;
> - dpif_operate(dpif, &opp, 1);
> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
> return op.error;
> }
> @@ -1068,7 +1069,7 @@ dpif_flow_del(struct dpif *dpif,
> op.flow_del.terse = false;
>
> opp = &op;
> - dpif_operate(dpif, &opp, 1);
> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
> return op.error;
> }
> @@ -1325,7 +1326,7 @@ dpif_execute(struct dpif *dpif, struct dpif_execute *execute)
> op.execute = *execute;
>
> opp = &op;
> - dpif_operate(dpif, &opp, 1);
> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>
> return op.error;
> } else {
> @@ -1336,10 +1337,21 @@ dpif_execute(struct dpif *dpif, struct dpif_execute *execute)
> /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the order in
> * which they are specified. Places each operation's results in the "output"
> * members documented in comments, and 0 in the 'error' member on success or a
> - * positive errno on failure. */
> + * positive errno on failure.
> + */
>
> Guess this should be undone as all comments in this file use this style of closing.

agreed.
>
> void
> -dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
> -{
> +dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
> + enum dpif_offload_type offload_type)
> +{
> + if (offload_type == DPIF_OFFLOAD_ALWAYS && !netdev_is_flow_api_enabled()) {
> + size_t i;
> + for (i = 0; i < n_ops; i++) {
> + struct dpif_op *op = ops[i];
> + op->error = EINVAL;
> + }
> + return;
> + }
> +
> while (n_ops > 0) {
> size_t chunk;
>
> @@ -1360,7 +1372,7 @@ dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
> * handle itself, without help. */
> size_t i;
>
> - dpif->dpif_class->operate(dpif, ops, chunk);
> + dpif->dpif_class->operate(dpif, ops, chunk, offload_type);
>
> for (i = 0; i < chunk; i++) {
> struct dpif_op *op = ops[i];
> @@ -1657,7 +1669,7 @@ dpif_queue_to_priority(const struct dpif *dpif, uint32_t queue_id,
> log_operation(dpif, "queue_to_priority", error);
> return error;
> }
> -
> +
>
> Removed by accident?
There was ^L char here; I deleted it.
>
> void
> dpif_init(struct dpif *dpif, const struct dpif_class *dpif_class,
> const char *name,
> diff --git a/lib/dpif.h b/lib/dpif.h
> index bbdc3eb6c..0675ab19f 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -614,6 +614,13 @@ enum dpif_op_type {
> DPIF_OP_FLOW_GET,
> };
>
> +/* offload_type argument types to (*operate) interface */
> +enum dpif_offload_type {
> + DPIF_OFFLOAD_AUTO, /* Offload if possible, fallback to software. */
> + DPIF_OFFLOAD_NEVER, /* Never offload to hardware. */
> + DPIF_OFFLOAD_ALWAYS, /* Always offload to hardware. */
> +};
> +
> /* Add or modify a flow.
> *
> * The flow is specified by the Netlink attributes with types OVS_KEY_ATTR_* in
> @@ -768,8 +775,9 @@ struct dpif_op {
> };
> };
>
> -void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops);
> -
>
> Same here?
>
> +void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops,
> + enum dpif_offload_type);
> +
> /* Upcalls. */
>
> enum dpif_upcall_type {
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index e320dad61..fb0c27e6e 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -38,10 +38,14 @@ struct netdev_tnl_build_header_params;
> /* Offload-capable (HW) netdev information */
> struct netdev_hw_info {
> bool oor; /* Out of Offload Resources ? */
> + int offload_count; /* Pending (non-offloaded) flow count */
> + int pending_count; /* Offloaded flow count */
>
> Guess they should be uint32_t assuming we do not expect negative counts.

They could be; but we are doing some subtractions (& comparisons), I'd
prefer to have them as signed ints to debug any related issues (i.e,
avoid confusion from unsigned wrap around).
>
> };
>
> enum hw_info_type {
> - HW_INFO_TYPE_OOR = 1 /* OOR state */
> + HW_INFO_TYPE_OOR = 1, /* OOR state */
> + HW_INFO_TYPE_PEND_COUNT = 2, /* Pending(non-offloaded) flow count */
> + HW_INFO_TYPE_OFFL_COUNT = 3 /* Offloaded flow count */
> };
>
> /* A network device (e.g. an Ethernet device).
> @@ -89,7 +93,6 @@ struct netdev {
> int n_rxq;
> struct shash_node *node; /* Pointer to element in global map. */
> struct ovs_list saved_flags_list; /* Contains "struct netdev_saved_flags". */
> -
> struct netdev_hw_info hw_info; /* offload-capable netdev info */
> };
>
> diff --git a/lib/netdev.c b/lib/netdev.c
> index f3fa08ca3..5d7f9c89b 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -2260,11 +2260,23 @@ netdev_get_block_id(struct netdev *netdev)
> int
> netdev_get_hw_info(struct netdev *netdev, int type)
> {
> - if (type == HW_INFO_TYPE_OOR) {
> - return netdev->hw_info.oor;
> + int val = -1;
> +
> + switch (type) {
> + case HW_INFO_TYPE_OOR:
> + val = netdev->hw_info.oor;
> + break;
> + case HW_INFO_TYPE_PEND_COUNT:
> + val = netdev->hw_info.pending_count;
> + break;
> + case HW_INFO_TYPE_OFFL_COUNT:
> + val = netdev->hw_info.offload_count;
> + break;
> + default:
> + break;
> }
>
> - return -1;
> + return val;
> }
>
> /*
> @@ -2273,9 +2285,47 @@ netdev_get_hw_info(struct netdev *netdev, int type)
> void
> netdev_set_hw_info(struct netdev *netdev, int type, int val)
> {
> - if (type == HW_INFO_TYPE_OOR) {
> + switch (type) {
> + case HW_INFO_TYPE_OOR:
> + if (val == 0) {
> + VLOG_DBG("Offload rebalance: netdev: %s is not OOR", netdev->name);
> + }
> netdev->hw_info.oor = val;
> + break;
> + case HW_INFO_TYPE_PEND_COUNT:
> + netdev->hw_info.pending_count = val;
> + break;
> + case HW_INFO_TYPE_OFFL_COUNT:
> + netdev->hw_info.offload_count = val;
> + break;
> + default:
> + break;
> + }
> +}
>
> See comment in first patchset about creating separate functions for this.
>
> +
> +/*
> + * Find if any netdev is in OOR state. Return true if there's at least
> + * one netdev that's in OOR state; otherwise return false.
> + */
> +bool
> +netdev_any_oor(void)
> + OVS_EXCLUDED(netdev_mutex)
> +{
> + struct shash_node *node;
> + bool oor = false;
> +
> + ovs_mutex_lock(&netdev_mutex);
> + SHASH_FOR_EACH (node, &netdev_shash) {
> + struct netdev *dev = node->data;
> +
> + if (dev->hw_info.oor) {
> + oor = true;
> + break;
> + }
> }
> + ovs_mutex_unlock(&netdev_mutex);
> +
> + return oor;
> }
>
> bool
> @@ -2549,6 +2599,10 @@ netdev_set_flow_api_enabled(const struct smap *ovs_other_config)
> tc_set_policy(smap_get_def(ovs_other_config, "tc-policy",
> TC_POLICY_DEFAULT));
>
> + if (smap_get_bool(ovs_other_config, "offload-rebalance", false)) {
> + netdev_offload_rebalance_policy = true;
> + }
> +
> netdev_ports_flow_init();
>
> ovsthread_once_done(&once);
> diff --git a/lib/netdev.h b/lib/netdev.h
> index b0e5c5b72..373be7cc0 100644
> --- a/lib/netdev.h
> +++ b/lib/netdev.h
> @@ -229,6 +229,7 @@ int netdev_init_flow_api(struct netdev *);
> uint32_t netdev_get_block_id(struct netdev *);
> int netdev_get_hw_info(struct netdev *, int);
> void netdev_set_hw_info(struct netdev *, int, int);
> +bool netdev_any_oor(void);
> bool netdev_is_flow_api_enabled(void);
> void netdev_set_flow_api_enabled(const struct smap *ovs_other_config);
> bool netdev_is_offload_rebalance_policy_enabled(void);
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index a372d6252..bb9e61b7c 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -22,6 +22,7 @@
> #include "connmgr.h"
> #include "coverage.h"
> #include "cmap.h"
> +#include "lib/dpif-provider.h"
> #include "dpif.h"
> #include "openvswitch/dynamic-string.h"
> #include "fail-open.h"
> @@ -42,7 +43,6 @@
> #include "tunnel.h"
> #include "unixctl.h"
> #include "openvswitch/vlog.h"
> -#include "lib/dpif-provider.h"
> #include "lib/netdev-provider.h"
>
> #define MAX_QUEUE_LENGTH 512
> @@ -182,6 +182,8 @@ struct udpif {
> uint64_t conn_seq; /* Corresponds to 'dump_seq' when
> conns[n_conns-1] was stored. */
> size_t n_conns; /* Number of connections waiting. */
> +
> + long long int offload_rebalance_time; /* Time of last offload rebalance */
> };
>
> enum upcall_type {
> @@ -308,6 +310,7 @@ struct udpif_key {
> struct recirc_refs recircs; /* Action recirc IDs with references held. */
>
> #define OFFL_REBAL_INTVL_MSEC 3000 /* dynamic offload rebalance freq */
> + struct netdev *in_netdev; /* in_odp_port's netdev */
> bool offloaded; /* True if flow is offloaded */
> uint64_t flow_pps_rate; /* Packets-Per-Second rate */
> long long int flow_time; /* last pps update time */
> @@ -396,6 +399,12 @@ static int upcall_receive(struct upcall *, const struct dpif_backer *,
> const ovs_u128 *ufid, const unsigned pmd_id);
> static void upcall_uninit(struct upcall *);
>
> +static void udpif_flow_rebalance(struct udpif *udpif);
> +static int udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
> + enum dpif_offload_type offload_type);
> +static int udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
> + enum dpif_offload_type offload_type);
> +
> static upcall_callback upcall_cb;
> static dp_purge_callback dp_purge_cb;
>
> @@ -567,6 +576,7 @@ udpif_start_threads(struct udpif *udpif, size_t n_handlers_,
> ovs_barrier_init(&udpif->pause_barrier, udpif->n_revalidators + 1);
> udpif->reval_exit = false;
> udpif->pause = false;
> + udpif->offload_rebalance_time = time_msec();
> udpif->revalidators = xzalloc(udpif->n_revalidators
> * sizeof *udpif->revalidators);
> for (size_t i = 0; i < udpif->n_revalidators; i++) {
> @@ -859,6 +869,26 @@ free_dupcall:
> return n_upcalls;
> }
>
> +static void
> +udpif_run_flow_rebalance(struct udpif *udpif)
> +{
> + long long int now = 0;
> +
> + /* Don't rebalance if OFFL_REBAL_INTVL_MSEC have not elapsed */
> + now = time_msec();
> + if (now < udpif->offload_rebalance_time + OFFL_REBAL_INTVL_MSEC) {
> + return;
> + }
> +
> + if (!netdev_any_oor()) {
> + return;
> + }
> +
> + VLOG_DBG("Offload rebalance: Found OOR netdevs");
>
> Can we add a coverage counter here?
>
>    COVERAGE_INC(offload_flow_rebalance);

Sure; thanks for suggesting this.
>
> + udpif->offload_rebalance_time = now;
> + udpif_flow_rebalance(udpif);
> +}
> +
> static void *
> udpif_revalidator(void *arg)
> {
> @@ -933,6 +963,9 @@ udpif_revalidator(void *arg)
>
> dpif_flow_dump_destroy(udpif->dump);
> seq_change(udpif->dump_seq);
> + if (netdev_is_offload_rebalance_policy_enabled()) {
> + udpif_run_flow_rebalance(udpif);
> + }
>
> duration = MAX(time_msec() - start_time, 1);
> udpif->dump_duration = duration;
> @@ -977,7 +1010,7 @@ udpif_revalidator(void *arg)
>
> return NULL;
> }
> -
>
> +
> static enum upcall_type
> classify_upcall(enum dpif_upcall_type type, const struct nlattr *userdata,
> struct user_action_cookie *cookie)
> @@ -1579,7 +1612,7 @@ handle_upcalls(struct udpif *udpif, struct upcall *upcalls,
> for (i = 0; i < n_ops; i++) {
> opsp[n_opsp++] = &ops[i].dop;
> }
> - dpif_operate(udpif->dpif, opsp, n_opsp);
> + dpif_operate(udpif->dpif, opsp, n_opsp, DPIF_OFFLOAD_AUTO);
> for (i = 0; i < n_ops; i++) {
> struct udpif_key *ukey = ops[i].ukey;
>
> @@ -1671,13 +1704,13 @@ ukey_create__(const struct nlattr *key, size_t key_len,
> ukey->state = UKEY_CREATED;
> ukey->state_thread = ovsthread_id_self();
> ukey->state_where = OVS_SOURCE_LOCATOR;
> - ukey->created = time_msec();
> + ukey->created = ukey->flow_time = time_msec();
> memset(&ukey->stats, 0, sizeof ukey->stats);
> ukey->stats.used = used;
> ukey->xcache = NULL;
>
> ukey->offloaded = false;
> - ukey->flow_time = 0;
> + ukey->in_netdev = NULL;
> ukey->flow_packets = ukey->flow_backlog_packets = 0;
>
> ukey->key_recirc_id = key_recirc_id;
> @@ -2329,7 +2362,7 @@ push_dp_ops(struct udpif *udpif, struct ukey_op *ops, size_t n_ops)
> for (i = 0; i < n_ops; i++) {
> opsp[i] = &ops[i].dop;
> }
> - dpif_operate(udpif->dpif, opsp, n_ops);
> + dpif_operate(udpif->dpif, opsp, n_ops, DPIF_OFFLOAD_AUTO);
>
> for (i = 0; i < n_ops; i++) {
> struct ukey_op *op = &ops[i];
> @@ -2455,6 +2488,57 @@ reval_op_init(struct ukey_op *op, enum reval_result result,
> }
> }
>
> +static void
> +ukey_netdev_unref(struct udpif_key *ukey)
> +{
> + if (!ukey->in_netdev) {
> + return;
> + }
> + netdev_close(ukey->in_netdev);
> + ukey->in_netdev = NULL;
> +}
> +
> +/*
> + * Given a udpif_key, get its input port (netdev) by parsing the flow keys
> + * and actions. The flow may not contain flow attributes if it is a terse
> + * dump; read its attributes from the ukey and then parse the flow to get
> + * the port info. Save them in udpif_key.
> + */
> +static void
> +ukey_to_flow_netdev(struct udpif *udpif, struct udpif_key *ukey)
> +{
> + const struct dpif *dpif = udpif->dpif;
> + const struct dpif_class *dpif_class = dpif->dpif_class;
> + const struct nlattr *k;
> + unsigned int left;
> +
> + /* Remove existing references to netdev */
> + ukey_netdev_unref(ukey);
>
> If for some reason we already have a reference, we should just return it, as the netdev can not change.

We release any previous reference held by us (rebalancing code). If you see
the implementation, we just return if in_netdev is NULL:
    if (!ukey->in_netdev) {
        return;
    }
We should hit this condition since we would have released references
at the end of previous iteration of rebalance. So I agree, we can remove this
call to ukey_netdev_unref() here, but let me do some experiments and confirm
it.

>
> +
> + /* Find the input port and get a reference to its netdev */
> + NL_ATTR_FOR_EACH (k, left, ukey->key, ukey->key_len) {
> + enum ovs_key_attr type = nl_attr_type(k);
> +
> + if (type == OVS_KEY_ATTR_IN_PORT) {
> + ukey->in_netdev = netdev_ports_get(nl_attr_get_odp_port(k),
> + dpif_class);
> + } else if (type == OVS_KEY_ATTR_TUNNEL) {
> + struct flow_tnl tnl;
> + enum odp_key_fitness res;
> +
> + if (ukey->in_netdev) {
> + netdev_close(ukey->in_netdev);
> + ukey->in_netdev = NULL;
> + }
> + res = odp_tun_key_from_attr(k, &tnl);
> + if (res != ODP_FIT_ERROR) {
> + ukey->in_netdev = flow_get_tunnel_netdev(&tnl);
> + break;
> + }
> + }
> + }
> +}
> +
> static uint64_t
> udpif_flow_packet_delta(struct udpif_key *ukey, const struct dpif_flow *f)
> {
> @@ -2468,6 +2552,16 @@ udpif_flow_time_delta(struct udpif *udpif, struct udpif_key *ukey)
> return (udpif->dpif->current_ms - ukey->flow_time) / 1000;
> }
>
> +/*
> + * Save backlog packet count while switching modes
> + * between offloaded and kernel datapaths.
> + */
> +static void
> +udpif_set_ukey_backlog_packets(struct udpif_key *ukey)
> +{
> + ukey->flow_backlog_packets = ukey->flow_packets;
> +}
> +
> /* Gather pps-rate for the given dpif_flow and save it in its ukey */
> static void
> udpif_update_flow_pps(struct udpif *udpif, struct udpif_key *ukey,
> @@ -2539,6 +2633,7 @@ revalidate(struct revalidator *revalidator)
> kill_them_all = n_dp_flows > flow_limit * 2;
> max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
>
> + udpif->dpif->current_ms = time_msec();
>
> Can we not use udpif->dpif->current_ms = now, as it's set two lines above.

Yes.
>
> for (f = flows; f < &flows[n_dumped]; f++) {
> long long int used = f->stats.used;
> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
> @@ -2915,3 +3010,342 @@ upcall_unixctl_purge(struct unixctl_conn *conn, int argc OVS_UNUSED,
> }
> unixctl_command_reply(conn, "");
> }
> +
> +/* Flows are sorted in the following order:
> + * netdev, flow state (offloaded/kernel path), flow_pps_rate.
> + */
> +static int
> +flow_compare_rebalance(const void *elem1, const void *elem2)
> +{
> + const struct udpif_key *f1 = *(struct udpif_key **)elem1;
> + const struct udpif_key *f2 = *(struct udpif_key **)elem2;
> + int64_t diff;
> +
> + if (f1->in_netdev < f2->in_netdev) {
> + return -1;
> + } else if (f1->in_netdev > f2->in_netdev) {
> + return 1;
> + }
> +
> + if (f1->offloaded != f2->offloaded) {
> + return f2->offloaded - f1->offloaded;
> + }
> +
> + diff = (f1->offloaded == true) ?
> + f1->flow_pps_rate - f2->flow_pps_rate :
> + f2->flow_pps_rate - f1->flow_pps_rate;
> +
> + return (diff < 0) ? -1 : 1;
> +}
> +
> +/* Insert flows from pending array during rebalancing */
> +static int
> +rebalance_insert_pending(struct udpif *udpif, struct udpif_key **pending_flows,
> + int pending_count, int insert_count,
> + uint64_t rate_threshold)
> +{
> + int count = 0;
> +
> + for (int i = 0; i < pending_count; i++) {
> + struct udpif_key *flow = pending_flows[i];
> + int err;
> +
> + /* Stop offloading pending flows if the insert count is
> + * reached and the flow rate is less than the threshold
> + */
> + if (count >= insert_count && flow->flow_pps_rate < rate_threshold) {
> + break;
> + }
> +
> + /* Offload the flow to netdev */
> + err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_ALWAYS);
> +
> + if (err == ENOSPC) {
> + /* Stop if we are out of resources */
>
> Are we sure we want to stop? We did release X flows, maybe some others might fit (See other comments on different type of hw tables).

That's a good point. I agree, it might be possible to continue adding flows
when we have different hw tables. This can be taken up as a part of further
additions to this feature.
>
> + break;
> + }
> +
> + if (err) {
> + continue;
> + }
> +
> + /* Offload succeeded; delete it from the kernel datapath */
> + udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
>
> No error checking here? for the other order you do it.

Yes, we can add error checking here.
>
> +
> + /* Change the state of the flow, adjust dpif counters */
> + flow->offloaded = true;
> +
> + udpif_set_ukey_backlog_packets(flow);
>
> Are ok with not clearing the flow_packets here? As now some of the previous count is used for the new mode while calculating.

This is ok since we update ukey->flow_packets in the next iteration in
udpif_update_flow_pps().
>
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/* Remove flows from offloaded array during rebalancing */
> +static void
> +rebalance_remove_offloaded(struct udpif *udpif,
> + struct udpif_key **offloaded_flows,
> + int offload_count)
> +{
> + for (int i = 0; i < offload_count; i++) {
> + struct udpif_key *flow = offloaded_flows[i];
> + int err;
> +
> + /* Install the flow into kernel path first */
> + err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_NEVER);
> + if (err) {
> + continue;
> + }
> +
> + /* Success; now remove offloaded flow from netdev */
> + err = udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_ALWAYS);
> + if (err) {
> + udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
> + continue;
> + }
> + udpif_set_ukey_backlog_packets(flow);
>
> Same here, are ok with not clearing the flow_packets here? As now some of the previous count is used for the new mode while calculating.
>
> + flow->offloaded = false;
> + }
> +}
> +
> +/*
> + * Rebalance offloaded flows on a netdev that's in OOR state.
> + *
> + * The rebalancing is done in two phases. In the first phase, we check if
> + * the pending flows can be offloaded (if some resources became available
> + * in the meantime) by trying to offload each pending flow. If all pending
> + * flows get successfully offloaded, the OOR state is cleared on the netdev
> + * and there's nothing to rebalance.
> + *
> + * If some of the pending flows could not be offloaded, i.e, we still see
> + * the OOR error, then we move to the second phase of rebalancing. In this
> + * phase, the rebalancer compares pps-rate of an offloaded flow with the
> + * least pps-rate with that of a pending flow with the highest pps-rate from
> + * their respective sorted arrays. If pps-rate of the offloaded flow is less
> + * than the pps-rate of the pending flow, then it deletes the offloaded flow
> + * from the HW/netdev and adds it to kernel datapath and then offloads pending
> + * to HW/netdev. This process is repeated for every pair of offloaded and
> + * pending flows in the ordered list. The process stops when we encounter an
> + * offloaded flow that has a higher pps-rate than the corresponding pending
> + * flow. The entire rebalancing process is repeated in the next iteration.
> + */
> +static bool
> +rebalance_device(struct udpif *udpif, struct udpif_key **offloaded_flows,
> + int offload_count, struct udpif_key **pending_flows,
> + int pending_count)
> +{
> +
> + /* Phase 1 */
> + int num_inserted = rebalance_insert_pending(udpif, pending_flows,
> + pending_count, pending_count,
> + 0);
> + if (num_inserted) {
> + VLOG_DBG("Offload rebalance: Phase1: inserted %d pending flows",
> + num_inserted);
> + }
> +
> + /* Adjust pending array */
> + pending_flows = &pending_flows[num_inserted];
> + pending_count -= num_inserted;
> +
> + if (!pending_count) {
> + /*
> + * Successfully offloaded all pending flows. The device
> + * is no longer in OOR state; done rebalancing this device.
> + */
> + return false;
> + }
> +
> + /*
> + * Phase 2; determine how many offloaded flows to churn.
> + */
> +#define OFFL_REBAL_MAX_CHURN 1024
> + int churn_count = 0;
> + while (churn_count < OFFL_REBAL_MAX_CHURN && churn_count < offload_count
> + && churn_count < pending_count) {
> + if (pending_flows[churn_count]->flow_pps_rate <=
> + offloaded_flows[churn_count]->flow_pps_rate)
>
>                                                         {
>
> + break;
>
>         }
>
> + churn_count++;
> + }
> +
> + if (churn_count) {
> + VLOG_DBG("Offload rebalance: Phase2: removing %d offloaded flows",
> + churn_count);
> + }
> +
> + /* Bail early if nothing to churn */
> + if (!churn_count) {
> + return true;
> + }
> +
> + /* Remove offloaded flows */
> + rebalance_remove_offloaded(udpif, offloaded_flows, churn_count);
> +
> + /* Adjust offloaded array */
> + offloaded_flows = &offloaded_flows[churn_count];
> + offload_count -= churn_count;
> +
> + /* Replace offloaded flows with pending flows */
> + num_inserted = rebalance_insert_pending(udpif, pending_flows,
> + pending_count, churn_count,
> + offload_count ?
> + offloaded_flows[0]->flow_pps_rate :
> + 0);
> + if (num_inserted) {
> + VLOG_DBG("Offload rebalance: Phase2: inserted %d pending flows",
> + num_inserted);
> + }
> +
> + return true;
> +}
> +
> +static struct udpif_key **
> +udpif_add_oor_flows(struct udpif_key **sort_flows, size_t *total_flow_count,
> + size_t *alloc_flow_count, struct udpif_key *ukey)
> +{
> + if (*total_flow_count >= *alloc_flow_count) {
>
> I might be missing the clue here, but alloc_flow_count is always 0, and can't find the place where it is incremented. So are we always writing to unallocated memory?

There is no problem here; &alloc_flow_count (pointer) is passed as an input
argument and is also used as an output parameter by x2nrealloc(); see other
callers for example or its implementation.
>
> In addition in the sake of performance, we might want to start with allocating X number, and increment by X, not by 1.

we can explore this.
>
> + sort_flows = x2nrealloc(sort_flows, alloc_flow_count, sizeof ukey);
> + }
> + sort_flows[(*total_flow_count)++] = ukey;
> + return sort_flows;
> +}
> +
> +/*
> + * Build sort_flows[] initially with flows that
> + * reference an 'OOR' netdev as their input port.
> + */
> +static struct udpif_key **
> +udpif_build_oor_flows(struct udpif_key **sort_flows, size_t *total_flow_count,
> + size_t *alloc_flow_count, struct udpif_key *ukey,
> + int *oor_netdev_count)
> +{
> + struct netdev *netdev;
> + int count;
> +
> + /* Input netdev must be available for the flow */
> + netdev = ukey->in_netdev;
> + if (!netdev) {
> + return sort_flows;
> + }
> +
> + /* Is the in-netdev for this flow in OOR state ? */
> + if (!netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR)) {
> + ukey_netdev_unref(ukey);
> + return sort_flows;
> + }
> +
> + /* Add the flow to sort_flows[] */
> + sort_flows = udpif_add_oor_flows(sort_flows, total_flow_count,
> + alloc_flow_count, ukey);
> + if (ukey->offloaded) {
> + count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
> + ovs_assert(count >= 0);
> + if (count++ == 0) {
> + (*oor_netdev_count)++;
> + }
> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, count);
> + } else {
> + count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
> + ovs_assert(count >= 0);
>
> We can loose the asserts if we make the value unsigned

I'd like to use signed int for the count; the assert can be removed since it
is being set only in 2 places and there's no code (yet) that decrements it.
>
> + netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, ++count);
> + }
> +
> + return sort_flows;
> +}
> +
> +/*
> + * Rebalance offloaded flows on HW netdevs that are in OOR state.
> + */
> +static void
> +udpif_flow_rebalance(struct udpif *udpif)
> +{
> + struct udpif_key **sort_flows = NULL;
> + size_t alloc_flow_count = 0;
> + size_t total_flow_count = 0;
> + int oor_netdev_count = 0;
> + int offload_index = 0;
> + int pending_index;
> +
> + /* Collect flows (offloaded and pending) that reference OOR netdevs */
> + for (size_t i = 0; i < N_UMAPS; i++) {
> + struct udpif_key *ukey;
> + struct umap *umap = &udpif->ukeys[i];
> +
> + CMAP_FOR_EACH (ukey, cmap_node, &umap->cmap) {
> + ukey_to_flow_netdev(udpif, ukey);
> + sort_flows = udpif_build_oor_flows(Sort_flows, &total_flow_count,
> + &alloc_flow_count, ukey,
> + &oor_netdev_count);
> + }
> + }
> +
> + /* Sort flows by OOR netdevs, state (offloaded/pending) and pps-rate */
> + qsort(sort_flows, total_flow_count, sizeof(struct udpif_key *),
> + flow_compare_rebalance);
> +
> + /*
> + * We now have flows referencing OOR netdevs, that are sorted. We also
> + * have a count of offloaded and pending flows on each of the netdevs
> + * that are in OOR state. Now rebalance each oor-netdev.
> + */
> + while (oor_netdev_count) {
> + struct netdev *netdev;
> + int offload_count;
> + int pending_count;
> + bool oor;
> +
> + netdev = sort_flows[offload_index]->in_netdev;
> + ovs_assert(netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR) == true);
>
> Think you should loose the assert, as you just checked this in udpif_build_oor_flows()

yes, this can be removed.
>
> If you are trying to safeguard stuff here I would make sure that none of the indexes used are above the total_flow_count value.

agree, can be added.

Thanks,
-Harsha
>
> + VLOG_DBG("Offload rebalance: netdev: %s is OOR", netdev->name);
> +
> + offload_count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
> + pending_count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
> + pending_index = offload_index + offload_count;
> +
> + oor = rebalance_device(udpif,
> + &sort_flows[offload_index], offload_count,
> + &sort_flows[pending_index], pending_count);
> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OOR, oor);
> +
> + offload_index = pending_index + pending_count;
> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, 0);
> + netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, 0);
> + oor_netdev_count--;
> + }
> +
> + for (int i = 0; i < total_flow_count; i++) {
> + struct udpif_key *ukey = sort_flows[i];
> + ukey_netdev_unref(ukey);
> + }
> + free(sort_flows);
> +}
> +
> +static int
> +udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
> + enum dpif_offload_type offload_type)
> +{
> + struct dpif_op *opsp;
> + struct ukey_op uop;
> +
> + opsp = &uop.dop;
> + put_op_init(&uop, ukey, DPIF_FP_CREATE);
> + dpif_operate(udpif->dpif, &opsp, 1, offload_type);
> +
> + return opsp->error;
> +}
> +
> +static int
> +udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
> + enum dpif_offload_type offload_type)
> +{
> + struct dpif_op *opsp;
> + struct ukey_op uop;
> +
> + opsp = &uop.dop;
> + delete_op_init(udpif, &uop, ukey);
> + dpif_operate(udpif->dpif, &opsp, 1, offload_type);
> +
> + return opsp->error;
> +}
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f05f616fe..2bfe4ff24 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -519,6 +519,27 @@
> </p>
> </column>
>
> + <column name="other_config" key="offload-rebalance"
> + type='{"type": "boolean"}'>
> + <p>
> + Configures HW offload rebalancing, that allows to dynamically
> + offload and un-offload flows while an offload-device is out of
> + resources (OOR). This policy allows flows to be selected for
> + offloading based on the packets-per-second (pps) rate of flows.
> + </p>
> + <p>
> + Set this value to <code>true</code> to enable this option.
> + </p>
> + <p>
> + The default value is <code>false</code>. Changing this value requires
> + restarting the daemon.
> + </p>
> + <p>
> + This is only relevant if HW offloading is enabled (hw-offload).
> + When this policy is enabled, it also requires 'tc-policy' to
> + be set to 'skip_sw'.
> + </p>
> + </column>
> </group>
>
> <group title="Status">
> --
> 2.18.0.rc1.1.g6f333ff
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ben Pfaff Oct. 25, 2018, 4:36 p.m. UTC | #4
On Thu, Oct 25, 2018 at 07:32:27PM +0530, Sriharsha Basavapatna via dev wrote:
> > Removed by accident?
> There was ^L char here; I deleted it.

Don't do that.  The coding style talks about the use of page breaks in code:

    Within a file, non-static functions should come first, in the order
    that they are declared in the header file, followed by static
    functions.  Static functions should be in one or more separate pages
    (separated by form feed characters) in logical groups. A commonly
    useful way to divide groups is by "level", with high-level functions
    first, followed by groups of progressively lower-level
    functions. This makes it easy for the program's reader to see the
    top-down structure by reading from top to bottom.
Eelco Chaudron Oct. 26, 2018, 9:24 a.m. UTC | #5
On 25 Oct 2018, at 16:02, Sriharsha Basavapatna wrote:

> Hi Eelco,
>
> Thanks for your review comments. Please see my response below.
> On Fri, Oct 19, 2018 at 7:53 PM Eelco Chaudron <echaudro@redhat.com> 
> wrote:
>>
>> On 18 Oct 2018, at 18:13, Sriharsha Basavapatna via dev wrote:
>>
>> This is the third patch in the patch-set to support dynamic 
>> rebalancing
>> of offloaded flows.
>>
>> The dynamic rebalancing functionality is implemented in this patch. 
>> The
>> ukeys that are not scheduled for deletion are obtained and passed as 
>> input
>> to the rebalancing routine. The rebalancing is done in the context of
>> revalidation leader thread, after all other revalidator threads are
>> done with gathering rebalancing data for flows.
>>
>> For each netdev that is in OOR state, a list of flows - both 
>> offloaded
>> and non-offloaded (pending) - is obtained using the ukeys. For each 
>> netdev
>> that is in OOR state, the flows are grouped and sorted into offloaded 
>> and
>> pending flows. The offloaded flows are sorted in descending order of
>> pps-rate, while pending flows are sorted in ascending order of 
>> pps-rate.
>>
>> The rebalancing is done in two phases. In the first phase, we try to
>> offload all pending flows and if that succeeds, the OOR state on the 
>> device
>> is cleared. If some (or none) of the pending flows could not be 
>> offloaded,
>> then we start replacing an offloaded flow that has a lower pps-rate 
>> than
>> a pending flow, until there are no more pending flows with a higher 
>> rate
>> than an offloaded flow. The flows that are replaced from the device 
>> are
>> added into kernel datapath.
>>
>> A new OVS configuration parameter "offload-rebalance", is added to 
>> ovsdb.
>> The default value of this is "false". To enable this feature, set the
>> value of this parameter to "true", which provides packets-per-second
>> rate based policy to dynamically offload and un-offload flows.
>>
>> Note: This option can be enabled only when 'hw-offload' policy is 
>> enabled.
>> It also requires 'tc-policy' to be set to 'skip_sw'; otherwise, flow
>> offload errors (specifically ENOSPC error this feature depends on) 
>> reported
>> by an offloaded device are supressed by TC-Flower kernel module.
>>
>> Signed-off-by: Sriharsha Basavapatna 
>> <sriharsha.basavapatna@broadcom.com>
>> Co-authored-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
>> Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
>> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com>
>> Reviewed-by: Simon Horman <simon.horman@netronome.com>
>> Reviewed-by: Ben Pfaff <blp@ovn.org>
>> ---
>> NEWS | 3 +
>> lib/dpif-netdev.c | 3 +-
>> lib/dpif-netlink.c | 29 ++-
>> lib/dpif-provider.h | 8 +-
>> lib/dpif.c | 30 ++-
>> lib/dpif.h | 12 +-
>> lib/netdev-provider.h | 7 +-
>> lib/netdev.c | 62 ++++-
>> lib/netdev.h | 1 +
>> ofproto/ofproto-dpif-upcall.c | 446 
>> +++++++++++++++++++++++++++++++++-
>> vswitchd/vswitch.xml | 21 ++
>> 11 files changed, 592 insertions(+), 30 deletions(-)
>>
>> diff --git a/NEWS b/NEWS
>> index 33b4d8a23..846b46fb5 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -8,6 +8,9 @@ Post-v2.10.0
>> as the default timeout for control utilities.
>> - ovn:
>> * ovn-ctl: allow passing user:group ids to the OVN daemons.
>> + - ovs-vswitchd:
>> + * New configuration option "offload-rebalance", that enables 
>> dynamic
>> + rebalancing of offloaded flows.
>>
>>
>> v2.10.0 - xx xxx xxxx
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 7c0300cc5..1c01d2278 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -3689,7 +3689,8 @@ dpif_netdev_execute(struct dpif *dpif, struct 
>> dpif_execute *execute)
>> }
>>
>> static void
>> -dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t 
>> n_ops)
>> +dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t 
>> n_ops,
>> + enum dpif_offload_type offload_type OVS_UNUSED)
>> {
>> size_t i;
>>
>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>> index b9ce9cbe2..2e01f5750 100644
>> --- a/lib/dpif-netlink.c
>> +++ b/lib/dpif-netlink.c
>> @@ -2281,7 +2281,8 @@ dpif_netlink_operate_chunks(struct dpif_netlink 
>> *dpif, struct dpif_op **ops,
>> }
>>
>> static void
>> -dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, 
>> size_t n_ops)
>> +dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, 
>> size_t n_ops,
>> + enum dpif_offload_type offload_type)
>> {
>> struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
>> struct dpif_op *new_ops[OPERATE_MAX_OPS];
>> @@ -2289,7 +2290,12 @@ dpif_netlink_operate(struct dpif *dpif_, 
>> struct dpif_op **ops, size_t n_ops)
>> int i = 0;
>> int err = 0;
>>
>> - if (netdev_is_flow_api_enabled()) {
>> + if (offload_type == DPIF_OFFLOAD_ALWAYS && 
>> !netdev_is_flow_api_enabled()) {
>> + VLOG_DBG("Invalid offload_type: %d", offload_type);
>>
>> Here we are not returning any errors just a silent return, should we 
>> return EINVAL instead?
> The interface has no return value (void).

I mean in op->error

>>
>> + return;
>> + }
>> +
>> + if (offload_type != DPIF_OFFLOAD_NEVER && 
>> netdev_is_flow_api_enabled()) {
>> while (n_ops > 0) {
>> count = 0;
>>
>> @@ -2298,6 +2304,23 @@ dpif_netlink_operate(struct dpif *dpif_, 
>> struct dpif_op **ops, size_t n_ops)
>>
>> err = try_send_to_netdev(dpif, op);
>> if (err && err != EEXIST) {
>> + if (offload_type == DPIF_OFFLOAD_ALWAYS) {
>> + /* We got an error while offloading an op. Since
>> + * OFFLOAD_ALWAYS is specified, we stop further
>> + * processing and return to the caller without
>> + * invoking kernel datapath as fallback. But the
>> + * interface requires us to process all n_ops; so
>> + * return the same error in the remaining ops too.
>> + */
>>
>> Why are we ok with failing all possible operations here? What if we 
>> are doing a FLOW_PUT and FLOW_DEL in a set?
>> Also one add might fail but another might pass, due to specific 
>> resources, i.e. seperate hw resources for excact match, and masked 
>> matches?
>
> Currently we don't support combining PUT/DEL operations when 
> offload_type is
> DPIF_OFFLOAD_ALWAYS. May be we should add a check or document the 
> interface.
> Regarding separate hw resource tables, I responded to your comment on 
> the
> cover letter patch.
>>
>> + op->error = err;
>> + n_ops--;
>> + while (n_ops > 0) {
>> + op = ops[i++];
>> + op->error = err;
>> + n_ops--;
>> + }
>> + return;
>> + }
>> new_ops[count++] = op;
>> } else {
>> op->error = err;
>> @@ -2308,7 +2331,7 @@ dpif_netlink_operate(struct dpif *dpif_, struct 
>> dpif_op **ops, size_t n_ops)
>>
>> dpif_netlink_operate_chunks(dpif, new_ops, count);
>> }
>> - } else {
>> + } else if (offload_type != DPIF_OFFLOAD_ALWAYS) {
>> dpif_netlink_operate_chunks(dpif, ops, n_ops);
>>
>>      } else {
>>      What should happen in this case? silently pass, if so we might 
>> want to clear all the errors just in case?
>>    }
>
> There's no need for this else case; it is handled by the if() 
> condition at
> the beginning of the function (if (offload_type == DPIF_OFFLOAD_ALWAYS 
> && ...))

If there is not possibility of the else case, then there is not need for 
the if either.

>>
>> }
>> }
>> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
>> index 7a71f5c0a..a30de740f 100644
>> --- a/lib/dpif-provider.h
>> +++ b/lib/dpif-provider.h
>> @@ -296,12 +296,14 @@ struct dpif_class {
>>
>> int (*flow_dump_next)(struct dpif_flow_dump_thread *thread,
>> struct dpif_flow *flows, int max_flows);
>> -
>> /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the 
>> order
>> * in which they are specified, placing each operation's results in 
>> the
>> * "output" members documented in comments and the 'error' member of 
>> each
>> - * dpif_op. */
>> - void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t 
>> n_ops);
>> + * dpif_op. The offload_type argument tells the provider if 'ops' 
>> should
>> + * be submitted to to a netdev (only offload) or to the kernel 
>> datapath
>> + * (never offload) or to both (offload if possible; software 
>> fallback). */
>> + void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t 
>> n_ops,
>> + enum dpif_offload_type offload_type);
>>
>> /* Enables or disables receiving packets with dpif_recv() for 'dpif'.
>> * Turning packet receive off and then back on is allowed to change 
>> Netlink
>> diff --git a/lib/dpif.c b/lib/dpif.c
>> index d799f972c..65880b86a 100644
>> --- a/lib/dpif.c
>> +++ b/lib/dpif.c
>> @@ -49,6 +49,7 @@
>> #include "valgrind.h"
>> #include "openvswitch/ofp-errors.h"
>> #include "openvswitch/vlog.h"
>> +#include "lib/netdev-provider.h"
>>
>> VLOG_DEFINE_THIS_MODULE(dpif);
>>
>> @@ -1015,7 +1016,7 @@ dpif_flow_get(struct dpif *dpif,
>> op.flow_get.flow->key_len = key_len;
>>
>> opp = &op;
>> - dpif_operate(dpif, &opp, 1);
>> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>>
>> return op.error;
>> }
>> @@ -1045,7 +1046,7 @@ dpif_flow_put(struct dpif *dpif, enum 
>> dpif_flow_put_flags flags,
>> op.flow_put.stats = stats;
>>
>> opp = &op;
>> - dpif_operate(dpif, &opp, 1);
>> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>>
>> return op.error;
>> }
>> @@ -1068,7 +1069,7 @@ dpif_flow_del(struct dpif *dpif,
>> op.flow_del.terse = false;
>>
>> opp = &op;
>> - dpif_operate(dpif, &opp, 1);
>> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>>
>> return op.error;
>> }
>> @@ -1325,7 +1326,7 @@ dpif_execute(struct dpif *dpif, struct 
>> dpif_execute *execute)
>> op.execute = *execute;
>>
>> opp = &op;
>> - dpif_operate(dpif, &opp, 1);
>> + dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
>>
>> return op.error;
>> } else {
>> @@ -1336,10 +1337,21 @@ dpif_execute(struct dpif *dpif, struct 
>> dpif_execute *execute)
>> /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the 
>> order in
>> * which they are specified. Places each operation's results in the 
>> "output"
>> * members documented in comments, and 0 in the 'error' member on 
>> success or a
>> - * positive errno on failure. */
>> + * positive errno on failure.
>> + */
>>
>> Guess this should be undone as all comments in this file use this 
>> style of closing.
>
> agreed.
>>
>> void
>> -dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
>> -{
>> +dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
>> + enum dpif_offload_type offload_type)
>> +{
>> + if (offload_type == DPIF_OFFLOAD_ALWAYS && 
>> !netdev_is_flow_api_enabled()) {
>> + size_t i;
>> + for (i = 0; i < n_ops; i++) {
>> + struct dpif_op *op = ops[i];
>> + op->error = EINVAL;
>> + }
>> + return;
>> + }
>> +
>> while (n_ops > 0) {
>> size_t chunk;
>>
>> @@ -1360,7 +1372,7 @@ dpif_operate(struct dpif *dpif, struct dpif_op 
>> **ops, size_t n_ops)
>> * handle itself, without help. */
>> size_t i;
>>
>> - dpif->dpif_class->operate(dpif, ops, chunk);
>> + dpif->dpif_class->operate(dpif, ops, chunk, offload_type);
>>
>> for (i = 0; i < chunk; i++) {
>> struct dpif_op *op = ops[i];
>> @@ -1657,7 +1669,7 @@ dpif_queue_to_priority(const struct dpif *dpif, 
>> uint32_t queue_id,
>> log_operation(dpif, "queue_to_priority", error);
>> return error;
>> }
>> -
>> +
>>
>> Removed by accident?
> There was ^L char here; I deleted it.

See Ben’s response you might want to re-insert the ones you deleted.

>>
>> void
>> dpif_init(struct dpif *dpif, const struct dpif_class *dpif_class,
>> const char *name,
>> diff --git a/lib/dpif.h b/lib/dpif.h
>> index bbdc3eb6c..0675ab19f 100644
>> --- a/lib/dpif.h
>> +++ b/lib/dpif.h
>> @@ -614,6 +614,13 @@ enum dpif_op_type {
>> DPIF_OP_FLOW_GET,
>> };
>>
>> +/* offload_type argument types to (*operate) interface */
>> +enum dpif_offload_type {
>> + DPIF_OFFLOAD_AUTO, /* Offload if possible, fallback to software. */
>> + DPIF_OFFLOAD_NEVER, /* Never offload to hardware. */
>> + DPIF_OFFLOAD_ALWAYS, /* Always offload to hardware. */
>> +};
>> +
>> /* Add or modify a flow.
>> *
>> * The flow is specified by the Netlink attributes with types 
>> OVS_KEY_ATTR_* in
>> @@ -768,8 +775,9 @@ struct dpif_op {
>> };
>> };
>>
>> -void dpif_operate(struct dpif *, struct dpif_op **ops, size_t 
>> n_ops);
>> -
>>
>> Same here?
>>
>> +void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops,
>> + enum dpif_offload_type);
>> +
>> /* Upcalls. */
>>
>> enum dpif_upcall_type {
>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
>> index e320dad61..fb0c27e6e 100644
>> --- a/lib/netdev-provider.h
>> +++ b/lib/netdev-provider.h
>> @@ -38,10 +38,14 @@ struct netdev_tnl_build_header_params;
>> /* Offload-capable (HW) netdev information */
>> struct netdev_hw_info {
>> bool oor; /* Out of Offload Resources ? */
>> + int offload_count; /* Pending (non-offloaded) flow count */
>> + int pending_count; /* Offloaded flow count */
>>
>> Guess they should be uint32_t assuming we do not expect negative 
>> counts.
>
> They could be; but we are doing some subtractions (& comparisons), I'd
> prefer to have them as signed ints to debug any related issues (i.e,
> avoid confusion from unsigned wrap around).

Ok, it was just a personal preference.

>>
>> };
>>
>> enum hw_info_type {
>> - HW_INFO_TYPE_OOR = 1 /* OOR state */
>> + HW_INFO_TYPE_OOR = 1, /* OOR state */
>> + HW_INFO_TYPE_PEND_COUNT = 2, /* Pending(non-offloaded) flow count 
>> */
>> + HW_INFO_TYPE_OFFL_COUNT = 3 /* Offloaded flow count */
>> };
>>
>> /* A network device (e.g. an Ethernet device).
>> @@ -89,7 +93,6 @@ struct netdev {
>> int n_rxq;
>> struct shash_node *node; /* Pointer to element in global map. */
>> struct ovs_list saved_flags_list; /* Contains "struct 
>> netdev_saved_flags". */
>> -
>> struct netdev_hw_info hw_info; /* offload-capable netdev info */
>> };
>>
>> diff --git a/lib/netdev.c b/lib/netdev.c
>> index f3fa08ca3..5d7f9c89b 100644
>> --- a/lib/netdev.c
>> +++ b/lib/netdev.c
>> @@ -2260,11 +2260,23 @@ netdev_get_block_id(struct netdev *netdev)
>> int
>> netdev_get_hw_info(struct netdev *netdev, int type)
>> {
>> - if (type == HW_INFO_TYPE_OOR) {
>> - return netdev->hw_info.oor;
>> + int val = -1;
>> +
>> + switch (type) {
>> + case HW_INFO_TYPE_OOR:
>> + val = netdev->hw_info.oor;
>> + break;
>> + case HW_INFO_TYPE_PEND_COUNT:
>> + val = netdev->hw_info.pending_count;
>> + break;
>> + case HW_INFO_TYPE_OFFL_COUNT:
>> + val = netdev->hw_info.offload_count;
>> + break;
>> + default:
>> + break;
>> }
>>
>> - return -1;
>> + return val;
>> }
>>
>> /*
>> @@ -2273,9 +2285,47 @@ netdev_get_hw_info(struct netdev *netdev, int 
>> type)
>> void
>> netdev_set_hw_info(struct netdev *netdev, int type, int val)
>> {
>> - if (type == HW_INFO_TYPE_OOR) {
>> + switch (type) {
>> + case HW_INFO_TYPE_OOR:
>> + if (val == 0) {
>> + VLOG_DBG("Offload rebalance: netdev: %s is not OOR", netdev->name);
>> + }
>> netdev->hw_info.oor = val;
>> + break;
>> + case HW_INFO_TYPE_PEND_COUNT:
>> + netdev->hw_info.pending_count = val;
>> + break;
>> + case HW_INFO_TYPE_OFFL_COUNT:
>> + netdev->hw_info.offload_count = val;
>> + break;
>> + default:
>> + break;
>> + }
>> +}
>>
>> See comment in first patchset about creating separate functions for 
>> this.
>>
>> +
>> +/*
>> + * Find if any netdev is in OOR state. Return true if there's at 
>> least
>> + * one netdev that's in OOR state; otherwise return false.
>> + */
>> +bool
>> +netdev_any_oor(void)
>> + OVS_EXCLUDED(netdev_mutex)
>> +{
>> + struct shash_node *node;
>> + bool oor = false;
>> +
>> + ovs_mutex_lock(&netdev_mutex);
>> + SHASH_FOR_EACH (node, &netdev_shash) {
>> + struct netdev *dev = node->data;
>> +
>> + if (dev->hw_info.oor) {
>> + oor = true;
>> + break;
>> + }
>> }
>> + ovs_mutex_unlock(&netdev_mutex);
>> +
>> + return oor;
>> }
>>
>> bool
>> @@ -2549,6 +2599,10 @@ netdev_set_flow_api_enabled(const struct smap 
>> *ovs_other_config)
>> tc_set_policy(smap_get_def(ovs_other_config, "tc-policy",
>> TC_POLICY_DEFAULT));
>>
>> + if (smap_get_bool(ovs_other_config, "offload-rebalance", false)) {
>> + netdev_offload_rebalance_policy = true;
>> + }
>> +
>> netdev_ports_flow_init();
>>
>> ovsthread_once_done(&once);
>> diff --git a/lib/netdev.h b/lib/netdev.h
>> index b0e5c5b72..373be7cc0 100644
>> --- a/lib/netdev.h
>> +++ b/lib/netdev.h
>> @@ -229,6 +229,7 @@ int netdev_init_flow_api(struct netdev *);
>> uint32_t netdev_get_block_id(struct netdev *);
>> int netdev_get_hw_info(struct netdev *, int);
>> void netdev_set_hw_info(struct netdev *, int, int);
>> +bool netdev_any_oor(void);
>> bool netdev_is_flow_api_enabled(void);
>> void netdev_set_flow_api_enabled(const struct smap 
>> *ovs_other_config);
>> bool netdev_is_offload_rebalance_policy_enabled(void);
>> diff --git a/ofproto/ofproto-dpif-upcall.c 
>> b/ofproto/ofproto-dpif-upcall.c
>> index a372d6252..bb9e61b7c 100644
>> --- a/ofproto/ofproto-dpif-upcall.c
>> +++ b/ofproto/ofproto-dpif-upcall.c
>> @@ -22,6 +22,7 @@
>> #include "connmgr.h"
>> #include "coverage.h"
>> #include "cmap.h"
>> +#include "lib/dpif-provider.h"
>> #include "dpif.h"
>> #include "openvswitch/dynamic-string.h"
>> #include "fail-open.h"
>> @@ -42,7 +43,6 @@
>> #include "tunnel.h"
>> #include "unixctl.h"
>> #include "openvswitch/vlog.h"
>> -#include "lib/dpif-provider.h"
>> #include "lib/netdev-provider.h"
>>
>> #define MAX_QUEUE_LENGTH 512
>> @@ -182,6 +182,8 @@ struct udpif {
>> uint64_t conn_seq; /* Corresponds to 'dump_seq' when
>> conns[n_conns-1] was stored. */
>> size_t n_conns; /* Number of connections waiting. */
>> +
>> + long long int offload_rebalance_time; /* Time of last offload 
>> rebalance */
>> };
>>
>> enum upcall_type {
>> @@ -308,6 +310,7 @@ struct udpif_key {
>> struct recirc_refs recircs; /* Action recirc IDs with references 
>> held. */
>>
>> #define OFFL_REBAL_INTVL_MSEC 3000 /* dynamic offload rebalance freq 
>> */
>> + struct netdev *in_netdev; /* in_odp_port's netdev */
>> bool offloaded; /* True if flow is offloaded */
>> uint64_t flow_pps_rate; /* Packets-Per-Second rate */
>> long long int flow_time; /* last pps update time */
>> @@ -396,6 +399,12 @@ static int upcall_receive(struct upcall *, const 
>> struct dpif_backer *,
>> const ovs_u128 *ufid, const unsigned pmd_id);
>> static void upcall_uninit(struct upcall *);
>>
>> +static void udpif_flow_rebalance(struct udpif *udpif);
>> +static int udpif_flow_program(struct udpif *udpif, struct udpif_key 
>> *ukey,
>> + enum dpif_offload_type offload_type);
>> +static int udpif_flow_unprogram(struct udpif *udpif, struct 
>> udpif_key *ukey,
>> + enum dpif_offload_type offload_type);
>> +
>> static upcall_callback upcall_cb;
>> static dp_purge_callback dp_purge_cb;
>>
>> @@ -567,6 +576,7 @@ udpif_start_threads(struct udpif *udpif, size_t 
>> n_handlers_,
>> ovs_barrier_init(&udpif->pause_barrier, udpif->n_revalidators + 1);
>> udpif->reval_exit = false;
>> udpif->pause = false;
>> + udpif->offload_rebalance_time = time_msec();
>> udpif->revalidators = xzalloc(udpif->n_revalidators
>> * sizeof *udpif->revalidators);
>> for (size_t i = 0; i < udpif->n_revalidators; i++) {
>> @@ -859,6 +869,26 @@ free_dupcall:
>> return n_upcalls;
>> }
>>
>> +static void
>> +udpif_run_flow_rebalance(struct udpif *udpif)
>> +{
>> + long long int now = 0;
>> +
>> + /* Don't rebalance if OFFL_REBAL_INTVL_MSEC have not elapsed */
>> + now = time_msec();
>> + if (now < udpif->offload_rebalance_time + OFFL_REBAL_INTVL_MSEC) {
>> + return;
>> + }
>> +
>> + if (!netdev_any_oor()) {
>> + return;
>> + }
>> +
>> + VLOG_DBG("Offload rebalance: Found OOR netdevs");
>>
>> Can we add a coverage counter here?
>>
>>    COVERAGE_INC(offload_flow_rebalance);
>
> Sure; thanks for suggesting this.
>>
>> + udpif->offload_rebalance_time = now;
>> + udpif_flow_rebalance(udpif);
>> +}
>> +
>> static void *
>> udpif_revalidator(void *arg)
>> {
>> @@ -933,6 +963,9 @@ udpif_revalidator(void *arg)
>>
>> dpif_flow_dump_destroy(udpif->dump);
>> seq_change(udpif->dump_seq);
>> + if (netdev_is_offload_rebalance_policy_enabled()) {
>> + udpif_run_flow_rebalance(udpif);
>> + }
>>
>> duration = MAX(time_msec() - start_time, 1);
>> udpif->dump_duration = duration;
>> @@ -977,7 +1010,7 @@ udpif_revalidator(void *arg)
>>
>> return NULL;
>> }
>> -
>>
>> +
>> static enum upcall_type
>> classify_upcall(enum dpif_upcall_type type, const struct nlattr 
>> *userdata,
>> struct user_action_cookie *cookie)
>> @@ -1579,7 +1612,7 @@ handle_upcalls(struct udpif *udpif, struct 
>> upcall *upcalls,
>> for (i = 0; i < n_ops; i++) {
>> opsp[n_opsp++] = &ops[i].dop;
>> }
>> - dpif_operate(udpif->dpif, opsp, n_opsp);
>> + dpif_operate(udpif->dpif, opsp, n_opsp, DPIF_OFFLOAD_AUTO);
>> for (i = 0; i < n_ops; i++) {
>> struct udpif_key *ukey = ops[i].ukey;
>>
>> @@ -1671,13 +1704,13 @@ ukey_create__(const struct nlattr *key, 
>> size_t key_len,
>> ukey->state = UKEY_CREATED;
>> ukey->state_thread = ovsthread_id_self();
>> ukey->state_where = OVS_SOURCE_LOCATOR;
>> - ukey->created = time_msec();
>> + ukey->created = ukey->flow_time = time_msec();
>> memset(&ukey->stats, 0, sizeof ukey->stats);
>> ukey->stats.used = used;
>> ukey->xcache = NULL;
>>
>> ukey->offloaded = false;
>> - ukey->flow_time = 0;
>> + ukey->in_netdev = NULL;
>> ukey->flow_packets = ukey->flow_backlog_packets = 0;
>>
>> ukey->key_recirc_id = key_recirc_id;
>> @@ -2329,7 +2362,7 @@ push_dp_ops(struct udpif *udpif, struct ukey_op 
>> *ops, size_t n_ops)
>> for (i = 0; i < n_ops; i++) {
>> opsp[i] = &ops[i].dop;
>> }
>> - dpif_operate(udpif->dpif, opsp, n_ops);
>> + dpif_operate(udpif->dpif, opsp, n_ops, DPIF_OFFLOAD_AUTO);
>>
>> for (i = 0; i < n_ops; i++) {
>> struct ukey_op *op = &ops[i];
>> @@ -2455,6 +2488,57 @@ reval_op_init(struct ukey_op *op, enum 
>> reval_result result,
>> }
>> }
>>
>> +static void
>> +ukey_netdev_unref(struct udpif_key *ukey)
>> +{
>> + if (!ukey->in_netdev) {
>> + return;
>> + }
>> + netdev_close(ukey->in_netdev);
>> + ukey->in_netdev = NULL;
>> +}
>> +
>> +/*
>> + * Given a udpif_key, get its input port (netdev) by parsing the 
>> flow keys
>> + * and actions. The flow may not contain flow attributes if it is a 
>> terse
>> + * dump; read its attributes from the ukey and then parse the flow 
>> to get
>> + * the port info. Save them in udpif_key.
>> + */
>> +static void
>> +ukey_to_flow_netdev(struct udpif *udpif, struct udpif_key *ukey)
>> +{
>> + const struct dpif *dpif = udpif->dpif;
>> + const struct dpif_class *dpif_class = dpif->dpif_class;
>> + const struct nlattr *k;
>> + unsigned int left;
>> +
>> + /* Remove existing references to netdev */
>> + ukey_netdev_unref(ukey);
>>
>> If for some reason we already have a reference, we should just return 
>> it, as the netdev can not change.
>
> We release any previous reference held by us (rebalancing code). If 
> you see
> the implementation, we just return if in_netdev is NULL:
>     if (!ukey->in_netdev) {
>         return;
>     }
> We should hit this condition since we would have released references
> at the end of previous iteration of rebalance. So I agree, we can 
> remove this
> call to ukey_netdev_unref() here, but let me do some experiments and 
> confirm
> it.

Yes I did see you always free it, so in theory it should not happen. But 
if it does, at this stage (maybe some future feature might also use it) 
we should return the current value.

>>
>> +
>> + /* Find the input port and get a reference to its netdev */
>> + NL_ATTR_FOR_EACH (k, left, ukey->key, ukey->key_len) {
>> + enum ovs_key_attr type = nl_attr_type(k);
>> +
>> + if (type == OVS_KEY_ATTR_IN_PORT) {
>> + ukey->in_netdev = netdev_ports_get(nl_attr_get_odp_port(k),
>> + dpif_class);
>> + } else if (type == OVS_KEY_ATTR_TUNNEL) {
>> + struct flow_tnl tnl;
>> + enum odp_key_fitness res;
>> +
>> + if (ukey->in_netdev) {
>> + netdev_close(ukey->in_netdev);
>> + ukey->in_netdev = NULL;
>> + }
>> + res = odp_tun_key_from_attr(k, &tnl);
>> + if (res != ODP_FIT_ERROR) {
>> + ukey->in_netdev = flow_get_tunnel_netdev(&tnl);
>> + break;
>> + }
>> + }
>> + }
>> +}
>> +
>> static uint64_t
>> udpif_flow_packet_delta(struct udpif_key *ukey, const struct 
>> dpif_flow *f)
>> {
>> @@ -2468,6 +2552,16 @@ udpif_flow_time_delta(struct udpif *udpif, 
>> struct udpif_key *ukey)
>> return (udpif->dpif->current_ms - ukey->flow_time) / 1000;
>> }
>>
>> +/*
>> + * Save backlog packet count while switching modes
>> + * between offloaded and kernel datapaths.
>> + */
>> +static void
>> +udpif_set_ukey_backlog_packets(struct udpif_key *ukey)
>> +{
>> + ukey->flow_backlog_packets = ukey->flow_packets;
>> +}
>> +
>> /* Gather pps-rate for the given dpif_flow and save it in its ukey */
>> static void
>> udpif_update_flow_pps(struct udpif *udpif, struct udpif_key *ukey,
>> @@ -2539,6 +2633,7 @@ revalidate(struct revalidator *revalidator)
>> kill_them_all = n_dp_flows > flow_limit * 2;
>> max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
>>
>> + udpif->dpif->current_ms = time_msec();
>>
>> Can we not use udpif->dpif->current_ms = now, as it's set two lines 
>> above.
>
> Yes.
>>
>> for (f = flows; f < &flows[n_dumped]; f++) {
>> long long int used = f->stats.used;
>> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
>> @@ -2915,3 +3010,342 @@ upcall_unixctl_purge(struct unixctl_conn 
>> *conn, int argc OVS_UNUSED,
>> }
>> unixctl_command_reply(conn, "");
>> }
>> +
>> +/* Flows are sorted in the following order:
>> + * netdev, flow state (offloaded/kernel path), flow_pps_rate.
>> + */
>> +static int
>> +flow_compare_rebalance(const void *elem1, const void *elem2)
>> +{
>> + const struct udpif_key *f1 = *(struct udpif_key **)elem1;
>> + const struct udpif_key *f2 = *(struct udpif_key **)elem2;
>> + int64_t diff;
>> +
>> + if (f1->in_netdev < f2->in_netdev) {
>> + return -1;
>> + } else if (f1->in_netdev > f2->in_netdev) {
>> + return 1;
>> + }
>> +
>> + if (f1->offloaded != f2->offloaded) {
>> + return f2->offloaded - f1->offloaded;
>> + }
>> +
>> + diff = (f1->offloaded == true) ?
>> + f1->flow_pps_rate - f2->flow_pps_rate :
>> + f2->flow_pps_rate - f1->flow_pps_rate;
>> +
>> + return (diff < 0) ? -1 : 1;
>> +}
>> +
>> +/* Insert flows from pending array during rebalancing */
>> +static int
>> +rebalance_insert_pending(struct udpif *udpif, struct udpif_key 
>> **pending_flows,
>> + int pending_count, int insert_count,
>> + uint64_t rate_threshold)
>> +{
>> + int count = 0;
>> +
>> + for (int i = 0; i < pending_count; i++) {
>> + struct udpif_key *flow = pending_flows[i];
>> + int err;
>> +
>> + /* Stop offloading pending flows if the insert count is
>> + * reached and the flow rate is less than the threshold
>> + */
>> + if (count >= insert_count && flow->flow_pps_rate < rate_threshold) 
>> {
>> + break;
>> + }
>> +
>> + /* Offload the flow to netdev */
>> + err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_ALWAYS);
>> +
>> + if (err == ENOSPC) {
>> + /* Stop if we are out of resources */
>>
>> Are we sure we want to stop? We did release X flows, maybe some 
>> others might fit (See other comments on different type of hw tables).
>
> That's a good point. I agree, it might be possible to continue adding 
> flows
> when we have different hw tables. This can be taken up as a part of 
> further
> additions to this feature.
>>
>> + break;
>> + }
>> +
>> + if (err) {
>> + continue;
>> + }
>> +
>> + /* Offload succeeded; delete it from the kernel datapath */
>> + udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
>>
>> No error checking here? for the other order you do it.
>
> Yes, we can add error checking here.
>>
>> +
>> + /* Change the state of the flow, adjust dpif counters */
>> + flow->offloaded = true;
>> +
>> + udpif_set_ukey_backlog_packets(flow);
>>
>> Are ok with not clearing the flow_packets here? As now some of the 
>> previous count is used for the new mode while calculating.
>
> This is ok since we update ukey->flow_packets in the next iteration in
> udpif_update_flow_pps().

But only after calculating the pps?
>>
>> + count++;
>> + }
>> +
>> + return count;
>> +}
>> +
>> +/* Remove flows from offloaded array during rebalancing */
>> +static void
>> +rebalance_remove_offloaded(struct udpif *udpif,
>> + struct udpif_key **offloaded_flows,
>> + int offload_count)
>> +{
>> + for (int i = 0; i < offload_count; i++) {
>> + struct udpif_key *flow = offloaded_flows[i];
>> + int err;
>> +
>> + /* Install the flow into kernel path first */
>> + err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_NEVER);
>> + if (err) {
>> + continue;
>> + }
>> +
>> + /* Success; now remove offloaded flow from netdev */
>> + err = udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_ALWAYS);
>> + if (err) {
>> + udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
>> + continue;
>> + }
>> + udpif_set_ukey_backlog_packets(flow);
>>
>> Same here, are ok with not clearing the flow_packets here? As now 
>> some of the previous count is used for the new mode while 
>> calculating.
>>
>> + flow->offloaded = false;
>> + }
>> +}
>> +
>> +/*
>> + * Rebalance offloaded flows on a netdev that's in OOR state.
>> + *
>> + * The rebalancing is done in two phases. In the first phase, we 
>> check if
>> + * the pending flows can be offloaded (if some resources became 
>> available
>> + * in the meantime) by trying to offload each pending flow. If all 
>> pending
>> + * flows get successfully offloaded, the OOR state is cleared on the 
>> netdev
>> + * and there's nothing to rebalance.
>> + *
>> + * If some of the pending flows could not be offloaded, i.e, we 
>> still see
>> + * the OOR error, then we move to the second phase of rebalancing. 
>> In this
>> + * phase, the rebalancer compares pps-rate of an offloaded flow with 
>> the
>> + * least pps-rate with that of a pending flow with the highest 
>> pps-rate from
>> + * their respective sorted arrays. If pps-rate of the offloaded flow 
>> is less
>> + * than the pps-rate of the pending flow, then it deletes the 
>> offloaded flow
>> + * from the HW/netdev and adds it to kernel datapath and then 
>> offloads pending
>> + * to HW/netdev. This process is repeated for every pair of 
>> offloaded and
>> + * pending flows in the ordered list. The process stops when we 
>> encounter an
>> + * offloaded flow that has a higher pps-rate than the corresponding 
>> pending
>> + * flow. The entire rebalancing process is repeated in the next 
>> iteration.
>> + */
>> +static bool
>> +rebalance_device(struct udpif *udpif, struct udpif_key 
>> **offloaded_flows,
>> + int offload_count, struct udpif_key **pending_flows,
>> + int pending_count)
>> +{
>> +
>> + /* Phase 1 */
>> + int num_inserted = rebalance_insert_pending(udpif, pending_flows,
>> + pending_count, pending_count,
>> + 0);
>> + if (num_inserted) {
>> + VLOG_DBG("Offload rebalance: Phase1: inserted %d pending flows",
>> + num_inserted);
>> + }
>> +
>> + /* Adjust pending array */
>> + pending_flows = &pending_flows[num_inserted];
>> + pending_count -= num_inserted;
>> +
>> + if (!pending_count) {
>> + /*
>> + * Successfully offloaded all pending flows. The device
>> + * is no longer in OOR state; done rebalancing this device.
>> + */
>> + return false;
>> + }
>> +
>> + /*
>> + * Phase 2; determine how many offloaded flows to churn.
>> + */
>> +#define OFFL_REBAL_MAX_CHURN 1024
>> + int churn_count = 0;
>> + while (churn_count < OFFL_REBAL_MAX_CHURN && churn_count < 
>> offload_count
>> + && churn_count < pending_count) {
>> + if (pending_flows[churn_count]->flow_pps_rate <=
>> + offloaded_flows[churn_count]->flow_pps_rate)
>>
>>                                                         {
>>
>> + break;
>>
>>         }
>>
>> + churn_count++;
>> + }
>> +
>> + if (churn_count) {
>> + VLOG_DBG("Offload rebalance: Phase2: removing %d offloaded flows",
>> + churn_count);
>> + }
>> +
>> + /* Bail early if nothing to churn */
>> + if (!churn_count) {
>> + return true;
>> + }
>> +
>> + /* Remove offloaded flows */
>> + rebalance_remove_offloaded(udpif, offloaded_flows, churn_count);
>> +
>> + /* Adjust offloaded array */
>> + offloaded_flows = &offloaded_flows[churn_count];
>> + offload_count -= churn_count;
>> +
>> + /* Replace offloaded flows with pending flows */
>> + num_inserted = rebalance_insert_pending(udpif, pending_flows,
>> + pending_count, churn_count,
>> + offload_count ?
>> + offloaded_flows[0]->flow_pps_rate :
>> + 0);
>> + if (num_inserted) {
>> + VLOG_DBG("Offload rebalance: Phase2: inserted %d pending flows",
>> + num_inserted);
>> + }
>> +
>> + return true;
>> +}
>> +
>> +static struct udpif_key **
>> +udpif_add_oor_flows(struct udpif_key **sort_flows, size_t 
>> *total_flow_count,
>> + size_t *alloc_flow_count, struct udpif_key *ukey)
>> +{
>> + if (*total_flow_count >= *alloc_flow_count) {
>>
>> I might be missing the clue here, but alloc_flow_count is always 0, 
>> and can't find the place where it is incremented. So are we always 
>> writing to unallocated memory?
>
> There is no problem here; &alloc_flow_count (pointer) is passed as an 
> input
> argument and is also used as an output parameter by x2nrealloc(); see 
> other
> callers for example or its implementation.

Ack, how could I have missed this :)

>>
>> In addition in the sake of performance, we might want to start with 
>> allocating X number, and increment by X, not by 1.
>
> we can explore this.
>>
>> + sort_flows = x2nrealloc(sort_flows, alloc_flow_count, sizeof ukey);
>> + }
>> + sort_flows[(*total_flow_count)++] = ukey;
>> + return sort_flows;
>> +}
>> +
>> +/*
>> + * Build sort_flows[] initially with flows that
>> + * reference an 'OOR' netdev as their input port.
>> + */
>> +static struct udpif_key **
>> +udpif_build_oor_flows(struct udpif_key **sort_flows, size_t 
>> *total_flow_count,
>> + size_t *alloc_flow_count, struct udpif_key *ukey,
>> + int *oor_netdev_count)
>> +{
>> + struct netdev *netdev;
>> + int count;
>> +
>> + /* Input netdev must be available for the flow */
>> + netdev = ukey->in_netdev;
>> + if (!netdev) {
>> + return sort_flows;
>> + }
>> +
>> + /* Is the in-netdev for this flow in OOR state ? */
>> + if (!netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR)) {
>> + ukey_netdev_unref(ukey);
>> + return sort_flows;
>> + }
>> +
>> + /* Add the flow to sort_flows[] */
>> + sort_flows = udpif_add_oor_flows(sort_flows, total_flow_count,
>> + alloc_flow_count, ukey);
>> + if (ukey->offloaded) {
>> + count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
>> + ovs_assert(count >= 0);
>> + if (count++ == 0) {
>> + (*oor_netdev_count)++;
>> + }
>> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, count);
>> + } else {
>> + count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
>> + ovs_assert(count >= 0);
>>
>> We can loose the asserts if we make the value unsigned
>
> I'd like to use signed int for the count; the assert can be removed 
> since it
> is being set only in 2 places and there's no code (yet) that 
> decrements it.
>>
>> + netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, ++count);
>> + }
>> +
>> + return sort_flows;
>> +}
>> +
>> +/*
>> + * Rebalance offloaded flows on HW netdevs that are in OOR state.
>> + */
>> +static void
>> +udpif_flow_rebalance(struct udpif *udpif)
>> +{
>> + struct udpif_key **sort_flows = NULL;
>> + size_t alloc_flow_count = 0;
>> + size_t total_flow_count = 0;
>> + int oor_netdev_count = 0;
>> + int offload_index = 0;
>> + int pending_index;
>> +
>> + /* Collect flows (offloaded and pending) that reference OOR netdevs 
>> */
>> + for (size_t i = 0; i < N_UMAPS; i++) {
>> + struct udpif_key *ukey;
>> + struct umap *umap = &udpif->ukeys[i];
>> +
>> + CMAP_FOR_EACH (ukey, cmap_node, &umap->cmap) {
>> + ukey_to_flow_netdev(udpif, ukey);
>> + sort_flows = udpif_build_oor_flows(Sort_flows, &total_flow_count,
>> + &alloc_flow_count, ukey,
>> + &oor_netdev_count);
>> + }
>> + }
>> +
>> + /* Sort flows by OOR netdevs, state (offloaded/pending) and 
>> pps-rate */
>> + qsort(sort_flows, total_flow_count, sizeof(struct udpif_key *),
>> + flow_compare_rebalance);
>> +
>> + /*
>> + * We now have flows referencing OOR netdevs, that are sorted. We 
>> also
>> + * have a count of offloaded and pending flows on each of the 
>> netdevs
>> + * that are in OOR state. Now rebalance each oor-netdev.
>> + */
>> + while (oor_netdev_count) {
>> + struct netdev *netdev;
>> + int offload_count;
>> + int pending_count;
>> + bool oor;
>> +
>> + netdev = sort_flows[offload_index]->in_netdev;
>> + ovs_assert(netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR) == true);
>>
>> Think you should loose the assert, as you just checked this in 
>> udpif_build_oor_flows()
>
> yes, this can be removed.
>>
>> If you are trying to safeguard stuff here I would make sure that none 
>> of the indexes used are above the total_flow_count value.
>
> agree, can be added.
>
> Thanks,
> -Harsha
>>
>> + VLOG_DBG("Offload rebalance: netdev: %s is OOR", netdev->name);
>> +
>> + offload_count = netdev_get_hw_info(netdev, 
>> HW_INFO_TYPE_OFFL_COUNT);
>> + pending_count = netdev_get_hw_info(netdev, 
>> HW_INFO_TYPE_PEND_COUNT);
>> + pending_index = offload_index + offload_count;
>> +
>> + oor = rebalance_device(udpif,
>> + &sort_flows[offload_index], offload_count,
>> + &sort_flows[pending_index], pending_count);
>> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OOR, oor);
>> +
>> + offload_index = pending_index + pending_count;
>> + netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, 0);
>> + netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, 0);
>> + oor_netdev_count--;
>> + }
>> +
>> + for (int i = 0; i < total_flow_count; i++) {
>> + struct udpif_key *ukey = sort_flows[i];
>> + ukey_netdev_unref(ukey);
>> + }
>> + free(sort_flows);
>> +}
>> +
>> +static int
>> +udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
>> + enum dpif_offload_type offload_type)
>> +{
>> + struct dpif_op *opsp;
>> + struct ukey_op uop;
>> +
>> + opsp = &uop.dop;
>> + put_op_init(&uop, ukey, DPIF_FP_CREATE);
>> + dpif_operate(udpif->dpif, &opsp, 1, offload_type);
>> +
>> + return opsp->error;
>> +}
>> +
>> +static int
>> +udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
>> + enum dpif_offload_type offload_type)
>> +{
>> + struct dpif_op *opsp;
>> + struct ukey_op uop;
>> +
>> + opsp = &uop.dop;
>> + delete_op_init(udpif, &uop, ukey);
>> + dpif_operate(udpif->dpif, &opsp, 1, offload_type);
>> +
>> + return opsp->error;
>> +}
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index f05f616fe..2bfe4ff24 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -519,6 +519,27 @@
>> </p>
>> </column>
>>
>> + <column name="other_config" key="offload-rebalance"
>> + type='{"type": "boolean"}'>
>> + <p>
>> + Configures HW offload rebalancing, that allows to dynamically
>> + offload and un-offload flows while an offload-device is out of
>> + resources (OOR). This policy allows flows to be selected for
>> + offloading based on the packets-per-second (pps) rate of flows.
>> + </p>
>> + <p>
>> + Set this value to <code>true</code> to enable this option.
>> + </p>
>> + <p>
>> + The default value is <code>false</code>. Changing this value 
>> requires
>> + restarting the daemon.
>> + </p>
>> + <p>
>> + This is only relevant if HW offloading is enabled (hw-offload).
>> + When this policy is enabled, it also requires 'tc-policy' to
>> + be set to 'skip_sw'.
>> + </p>
>> + </column>
>> </group>
>>
>> <group title="Status">
>> --
>> 2.18.0.rc1.1.g6f333ff
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index 33b4d8a23..846b46fb5 100644
--- a/NEWS
+++ b/NEWS
@@ -8,6 +8,9 @@  Post-v2.10.0
      as the default timeout for control utilities.
    - ovn:
      * ovn-ctl: allow passing user:group ids to the OVN daemons.
+   - ovs-vswitchd:
+     * New configuration option "offload-rebalance", that enables dynamic
+       rebalancing of offloaded flows.
 
 
 v2.10.0 - xx xxx xxxx
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 7c0300cc5..1c01d2278 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3689,7 +3689,8 @@  dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
 }
 
 static void
-dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
+dpif_netdev_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
+                    enum dpif_offload_type offload_type OVS_UNUSED)
 {
     size_t i;
 
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index b9ce9cbe2..2e01f5750 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2281,7 +2281,8 @@  dpif_netlink_operate_chunks(struct dpif_netlink *dpif, struct dpif_op **ops,
 }
 
 static void
-dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
+dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
+                     enum dpif_offload_type offload_type)
 {
     struct dpif_netlink *dpif = dpif_netlink_cast(dpif_);
     struct dpif_op *new_ops[OPERATE_MAX_OPS];
@@ -2289,7 +2290,12 @@  dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
     int i = 0;
     int err = 0;
 
-    if (netdev_is_flow_api_enabled()) {
+    if (offload_type == DPIF_OFFLOAD_ALWAYS && !netdev_is_flow_api_enabled()) {
+        VLOG_DBG("Invalid offload_type: %d", offload_type);
+        return;
+    }
+
+    if (offload_type != DPIF_OFFLOAD_NEVER && netdev_is_flow_api_enabled()) {
         while (n_ops > 0) {
             count = 0;
 
@@ -2298,6 +2304,23 @@  dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
 
                 err = try_send_to_netdev(dpif, op);
                 if (err && err != EEXIST) {
+                    if (offload_type == DPIF_OFFLOAD_ALWAYS) {
+                        /* We got an error while offloading an op. Since
+                         * OFFLOAD_ALWAYS is specified, we stop further
+                         * processing and return to the caller without
+                         * invoking kernel datapath as fallback. But the
+                         * interface requires us to process all n_ops; so
+                         * return the same error in the remaining ops too.
+                         */
+                        op->error = err;
+                        n_ops--;
+                        while (n_ops > 0) {
+                            op = ops[i++];
+                            op->error = err;
+                            n_ops--;
+                        }
+                        return;
+                    }
                     new_ops[count++] = op;
                 } else {
                     op->error = err;
@@ -2308,7 +2331,7 @@  dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
 
             dpif_netlink_operate_chunks(dpif, new_ops, count);
         }
-    } else {
+    } else if (offload_type != DPIF_OFFLOAD_ALWAYS) {
         dpif_netlink_operate_chunks(dpif, ops, n_ops);
     }
 }
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index 7a71f5c0a..a30de740f 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -296,12 +296,14 @@  struct dpif_class {
 
     int (*flow_dump_next)(struct dpif_flow_dump_thread *thread,
                           struct dpif_flow *flows, int max_flows);
-
     /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the order
      * in which they are specified, placing each operation's results in the
      * "output" members documented in comments and the 'error' member of each
-     * dpif_op. */
-    void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops);
+     * dpif_op. The offload_type argument tells the provider if 'ops' should
+     * be submitted to to a netdev (only offload) or to the kernel datapath
+     * (never offload) or to both (offload if possible; software fallback). */
+    void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
+                    enum dpif_offload_type offload_type);
 
     /* Enables or disables receiving packets with dpif_recv() for 'dpif'.
      * Turning packet receive off and then back on is allowed to change Netlink
diff --git a/lib/dpif.c b/lib/dpif.c
index d799f972c..65880b86a 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -49,6 +49,7 @@ 
 #include "valgrind.h"
 #include "openvswitch/ofp-errors.h"
 #include "openvswitch/vlog.h"
+#include "lib/netdev-provider.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif);
 
@@ -1015,7 +1016,7 @@  dpif_flow_get(struct dpif *dpif,
     op.flow_get.flow->key_len = key_len;
 
     opp = &op;
-    dpif_operate(dpif, &opp, 1);
+    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
 
     return op.error;
 }
@@ -1045,7 +1046,7 @@  dpif_flow_put(struct dpif *dpif, enum dpif_flow_put_flags flags,
     op.flow_put.stats = stats;
 
     opp = &op;
-    dpif_operate(dpif, &opp, 1);
+    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
 
     return op.error;
 }
@@ -1068,7 +1069,7 @@  dpif_flow_del(struct dpif *dpif,
     op.flow_del.terse = false;
 
     opp = &op;
-    dpif_operate(dpif, &opp, 1);
+    dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
 
     return op.error;
 }
@@ -1325,7 +1326,7 @@  dpif_execute(struct dpif *dpif, struct dpif_execute *execute)
         op.execute = *execute;
 
         opp = &op;
-        dpif_operate(dpif, &opp, 1);
+        dpif_operate(dpif, &opp, 1, DPIF_OFFLOAD_AUTO);
 
         return op.error;
     } else {
@@ -1336,10 +1337,21 @@  dpif_execute(struct dpif *dpif, struct dpif_execute *execute)
 /* Executes each of the 'n_ops' operations in 'ops' on 'dpif', in the order in
  * which they are specified.  Places each operation's results in the "output"
  * members documented in comments, and 0 in the 'error' member on success or a
- * positive errno on failure. */
+ * positive errno on failure.
+ */
 void
-dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
-{
+dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
+             enum dpif_offload_type offload_type)
+{
+    if (offload_type == DPIF_OFFLOAD_ALWAYS && !netdev_is_flow_api_enabled()) {
+        size_t i;
+        for (i = 0; i < n_ops; i++) {
+            struct dpif_op *op = ops[i];
+            op->error = EINVAL;
+        }
+        return;
+    }
+
     while (n_ops > 0) {
         size_t chunk;
 
@@ -1360,7 +1372,7 @@  dpif_operate(struct dpif *dpif, struct dpif_op **ops, size_t n_ops)
              * handle itself, without help. */
             size_t i;
 
-            dpif->dpif_class->operate(dpif, ops, chunk);
+            dpif->dpif_class->operate(dpif, ops, chunk, offload_type);
 
             for (i = 0; i < chunk; i++) {
                 struct dpif_op *op = ops[i];
@@ -1657,7 +1669,7 @@  dpif_queue_to_priority(const struct dpif *dpif, uint32_t queue_id,
     log_operation(dpif, "queue_to_priority", error);
     return error;
 }
-
+
 void
 dpif_init(struct dpif *dpif, const struct dpif_class *dpif_class,
           const char *name,
diff --git a/lib/dpif.h b/lib/dpif.h
index bbdc3eb6c..0675ab19f 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -614,6 +614,13 @@  enum dpif_op_type {
     DPIF_OP_FLOW_GET,
 };
 
+/* offload_type argument types to (*operate) interface */
+enum dpif_offload_type {
+    DPIF_OFFLOAD_AUTO,         /* Offload if possible, fallback to software. */
+    DPIF_OFFLOAD_NEVER,        /* Never offload to hardware. */
+    DPIF_OFFLOAD_ALWAYS,       /* Always offload to hardware. */
+};
+
 /* Add or modify a flow.
  *
  * The flow is specified by the Netlink attributes with types OVS_KEY_ATTR_* in
@@ -768,8 +775,9 @@  struct dpif_op {
     };
 };
 
-void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops);
-
+void dpif_operate(struct dpif *, struct dpif_op **ops, size_t n_ops,
+                  enum dpif_offload_type);
+
 /* Upcalls. */
 
 enum dpif_upcall_type {
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index e320dad61..fb0c27e6e 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -38,10 +38,14 @@  struct netdev_tnl_build_header_params;
 /* Offload-capable (HW) netdev information */
 struct netdev_hw_info {
     bool oor;		/* Out of Offload Resources ? */
+    int offload_count;  /* Pending (non-offloaded) flow count */
+    int pending_count;  /* Offloaded flow count */
 };
 
 enum hw_info_type {
-    HW_INFO_TYPE_OOR = 1	/* OOR state */
+    HW_INFO_TYPE_OOR = 1,		/* OOR state */
+    HW_INFO_TYPE_PEND_COUNT = 2,	/* Pending(non-offloaded) flow count */
+    HW_INFO_TYPE_OFFL_COUNT = 3		/* Offloaded flow count */
 };
 
 /* A network device (e.g. an Ethernet device).
@@ -89,7 +93,6 @@  struct netdev {
     int n_rxq;
     struct shash_node *node;            /* Pointer to element in global map. */
     struct ovs_list saved_flags_list; /* Contains "struct netdev_saved_flags". */
-
     struct netdev_hw_info hw_info;	/* offload-capable netdev info */
 };
 
diff --git a/lib/netdev.c b/lib/netdev.c
index f3fa08ca3..5d7f9c89b 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -2260,11 +2260,23 @@  netdev_get_block_id(struct netdev *netdev)
 int
 netdev_get_hw_info(struct netdev *netdev, int type)
 {
-    if (type == HW_INFO_TYPE_OOR) {
-        return netdev->hw_info.oor;
+    int val = -1;
+
+    switch (type) {
+    case HW_INFO_TYPE_OOR:
+        val = netdev->hw_info.oor;
+        break;
+    case HW_INFO_TYPE_PEND_COUNT:
+        val = netdev->hw_info.pending_count;
+        break;
+    case HW_INFO_TYPE_OFFL_COUNT:
+        val = netdev->hw_info.offload_count;
+        break;
+    default:
+        break;
     }
 
-    return -1;
+    return val;
 }
 
 /*
@@ -2273,9 +2285,47 @@  netdev_get_hw_info(struct netdev *netdev, int type)
 void
 netdev_set_hw_info(struct netdev *netdev, int type, int val)
 {
-    if (type == HW_INFO_TYPE_OOR) {
+    switch (type) {
+    case HW_INFO_TYPE_OOR:
+        if (val == 0) {
+            VLOG_DBG("Offload rebalance: netdev: %s is not OOR", netdev->name);
+        }
         netdev->hw_info.oor = val;
+        break;
+    case HW_INFO_TYPE_PEND_COUNT:
+        netdev->hw_info.pending_count = val;
+        break;
+    case HW_INFO_TYPE_OFFL_COUNT:
+        netdev->hw_info.offload_count = val;
+        break;
+    default:
+        break;
+    }
+}
+
+/*
+ * Find if any netdev is in OOR state. Return true if there's at least
+ * one netdev that's in OOR state; otherwise return false.
+ */
+bool
+netdev_any_oor(void)
+    OVS_EXCLUDED(netdev_mutex)
+{
+    struct shash_node *node;
+    bool oor = false;
+
+    ovs_mutex_lock(&netdev_mutex);
+    SHASH_FOR_EACH (node, &netdev_shash) {
+        struct netdev *dev = node->data;
+
+        if (dev->hw_info.oor) {
+            oor = true;
+            break;
+        }
     }
+    ovs_mutex_unlock(&netdev_mutex);
+
+    return oor;
 }
 
 bool
@@ -2549,6 +2599,10 @@  netdev_set_flow_api_enabled(const struct smap *ovs_other_config)
             tc_set_policy(smap_get_def(ovs_other_config, "tc-policy",
                                        TC_POLICY_DEFAULT));
 
+            if (smap_get_bool(ovs_other_config, "offload-rebalance", false)) {
+                netdev_offload_rebalance_policy = true;
+            }
+
             netdev_ports_flow_init();
 
             ovsthread_once_done(&once);
diff --git a/lib/netdev.h b/lib/netdev.h
index b0e5c5b72..373be7cc0 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -229,6 +229,7 @@  int netdev_init_flow_api(struct netdev *);
 uint32_t netdev_get_block_id(struct netdev *);
 int netdev_get_hw_info(struct netdev *, int);
 void netdev_set_hw_info(struct netdev *, int, int);
+bool netdev_any_oor(void);
 bool netdev_is_flow_api_enabled(void);
 void netdev_set_flow_api_enabled(const struct smap *ovs_other_config);
 bool netdev_is_offload_rebalance_policy_enabled(void);
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index a372d6252..bb9e61b7c 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -22,6 +22,7 @@ 
 #include "connmgr.h"
 #include "coverage.h"
 #include "cmap.h"
+#include "lib/dpif-provider.h"
 #include "dpif.h"
 #include "openvswitch/dynamic-string.h"
 #include "fail-open.h"
@@ -42,7 +43,6 @@ 
 #include "tunnel.h"
 #include "unixctl.h"
 #include "openvswitch/vlog.h"
-#include "lib/dpif-provider.h"
 #include "lib/netdev-provider.h"
 
 #define MAX_QUEUE_LENGTH 512
@@ -182,6 +182,8 @@  struct udpif {
     uint64_t conn_seq;                 /* Corresponds to 'dump_seq' when
                                           conns[n_conns-1] was stored. */
     size_t n_conns;                    /* Number of connections waiting. */
+
+    long long int offload_rebalance_time;  /* Time of last offload rebalance */
 };
 
 enum upcall_type {
@@ -308,6 +310,7 @@  struct udpif_key {
     struct recirc_refs recircs;  /* Action recirc IDs with references held. */
 
 #define OFFL_REBAL_INTVL_MSEC  3000	/* dynamic offload rebalance freq */
+    struct netdev *in_netdev;		/* in_odp_port's netdev */
     bool offloaded;			/* True if flow is offloaded */
     uint64_t flow_pps_rate;		/* Packets-Per-Second rate */
     long long int flow_time;		/* last pps update time */
@@ -396,6 +399,12 @@  static int upcall_receive(struct upcall *, const struct dpif_backer *,
                           const ovs_u128 *ufid, const unsigned pmd_id);
 static void upcall_uninit(struct upcall *);
 
+static void udpif_flow_rebalance(struct udpif *udpif);
+static int udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
+                              enum dpif_offload_type offload_type);
+static int udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
+                                enum dpif_offload_type offload_type);
+
 static upcall_callback upcall_cb;
 static dp_purge_callback dp_purge_cb;
 
@@ -567,6 +576,7 @@  udpif_start_threads(struct udpif *udpif, size_t n_handlers_,
         ovs_barrier_init(&udpif->pause_barrier, udpif->n_revalidators + 1);
         udpif->reval_exit = false;
         udpif->pause = false;
+        udpif->offload_rebalance_time = time_msec();
         udpif->revalidators = xzalloc(udpif->n_revalidators
                                       * sizeof *udpif->revalidators);
         for (size_t i = 0; i < udpif->n_revalidators; i++) {
@@ -859,6 +869,26 @@  free_dupcall:
     return n_upcalls;
 }
 
+static void
+udpif_run_flow_rebalance(struct udpif *udpif)
+{
+    long long int now = 0;
+
+    /* Don't rebalance if OFFL_REBAL_INTVL_MSEC have not elapsed */
+    now = time_msec();
+    if (now < udpif->offload_rebalance_time + OFFL_REBAL_INTVL_MSEC) {
+        return;
+    }
+
+    if (!netdev_any_oor()) {
+        return;
+    }
+
+    VLOG_DBG("Offload rebalance: Found OOR netdevs");
+    udpif->offload_rebalance_time = now;
+    udpif_flow_rebalance(udpif);
+}
+
 static void *
 udpif_revalidator(void *arg)
 {
@@ -933,6 +963,9 @@  udpif_revalidator(void *arg)
 
             dpif_flow_dump_destroy(udpif->dump);
             seq_change(udpif->dump_seq);
+            if (netdev_is_offload_rebalance_policy_enabled()) {
+                udpif_run_flow_rebalance(udpif);
+            }
 
             duration = MAX(time_msec() - start_time, 1);
             udpif->dump_duration = duration;
@@ -977,7 +1010,7 @@  udpif_revalidator(void *arg)
 
     return NULL;
 }
-
+
 static enum upcall_type
 classify_upcall(enum dpif_upcall_type type, const struct nlattr *userdata,
                 struct user_action_cookie *cookie)
@@ -1579,7 +1612,7 @@  handle_upcalls(struct udpif *udpif, struct upcall *upcalls,
     for (i = 0; i < n_ops; i++) {
         opsp[n_opsp++] = &ops[i].dop;
     }
-    dpif_operate(udpif->dpif, opsp, n_opsp);
+    dpif_operate(udpif->dpif, opsp, n_opsp, DPIF_OFFLOAD_AUTO);
     for (i = 0; i < n_ops; i++) {
         struct udpif_key *ukey = ops[i].ukey;
 
@@ -1671,13 +1704,13 @@  ukey_create__(const struct nlattr *key, size_t key_len,
     ukey->state = UKEY_CREATED;
     ukey->state_thread = ovsthread_id_self();
     ukey->state_where = OVS_SOURCE_LOCATOR;
-    ukey->created = time_msec();
+    ukey->created = ukey->flow_time = time_msec();
     memset(&ukey->stats, 0, sizeof ukey->stats);
     ukey->stats.used = used;
     ukey->xcache = NULL;
 
     ukey->offloaded = false;
-    ukey->flow_time = 0;
+    ukey->in_netdev = NULL;
     ukey->flow_packets = ukey->flow_backlog_packets = 0;
 
     ukey->key_recirc_id = key_recirc_id;
@@ -2329,7 +2362,7 @@  push_dp_ops(struct udpif *udpif, struct ukey_op *ops, size_t n_ops)
     for (i = 0; i < n_ops; i++) {
         opsp[i] = &ops[i].dop;
     }
-    dpif_operate(udpif->dpif, opsp, n_ops);
+    dpif_operate(udpif->dpif, opsp, n_ops, DPIF_OFFLOAD_AUTO);
 
     for (i = 0; i < n_ops; i++) {
         struct ukey_op *op = &ops[i];
@@ -2455,6 +2488,57 @@  reval_op_init(struct ukey_op *op, enum reval_result result,
     }
 }
 
+static void
+ukey_netdev_unref(struct udpif_key *ukey)
+{
+    if (!ukey->in_netdev) {
+        return;
+    }
+    netdev_close(ukey->in_netdev);
+    ukey->in_netdev = NULL;
+}
+
+/*
+ * Given a udpif_key, get its input port (netdev) by parsing the flow keys
+ * and actions. The flow may not contain flow attributes if it is a terse
+ * dump; read its attributes from the ukey and then parse the flow to get
+ * the port info. Save them in udpif_key.
+ */
+static void
+ukey_to_flow_netdev(struct udpif *udpif, struct udpif_key *ukey)
+{
+    const struct dpif *dpif = udpif->dpif;
+    const struct dpif_class *dpif_class = dpif->dpif_class;
+    const struct nlattr *k;
+    unsigned int left;
+
+    /* Remove existing references to netdev */
+    ukey_netdev_unref(ukey);
+
+    /* Find the input port and get a reference to its netdev */
+    NL_ATTR_FOR_EACH (k, left, ukey->key, ukey->key_len) {
+        enum ovs_key_attr type = nl_attr_type(k);
+
+        if (type == OVS_KEY_ATTR_IN_PORT) {
+            ukey->in_netdev = netdev_ports_get(nl_attr_get_odp_port(k),
+                                               dpif_class);
+        } else if (type == OVS_KEY_ATTR_TUNNEL) {
+            struct flow_tnl tnl;
+            enum odp_key_fitness res;
+
+            if (ukey->in_netdev) {
+                netdev_close(ukey->in_netdev);
+                ukey->in_netdev = NULL;
+            }
+            res = odp_tun_key_from_attr(k, &tnl);
+            if (res != ODP_FIT_ERROR) {
+                ukey->in_netdev = flow_get_tunnel_netdev(&tnl);
+                break;
+            }
+        }
+    }
+}
+
 static uint64_t
 udpif_flow_packet_delta(struct udpif_key *ukey, const struct dpif_flow *f)
 {
@@ -2468,6 +2552,16 @@  udpif_flow_time_delta(struct udpif *udpif, struct udpif_key *ukey)
     return (udpif->dpif->current_ms - ukey->flow_time) / 1000;
 }
 
+/*
+ * Save backlog packet count while switching modes
+ * between offloaded and kernel datapaths.
+ */
+static void
+udpif_set_ukey_backlog_packets(struct udpif_key *ukey)
+{
+    ukey->flow_backlog_packets = ukey->flow_packets;
+}
+
 /* Gather pps-rate for the given dpif_flow and save it in its ukey */
 static void
 udpif_update_flow_pps(struct udpif *udpif, struct udpif_key *ukey,
@@ -2539,6 +2633,7 @@  revalidate(struct revalidator *revalidator)
         kill_them_all = n_dp_flows > flow_limit * 2;
         max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
 
+        udpif->dpif->current_ms = time_msec();
         for (f = flows; f < &flows[n_dumped]; f++) {
             long long int used = f->stats.used;
             struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
@@ -2915,3 +3010,342 @@  upcall_unixctl_purge(struct unixctl_conn *conn, int argc OVS_UNUSED,
     }
     unixctl_command_reply(conn, "");
 }
+
+/* Flows are sorted in the following order:
+ * netdev, flow state (offloaded/kernel path), flow_pps_rate.
+ */
+static int
+flow_compare_rebalance(const void *elem1, const void *elem2)
+{
+    const struct udpif_key *f1 = *(struct udpif_key **)elem1;
+    const struct udpif_key *f2 = *(struct udpif_key **)elem2;
+    int64_t diff;
+
+    if (f1->in_netdev < f2->in_netdev) {
+        return -1;
+    } else if (f1->in_netdev > f2->in_netdev) {
+        return 1;
+    }
+
+    if (f1->offloaded != f2->offloaded) {
+        return f2->offloaded - f1->offloaded;
+    }
+
+    diff = (f1->offloaded == true) ?
+        f1->flow_pps_rate - f2->flow_pps_rate :
+        f2->flow_pps_rate - f1->flow_pps_rate;
+
+    return (diff < 0) ? -1 : 1;
+}
+
+/* Insert flows from pending array during rebalancing */
+static int
+rebalance_insert_pending(struct udpif *udpif, struct udpif_key **pending_flows,
+                         int pending_count, int insert_count,
+                         uint64_t rate_threshold)
+{
+    int count = 0;
+
+    for (int i = 0; i < pending_count; i++) {
+        struct udpif_key *flow = pending_flows[i];
+        int err;
+
+        /* Stop offloading pending flows if the insert count is
+         * reached and the flow rate is less than the threshold
+         */
+        if (count >= insert_count && flow->flow_pps_rate < rate_threshold) {
+                break;
+        }
+
+        /* Offload the flow to netdev */
+        err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_ALWAYS);
+
+        if (err == ENOSPC) {
+            /* Stop if we are out of resources */
+            break;
+        }
+
+        if (err) {
+            continue;
+        }
+
+        /* Offload succeeded; delete it from the kernel datapath */
+        udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
+
+        /* Change the state of the flow, adjust dpif counters */
+        flow->offloaded = true;
+
+        udpif_set_ukey_backlog_packets(flow);
+        count++;
+    }
+
+    return count;
+}
+
+/* Remove flows from offloaded array during rebalancing */
+static void
+rebalance_remove_offloaded(struct udpif *udpif,
+                           struct udpif_key **offloaded_flows,
+                           int offload_count)
+{
+    for (int i = 0; i < offload_count; i++) {
+        struct udpif_key *flow = offloaded_flows[i];
+        int err;
+
+        /* Install the flow into kernel path first */
+        err = udpif_flow_program(udpif, flow, DPIF_OFFLOAD_NEVER);
+        if (err) {
+            continue;
+        }
+
+        /* Success; now remove offloaded flow from netdev */
+        err = udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_ALWAYS);
+        if (err) {
+            udpif_flow_unprogram(udpif, flow, DPIF_OFFLOAD_NEVER);
+            continue;
+        }
+        udpif_set_ukey_backlog_packets(flow);
+        flow->offloaded = false;
+    }
+}
+
+/*
+ * Rebalance offloaded flows on a netdev that's in OOR state.
+ *
+ * The rebalancing is done in two phases. In the first phase, we check if
+ * the pending flows can be offloaded (if some resources became available
+ * in the meantime) by trying to offload each pending flow. If all pending
+ * flows get successfully offloaded, the OOR state is cleared on the netdev
+ * and there's nothing to rebalance.
+ *
+ * If some of the pending flows could not be offloaded, i.e, we still see
+ * the OOR error, then we move to the second phase of rebalancing. In this
+ * phase, the rebalancer compares pps-rate of an offloaded flow with the
+ * least pps-rate with that of a pending flow with the highest pps-rate from
+ * their respective sorted arrays. If pps-rate of the offloaded flow is less
+ * than the pps-rate of the pending flow, then it deletes the offloaded flow
+ * from the HW/netdev and adds it to kernel datapath and then offloads pending
+ * to HW/netdev. This process is repeated for every pair of offloaded and
+ * pending flows in the ordered list. The process stops when we encounter an
+ * offloaded flow that has a higher pps-rate than the corresponding pending
+ * flow. The entire rebalancing process is repeated in the next iteration.
+ */
+static bool
+rebalance_device(struct udpif *udpif, struct udpif_key **offloaded_flows,
+                 int offload_count, struct udpif_key **pending_flows,
+                 int pending_count)
+{
+
+    /* Phase 1 */
+    int num_inserted = rebalance_insert_pending(udpif, pending_flows,
+                                                pending_count, pending_count,
+                                                0);
+    if (num_inserted) {
+        VLOG_DBG("Offload rebalance: Phase1: inserted %d pending flows",
+                  num_inserted);
+    }
+
+    /* Adjust pending array */
+    pending_flows = &pending_flows[num_inserted];
+    pending_count -= num_inserted;
+
+    if (!pending_count) {
+        /*
+         * Successfully offloaded all pending flows. The device
+         * is no longer in OOR state; done rebalancing this device.
+         */
+        return false;
+    }
+
+    /*
+     * Phase 2; determine how many offloaded flows to churn.
+     */
+#define	OFFL_REBAL_MAX_CHURN    1024
+    int churn_count = 0;
+    while (churn_count < OFFL_REBAL_MAX_CHURN && churn_count < offload_count
+           && churn_count < pending_count) {
+        if (pending_flows[churn_count]->flow_pps_rate <=
+            offloaded_flows[churn_count]->flow_pps_rate)
+                break;
+        churn_count++;
+    }
+
+    if (churn_count) {
+        VLOG_DBG("Offload rebalance: Phase2: removing %d offloaded flows",
+                  churn_count);
+    }
+
+    /* Bail early if nothing to churn */
+    if (!churn_count) {
+        return true;
+    }
+
+    /* Remove offloaded flows */
+    rebalance_remove_offloaded(udpif, offloaded_flows, churn_count);
+
+    /* Adjust offloaded array */
+    offloaded_flows = &offloaded_flows[churn_count];
+    offload_count -= churn_count;
+
+    /* Replace offloaded flows with pending flows */
+    num_inserted = rebalance_insert_pending(udpif, pending_flows,
+                                            pending_count, churn_count,
+                                            offload_count ?
+                                            offloaded_flows[0]->flow_pps_rate :
+                                            0);
+    if (num_inserted) {
+        VLOG_DBG("Offload rebalance: Phase2: inserted %d pending flows",
+                  num_inserted);
+    }
+
+    return true;
+}
+
+static struct udpif_key **
+udpif_add_oor_flows(struct udpif_key **sort_flows, size_t *total_flow_count,
+                    size_t *alloc_flow_count, struct udpif_key *ukey)
+{
+    if (*total_flow_count >= *alloc_flow_count) {
+        sort_flows = x2nrealloc(sort_flows, alloc_flow_count, sizeof ukey);
+    }
+    sort_flows[(*total_flow_count)++] = ukey;
+    return sort_flows;
+}
+
+/*
+ * Build sort_flows[] initially with flows that
+ * reference an 'OOR' netdev as their input port.
+ */
+static struct udpif_key **
+udpif_build_oor_flows(struct udpif_key **sort_flows, size_t *total_flow_count,
+                      size_t *alloc_flow_count, struct udpif_key *ukey,
+                      int *oor_netdev_count)
+{
+    struct netdev *netdev;
+    int count;
+
+    /* Input netdev must be available for the flow */
+    netdev = ukey->in_netdev;
+    if (!netdev) {
+        return sort_flows;
+    }
+
+    /* Is the in-netdev for this flow in OOR state ? */
+    if (!netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR)) {
+        ukey_netdev_unref(ukey);
+        return sort_flows;
+    }
+
+    /* Add the flow to sort_flows[] */
+    sort_flows = udpif_add_oor_flows(sort_flows, total_flow_count,
+                                      alloc_flow_count, ukey);
+    if (ukey->offloaded) {
+        count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
+        ovs_assert(count >= 0);
+        if (count++ == 0) {
+            (*oor_netdev_count)++;
+        }
+        netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, count);
+    } else {
+        count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
+        ovs_assert(count >= 0);
+        netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, ++count);
+    }
+
+    return sort_flows;
+}
+
+/*
+ * Rebalance offloaded flows on HW netdevs that are in OOR state.
+ */
+static void
+udpif_flow_rebalance(struct udpif *udpif)
+{
+    struct udpif_key **sort_flows = NULL;
+    size_t alloc_flow_count = 0;
+    size_t total_flow_count = 0;
+    int oor_netdev_count = 0;
+    int offload_index = 0;
+    int pending_index;
+
+    /* Collect flows (offloaded and pending) that reference OOR netdevs */
+    for (size_t i = 0; i < N_UMAPS; i++) {
+        struct udpif_key *ukey;
+        struct umap *umap = &udpif->ukeys[i];
+
+        CMAP_FOR_EACH (ukey, cmap_node, &umap->cmap) {
+            ukey_to_flow_netdev(udpif, ukey);
+            sort_flows = udpif_build_oor_flows(sort_flows, &total_flow_count,
+                                               &alloc_flow_count, ukey,
+                                               &oor_netdev_count);
+        }
+    }
+
+    /* Sort flows by OOR netdevs, state (offloaded/pending) and pps-rate  */
+    qsort(sort_flows, total_flow_count, sizeof(struct udpif_key *),
+          flow_compare_rebalance);
+
+    /*
+     * We now have flows referencing OOR netdevs, that are sorted. We also
+     * have a count of offloaded and pending flows on each of the netdevs
+     * that are in OOR state. Now rebalance each oor-netdev.
+     */
+    while (oor_netdev_count) {
+        struct netdev *netdev;
+        int offload_count;
+        int pending_count;
+        bool oor;
+
+        netdev = sort_flows[offload_index]->in_netdev;
+        ovs_assert(netdev_get_hw_info(netdev, HW_INFO_TYPE_OOR) == true);
+        VLOG_DBG("Offload rebalance: netdev: %s is OOR", netdev->name);
+
+        offload_count = netdev_get_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT);
+        pending_count = netdev_get_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT);
+        pending_index = offload_index + offload_count;
+
+        oor = rebalance_device(udpif,
+                               &sort_flows[offload_index], offload_count,
+                               &sort_flows[pending_index], pending_count);
+        netdev_set_hw_info(netdev, HW_INFO_TYPE_OOR, oor);
+
+        offload_index = pending_index + pending_count;
+        netdev_set_hw_info(netdev, HW_INFO_TYPE_OFFL_COUNT, 0);
+        netdev_set_hw_info(netdev, HW_INFO_TYPE_PEND_COUNT, 0);
+        oor_netdev_count--;
+    }
+
+    for (int i = 0; i < total_flow_count; i++) {
+        struct udpif_key *ukey = sort_flows[i];
+        ukey_netdev_unref(ukey);
+    }
+    free(sort_flows);
+}
+
+static int
+udpif_flow_program(struct udpif *udpif, struct udpif_key *ukey,
+                   enum dpif_offload_type offload_type)
+{
+    struct dpif_op *opsp;
+    struct ukey_op uop;
+
+    opsp = &uop.dop;
+    put_op_init(&uop, ukey, DPIF_FP_CREATE);
+    dpif_operate(udpif->dpif, &opsp, 1, offload_type);
+
+    return opsp->error;
+}
+
+static int
+udpif_flow_unprogram(struct udpif *udpif, struct udpif_key *ukey,
+                     enum dpif_offload_type offload_type)
+{
+    struct dpif_op *opsp;
+    struct ukey_op uop;
+
+    opsp = &uop.dop;
+    delete_op_init(udpif, &uop, ukey);
+    dpif_operate(udpif->dpif, &opsp, 1, offload_type);
+
+    return opsp->error;
+}
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index f05f616fe..2bfe4ff24 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -519,6 +519,27 @@ 
         </p>
     </column>
 
+      <column name="other_config" key="offload-rebalance"
+              type='{"type": "boolean"}'>
+        <p>
+            Configures HW offload rebalancing, that allows to dynamically
+            offload and un-offload flows while an offload-device is out of
+            resources (OOR). This policy allows flows to be selected for
+            offloading based on the packets-per-second (pps) rate of flows.
+        </p>
+        <p>
+          Set this value to <code>true</code> to enable this option.
+        </p>
+        <p>
+          The default value is <code>false</code>. Changing this value requires
+          restarting the daemon.
+        </p>
+        <p>
+            This is only relevant if HW offloading is enabled (hw-offload).
+            When this policy is enabled, it also requires 'tc-policy' to
+            be set to 'skip_sw'.
+        </p>
+      </column>
     </group>
 
     <group title="Status">