mbox series

[ovs-dev,v7,0/3] Support dynamic rebalancing of offloaded flows

Message ID 20180927064827.26256-1-sriharsha.basavapatna@broadcom.com
Headers show
Series Support dynamic rebalancing of offloaded flows | expand

Message

Li,Rongqing via dev Sept. 27, 2018, 6:48 a.m. UTC
With the current OVS offload design, when an offload-device fails to add a
flow rule and returns an error, OVS adds the rule to the kernel datapath.
The flow gets processed by the kernel datapath for the entire life of that
flow. This is fine when an error is returned by the device due to lack of
support for certain keys or actions.

But when an error is returned due to temporary conditions such as lack of
resources to add a flow rule, the flow continues to be processed by kernel
even when resources become available later. That is, those flows never get
offloaded again. This problem becomes more pronounced when a flow that has
been initially offloaded may have a smaller packet rate than a later flow
that could not be offloaded due to lack of resources. This leads to
inefficient use of HW resources and wastage of host CPU cycles.

This patch-set addresses this issue by providing a way to detect temporary
offload resource constraints (Out-Of-Resource or OOR condition) and to
selectively and dynamically offload flows with a higher packets-per-second
(pps) rate. This dynamic rebalancing is done periodically on netdevs that
are in OOR state until resources become available to offload all pending
flows.

The patch-set involves the following changes at a high level:

1. Detection of Out-Of-Resources (OOR) condition on an offload-capable 
   netdev.
2. Gathering flow offload selection criteria for all flows on an OOR netdev;
   i.e, packets-per-second (pps) rate of flows for offloaded and
   non-offloaded (pending) flows.
3. Dynamically replacing offloaded flows with a lower pps-rate, with
   non-offloaded flows with a higher pps-rate, on an OOR netdev. A new
   OpenvSwitch configuration option - "offload-rebalance" to enable
   this policy.

Cost/benefits data points:

1. Rough cost of the new rebalancing, in terms of CPU time:

   Ran a test that replaced 256 low pps-rate flows(pings) with 256 high
   pps-rate flows(iperf), in a system with 4 cpus (Intel Xeon E5 @ 2.40GHz;
   2 cores with hw threads enabled, rest disabled). The data showed that cpu
   utilization increased by about ~20%. This increase occurs during the
   specific second in which rebalancing is done. And subsequently (from the
   next second), cpu utilization decreases significantly due to offloading
   of higher pps-rate flows. So effectively there's a bump in cpu utilization
   at the time of rebalancing, that is more than compensated by reduced cpu
   utilization once the right flows get offloaded.

2. Rough benefits to the user in terms of offload performance:

   The benefits to the user is reduced cpu utilization in the host, since
   higher pps-rate flows get offloaded, replacing lower pps-rate flows.
   Replacing a single offloaded flood ping flow with an iperf flow (multiple
   connections), shows that the cpu %used that was originally 100% on a
   single cpu (rebalancing disabled) goes down to 35% (rebalancing enabled).
   That is, cpu utilization decreased 65% after rebalancing.

3. Circumstances under which the benefits would show up:

   The rebalancing benefits would show up once offload resources are
   exhausted and new flows with higher pps-rate are initiated, that would
   otherwise be handled by kernel datapath costing host cpu cycles.

   This can be observed using 'ovs appctl dpctl/ dump-flows' command. Prior
   to rebalancing, any high pps-rate flows that couldn't be offloaded due to
   resource crunch would show up in the output of 'dump-flows type=ovs' and
   after rebalancing such flows would appear in the output of
   'dump-flows type=offloaded'.

******

v6-->v7:
 - Fixed comments from Simon Horman. Key changes:
   - Removed out_netdev and related code which is not needed.
   - Fixed a potential race between un-offloading and adding a rule to kernel.
   - Moved config option parameter check to respective patches.
   - Removed a couple of assertions.
   - Added a few comments.
   - Added a workaround to avoid a checkpatch warning for a pre-decrement op.

v5-->v6:
  - Fixed a build error reported by 0-day robot in Patch-2

v4-->v5:
 - Fixed v4 comments from Ben Pfaff. Key changes:
   - Updated patch-0 comments to reflect cost/benefit of rebalancing
   - Release references to netdevs after rebalancing
   - Reduce ref holding window to rebalancing routine
   - Avoid collecting all flows before rebalancing (save time/space)
   - Skip rebalancing when no OOR netdev (new function netdev_any_oor())
   - Avoid direct access to netdev members (add new netdev interfaces)
   - Handle tunnel-netdev as a special case of in-netdev (no separate var)
   - Remove floating point type for pps-rate (not needed, per-second based)
   - Simplify ukey_to_flow_netdevs() (remove tmp dpif_flow)
   - Update skip_offload code to more meaningful offload_type
   - dpif_operate() should validate offload_types
   - dpif_netlink_operate() should process all ops and set error if needed
   - Save time of last rebalancing in udpif (remove static vars)
   - Declare variables at point of first use where possible
   - Merged patch-4 into patch-3
   - Simplified offload-rebalance config parameter to bool type
   - Removed feature enable/disable log messages

v3-->v4:
  - Updated parse_flow_put() with the following changes:
    - Fixed outdev memory leak with multiple output actions
    - Moved variables closer to their first use
    - Removed outdev check while setting oor, since indev is sufficient

v2-->v3:
  - Removed some VLOG_DBG() in patches 2 and 3
  - Reworded a few VLOG_DBG() in patch 3
  - Fixed a comment line in patch 3

v1-->v2:
  - Fixed build errors reported by 0-day robot
  - Updated patch prefixes with relevant subsystem names

******

Sriharsha Basavapatna (3):
  dpif-netlink: Detect Out-Of-Resource condition on a netdev
  revalidator: Gather packets-per-second rate of flows
  revalidator: Rebalance offloaded flows based on the pps rate

 NEWS                          |   3 +
 lib/dpif-netdev.c             |   3 +-
 lib/dpif-netlink.c            |  47 +++-
 lib/dpif-provider.h           |   9 +-
 lib/dpif.c                    |  30 ++-
 lib/dpif.h                    |  12 +-
 lib/flow.c                    |  25 ++
 lib/flow.h                    |   1 +
 lib/netdev-provider.h         |  14 +
 lib/netdev.c                  |  89 ++++++
 lib/netdev.h                  |   4 +
 ofproto/ofproto-dpif-upcall.c | 494 +++++++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml          |  21 ++
 13 files changed, 729 insertions(+), 23 deletions(-)

Comments

Ben Pfaff Oct. 11, 2018, 11:01 p.m. UTC | #1
Thanks for the revision.

This seems basically OK at a glance but I'd like a second set of eyes.
Simon, are you willing to review this?  It seems roughly in your area
too.

On Thu, Sep 27, 2018 at 12:18:24PM +0530, Sriharsha Basavapatna via dev wrote:
> With the current OVS offload design, when an offload-device fails to add a
> flow rule and returns an error, OVS adds the rule to the kernel datapath.
> The flow gets processed by the kernel datapath for the entire life of that
> flow. This is fine when an error is returned by the device due to lack of
> support for certain keys or actions.
> 
> But when an error is returned due to temporary conditions such as lack of
> resources to add a flow rule, the flow continues to be processed by kernel
> even when resources become available later. That is, those flows never get
> offloaded again. This problem becomes more pronounced when a flow that has
> been initially offloaded may have a smaller packet rate than a later flow
> that could not be offloaded due to lack of resources. This leads to
> inefficient use of HW resources and wastage of host CPU cycles.
> 
> This patch-set addresses this issue by providing a way to detect temporary
> offload resource constraints (Out-Of-Resource or OOR condition) and to
> selectively and dynamically offload flows with a higher packets-per-second
> (pps) rate. This dynamic rebalancing is done periodically on netdevs that
> are in OOR state until resources become available to offload all pending
> flows.
> 
> The patch-set involves the following changes at a high level:
> 
> 1. Detection of Out-Of-Resources (OOR) condition on an offload-capable 
>    netdev.
> 2. Gathering flow offload selection criteria for all flows on an OOR netdev;
>    i.e, packets-per-second (pps) rate of flows for offloaded and
>    non-offloaded (pending) flows.
> 3. Dynamically replacing offloaded flows with a lower pps-rate, with
>    non-offloaded flows with a higher pps-rate, on an OOR netdev. A new
>    OpenvSwitch configuration option - "offload-rebalance" to enable
>    this policy.
> 
> Cost/benefits data points:
> 
> 1. Rough cost of the new rebalancing, in terms of CPU time:
> 
>    Ran a test that replaced 256 low pps-rate flows(pings) with 256 high
>    pps-rate flows(iperf), in a system with 4 cpus (Intel Xeon E5 @ 2.40GHz;
>    2 cores with hw threads enabled, rest disabled). The data showed that cpu
>    utilization increased by about ~20%. This increase occurs during the
>    specific second in which rebalancing is done. And subsequently (from the
>    next second), cpu utilization decreases significantly due to offloading
>    of higher pps-rate flows. So effectively there's a bump in cpu utilization
>    at the time of rebalancing, that is more than compensated by reduced cpu
>    utilization once the right flows get offloaded.
> 
> 2. Rough benefits to the user in terms of offload performance:
> 
>    The benefits to the user is reduced cpu utilization in the host, since
>    higher pps-rate flows get offloaded, replacing lower pps-rate flows.
>    Replacing a single offloaded flood ping flow with an iperf flow (multiple
>    connections), shows that the cpu %used that was originally 100% on a
>    single cpu (rebalancing disabled) goes down to 35% (rebalancing enabled).
>    That is, cpu utilization decreased 65% after rebalancing.
> 
> 3. Circumstances under which the benefits would show up:
> 
>    The rebalancing benefits would show up once offload resources are
>    exhausted and new flows with higher pps-rate are initiated, that would
>    otherwise be handled by kernel datapath costing host cpu cycles.
> 
>    This can be observed using 'ovs appctl dpctl/ dump-flows' command. Prior
>    to rebalancing, any high pps-rate flows that couldn't be offloaded due to
>    resource crunch would show up in the output of 'dump-flows type=ovs' and
>    after rebalancing such flows would appear in the output of
>    'dump-flows type=offloaded'.
> 
> ******
> 
> v6-->v7:
>  - Fixed comments from Simon Horman. Key changes:
>    - Removed out_netdev and related code which is not needed.
>    - Fixed a potential race between un-offloading and adding a rule to kernel.
>    - Moved config option parameter check to respective patches.
>    - Removed a couple of assertions.
>    - Added a few comments.
>    - Added a workaround to avoid a checkpatch warning for a pre-decrement op.
> 
> v5-->v6:
>   - Fixed a build error reported by 0-day robot in Patch-2
> 
> v4-->v5:
>  - Fixed v4 comments from Ben Pfaff. Key changes:
>    - Updated patch-0 comments to reflect cost/benefit of rebalancing
>    - Release references to netdevs after rebalancing
>    - Reduce ref holding window to rebalancing routine
>    - Avoid collecting all flows before rebalancing (save time/space)
>    - Skip rebalancing when no OOR netdev (new function netdev_any_oor())
>    - Avoid direct access to netdev members (add new netdev interfaces)
>    - Handle tunnel-netdev as a special case of in-netdev (no separate var)
>    - Remove floating point type for pps-rate (not needed, per-second based)
>    - Simplify ukey_to_flow_netdevs() (remove tmp dpif_flow)
>    - Update skip_offload code to more meaningful offload_type
>    - dpif_operate() should validate offload_types
>    - dpif_netlink_operate() should process all ops and set error if needed
>    - Save time of last rebalancing in udpif (remove static vars)
>    - Declare variables at point of first use where possible
>    - Merged patch-4 into patch-3
>    - Simplified offload-rebalance config parameter to bool type
>    - Removed feature enable/disable log messages
> 
> v3-->v4:
>   - Updated parse_flow_put() with the following changes:
>     - Fixed outdev memory leak with multiple output actions
>     - Moved variables closer to their first use
>     - Removed outdev check while setting oor, since indev is sufficient
> 
> v2-->v3:
>   - Removed some VLOG_DBG() in patches 2 and 3
>   - Reworded a few VLOG_DBG() in patch 3
>   - Fixed a comment line in patch 3
> 
> v1-->v2:
>   - Fixed build errors reported by 0-day robot
>   - Updated patch prefixes with relevant subsystem names
> 
> ******
> 
> Sriharsha Basavapatna (3):
>   dpif-netlink: Detect Out-Of-Resource condition on a netdev
>   revalidator: Gather packets-per-second rate of flows
>   revalidator: Rebalance offloaded flows based on the pps rate
> 
>  NEWS                          |   3 +
>  lib/dpif-netdev.c             |   3 +-
>  lib/dpif-netlink.c            |  47 +++-
>  lib/dpif-provider.h           |   9 +-
>  lib/dpif.c                    |  30 ++-
>  lib/dpif.h                    |  12 +-
>  lib/flow.c                    |  25 ++
>  lib/flow.h                    |   1 +
>  lib/netdev-provider.h         |  14 +
>  lib/netdev.c                  |  89 ++++++
>  lib/netdev.h                  |   4 +
>  ofproto/ofproto-dpif-upcall.c | 494 +++++++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml          |  21 ++
>  13 files changed, 729 insertions(+), 23 deletions(-)
> 
> -- 
> 2.18.0.rc1.1.g6f333ff
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Simon Horman Oct. 12, 2018, 5:53 a.m. UTC | #2
On Thu, Oct 11, 2018 at 04:01:40PM -0700, Ben Pfaff wrote:
> Thanks for the revision.
> 
> This seems basically OK at a glance but I'd like a second set of eyes.
> Simon, are you willing to review this?  It seems roughly in your area
> too.

Thanks Ben,

this has been on my todo list for a little too long.
I'll try and get to it today.
Simon Horman Oct. 12, 2018, 9:44 a.m. UTC | #3
On Thu, Oct 11, 2018 at 04:01:40PM -0700, Ben Pfaff wrote:
> Thanks for the revision.
> 
> This seems basically OK at a glance but I'd like a second set of eyes.
> Simon, are you willing to review this?  It seems roughly in your area
> too.

Thanks Ben, Thanks Sriharsha,

I am very pleased to see work in this area.

I have a few lingering concerns, which I noted in separate emails regarding
* Extra CPU cost of processing OOR flows and;
* Correctly detecting the offload device of a tunnel
but I think they can be treated as possible further work rather
than holding up this patchset.

I am also not entirely comfortable with the use of ovs_assert() (in general)
but this may be just a matter of personal taste and again I don't think
it needs to hold up this patchset.

In all, as the feature will be disabled by default and should have negligible
impact when disabled I think it would be good to merge in its current form
to allow further testing (and ideally evolution) of this feature.

Acked-by: Simon Horman <simon.horman@netronome.com>


Ben, I would be happy to apply this series but I'd rather
do so once master travis-ci clean for master. Something I am looking at
separately. I'm also happy for you to apply this series.