Message ID | 20180927064827.26256-1-sriharsha.basavapatna@broadcom.com |
---|---|
Headers | show |
Series | Support dynamic rebalancing of offloaded flows | expand |
Thanks for the revision. This seems basically OK at a glance but I'd like a second set of eyes. Simon, are you willing to review this? It seems roughly in your area too. On Thu, Sep 27, 2018 at 12:18:24PM +0530, Sriharsha Basavapatna via dev wrote: > With the current OVS offload design, when an offload-device fails to add a > flow rule and returns an error, OVS adds the rule to the kernel datapath. > The flow gets processed by the kernel datapath for the entire life of that > flow. This is fine when an error is returned by the device due to lack of > support for certain keys or actions. > > But when an error is returned due to temporary conditions such as lack of > resources to add a flow rule, the flow continues to be processed by kernel > even when resources become available later. That is, those flows never get > offloaded again. This problem becomes more pronounced when a flow that has > been initially offloaded may have a smaller packet rate than a later flow > that could not be offloaded due to lack of resources. This leads to > inefficient use of HW resources and wastage of host CPU cycles. > > This patch-set addresses this issue by providing a way to detect temporary > offload resource constraints (Out-Of-Resource or OOR condition) and to > selectively and dynamically offload flows with a higher packets-per-second > (pps) rate. This dynamic rebalancing is done periodically on netdevs that > are in OOR state until resources become available to offload all pending > flows. > > The patch-set involves the following changes at a high level: > > 1. Detection of Out-Of-Resources (OOR) condition on an offload-capable > netdev. > 2. Gathering flow offload selection criteria for all flows on an OOR netdev; > i.e, packets-per-second (pps) rate of flows for offloaded and > non-offloaded (pending) flows. > 3. Dynamically replacing offloaded flows with a lower pps-rate, with > non-offloaded flows with a higher pps-rate, on an OOR netdev. A new > OpenvSwitch configuration option - "offload-rebalance" to enable > this policy. > > Cost/benefits data points: > > 1. Rough cost of the new rebalancing, in terms of CPU time: > > Ran a test that replaced 256 low pps-rate flows(pings) with 256 high > pps-rate flows(iperf), in a system with 4 cpus (Intel Xeon E5 @ 2.40GHz; > 2 cores with hw threads enabled, rest disabled). The data showed that cpu > utilization increased by about ~20%. This increase occurs during the > specific second in which rebalancing is done. And subsequently (from the > next second), cpu utilization decreases significantly due to offloading > of higher pps-rate flows. So effectively there's a bump in cpu utilization > at the time of rebalancing, that is more than compensated by reduced cpu > utilization once the right flows get offloaded. > > 2. Rough benefits to the user in terms of offload performance: > > The benefits to the user is reduced cpu utilization in the host, since > higher pps-rate flows get offloaded, replacing lower pps-rate flows. > Replacing a single offloaded flood ping flow with an iperf flow (multiple > connections), shows that the cpu %used that was originally 100% on a > single cpu (rebalancing disabled) goes down to 35% (rebalancing enabled). > That is, cpu utilization decreased 65% after rebalancing. > > 3. Circumstances under which the benefits would show up: > > The rebalancing benefits would show up once offload resources are > exhausted and new flows with higher pps-rate are initiated, that would > otherwise be handled by kernel datapath costing host cpu cycles. > > This can be observed using 'ovs appctl dpctl/ dump-flows' command. Prior > to rebalancing, any high pps-rate flows that couldn't be offloaded due to > resource crunch would show up in the output of 'dump-flows type=ovs' and > after rebalancing such flows would appear in the output of > 'dump-flows type=offloaded'. > > ****** > > v6-->v7: > - Fixed comments from Simon Horman. Key changes: > - Removed out_netdev and related code which is not needed. > - Fixed a potential race between un-offloading and adding a rule to kernel. > - Moved config option parameter check to respective patches. > - Removed a couple of assertions. > - Added a few comments. > - Added a workaround to avoid a checkpatch warning for a pre-decrement op. > > v5-->v6: > - Fixed a build error reported by 0-day robot in Patch-2 > > v4-->v5: > - Fixed v4 comments from Ben Pfaff. Key changes: > - Updated patch-0 comments to reflect cost/benefit of rebalancing > - Release references to netdevs after rebalancing > - Reduce ref holding window to rebalancing routine > - Avoid collecting all flows before rebalancing (save time/space) > - Skip rebalancing when no OOR netdev (new function netdev_any_oor()) > - Avoid direct access to netdev members (add new netdev interfaces) > - Handle tunnel-netdev as a special case of in-netdev (no separate var) > - Remove floating point type for pps-rate (not needed, per-second based) > - Simplify ukey_to_flow_netdevs() (remove tmp dpif_flow) > - Update skip_offload code to more meaningful offload_type > - dpif_operate() should validate offload_types > - dpif_netlink_operate() should process all ops and set error if needed > - Save time of last rebalancing in udpif (remove static vars) > - Declare variables at point of first use where possible > - Merged patch-4 into patch-3 > - Simplified offload-rebalance config parameter to bool type > - Removed feature enable/disable log messages > > v3-->v4: > - Updated parse_flow_put() with the following changes: > - Fixed outdev memory leak with multiple output actions > - Moved variables closer to their first use > - Removed outdev check while setting oor, since indev is sufficient > > v2-->v3: > - Removed some VLOG_DBG() in patches 2 and 3 > - Reworded a few VLOG_DBG() in patch 3 > - Fixed a comment line in patch 3 > > v1-->v2: > - Fixed build errors reported by 0-day robot > - Updated patch prefixes with relevant subsystem names > > ****** > > Sriharsha Basavapatna (3): > dpif-netlink: Detect Out-Of-Resource condition on a netdev > revalidator: Gather packets-per-second rate of flows > revalidator: Rebalance offloaded flows based on the pps rate > > NEWS | 3 + > lib/dpif-netdev.c | 3 +- > lib/dpif-netlink.c | 47 +++- > lib/dpif-provider.h | 9 +- > lib/dpif.c | 30 ++- > lib/dpif.h | 12 +- > lib/flow.c | 25 ++ > lib/flow.h | 1 + > lib/netdev-provider.h | 14 + > lib/netdev.c | 89 ++++++ > lib/netdev.h | 4 + > ofproto/ofproto-dpif-upcall.c | 494 +++++++++++++++++++++++++++++++++- > vswitchd/vswitch.xml | 21 ++ > 13 files changed, 729 insertions(+), 23 deletions(-) > > -- > 2.18.0.rc1.1.g6f333ff > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On Thu, Oct 11, 2018 at 04:01:40PM -0700, Ben Pfaff wrote: > Thanks for the revision. > > This seems basically OK at a glance but I'd like a second set of eyes. > Simon, are you willing to review this? It seems roughly in your area > too. Thanks Ben, this has been on my todo list for a little too long. I'll try and get to it today.
On Thu, Oct 11, 2018 at 04:01:40PM -0700, Ben Pfaff wrote: > Thanks for the revision. > > This seems basically OK at a glance but I'd like a second set of eyes. > Simon, are you willing to review this? It seems roughly in your area > too. Thanks Ben, Thanks Sriharsha, I am very pleased to see work in this area. I have a few lingering concerns, which I noted in separate emails regarding * Extra CPU cost of processing OOR flows and; * Correctly detecting the offload device of a tunnel but I think they can be treated as possible further work rather than holding up this patchset. I am also not entirely comfortable with the use of ovs_assert() (in general) but this may be just a matter of personal taste and again I don't think it needs to hold up this patchset. In all, as the feature will be disabled by default and should have negligible impact when disabled I think it would be good to merge in its current form to allow further testing (and ideally evolution) of this feature. Acked-by: Simon Horman <simon.horman@netronome.com> Ben, I would be happy to apply this series but I'd rather do so once master travis-ci clean for master. Something I am looking at separately. I'm also happy for you to apply this series.