diff mbox series

[ovs-dev,RFC] netlink-conntrack: optimize flushing ct zone

Message ID ZVeHQbMvsLgct1On@SIT-SDELAP4051.int.lidl.net
State Superseded
Delegated to: Simon Horman
Headers show
Series [ovs-dev,RFC] netlink-conntrack: optimize flushing ct zone | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Felix Huettner Nov. 17, 2023, 3:31 p.m. UTC
NOTE: this change makes improvements depending on a change in the kernel
currently on the netdev mailing list (see [1]).

Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
FOr the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. The main benefit
can already be acheived by using the existing logicl with the additional
filter based on the zone (90-95% speedup). Using the logical to flush
directly by zone brings an additional 10-15% on top of that (more
numbers below).

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=<zoneid>`.

With this patch and the respective kernel patch applied, but
OVS_NETLINK_CONNTRAK_FLUSH_ZONE_SUPPORTED unset:

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   0.309           0.372           0.393           0.467           0.516           0.374           93.597          250
flush zone with no entry       0.265           0.305           0.333           0.352           0.393           0.307           76.770          250
-----------------------------------------------------------------------------------------------------------------------------------------------------

With this patch and the respective kernel patch applied, and
OVS_NETLINK_CONNTRAK_FLUSH_ZONE_SUPPORTED set:

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   0.256           0.323           0.341           0.367           0.389           0.322           80.729          250
flush zone with no entry       0.225           0.265           0.317           0.336           0.351           0.274           68.659          250
-----------------------------------------------------------------------------------------------------------------------------------------------------

Before this patch and/or without the respective kernel patch

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   2.499           4.990           5.209           6.435           7.150           5.008           1252.158        250
flush zone with no entry       4.120           4.572           4.783           5.156           5.364           4.559           1139.786        250
-----------------------------------------------------------------------------------------------------------------------------------------------------

[1]: https://lore.kernel.org/netdev/ZVeGFP2x-Wx6duYs@SIT-SDELAP4051.int.lidl.net/T/#u
---
 lib/netlink-conntrack.c | 49 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

--
2.42.0

Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt.
Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail.

Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.


This e-mail may contain confidential content and is intended only for the specified recipient/s.
If you are not the intended recipient, please inform the sender immediately and delete this e-mail.

Information on data protection can be found here<https://www.datenschutz.schwarz>.
diff mbox series

Patch

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..32be0d122 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@  nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,

     nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    if (zone) {
+        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
+    }
     nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
     ofpbuf_clear(&state->buf);

@@ -283,23 +286,65 @@  nl_ct_flush_zone(uint16_t flush_zone)
     return err;
 }
 #else
+
+static bool netlink_flush_supports_zone(void) {
+    static bool valid, supported = false;
+    if (!valid) {
+        char *env = getenv("OVS_NETLINK_CONNTRAK_FLUSH_ZONE_SUPPORTED");
+        if (env && env[0]) {
+            if (env[0] == 'T' || env[0] == 't') {
+                supported = true;
+            }
+        }
+        valid = true;
+    }
+    return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-    /* Apparently, there's no netlink interface to flush a specific zone.
+    /* In older kernels, there was no netlink interface to flush a specific
+     * conntrack zone.
      * This code dumps every connection, checks the zone and eventually
      * delete the entry.
+     * In newer kernels there is the option to specifiy a zone for filtering
+     * during dumps. Older kernels ignore this option. We set it here in the
+     * hope we only get relevant entries back, but fall back to filtering here
+     * to keep compatibility.
      *
-     * This is race-prone, but it is better than using shell scripts. */
+     * This is race-prone, but it is better than using shell scripts.
+     *
+     * Additionaly newer kenerls also support flushing a zone without listing
+     * it first. However it is not easily possible to discover if the kernel
+     * supports this feature or if it will flush the complete conntrack table.
+     * We therefor rely on an environment variable, allowing the user to
+     * provide us this information. In the future we can use kernel version
+     * numbers. */

     struct nl_dump dump;
     struct ofpbuf buf, reply, delete;
+    int err;
+
+    if (netlink_flush_supports_zone()) {
+        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
+
+        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
+                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
+        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
+
+        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
+        ofpbuf_uninit(&buf);
+
+        return err;
+    }

     ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
     ofpbuf_init(&delete, NL_DUMP_BUFSIZE);

     nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
     nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
     ofpbuf_clear(&buf);