diff mbox series

[ovs-dev,v6,2/2] netlink-conntrack: Optimize flushing ct zone.

Message ID f1120a8c35b85a06f3fc58676faa2508aca381d5.1709537628.git.felix.huettner@mail.schwarz
State Changes Requested
Headers show
Series [ovs-dev,v6,1/2] util: Support checking for kernel versions. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Felix Huettner March 4, 2024, 8:22 a.m. UTC
Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

The implementation is now identical to the windows one, so we combine
them.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=<zoneid>`.

          |  flush zone with 1000 entries  |   flush zone with no entry     |
          +---------------------+----------+---------------------+----------|
          |   with the patch    | without  |   with the patch    | without  |
          +----------+----------+----------+----------+----------+----------|
          | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
+---------+----------+----------+----------+----------+----------+----------|
| Min     |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
| Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
| 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
| 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
| Max     |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
| Mean    |  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
| Total   |  80.02   |  1058    |  1082    |  73.93   |  1107    |  1017    |

[1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
---
v5->v6: none
v4->v5: none
v3->v4:
- combine the flush logic with windows implementation
v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 52 ++++++++++++++++++++++++++++++++++++-----
 tests/system-traffic.at |  8 +++++++
 2 files changed, 54 insertions(+), 6 deletions(-)

Comments

Mike Pattrick March 6, 2024, 2:01 p.m. UTC | #1
On Mon, Mar 4, 2024 at 3:22 AM Felix Huettner via dev
<ovs-dev@openvswitch.org> wrote:
>
> Previously the kernel did not provide a netlink interface to flush/list
> only conntrack entries matching a specific zone. With [1] and [2] it is now
> possible to flush and list conntrack entries filtered by zone. Older
> kernels not yet supporting this feature will ignore the filter.
> For the list request that means just returning all entries (which we can
> then filter in userspace as before).
> For the flush request that means deleting all conntrack entries.
>
> The implementation is now identical to the windows one, so we combine
> them.
>
> These significantly improves the performance of flushing conntrack zones
> when the conntrack table is large. Since flushing a conntrack zone is
> normally triggered via an openflow command it blocks the main ovs thread
> and thereby also blocks new flows from being applied. Using this new
> feature we can reduce the flushing time for zones by around 93%.
>
> In combination with OVN the creation of a Logical_Router (which causes
> the flushing of a ct zone) could block other operations, e.g. the
> failover of Logical_Routers (as they cause new flows to be created).
> This is visible from a user perspective as a ovn-controller that is idle
> (as it waits for vswitchd) and vswitchd reporting:
> "blocked 1000 ms waiting for main to quiesce" (potentially with ever
> increasing times).
>
> The following performance tests where run in a qemu vm with 500.000
> conntrack entries distributed evenly over 500 ct zones using `ovstest
> test-netlink-conntrack flush zone=<zoneid>`.
>
>           |  flush zone with 1000 entries  |   flush zone with no entry     |
>           +---------------------+----------+---------------------+----------|
>           |   with the patch    | without  |   with the patch    | without  |
>           +----------+----------+----------+----------+----------+----------|
>           | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
> +---------+----------+----------+----------+----------+----------+----------|
> | Min     |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
> | Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
> | 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
> | 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
> | Max     |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
> | Mean    |  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
> | Total   |  80.02   |  1058    |  1082    |  73.93   |  1107    |  1017    |
>
> [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
> [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a
>
> Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
> ---

Acked-by: Mike Pattrick <mkp@redhat.com>
Ilya Maximets March 8, 2024, 11:08 a.m. UTC | #2
On 3/4/24 09:22, Felix Huettner via dev wrote:
> Previously the kernel did not provide a netlink interface to flush/list
> only conntrack entries matching a specific zone. With [1] and [2] it is now
> possible to flush and list conntrack entries filtered by zone. Older
> kernels not yet supporting this feature will ignore the filter.
> For the list request that means just returning all entries (which we can
> then filter in userspace as before).
> For the flush request that means deleting all conntrack entries.
> 
> The implementation is now identical to the windows one, so we combine
> them.
> 
> These significantly improves the performance of flushing conntrack zones
> when the conntrack table is large. Since flushing a conntrack zone is
> normally triggered via an openflow command it blocks the main ovs thread
> and thereby also blocks new flows from being applied. Using this new
> feature we can reduce the flushing time for zones by around 93%.
> 
> In combination with OVN the creation of a Logical_Router (which causes
> the flushing of a ct zone) could block other operations, e.g. the
> failover of Logical_Routers (as they cause new flows to be created).
> This is visible from a user perspective as a ovn-controller that is idle
> (as it waits for vswitchd) and vswitchd reporting:
> "blocked 1000 ms waiting for main to quiesce" (potentially with ever
> increasing times).
> 
> The following performance tests where run in a qemu vm with 500.000
> conntrack entries distributed evenly over 500 ct zones using `ovstest
> test-netlink-conntrack flush zone=<zoneid>`.
> 
>           |  flush zone with 1000 entries  |   flush zone with no entry     |
>           +---------------------+----------+---------------------+----------|
>           |   with the patch    | without  |   with the patch    | without  |
>           +----------+----------+----------+----------+----------+----------|
>           | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4 |
> +---------+----------+----------+----------+----------+----------+----------|
> | Min     |  0.260   |  3.946   |  3.497   |  0.228   |  3.462   |  3.212   |
> | Median  |  0.319   |  4.237   |  4.349   |  0.298   |  4.460   |  4.010   |
> | 90%ile  |  0.335   |  4.367   |  4.522   |  0.325   |  4.662   |  4.572   |
> | 99%ile  |  0.348   |  4.495   |  4.773   |  0.340   |  4.931   |  6.003   |
> | Max     |  0.362   |  4.543   |  5.054   |  0.348   |  5.390   |  6.396   |
> | Mean    |  0.320   |  4.236   |  4.331   |  0.296   |  4.430   |  4.071   |
> | Total   |  80.02   |  1058    |  1082    |  73.93   |  1107    |  1017    |
> 
> [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
> [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a
> 
> Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
> ---
> v5->v6: none
> v4->v5: none
> v3->v4:
> - combine the flush logic with windows implementation
> v2->v3:
> - update description to include upstream fix (Thanks to Ilya for finding
>   that issue)
> v1->v2:
> - fixed wrong signed-off-by
> 
>  lib/netlink-conntrack.c | 52 ++++++++++++++++++++++++++++++++++++-----
>  tests/system-traffic.at |  8 +++++++
>  2 files changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
> index 492bfcffb..d9015d9f0 100644
> --- a/lib/netlink-conntrack.c
> +++ b/lib/netlink-conntrack.c
> @@ -141,6 +141,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
>  
>      nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> +    if (zone) {
> +        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
> +    }
>      nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
>      ofpbuf_clear(&state->buf);
>  
> @@ -263,11 +266,9 @@ out:
>      return err;
>  }
>  
> -#ifdef _WIN32
> -int
> -nl_ct_flush_zone(uint16_t flush_zone)
> +static int
> +nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
>  {
> -    /* Windows can flush a specific zone */
>      struct ofpbuf buf;
>      int err;
>  
> @@ -282,24 +283,63 @@ nl_ct_flush_zone(uint16_t flush_zone)
>  
>      return err;
>  }
> +
> +#ifdef _WIN32
> +int
> +nl_ct_flush_zone(uint16_t flush_zone)
> +{
> +    return nl_ct_flush_zone_with_cta_zone(flush_zone);
> +}
>  #else
> +
> +static bool
> +netlink_flush_supports_zone(void)
> +{
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +    static bool supported = false;
> +
> +    if (ovsthread_once_start(&once)) {
> +        if (ovs_kernel_is_version_or_newer(6, 8)) {
> +            supported = true;
> +        } else {
> +            VLOG_INFO("disabling conntrack flush by zone. "
> +                      "Not supported in Linux kernel");
> +        }
> +        ovsthread_once_done(&once);
> +    }
> +    return supported;
> +}
> +
>  int
>  nl_ct_flush_zone(uint16_t flush_zone)
>  {
> -    /* Apparently, there's no netlink interface to flush a specific zone.
> +    /* In older kernels, there was no netlink interface to flush a specific
> +     * conntrack zone.
>       * This code dumps every connection, checks the zone and eventually
>       * delete the entry.
> +     * In newer kernels there is the option to specify a zone for filtering
> +     * during dumps. Older kernels ignore this option. We set it here in the

Nit: Double spaces between sentences.

> +     * hope we only get relevant entries back, but fall back to filtering here
> +     * to keep compatibility.
>       *
> -     * This is race-prone, but it is better than using shell scripts. */
> +     * This is race-prone, but it is better than using shell scripts.
> +     *
> +     * Additionally newer kernels also support flushing a zone without listing
> +     * it first. */
>  
>      struct nl_dump dump;
>      struct ofpbuf buf, reply, delete;
>  
> +    if (netlink_flush_supports_zone()) {
> +        return nl_ct_flush_zone_with_cta_zone(flush_zone);
> +    }
> +
>      ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
>      ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
>  
>      nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> +    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
>      nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
>      ofpbuf_clear(&buf);
>  
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 07d09b912..d87d6914d 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -3376,6 +3376,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl
>  tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>)
>  ])
>  
> +dnl flushing one zone should leave the others intact

Nit: Capital letter and a period at the end.

> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>)
> +])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl
> +])

Also this test seems out of place here.  I think, it should be part
of the 'conntrack - ct flush' test and use FLUSH_CMD instead.  We
should not mix the general zone testing with the flush testing.

We may add another port and an OpenFlow rule in order to have more
zones in 'conntrack - ct flush' test if needed.

Best regards, Ilya Maximets.
Felix Huettner March 11, 2024, 12:49 p.m. UTC | #3
> > diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> > index 07d09b912..d87d6914d 100644
> > --- a/tests/system-traffic.at
> > +++ b/tests/system-traffic.at
> > @@ -3376,6 +3376,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl
> >  tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>)
> >  ])
> >  
> > +dnl flushing one zone should leave the others intact
> 
> Nit: Capital letter and a period at the end.
> 
> > +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> > +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>)
> > +])
> > +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl
> > +])
> 
> Also this test seems out of place here.  I think, it should be part
> of the 'conntrack - ct flush' test and use FLUSH_CMD instead.  We
> should not mix the general zone testing with the flush testing.
> 
> We may add another port and an OpenFlow rule in order to have more
> zones in 'conntrack - ct flush' test if needed.
> 
> Best regards, Ilya Maximets.

Thanks i'll move it there and post a v7
diff mbox series

Patch

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..d9015d9f0 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -141,6 +141,9 @@  nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
 
     nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    if (zone) {
+        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
+    }
     nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
     ofpbuf_clear(&state->buf);
 
@@ -263,11 +266,9 @@  out:
     return err;
 }
 
-#ifdef _WIN32
-int
-nl_ct_flush_zone(uint16_t flush_zone)
+static int
+nl_ct_flush_zone_with_cta_zone(uint16_t flush_zone)
 {
-    /* Windows can flush a specific zone */
     struct ofpbuf buf;
     int err;
 
@@ -282,24 +283,63 @@  nl_ct_flush_zone(uint16_t flush_zone)
 
     return err;
 }
+
+#ifdef _WIN32
+int
+nl_ct_flush_zone(uint16_t flush_zone)
+{
+    return nl_ct_flush_zone_with_cta_zone(flush_zone);
+}
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+    static bool supported = false;
+
+    if (ovsthread_once_start(&once)) {
+        if (ovs_kernel_is_version_or_newer(6, 8)) {
+            supported = true;
+        } else {
+            VLOG_INFO("disabling conntrack flush by zone. "
+                      "Not supported in Linux kernel");
+        }
+        ovsthread_once_done(&once);
+    }
+    return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-    /* Apparently, there's no netlink interface to flush a specific zone.
+    /* In older kernels, there was no netlink interface to flush a specific
+     * conntrack zone.
      * This code dumps every connection, checks the zone and eventually
      * delete the entry.
+     * In newer kernels there is the option to specify a zone for filtering
+     * during dumps. Older kernels ignore this option. We set it here in the
+     * hope we only get relevant entries back, but fall back to filtering here
+     * to keep compatibility.
      *
-     * This is race-prone, but it is better than using shell scripts. */
+     * This is race-prone, but it is better than using shell scripts.
+     *
+     * Additionally newer kernels also support flushing a zone without listing
+     * it first. */
 
     struct nl_dump dump;
     struct ofpbuf buf, reply, delete;
 
+    if (netlink_flush_supports_zone()) {
+        return nl_ct_flush_zone_with_cta_zone(flush_zone);
+    }
+
     ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
     ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
 
     nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
     nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
     ofpbuf_clear(&buf);
 
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 07d09b912..d87d6914d 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -3376,6 +3376,14 @@  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl
 tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>)
 ])
 
+dnl flushing one zone should leave the others intact
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
+AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>)
+])
+AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl
+])
+
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP