diff mbox series

[ovs-dev,v3] netlink-conntrack: Optimize flushing ct zone.

Message ID Zcn1FwvYi3TfLsqJ@SIT-SDELAP4051.int.lidl.net
State Changes Requested
Headers show
Series [ovs-dev,v3] netlink-conntrack: Optimize flushing ct zone. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Felix Huettner Feb. 12, 2024, 10:38 a.m. UTC
Previously the kernel did not provide a netlink interface to flush/list
only conntrack entries matching a specific zone. With [1] and [2] it is now
possible to flush and list conntrack entries filtered by zone. Older
kernels not yet supporting this feature will ignore the filter.
For the list request that means just returning all entries (which we can
then filter in userspace as before).
For the flush request that means deleting all conntrack entries.

These significantly improves the performance of flushing conntrack zones
when the conntrack table is large. Since flushing a conntrack zone is
normally triggered via an openflow command it blocks the main ovs thread
and thereby also blocks new flows from being applied. Using this new
feature we can reduce the flushing time for zones by around 93%.

In combination with OVN the creation of a Logical_Router (which causes
the flushing of a ct zone) could block other operations, e.g. the
failover of Logical_Routers (as they cause new flows to be created).
This is visible from a user perspective as a ovn-controller that is idle
(as it waits for vswitchd) and vswitchd reporting:
"blocked 1000 ms waiting for main to quiesce" (potentially with ever
increasing times).

The following performance tests where run in a qemu vm with 500.000
conntrack entries distributed evenly over 500 ct zones using `ovstest
test-netlink-conntrack flush zone=<zoneid>`.

With this patch and kernel v6.8-rc4:

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   0.260           0.319           0.335           0.348           0.362           0.320           80.02           250
flush zone with no entry       0.228           0.298           0.325           0.340           0.348           0.296           73.93           250
-----------------------------------------------------------------------------------------------------------------------------------------------------

With this patch and kernel v6.7.1:

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   3.946           4.237           4.367           4.495           4.543           4.236           1058.992        250
flush zone with no entry       3.462           4.460           4.662           4.931           5.390           4.430           1107.479        250
-----------------------------------------------------------------------------------------------------------------------------------------------------

Without this patch and kernel v6.8-rc4:

-----------------------------------------------------------------------------------------------------------------------------------------------------
                               Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
-----------------------------------------------------------------------------------------------------------------------------------------------------
flush zone with 1000 entries   3.497           4.349           4.522           4.773           5.054           4.331           1082.802        250
flush zone with no entry       3.212           4.010           4.572           6.003           6.396           4.071           1017.838        250
-----------------------------------------------------------------------------------------------------------------------------------------------------

[1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
[2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a

Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
---

v2->v3:
- update description to include upstream fix (Thanks to Ilya for finding
  that issue)
v1->v2:
- fixed wrong signed-off-by

 lib/netlink-conntrack.c | 57 +++++++++++++++++++++++++++++++++++++++--
 tests/system-traffic.at |  8 ++++++
 2 files changed, 63 insertions(+), 2 deletions(-)


base-commit: b3fc822208e89c6ba04e1fc972da0bf4426a5847

Comments

Aaron Conole Feb. 15, 2024, 2:43 p.m. UTC | #1
Felix Huettner via dev <ovs-dev@openvswitch.org> writes:

> Previously the kernel did not provide a netlink interface to flush/list
> only conntrack entries matching a specific zone. With [1] and [2] it is now
> possible to flush and list conntrack entries filtered by zone. Older
> kernels not yet supporting this feature will ignore the filter.
> For the list request that means just returning all entries (which we can
> then filter in userspace as before).
> For the flush request that means deleting all conntrack entries.
>
> These significantly improves the performance of flushing conntrack zones
> when the conntrack table is large. Since flushing a conntrack zone is
> normally triggered via an openflow command it blocks the main ovs thread
> and thereby also blocks new flows from being applied. Using this new
> feature we can reduce the flushing time for zones by around 93%.
>
> In combination with OVN the creation of a Logical_Router (which causes
> the flushing of a ct zone) could block other operations, e.g. the
> failover of Logical_Routers (as they cause new flows to be created).
> This is visible from a user perspective as a ovn-controller that is idle
> (as it waits for vswitchd) and vswitchd reporting:
> "blocked 1000 ms waiting for main to quiesce" (potentially with ever
> increasing times).
>
> The following performance tests where run in a qemu vm with 500.000
> conntrack entries distributed evenly over 500 ct zones using `ovstest
> test-netlink-conntrack flush zone=<zoneid>`.
>
> With this patch and kernel v6.8-rc4:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   0.260           0.319           0.335           0.348           0.362           0.320           80.02           250
> flush zone with no entry       0.228           0.298           0.325           0.340           0.348           0.296           73.93           250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> With this patch and kernel v6.7.1:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   3.946           4.237           4.367           4.495           4.543           4.236           1058.992        250
> flush zone with no entry       3.462           4.460           4.662           4.931           5.390           4.430           1107.479        250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Without this patch and kernel v6.8-rc4:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   3.497           4.349           4.522           4.773           5.054           4.331           1082.802        250
> flush zone with no entry       3.212           4.010           4.572           6.003           6.396           4.071           1017.838        250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
> [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a
>
> Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
> ---

Thanks so much - I haven't gotten a chance to test on a new kernel, but
it's on my agenda.

> v2->v3:
> - update description to include upstream fix (Thanks to Ilya for finding
>   that issue)
> v1->v2:
> - fixed wrong signed-off-by
>
>  lib/netlink-conntrack.c | 57 +++++++++++++++++++++++++++++++++++++++--
>  tests/system-traffic.at |  8 ++++++
>  2 files changed, 63 insertions(+), 2 deletions(-)
>
> diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
> index 492bfcffb..1b050894d 100644
> --- a/lib/netlink-conntrack.c
> +++ b/lib/netlink-conntrack.c
> @@ -25,6 +25,7 @@
>  #include <linux/netfilter/nf_conntrack_tcp.h>
>  #include <linux/netfilter/nf_conntrack_ftp.h>
>  #include <linux/netfilter/nf_conntrack_sctp.h>
> +#include <sys/utsname.h>
>  
>  #include "byte-order.h"
>  #include "compiler.h"
> @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
>  
>      nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> +    if (zone) {
> +        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
> +    }
>      nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
>      ofpbuf_clear(&state->buf);
>  
> @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone)
>      return err;
>  }
>  #else
> +
> +static bool
> +netlink_flush_supports_zone(void)
> +{
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +    static bool supported = false;
> +
> +    if (ovsthread_once_start(&once)) {
> +        struct utsname utsname;
> +        int major, minor;
> +
> +        if (uname(&utsname) == -1) {
> +            VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
> +        } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) {
> +            VLOG_WARN("uname reported bad OS release (%s)", utsname.release);

Do these WARN calls need to be severe like this?  There isn't much for
the user to do about this, and it won't impact the functioning of the
system in such a major way.  Maybe they can be INFO instead?  Otherwise,
we may need to change most of the selftests to ignore the "uname failed"
warnings.

More of a nit, because it shouldn't generally fail on any systems.

> +        } else if (major < 6 || (major == 6 && minor < 8)) {
> +            VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s",
> +                      utsname.release);
> +        } else {
> +            supported = true;
> +        }
> +        ovsthread_once_done(&once);
> +    }
> +    return supported;
> +}
> +
>  int
>  nl_ct_flush_zone(uint16_t flush_zone)
>  {
> -    /* Apparently, there's no netlink interface to flush a specific zone.
> +    /* In older kernels, there was no netlink interface to flush a specific
> +     * conntrack zone.
>       * This code dumps every connection, checks the zone and eventually
>       * delete the entry.
> +     * In newer kernels there is the option to specifiy a zone for filtering
> +     * during dumps. Older kernels ignore this option. We set it here in the
> +     * hope we only get relevant entries back, but fall back to filtering here
> +     * to keep compatibility.
>       *
> -     * This is race-prone, but it is better than using shell scripts. */
> +     * This is race-prone, but it is better than using shell scripts.
> +     *
> +     * Additionally newer kernels also support flushing a zone without listing
> +     * it first. */
>  
>      struct nl_dump dump;
>      struct ofpbuf buf, reply, delete;
> +    int err;
> +
> +    if (netlink_flush_supports_zone()) {
> +        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
> +
> +        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> +                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
> +        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
> +
> +        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
> +        ofpbuf_uninit(&buf);
> +
> +        return err;
> +    }
>  
>      ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
>      ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
>  
>      nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> +    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
>      nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
>      ofpbuf_clear(&buf);
>  
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 4fd5dbe59..14010dc41 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -3297,6 +3297,14 @@ AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl
>  tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>)
>  ])
>  
> +dnl flushing one zone should leave the others intact
> +AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>)
> +])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl
> +])
> +
>  OVS_TRAFFIC_VSWITCHD_STOP
>  AT_CLEANUP
>  
>
> base-commit: b3fc822208e89c6ba04e1fc972da0bf4426a5847
Felix Huettner Feb. 15, 2024, 3:04 p.m. UTC | #2
> > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
> > index 492bfcffb..1b050894d 100644
> > --- a/lib/netlink-conntrack.c
> > +++ b/lib/netlink-conntrack.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/netfilter/nf_conntrack_tcp.h>
> >  #include <linux/netfilter/nf_conntrack_ftp.h>
> >  #include <linux/netfilter/nf_conntrack_sctp.h>
> > +#include <sys/utsname.h>
> >  
> >  #include "byte-order.h"
> >  #include "compiler.h"
> > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
> >  
> >      nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> >                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> > +    if (zone) {
> > +        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
> > +    }
> >      nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
> >      ofpbuf_clear(&state->buf);
> >  
> > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone)
> >      return err;
> >  }
> >  #else
> > +
> > +static bool
> > +netlink_flush_supports_zone(void)
> > +{
> > +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +    static bool supported = false;
> > +
> > +    if (ovsthread_once_start(&once)) {
> > +        struct utsname utsname;
> > +        int major, minor;
> > +
> > +        if (uname(&utsname) == -1) {
> > +            VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
> > +        } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) {
> > +            VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
> 
> Do these WARN calls need to be severe like this?  There isn't much for
> the user to do about this, and it won't impact the functioning of the
> system in such a major way.  Maybe they can be INFO instead?  Otherwise,
> we may need to change most of the selftests to ignore the "uname failed"
> warnings.
> 
> More of a nit, because it shouldn't generally fail on any systems.

That was also my thought. I actually copied most of the logic from
`getqdisc_is_safe` in `netdev-linux.c`.

I am completely fine with changing it to something else if that makes
everyones lives easier, since hitting it is so unrealistic.

Let me know what you would prefer.

Thanks
Felix

> 
> > +        } else if (major < 6 || (major == 6 && minor < 8)) {
> > +            VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s",
> > +                      utsname.release);
> > +        } else {
> > +            supported = true;
> > +        }
> > +        ovsthread_once_done(&once);
> > +    }
> > +    return supported;
> > +}
> > +
> >  int
> >  nl_ct_flush_zone(uint16_t flush_zone)
> >  {
> > -    /* Apparently, there's no netlink interface to flush a specific zone.
> > +    /* In older kernels, there was no netlink interface to flush a specific
> > +     * conntrack zone.
> >       * This code dumps every connection, checks the zone and eventually
> >       * delete the entry.
> > +     * In newer kernels there is the option to specifiy a zone for filtering
> > +     * during dumps. Older kernels ignore this option. We set it here in the
> > +     * hope we only get relevant entries back, but fall back to filtering here
> > +     * to keep compatibility.
> >       *
> > -     * This is race-prone, but it is better than using shell scripts. */
> > +     * This is race-prone, but it is better than using shell scripts.
> > +     *
> > +     * Additionally newer kernels also support flushing a zone without listing
> > +     * it first. */
> >  
> >      struct nl_dump dump;
> >      struct ofpbuf buf, reply, delete;
> > +    int err;
> > +
> > +    if (netlink_flush_supports_zone()) {
> > +        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
> > +
> > +        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> > +                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
> > +        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
> > +
> > +        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
> > +        ofpbuf_uninit(&buf);
> > +
> > +        return err;
> > +    }
> >  
> >      ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
> >      ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
> >  
> >      nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> >                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> > +    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
> >      nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
> >      ofpbuf_clear(&buf);
> >
Aaron Conole Feb. 19, 2024, 6:24 p.m. UTC | #3
Felix Huettner <felix.huettner@mail.schwarz> writes:

>> > diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
>> > index 492bfcffb..1b050894d 100644
>> > --- a/lib/netlink-conntrack.c
>> > +++ b/lib/netlink-conntrack.c
>> > @@ -25,6 +25,7 @@
>> >  #include <linux/netfilter/nf_conntrack_tcp.h>
>> >  #include <linux/netfilter/nf_conntrack_ftp.h>
>> >  #include <linux/netfilter/nf_conntrack_sctp.h>
>> > +#include <sys/utsname.h>
>> >  
>> >  #include "byte-order.h"
>> >  #include "compiler.h"
>> > @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
>> >  
>> >      nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>> >                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
>> > +    if (zone) {
>> > +        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
>> > +    }
>> >      nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
>> >      ofpbuf_clear(&state->buf);
>> >  
>> > @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone)
>> >      return err;
>> >  }
>> >  #else
>> > +
>> > +static bool
>> > +netlink_flush_supports_zone(void)
>> > +{
>> > +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> > +    static bool supported = false;
>> > +
>> > +    if (ovsthread_once_start(&once)) {
>> > +        struct utsname utsname;
>> > +        int major, minor;
>> > +
>> > +        if (uname(&utsname) == -1) {
>> > +            VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
>> > +        } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) {
>> > +            VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
>> 
>> Do these WARN calls need to be severe like this?  There isn't much for
>> the user to do about this, and it won't impact the functioning of the
>> system in such a major way.  Maybe they can be INFO instead?  Otherwise,
>> we may need to change most of the selftests to ignore the "uname failed"
>> warnings.
>> 
>> More of a nit, because it shouldn't generally fail on any systems.
>
> That was also my thought. I actually copied most of the logic from
> `getqdisc_is_safe` in `netdev-linux.c`.
>
> I am completely fine with changing it to something else if that makes
> everyones lives easier, since hitting it is so unrealistic.

I think it makes sense to just have a single function that the tc and
conntrack layers can call for this functionality, while we can just make
the logs at INFO level - then no issues with strange test environments.
And it doesn't prevent anything from running, or result in 'degraded'
performance - just prevents an optimization.

WDYT?

> Let me know what you would prefer.
>
> Thanks
> Felix
>
>> 
>> > +        } else if (major < 6 || (major == 6 && minor < 8)) {
>> > +            VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s",
>> > +                      utsname.release);
>> > +        } else {
>> > +            supported = true;
>> > +        }
>> > +        ovsthread_once_done(&once);
>> > +    }
>> > +    return supported;
>> > +}
>> > +
>> >  int
>> >  nl_ct_flush_zone(uint16_t flush_zone)
>> >  {
>> > -    /* Apparently, there's no netlink interface to flush a specific zone.
>> > +    /* In older kernels, there was no netlink interface to flush a specific
>> > +     * conntrack zone.
>> >       * This code dumps every connection, checks the zone and eventually
>> >       * delete the entry.
>> > +     * In newer kernels there is the option to specifiy a zone for filtering
>> > +     * during dumps. Older kernels ignore this option. We set it here in the
>> > +     * hope we only get relevant entries back, but fall back to filtering here
>> > +     * to keep compatibility.
>> >       *
>> > -     * This is race-prone, but it is better than using shell scripts. */
>> > +     * This is race-prone, but it is better than using shell scripts.
>> > +     *
>> > +     * Additionally newer kernels also support flushing a zone without listing
>> > +     * it first. */
>> >  
>> >      struct nl_dump dump;
>> >      struct ofpbuf buf, reply, delete;
>> > +    int err;
>> > +
>> > +    if (netlink_flush_supports_zone()) {
>> > +        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
>> > +
>> > +        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>> > +                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
>> > +        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
>> > +
>> > +        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
>> > +        ofpbuf_uninit(&buf);
>> > +
>> > +        return err;
>> > +    }
>> >  
>> >      ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
>> >      ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
>> >  
>> >      nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>> >                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
>> > +    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
>> >      nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
>> >      ofpbuf_clear(&buf);
>> >
Ilya Maximets Feb. 19, 2024, 7:10 p.m. UTC | #4
On 2/12/24 11:38, Felix Huettner via dev wrote:
> Previously the kernel did not provide a netlink interface to flush/list
> only conntrack entries matching a specific zone. With [1] and [2] it is now
> possible to flush and list conntrack entries filtered by zone. Older
> kernels not yet supporting this feature will ignore the filter.
> For the list request that means just returning all entries (which we can
> then filter in userspace as before).
> For the flush request that means deleting all conntrack entries.
> 
> These significantly improves the performance of flushing conntrack zones
> when the conntrack table is large. Since flushing a conntrack zone is
> normally triggered via an openflow command it blocks the main ovs thread
> and thereby also blocks new flows from being applied. Using this new
> feature we can reduce the flushing time for zones by around 93%.
> 
> In combination with OVN the creation of a Logical_Router (which causes
> the flushing of a ct zone) could block other operations, e.g. the
> failover of Logical_Routers (as they cause new flows to be created).
> This is visible from a user perspective as a ovn-controller that is idle
> (as it waits for vswitchd) and vswitchd reporting:
> "blocked 1000 ms waiting for main to quiesce" (potentially with ever
> increasing times).
> 
> The following performance tests where run in a qemu vm with 500.000
> conntrack entries distributed evenly over 500 ct zones using `ovstest
> test-netlink-conntrack flush zone=<zoneid>`.
> 
> With this patch and kernel v6.8-rc4:
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   0.260           0.319           0.335           0.348           0.362           0.320           80.02           250
> flush zone with no entry       0.228           0.298           0.325           0.340           0.348           0.296           73.93           250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 
> With this patch and kernel v6.7.1:
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   3.946           4.237           4.367           4.495           4.543           4.236           1058.992        250
> flush zone with no entry       3.462           4.460           4.662           4.931           5.390           4.430           1107.479        250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 
> Without this patch and kernel v6.8-rc4:
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                                Min (s)         Median (s)      90%ile (s)      99%ile (s)      Max (s)         Mean (s)        Total (s)       Count
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> flush zone with 1000 entries   3.497           4.349           4.522           4.773           5.054           4.331           1082.802        250
> flush zone with no entry       3.212           4.010           4.572           6.003           6.396           4.071           1017.838        250
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 

Maybe transpose these tables?  They will not look good in a git log.
Frankly, they barely fit into my email client window. :)

Potentially, collect all 3 tables into one.  A quick mock up for
example:

        |  flush zone with 1000 entries  |    flush zone with no entry 
        +---------------------+----------+---------------------+----------
        |    with the patch   | without  |    with the patch   | without
        +----------+----------+----------+---------------------+----------
        | v6.8-rc4 |  v6.7.1  | v6.8-rc4 | v6.8-rc4 |  v6.7.1  | v6.8-rc4
 -------+----------+----------+----------+----------+----------+----------
 Min    |          |          |          |
 Median |          |          |          |         ....
 90%ile |          |          |          |
 ...    |          |          |          |         ....


Can be prettier of course.  Ideally, would be nice if it doesn't exceed
72-80 in width, but can be a little larger.

Also, some tools may be confused by '---...' lines, so it's better to
start them not from the beginning of a line.

> [1]: https://github.com/torvalds/linux/commit/eff3c558bb7e61c41b53e4c8130e514a5a4df9ba
> [2]: https://github.com/torvalds/linux/commit/fa173a1b4e3fd1ab5451cbc57de6fc624c824b0a
> 
> Co-Authored-By: Luca Czesla <luca.czesla@mail.schwarz>
> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> Co-Authored-By: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
> ---
> 
> v2->v3:
> - update description to include upstream fix (Thanks to Ilya for finding
>   that issue)
> v1->v2:
> - fixed wrong signed-off-by
> 
>  lib/netlink-conntrack.c | 57 +++++++++++++++++++++++++++++++++++++++--
>  tests/system-traffic.at |  8 ++++++
>  2 files changed, 63 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
> index 492bfcffb..1b050894d 100644
> --- a/lib/netlink-conntrack.c
> +++ b/lib/netlink-conntrack.c
> @@ -25,6 +25,7 @@
>  #include <linux/netfilter/nf_conntrack_tcp.h>
>  #include <linux/netfilter/nf_conntrack_ftp.h>
>  #include <linux/netfilter/nf_conntrack_sctp.h>
> +#include <sys/utsname.h>
>  
>  #include "byte-order.h"
>  #include "compiler.h"
> @@ -141,6 +142,9 @@ nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
>  
>      nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
>                          IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
> +    if (zone) {
> +        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
> +    }
>      nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
>      ofpbuf_clear(&state->buf);
>  
> @@ -283,23 +287,72 @@ nl_ct_flush_zone(uint16_t flush_zone)
>      return err;
>  }
>  #else
> +
> +static bool
> +netlink_flush_supports_zone(void)
> +{
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +    static bool supported = false;
> +
> +    if (ovsthread_once_start(&once)) {
> +        struct utsname utsname;
> +        int major, minor;
> +
> +        if (uname(&utsname) == -1) {
> +            VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
> +        } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) {
> +            VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
> +        } else if (major < 6 || (major == 6 && minor < 8)) {
> +            VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s",
> +                      utsname.release);
> +        } else {
> +            supported = true;
> +        }
> +        ovsthread_once_done(&once);
> +    }
> +    return supported;
> +}
> +
>  int
>  nl_ct_flush_zone(uint16_t flush_zone)
>  {
> -    /* Apparently, there's no netlink interface to flush a specific zone.
> +    /* In older kernels, there was no netlink interface to flush a specific
> +     * conntrack zone.
>       * This code dumps every connection, checks the zone and eventually
>       * delete the entry.
> +     * In newer kernels there is the option to specifiy a zone for filtering

typo: specify

> +     * during dumps. Older kernels ignore this option. We set it here in the
> +     * hope we only get relevant entries back, but fall back to filtering here
> +     * to keep compatibility.
>       *
> -     * This is race-prone, but it is better than using shell scripts. */
> +     * This is race-prone, but it is better than using shell scripts.
> +     *
> +     * Additionally newer kernels also support flushing a zone without listing
> +     * it first. */
>  
>      struct nl_dump dump;
>      struct ofpbuf buf, reply, delete;
> +    int err;
> +
> +    if (netlink_flush_supports_zone()) {
> +        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
> +
> +        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
> +                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
> +        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
> +
> +        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
> +        ofpbuf_uninit(&buf);
> +
> +        return err;

Not a full review.  But this part seem to be exactly the same as
the WIN32 implementation of the flush.  We should try to re-use
the code, I think, i.e. move the implementation from WIN32 function
into a common helper and call it from two places.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/lib/netlink-conntrack.c b/lib/netlink-conntrack.c
index 492bfcffb..1b050894d 100644
--- a/lib/netlink-conntrack.c
+++ b/lib/netlink-conntrack.c
@@ -25,6 +25,7 @@ 
 #include <linux/netfilter/nf_conntrack_tcp.h>
 #include <linux/netfilter/nf_conntrack_ftp.h>
 #include <linux/netfilter/nf_conntrack_sctp.h>
+#include <sys/utsname.h>
 
 #include "byte-order.h"
 #include "compiler.h"
@@ -141,6 +142,9 @@  nl_ct_dump_start(struct nl_ct_dump_state **statep, const uint16_t *zone,
 
     nl_msg_put_nfgenmsg(&state->buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    if (zone) {
+        nl_msg_put_be16(&state->buf, CTA_ZONE, htons(*zone));
+    }
     nl_dump_start(&state->dump, NETLINK_NETFILTER, &state->buf);
     ofpbuf_clear(&state->buf);
 
@@ -283,23 +287,72 @@  nl_ct_flush_zone(uint16_t flush_zone)
     return err;
 }
 #else
+
+static bool
+netlink_flush_supports_zone(void)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+    static bool supported = false;
+
+    if (ovsthread_once_start(&once)) {
+        struct utsname utsname;
+        int major, minor;
+
+        if (uname(&utsname) == -1) {
+            VLOG_WARN("uname failed (%s)", ovs_strerror(errno));
+        } else if (!ovs_scan(utsname.release, "%d.%d", &major, &minor)) {
+            VLOG_WARN("uname reported bad OS release (%s)", utsname.release);
+        } else if (major < 6 || (major == 6 && minor < 8)) {
+            VLOG_INFO("disabling conntrack flush by zone in Linux kernel %s",
+                      utsname.release);
+        } else {
+            supported = true;
+        }
+        ovsthread_once_done(&once);
+    }
+    return supported;
+}
+
 int
 nl_ct_flush_zone(uint16_t flush_zone)
 {
-    /* Apparently, there's no netlink interface to flush a specific zone.
+    /* In older kernels, there was no netlink interface to flush a specific
+     * conntrack zone.
      * This code dumps every connection, checks the zone and eventually
      * delete the entry.
+     * In newer kernels there is the option to specifiy a zone for filtering
+     * during dumps. Older kernels ignore this option. We set it here in the
+     * hope we only get relevant entries back, but fall back to filtering here
+     * to keep compatibility.
      *
-     * This is race-prone, but it is better than using shell scripts. */
+     * This is race-prone, but it is better than using shell scripts.
+     *
+     * Additionally newer kernels also support flushing a zone without listing
+     * it first. */
 
     struct nl_dump dump;
     struct ofpbuf buf, reply, delete;
+    int err;
+
+    if (netlink_flush_supports_zone()) {
+        ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
+
+        nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
+                            IPCTNL_MSG_CT_DELETE, NLM_F_REQUEST);
+        nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
+
+        err = nl_transact(NETLINK_NETFILTER, &buf, NULL);
+        ofpbuf_uninit(&buf);
+
+        return err;
+    }
 
     ofpbuf_init(&buf, NL_DUMP_BUFSIZE);
     ofpbuf_init(&delete, NL_DUMP_BUFSIZE);
 
     nl_msg_put_nfgenmsg(&buf, 0, AF_UNSPEC, NFNL_SUBSYS_CTNETLINK,
                         IPCTNL_MSG_CT_GET, NLM_F_REQUEST);
+    nl_msg_put_be16(&buf, CTA_ZONE, htons(flush_zone));
     nl_dump_start(&dump, NETLINK_NETFILTER, &buf);
     ofpbuf_clear(&buf);
 
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 4fd5dbe59..14010dc41 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -3297,6 +3297,14 @@  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(10.1.1.4)], [0], [dnl
 tcp,orig=(src=10.1.1.3,dst=10.1.1.4,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.4,dst=10.1.1.3,sport=<cleared>,dport=<cleared>),zone=2,protoinfo=(state=<cleared>)
 ])
 
+dnl flushing one zone should leave the others intact
+AT_CHECK([ovs-appctl dpctl/flush-conntrack zone=2])
+AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=<cleared>,dport=<cleared>),reply=(src=10.1.1.2,dst=10.1.1.1,sport=<cleared>,dport=<cleared>),zone=1,protoinfo=(state=<cleared>)
+])
+AT_CHECK([ovs-appctl dpctl/dump-conntrack zone=2 | FORMAT_CT(10.1.1.4)], [0], [dnl
+])
+
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP