diff mbox series

[ovs-dev,v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding.

Message ID 20220824064051.2525214-1-hzhou@ovn.org
State Accepted
Headers show
Series [ovs-dev,v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Han Zhou Aug. 24, 2022, 6:40 a.m. UTC
The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
registers is causing a critical dataplane performance impact to
short-lived connections, because it unwildcards megaflows with exact
match on dst IP and L4 ports. Any new connections with a different
client side L4 port will encounter datapath flow miss and upcall to
ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
RESTful API calls suffer big performance degredations.

These fields (dst IP and port) were saved to registers to solve a
problem of LB hairpin use case when different VIPs are sharing
overlapping backend+port [0]. The change [0] might not have as wide
performance impact as it is now because at that time one of the match
condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
natted traffic, while now the impact is more obvious because
REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
configured on the LS) since commit [1], after several other indirectly
related optimizations and refactors.

This patch fixes the problem by modifying the priority-120 flows in
ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
port only for traffic matching the LB VIPs, because these are the ones
that need to be saved for the hairpin purpose. The existed priority-110
flows will match the rest of the traffic just like before but wouldn't
not save dst IP and L4 port, so any server->client traffic would not
unwildcard megaflows with client side L4 ports.

[0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared backends.")
[1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")

Signed-off-by: Han Zhou <hzhou@ovn.org>
---
v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot to commit
          before sending v1.

 northd/northd.c         | 125 +++++++++++++++++++++++++---------------
 northd/ovn-northd.8.xml |  14 ++---
 tests/ovn-northd.at     |  87 ++++++++++------------------
 tests/ovn.at            |  14 ++---
 4 files changed, 118 insertions(+), 122 deletions(-)

Comments

Dumitru Ceara Aug. 30, 2022, 3:18 p.m. UTC | #1
Hi Han,

On 8/24/22 08:40, Han Zhou wrote:
> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> registers is causing a critical dataplane performance impact to
> short-lived connections, because it unwildcards megaflows with exact
> match on dst IP and L4 ports. Any new connections with a different
> client side L4 port will encounter datapath flow miss and upcall to
> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> RESTful API calls suffer big performance degredations.
> 
> These fields (dst IP and port) were saved to registers to solve a
> problem of LB hairpin use case when different VIPs are sharing
> overlapping backend+port [0]. The change [0] might not have as wide
> performance impact as it is now because at that time one of the match
> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> natted traffic, while now the impact is more obvious because
> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> configured on the LS) since commit [1], after several other indirectly
> related optimizations and refactors.
> 
> This patch fixes the problem by modifying the priority-120 flows in
> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> port only for traffic matching the LB VIPs, because these are the ones
> that need to be saved for the hairpin purpose. The existed priority-110
> flows will match the rest of the traffic just like before but wouldn't
> not save dst IP and L4 port, so any server->client traffic would not
> unwildcard megaflows with client side L4 ports.
> 

Just to be 100% sure, the client->server traffic will still generate up
to V x H flows where V is "the number of VIP:port tuples" and H is "the
number of potential (dp_)hash values", right?

> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared backends.")
> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> 
> Signed-off-by: Han Zhou <hzhou@ovn.org>
> ---
> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot to commit
>           before sending v1.
> 
>  northd/northd.c         | 125 +++++++++++++++++++++++++---------------
>  northd/ovn-northd.8.xml |  14 ++---
>  tests/ovn-northd.at     |  87 ++++++++++------------------
>  tests/ovn.at            |  14 ++---
>  4 files changed, 118 insertions(+), 122 deletions(-)
> 
> diff --git a/northd/northd.c b/northd/northd.c
> index 7e2681865..860641936 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -273,15 +273,15 @@ enum ovn_stage {
>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |                  |
>   * |    |     REGBIT_ACL_LABEL                         | X |                  |
>   * +----+----------------------------------------------+ X |                  |
> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |                  |
> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |                  |
>   * +----+----------------------------------------------+ E |                  |
> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |                  |
> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |                  |
>   * +----+----------------------------------------------+ 0 |                  |
>   * | R3 |                  ACL LABEL                   |   |                  |
>   * +----+----------------------------------------------+---+------------------+
>   * | R4 |                   UNUSED                     |   |                  |
> - * +----+----------------------------------------------+ X |   ORIG_DIP_IPV6  |
> - * | R5 |                   UNUSED                     | X | (>= IN_STATEFUL) |
> + * +----+----------------------------------------------+ X | ORIG_DIP_IPV6(>= |
> + * | R5 |                   UNUSED                     | X | IN_PRE_STATEFUL) |
>   * +----+----------------------------------------------+ R |                  |
>   * | R6 |                   UNUSED                     | E |                  |
>   * +----+----------------------------------------------+ G |                  |
> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", "next;");
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", "next;");
>  
> -    const char *ct_lb_action = features->ct_no_masked_label
> -                               ? "ct_lb_mark"
> -                               : "ct_lb";
> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> -    struct ds actions = DS_EMPTY_INITIALIZER;
> -    struct ds match = DS_EMPTY_INITIALIZER;
> -
> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> -        ds_clear(&match);
> -        ds_clear(&actions);
> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
> -                      lb_protocols[i]);
> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> -                      lb_protocols[i], ct_lb_action);
> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> -                      ds_cstr(&match), ds_cstr(&actions));
> -
> -        ds_clear(&match);
> -        ds_clear(&actions);
> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
> -                      lb_protocols[i]);
> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> -                      lb_protocols[i], ct_lb_action);
> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> -                      ds_cstr(&match), ds_cstr(&actions));
> -    }
> +    /* Note: priority-120 flows are added in build_lb_rules_pre_stateful(). */
>  
> -    ds_clear(&actions);
> -    ds_put_format(&actions, "%s;", ct_lb_action);
> +    const char *ct_lb_action = features->ct_no_masked_label
> +                               ? "ct_lb_mark;"
> +                               : "ct_lb;";
>  
>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>  
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>  
>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should be
>       * sent to conntrack for tracking and defragmentation. */
> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>  
> -    ds_destroy(&actions);
> -    ds_destroy(&match);
>  }
>  
>  static void
> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap *lflows) {
>  }
>  
>  static void
> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool ct_lb_mark,
> -               struct ds *match, struct ds *action,
> -               const struct shash *meter_groups)
> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb *lb,
> +                            bool ct_lb_mark, struct ds *match,
> +                            struct ds *action)
>  {
>      for (size_t i = 0; i < lb->n_vips; i++) {
>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> -        const char *ip_match = NULL;
> -
>          ds_clear(action);
>          ds_clear(match);
> +        const char *ip_match = NULL;
>  
> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.  Otherwise
> -         * the load balanced packet will be committed again in
> -         * S_SWITCH_IN_STATEFUL. */
> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>          /* Store the original destination IP to be used when generating
>           * hairpin flows.
>           */
> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool ct_lb_mark,
>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>                            lb_vip->vip_port);
>          }
> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : "ct_lb");
> +
> +        ds_put_format(match, "%s.dst == %s", ip_match, lb_vip->vip_str);
> +        if (lb_vip->vip_port) {
> +            ds_put_format(match, " && %s.dst == %d", proto, lb_vip->vip_port);
> +        }
> +
> +        struct ovn_lflow *lflow_ref = NULL;
> +        uint32_t hash = ovn_logical_flow_hash(
> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
> +                ds_cstr(match), ds_cstr(action));
> +
> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> +            struct ovn_datapath *od = lb->nb_ls[j];
> +
> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> +                lflow_ref = ovn_lflow_add_at_with_hash(
> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> +                        ds_cstr(match), ds_cstr(action),
> +                        NULL, NULL, &lb->nlb->header_,
> +                        OVS_SOURCE_LOCATOR, hash);
> +            }
> +        }
> +    }

I see how this can fix the dataplane issue because we're limiting the
number of dp flows but this now induces a measurable penalty on the
control plane side because we essentially double the number of logical
flows that ovn-northd generates for load balancer VIPs

Using a NB database [2] from a 250 node density heavy test [3] we ran
in-house I see the following:

a. before this patch (main):
- number of SB lflows:         146233
- SB size on disk (compacted): 70MB
- northd poll loop intervals:  8525ms

b. with this patch:
- number of SB lflows:         163303
- SB size on disk (compacted): 76MB
- northd poll loop intervals:  9958ms

[2]
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
[3]
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml

This raises some concerns.  However, I went ahead and also ran a test
with Ilya's load balancer optimization series [4] (up for review) and I got:

c. with this patch + Ilya's series:
- number of SB lflows:         163303  << Didn't change compared to "b"
- SB size on disk (compacted): 76MB    << Didn't change compared to "b"
- northd poll loop intervals:  6437ms  << 25% better than "a"

Then I ran a test with main + Ilya's series.

d. main + Ilya's series:
- number of SB lflows:         146233  << Didn't change compared to "a"
- SB size on disk (compacted): 70MB    << Didn't change compared to "a"
- northd poll loop intervals:  5413ms  << 35% better than "a"

[4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*

I guess what I'm trying to say is that if go ahead with the approach you
proposed and especially if we want to backport it to the LTS we should
consider accepting/backporting other optimizations too (such as Ilya's)
that would compensate (and more) for the control plane performance hit.

I tried to think of an alternative that wouldn't require doubling the
number of VIP logical flows but couldn't come up with one.  Maybe you
have more ideas?

Thanks,
Dumitru
Ilya Maximets Aug. 30, 2022, 4:35 p.m. UTC | #2
On 8/24/22 08:40, Han Zhou wrote:
> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> registers is causing a critical dataplane performance impact to
> short-lived connections, because it unwildcards megaflows with exact
> match on dst IP and L4 ports. Any new connections with a different
> client side L4 port will encounter datapath flow miss and upcall to
> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> RESTful API calls suffer big performance degredations.
> 
> These fields (dst IP and port) were saved to registers to solve a
> problem of LB hairpin use case when different VIPs are sharing
> overlapping backend+port [0]. The change [0] might not have as wide
> performance impact as it is now because at that time one of the match
> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> natted traffic, while now the impact is more obvious because
> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> configured on the LS) since commit [1], after several other indirectly
> related optimizations and refactors.
> 
> This patch fixes the problem by modifying the priority-120 flows in
> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> port only for traffic matching the LB VIPs, because these are the ones
> that need to be saved for the hairpin purpose. The existed priority-110
> flows will match the rest of the traffic just like before but wouldn't
> not save dst IP and L4 port, so any server->client traffic would not
> unwildcard megaflows with client side L4 ports.

Hmm, but if higher priority flows have matches on these fields, datapath
flows will have them unwildcarded anyway.  So, why exactly that is better
than the current approach?

I see how that can help for the case where vIPs has no ports specified,
because we will not have ports unwildcarded in this case, but I thought
it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
even one vIP will have a port, all the datapath flows will have a port
match.  Or am I missing something?

Best regards, Ilya Maximets.
Han Zhou Aug. 30, 2022, 11:32 p.m. UTC | #3
On Tue, Aug 30, 2022 at 9:35 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 8/24/22 08:40, Han Zhou wrote:
> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> > registers is causing a critical dataplane performance impact to
> > short-lived connections, because it unwildcards megaflows with exact
> > match on dst IP and L4 ports. Any new connections with a different
> > client side L4 port will encounter datapath flow miss and upcall to
> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> > RESTful API calls suffer big performance degredations.
> >
> > These fields (dst IP and port) were saved to registers to solve a
> > problem of LB hairpin use case when different VIPs are sharing
> > overlapping backend+port [0]. The change [0] might not have as wide
> > performance impact as it is now because at that time one of the match
> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> > natted traffic, while now the impact is more obvious because
> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > configured on the LS) since commit [1], after several other indirectly
> > related optimizations and refactors.
> >
> > This patch fixes the problem by modifying the priority-120 flows in
> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> > port only for traffic matching the LB VIPs, because these are the ones
> > that need to be saved for the hairpin purpose. The existed priority-110
> > flows will match the rest of the traffic just like before but wouldn't
> > not save dst IP and L4 port, so any server->client traffic would not
> > unwildcard megaflows with client side L4 ports.
>
> Hmm, but if higher priority flows have matches on these fields, datapath
> flows will have them unwildcarded anyway.  So, why exactly that is better
> than the current approach?
>
Hi Ilya,

The problem of the current approach is that it blindly saves the L4 dst
port for any traffic in any direction, as long as there are VIPs configured
on the datapath.
So consider the most typical scenario of a client sending API requests to
server backends behind a VIP. On the server side, any *reply* packets would
hit the flow that saves the client side L4 port because for server->client
direction the client port is the dst. If the client sends 10 requests, each
with a different source port, the server side will end up with unwildcarded
DP flows like below: (192.168.1.2 is client IP)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51224),
packets:5, bytes:2475, used:1.118s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x20)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51226),
packets:5, bytes:2475, used:1.105s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x21)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=37798),
packets:5, bytes:2475, used:0.574s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x40)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51250),
packets:5, bytes:2475, used:0.872s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x2d)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46940),
packets:5, bytes:2475, used:0.109s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x60)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46938),
packets:5, bytes:2475, used:0.118s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x5f)
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51236),
packets:5, bytes:2475, used:0.938s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x26)
...

As a result, DP flows explode and every new request is going to be a miss
and upcall to userspace, which is very inefficient. Even worse, as the flow
is so generic, even traffic unrelated to the VIP would have the same
impact, as long as a server on a LS with any VIP configuration is replying
client requests.
With the fix, only the client->VIP packets would hit such flows, and in
those cases the dst port is the server (well known) port, which is expected
to be matched in megaflows anyway, while the client side port is not
unwildcarded, so new requests/replies will match megaflows in fast path.
The above megaflows become:
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=
128.0.0.0/128.0.0.0,frag=no), packets:263, bytes:112082, used:0.013s,
flags:SFP., actions:ct(zone=8,nat),recirc(0xd)

Thanks,
Han

> I see how that can help for the case where vIPs has no ports specified,
> because we will not have ports unwildcarded in this case, but I thought
> it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
> even one vIP will have a port, all the datapath flows will have a port
> match.  Or am I missing something?
>
> Best regards, Ilya Maximets.
Han Zhou Aug. 31, 2022, 12:17 a.m. UTC | #4
On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> Hi Han,
>
> On 8/24/22 08:40, Han Zhou wrote:
> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> > registers is causing a critical dataplane performance impact to
> > short-lived connections, because it unwildcards megaflows with exact
> > match on dst IP and L4 ports. Any new connections with a different
> > client side L4 port will encounter datapath flow miss and upcall to
> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> > RESTful API calls suffer big performance degredations.
> >
> > These fields (dst IP and port) were saved to registers to solve a
> > problem of LB hairpin use case when different VIPs are sharing
> > overlapping backend+port [0]. The change [0] might not have as wide
> > performance impact as it is now because at that time one of the match
> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> > natted traffic, while now the impact is more obvious because
> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > configured on the LS) since commit [1], after several other indirectly
> > related optimizations and refactors.
> >
> > This patch fixes the problem by modifying the priority-120 flows in
> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> > port only for traffic matching the LB VIPs, because these are the ones
> > that need to be saved for the hairpin purpose. The existed priority-110
> > flows will match the rest of the traffic just like before but wouldn't
> > not save dst IP and L4 port, so any server->client traffic would not
> > unwildcard megaflows with client side L4 ports.
> >
>
> Just to be 100% sure, the client->server traffic will still generate up
> to V x H flows where V is "the number of VIP:port tuples" and H is "the
> number of potential (dp_)hash values", right?

Yes, if all the VIPs and backends are being accessed at the same time. This
is expected. For any given pair of client<->server, regardless of the
connections (source port number changes), the packets should match the same
mega-flow and be handled by fastpath (instead of going to userspace, which
is the behavior before this patch).

>
> > [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
backends.")
> > [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> >
> > Signed-off-by: Han Zhou <hzhou@ovn.org>
> > ---
> > v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
to commit
> >           before sending v1.
> >
> >  northd/northd.c         | 125 +++++++++++++++++++++++++---------------
> >  northd/ovn-northd.8.xml |  14 ++---
> >  tests/ovn-northd.at     |  87 ++++++++++------------------
> >  tests/ovn.at            |  14 ++---
> >  4 files changed, 118 insertions(+), 122 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index 7e2681865..860641936 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -273,15 +273,15 @@ enum ovn_stage {
> >   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
       |
> >   * |    |     REGBIT_ACL_LABEL                         | X |
       |
> >   * +----+----------------------------------------------+ X |
       |
> > - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
       |
> > + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
       |
> >   * +----+----------------------------------------------+ E |
       |
> > - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
       |
> > + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
       |
> >   * +----+----------------------------------------------+ 0 |
       |
> >   * | R3 |                  ACL LABEL                   |   |
       |
> >   *
+----+----------------------------------------------+---+------------------+
> >   * | R4 |                   UNUSED                     |   |
       |
> > - * +----+----------------------------------------------+ X |
ORIG_DIP_IPV6  |
> > - * | R5 |                   UNUSED                     | X | (>=
IN_STATEFUL) |
> > + * +----+----------------------------------------------+ X |
ORIG_DIP_IPV6(>= |
> > + * | R5 |                   UNUSED                     | X |
IN_PRE_STATEFUL) |
> >   * +----+----------------------------------------------+ R |
       |
> >   * | R6 |                   UNUSED                     | E |
       |
> >   * +----+----------------------------------------------+ G |
       |
> > @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
> >      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
"next;");
> >      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
"next;");
> >
> > -    const char *ct_lb_action = features->ct_no_masked_label
> > -                               ? "ct_lb_mark"
> > -                               : "ct_lb";
> > -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> > -    struct ds actions = DS_EMPTY_INITIALIZER;
> > -    struct ds match = DS_EMPTY_INITIALIZER;
> > -
> > -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> > -        ds_clear(&match);
> > -        ds_clear(&actions);
> > -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
> > -                      lb_protocols[i]);
> > -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> > -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> > -                      lb_protocols[i], ct_lb_action);
> > -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > -                      ds_cstr(&match), ds_cstr(&actions));
> > -
> > -        ds_clear(&match);
> > -        ds_clear(&actions);
> > -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
> > -                      lb_protocols[i]);
> > -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> > -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> > -                      lb_protocols[i], ct_lb_action);
> > -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > -                      ds_cstr(&match), ds_cstr(&actions));
> > -    }
> > +    /* Note: priority-120 flows are added in
build_lb_rules_pre_stateful(). */
> >
> > -    ds_clear(&actions);
> > -    ds_put_format(&actions, "%s;", ct_lb_action);
> > +    const char *ct_lb_action = features->ct_no_masked_label
> > +                               ? "ct_lb_mark;"
> > +                               : "ct_lb;";
> >
> >      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> > -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> > +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >
> >      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> > -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> > +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >
> >      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should
be
> >       * sent to conntrack for tracking and defragmentation. */
> > @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> >      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> >                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >
> > -    ds_destroy(&actions);
> > -    ds_destroy(&match);
> >  }
> >
> >  static void
> > @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
*lflows) {
> >  }
> >
> >  static void
> > -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
ct_lb_mark,
> > -               struct ds *match, struct ds *action,
> > -               const struct shash *meter_groups)
> > +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
*lb,
> > +                            bool ct_lb_mark, struct ds *match,
> > +                            struct ds *action)
> >  {
> >      for (size_t i = 0; i < lb->n_vips; i++) {
> >          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> > -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> > -        const char *ip_match = NULL;
> > -
> >          ds_clear(action);
> >          ds_clear(match);
> > +        const char *ip_match = NULL;
> >
> > -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
Otherwise
> > -         * the load balanced packet will be committed again in
> > -         * S_SWITCH_IN_STATEFUL. */
> > -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> >          /* Store the original destination IP to be used when generating
> >           * hairpin flows.
> >           */
> > @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
ovn_northd_lb *lb, bool ct_lb_mark,
> >              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
> >                            lb_vip->vip_port);
> >          }
> > +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
"ct_lb");
> > +
> > +        ds_put_format(match, "%s.dst == %s", ip_match,
lb_vip->vip_str);
> > +        if (lb_vip->vip_port) {
> > +            ds_put_format(match, " && %s.dst == %d", proto,
lb_vip->vip_port);
> > +        }
> > +
> > +        struct ovn_lflow *lflow_ref = NULL;
> > +        uint32_t hash = ovn_logical_flow_hash(
> > +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> > +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
> > +                ds_cstr(match), ds_cstr(action));
> > +
> > +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> > +            struct ovn_datapath *od = lb->nb_ls[j];
> > +
> > +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> > +                lflow_ref = ovn_lflow_add_at_with_hash(
> > +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> > +                        ds_cstr(match), ds_cstr(action),
> > +                        NULL, NULL, &lb->nlb->header_,
> > +                        OVS_SOURCE_LOCATOR, hash);
> > +            }
> > +        }
> > +    }
>
> I see how this can fix the dataplane issue because we're limiting the
> number of dp flows but this now induces a measurable penalty on the
> control plane side because we essentially double the number of logical
> flows that ovn-northd generates for load balancer VIPs
>
> Using a NB database [2] from a 250 node density heavy test [3] we ran
> in-house I see the following:
>
> a. before this patch (main):
> - number of SB lflows:         146233
> - SB size on disk (compacted): 70MB
> - northd poll loop intervals:  8525ms
>
> b. with this patch:
> - number of SB lflows:         163303
> - SB size on disk (compacted): 76MB
> - northd poll loop intervals:  9958ms
>
> [2]
>
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> [3]
>
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>
> This raises some concerns.  However, I went ahead and also ran a test
> with Ilya's load balancer optimization series [4] (up for review) and I
got:
>
> c. with this patch + Ilya's series:
> - number of SB lflows:         163303  << Didn't change compared to "b"
> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
> - northd poll loop intervals:  6437ms  << 25% better than "a"
>
> Then I ran a test with main + Ilya's series.
>
> d. main + Ilya's series:
> - number of SB lflows:         146233  << Didn't change compared to "a"
> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
> - northd poll loop intervals:  5413ms  << 35% better than "a"
>
> [4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>
> I guess what I'm trying to say is that if go ahead with the approach you
> proposed and especially if we want to backport it to the LTS we should
> consider accepting/backporting other optimizations too (such as Ilya's)
> that would compensate (and more) for the control plane performance hit.
>
Thanks for the scale test!
I do expect some cost in the control plane, and your test result is aligned
with my expectation. It is unpleasant to see ~10% SB size increase (in the
density-heavy and LB/service-heavy test scenario), but if it is necessary
to fix the dataplane performance bug I think it is worth it. After all,
everything the control plane does is to serve the dataplane, right? The
resulting dataplane performance boost is at least 10x for short-lived
connection scenarios.
It's great to see Ilya's patch helps the LB scale, and I have no objection
to backport them if it is considered necessary to stable branches, but I'd
consider it independent of the current patch.

> I tried to think of an alternative that wouldn't require doubling the
> number of VIP logical flows but couldn't come up with one.  Maybe you
> have more ideas?
>
This is the approach that has the least impact and changes to existing
pipelines I have thought about so far. The earlier attempt wouldn't
increase the number of flows but it had to compromise to either failing to
support a corner case (that we are still not sure if it is really useful at
all) or losing HW offloading capability. I'll still think about some more
approaches but the NAT/LB/ACL pipelines are very complex now and I'd avoid
big changes in case any corner case is broken. I'd spend more time sorting
out different use cases and the documentation before making more
optimizations, but I don't think we should also fix this performance bug as
soon as we can. (I am also curious why we haven't heard other OVN users
complain about this yet ...)

Thanks,
Han

> Thanks,
> Dumitru
>
Dumitru Ceara Aug. 31, 2022, 8:37 a.m. UTC | #5
On 8/31/22 02:17, Han Zhou wrote:
> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> Hi Han,
>>
>> On 8/24/22 08:40, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>> RESTful API calls suffer big performance degredations.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other indirectly
>>> related optimizations and refactors.
>>>
>>> This patch fixes the problem by modifying the priority-120 flows in
>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>>> port only for traffic matching the LB VIPs, because these are the ones
>>> that need to be saved for the hairpin purpose. The existed priority-110
>>> flows will match the rest of the traffic just like before but wouldn't
>>> not save dst IP and L4 port, so any server->client traffic would not
>>> unwildcard megaflows with client side L4 ports.
>>>
>>
>> Just to be 100% sure, the client->server traffic will still generate up
>> to V x H flows where V is "the number of VIP:port tuples" and H is "the
>> number of potential (dp_)hash values", right?
> 
> Yes, if all the VIPs and backends are being accessed at the same time. This
> is expected. For any given pair of client<->server, regardless of the
> connections (source port number changes), the packets should match the same
> mega-flow and be handled by fastpath (instead of going to userspace, which
> is the behavior before this patch).
> 

Thanks for the confirmation!

>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
>>>
>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>>> ---
>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
> to commit
>>>           before sending v1.
>>>
>>>  northd/northd.c         | 125 +++++++++++++++++++++++++---------------
>>>  northd/ovn-northd.8.xml |  14 ++---
>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>  tests/ovn.at            |  14 ++---
>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>
>>> diff --git a/northd/northd.c b/northd/northd.c
>>> index 7e2681865..860641936 100644
>>> --- a/northd/northd.c
>>> +++ b/northd/northd.c
>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>        |
>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>        |
>>>   * +----+----------------------------------------------+ X |
>        |
>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>        |
>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>        |
>>>   * +----+----------------------------------------------+ E |
>        |
>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>        |
>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>        |
>>>   * +----+----------------------------------------------+ 0 |
>        |
>>>   * | R3 |                  ACL LABEL                   |   |
>        |
>>>   *
> +----+----------------------------------------------+---+------------------+
>>>   * | R4 |                   UNUSED                     |   |
>        |
>>> - * +----+----------------------------------------------+ X |
> ORIG_DIP_IPV6  |
>>> - * | R5 |                   UNUSED                     | X | (>=
> IN_STATEFUL) |
>>> + * +----+----------------------------------------------+ X |
> ORIG_DIP_IPV6(>= |
>>> + * | R5 |                   UNUSED                     | X |
> IN_PRE_STATEFUL) |
>>>   * +----+----------------------------------------------+ R |
>        |
>>>   * | R6 |                   UNUSED                     | E |
>        |
>>>   * +----+----------------------------------------------+ G |
>        |
>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> "next;");
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> "next;");
>>>
>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>> -                               ? "ct_lb_mark"
>>> -                               : "ct_lb";
>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>> -
>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>> -        ds_clear(&match);
>>> -        ds_clear(&actions);
>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
>>> -                      lb_protocols[i]);
>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>> -                      lb_protocols[i], ct_lb_action);
>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>> -
>>> -        ds_clear(&match);
>>> -        ds_clear(&actions);
>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
>>> -                      lb_protocols[i]);
>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>> -                      lb_protocols[i], ct_lb_action);
>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>> -    }
>>> +    /* Note: priority-120 flows are added in
> build_lb_rules_pre_stateful(). */
>>>
>>> -    ds_clear(&actions);
>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>> +                               ? "ct_lb_mark;"
>>> +                               : "ct_lb;";
>>>
>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>
>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should
> be
>>>       * sent to conntrack for tracking and defragmentation. */
>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>
>>> -    ds_destroy(&actions);
>>> -    ds_destroy(&match);
>>>  }
>>>
>>>  static void
>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
> *lflows) {
>>>  }
>>>
>>>  static void
>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
> ct_lb_mark,
>>> -               struct ds *match, struct ds *action,
>>> -               const struct shash *meter_groups)
>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
> *lb,
>>> +                            bool ct_lb_mark, struct ds *match,
>>> +                            struct ds *action)
>>>  {
>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>> -        const char *ip_match = NULL;
>>> -
>>>          ds_clear(action);
>>>          ds_clear(match);
>>> +        const char *ip_match = NULL;
>>>
>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
> Otherwise
>>> -         * the load balanced packet will be committed again in
>>> -         * S_SWITCH_IN_STATEFUL. */
>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>          /* Store the original destination IP to be used when generating
>>>           * hairpin flows.
>>>           */
>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> ovn_northd_lb *lb, bool ct_lb_mark,
>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>>>                            lb_vip->vip_port);
>>>          }
>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> "ct_lb");
>>> +
>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> lb_vip->vip_str);
>>> +        if (lb_vip->vip_port) {
>>> +            ds_put_format(match, " && %s.dst == %d", proto,
> lb_vip->vip_port);
>>> +        }
>>> +
>>> +        struct ovn_lflow *lflow_ref = NULL;
>>> +        uint32_t hash = ovn_logical_flow_hash(
>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
>>> +                ds_cstr(match), ds_cstr(action));
>>> +
>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>> +
>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>> +                        ds_cstr(match), ds_cstr(action),
>>> +                        NULL, NULL, &lb->nlb->header_,
>>> +                        OVS_SOURCE_LOCATOR, hash);
>>> +            }
>>> +        }
>>> +    }
>>
>> I see how this can fix the dataplane issue because we're limiting the
>> number of dp flows but this now induces a measurable penalty on the
>> control plane side because we essentially double the number of logical
>> flows that ovn-northd generates for load balancer VIPs
>>
>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>> in-house I see the following:
>>
>> a. before this patch (main):
>> - number of SB lflows:         146233
>> - SB size on disk (compacted): 70MB
>> - northd poll loop intervals:  8525ms
>>
>> b. with this patch:
>> - number of SB lflows:         163303
>> - SB size on disk (compacted): 76MB
>> - northd poll loop intervals:  9958ms
>>
>> [2]
>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>> [3]
>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>
>> This raises some concerns.  However, I went ahead and also ran a test
>> with Ilya's load balancer optimization series [4] (up for review) and I
> got:
>>
>> c. with this patch + Ilya's series:
>> - number of SB lflows:         163303  << Didn't change compared to "b"
>> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>
>> Then I ran a test with main + Ilya's series.
>>
>> d. main + Ilya's series:
>> - number of SB lflows:         146233  << Didn't change compared to "a"
>> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>
>> [4] https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>
>> I guess what I'm trying to say is that if go ahead with the approach you
>> proposed and especially if we want to backport it to the LTS we should
>> consider accepting/backporting other optimizations too (such as Ilya's)
>> that would compensate (and more) for the control plane performance hit.
>>
> Thanks for the scale test!
> I do expect some cost in the control plane, and your test result is aligned
> with my expectation. It is unpleasant to see ~10% SB size increase (in the
> density-heavy and LB/service-heavy test scenario), but if it is necessary
> to fix the dataplane performance bug I think it is worth it. After all,
> everything the control plane does is to serve the dataplane, right? The
> resulting dataplane performance boost is at least 10x for short-lived
> connection scenarios.
> It's great to see Ilya's patch helps the LB scale, and I have no objection
> to backport them if it is considered necessary to stable branches, but I'd
> consider it independent of the current patch.
> 

They are independent.  But the problem is your fix will visibily impact
control plane latency (~16% on my setup).  So, if we can avoid some of
that hit by considering both patchsets together, I'd vote for that.

Given that Ilya's changes seem very contained I don't think this should
be a problem.  They just need a proper review first.

>> I tried to think of an alternative that wouldn't require doubling the
>> number of VIP logical flows but couldn't come up with one.  Maybe you
>> have more ideas?
>>
> This is the approach that has the least impact and changes to existing
> pipelines I have thought about so far. The earlier attempt wouldn't
> increase the number of flows but it had to compromise to either failing to
> support a corner case (that we are still not sure if it is really useful at
> all) or losing HW offloading capability. I'll still think about some more
> approaches but the NAT/LB/ACL pipelines are very complex now and I'd avoid
> big changes in case any corner case is broken. I'd spend more time sorting
> out different use cases and the documentation before making more
> optimizations, but I don't think we should also fix this performance bug as
> soon as we can. (I am also curious why we haven't heard other OVN users
> complain about this yet ...)

It's a guess, but we probably didn't notice this issue because it should
be more visible when there are lots of short lived connections.

Thanks,
Dumitru
Ilya Maximets Aug. 31, 2022, 9:06 a.m. UTC | #6
On 8/31/22 01:32, Han Zhou wrote:
> 
> 
> On Tue, Aug 30, 2022 at 9:35 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> On 8/24/22 08:40, Han Zhou wrote:
>> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>> > registers is causing a critical dataplane performance impact to
>> > short-lived connections, because it unwildcards megaflows with exact
>> > match on dst IP and L4 ports. Any new connections with a different
>> > client side L4 port will encounter datapath flow miss and upcall to
>> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>> > RESTful API calls suffer big performance degredations.
>> >
>> > These fields (dst IP and port) were saved to registers to solve a
>> > problem of LB hairpin use case when different VIPs are sharing
>> > overlapping backend+port [0]. The change [0] might not have as wide
>> > performance impact as it is now because at that time one of the match
>> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>> > natted traffic, while now the impact is more obvious because
>> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>> > configured on the LS) since commit [1], after several other indirectly
>> > related optimizations and refactors.
>> >
>> > This patch fixes the problem by modifying the priority-120 flows in
>> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
>> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>> > port only for traffic matching the LB VIPs, because these are the ones
>> > that need to be saved for the hairpin purpose. The existed priority-110
>> > flows will match the rest of the traffic just like before but wouldn't
>> > not save dst IP and L4 port, so any server->client traffic would not
>> > unwildcard megaflows with client side L4 ports.
>>
>> Hmm, but if higher priority flows have matches on these fields, datapath
>> flows will have them unwildcarded anyway.  So, why exactly that is better
>> than the current approach?
>>
> Hi Ilya,
> 
> The problem of the current approach is that it blindly saves the L4 dst port for any traffic in any direction, as long as there are VIPs configured on the datapath.
> So consider the most typical scenario of a client sending API requests to server backends behind a VIP. On the server side, any *reply* packets would hit the flow that saves the client side L4 port because for server->client direction the client port is the dst. If the client sends 10 requests, each with a different source port, the server side will end up with unwildcarded DP flows like below: (192.168.1.2 is client IP)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51224), packets:5, bytes:2475, used:1.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x20)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51226), packets:5, bytes:2475, used:1.105s, flags:FP., actions:ct(zone=8,nat),recirc(0x21)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=37798), packets:5, bytes:2475, used:0.574s, flags:FP., actions:ct(zone=8,nat),recirc(0x40)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51250), packets:5, bytes:2475, used:0.872s, flags:FP., actions:ct(zone=8,nat),recirc(0x2d)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46940), packets:5, bytes:2475, used:0.109s, flags:FP., actions:ct(zone=8,nat),recirc(0x60)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46938), packets:5, bytes:2475, used:0.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x5f)
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51236), packets:5, bytes:2475, used:0.938s, flags:FP., actions:ct(zone=8,nat),recirc(0x26)
> ...
> 
> As a result, DP flows explode and every new request is going to be a miss and upcall to userspace, which is very inefficient. Even worse, as the flow is so generic, even traffic unrelated to the VIP would have the same impact, as long as a server on a LS with any VIP configuration is replying client requests.
> With the fix, only the client->VIP packets would hit such flows, and in those cases the dst port is the server (well known) port, which is expected to be matched in megaflows anyway, while the client side port is not unwildcarded, so new requests/replies will match megaflows in fast path.
> The above megaflows become:
> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no>), packets:263, bytes:112082, used:0.013s, flags:SFP., actions:ct(zone=8,nat),recirc(0xd)


Oh, OK.  Thanks for the explanation!

So, it's a reply traffic, and it will not have matches on L3 level unwildcarded
too much since, I suppose, it has a destination address typically in a different
subnet.  So, the ipv4 trie on addresses cuts off the rest of the L3/L4 headers
including source ip and the ports from the match criteria.

Did I get that right?

> 
> Thanks,
> Han
> 
>> I see how that can help for the case where vIPs has no ports specified,
>> because we will not have ports unwildcarded in this case, but I thought
>> it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
>> even one vIP will have a port, all the datapath flows will have a port
>> match.  Or am I missing something?
>>
>> Best regards, Ilya Maximets.
Han Zhou Aug. 31, 2022, 2:59 p.m. UTC | #7
On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 8/31/22 02:17, Han Zhou wrote:
> > On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> Hi Han,
> >>
> >> On 8/24/22 08:40, Han Zhou wrote:
> >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> >>> registers is causing a critical dataplane performance impact to
> >>> short-lived connections, because it unwildcards megaflows with exact
> >>> match on dst IP and L4 ports. Any new connections with a different
> >>> client side L4 port will encounter datapath flow miss and upcall to
> >>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> >>> RESTful API calls suffer big performance degredations.
> >>>
> >>> These fields (dst IP and port) were saved to registers to solve a
> >>> problem of LB hairpin use case when different VIPs are sharing
> >>> overlapping backend+port [0]. The change [0] might not have as wide
> >>> performance impact as it is now because at that time one of the match
> >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> >>> natted traffic, while now the impact is more obvious because
> >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>> configured on the LS) since commit [1], after several other indirectly
> >>> related optimizations and refactors.
> >>>
> >>> This patch fixes the problem by modifying the priority-120 flows in
> >>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
any
> >>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> >>> port only for traffic matching the LB VIPs, because these are the ones
> >>> that need to be saved for the hairpin purpose. The existed
priority-110
> >>> flows will match the rest of the traffic just like before but wouldn't
> >>> not save dst IP and L4 port, so any server->client traffic would not
> >>> unwildcard megaflows with client side L4 ports.
> >>>
> >>
> >> Just to be 100% sure, the client->server traffic will still generate up
> >> to V x H flows where V is "the number of VIP:port tuples" and H is "the
> >> number of potential (dp_)hash values", right?
> >
> > Yes, if all the VIPs and backends are being accessed at the same time.
This
> > is expected. For any given pair of client<->server, regardless of the
> > connections (source port number changes), the packets should match the
same
> > mega-flow and be handled by fastpath (instead of going to userspace,
which
> > is the behavior before this patch).
> >
>
> Thanks for the confirmation!
>
> >>
> >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
shared
> > backends.")
> >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
traffic.")
> >>>
> >>> Signed-off-by: Han Zhou <hzhou@ovn.org>
> >>> ---
> >>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
> > to commit
> >>>           before sending v1.
> >>>
> >>>  northd/northd.c         | 125
+++++++++++++++++++++++++---------------
> >>>  northd/ovn-northd.8.xml |  14 ++---
> >>>  tests/ovn-northd.at     |  87 ++++++++++------------------
> >>>  tests/ovn.at            |  14 ++---
> >>>  4 files changed, 118 insertions(+), 122 deletions(-)
> >>>
> >>> diff --git a/northd/northd.c b/northd/northd.c
> >>> index 7e2681865..860641936 100644
> >>> --- a/northd/northd.c
> >>> +++ b/northd/northd.c
> >>> @@ -273,15 +273,15 @@ enum ovn_stage {
> >>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
> >        |
> >>>   * |    |     REGBIT_ACL_LABEL                         | X |
> >        |
> >>>   * +----+----------------------------------------------+ X |
> >        |
> >>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
> >        |
> >>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
> >        |
> >>>   * +----+----------------------------------------------+ E |
> >        |
> >>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
> >        |
> >>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
> >        |
> >>>   * +----+----------------------------------------------+ 0 |
> >        |
> >>>   * | R3 |                  ACL LABEL                   |   |
> >        |
> >>>   *
> >
+----+----------------------------------------------+---+------------------+
> >>>   * | R4 |                   UNUSED                     |   |
> >        |
> >>> - * +----+----------------------------------------------+ X |
> > ORIG_DIP_IPV6  |
> >>> - * | R5 |                   UNUSED                     | X | (>=
> > IN_STATEFUL) |
> >>> + * +----+----------------------------------------------+ X |
> > ORIG_DIP_IPV6(>= |
> >>> + * | R5 |                   UNUSED                     | X |
> > IN_PRE_STATEFUL) |
> >>>   * +----+----------------------------------------------+ R |
> >        |
> >>>   * | R6 |                   UNUSED                     | E |
> >        |
> >>>   * +----+----------------------------------------------+ G |
> >        |
> >>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> > "next;");
> >>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> > "next;");
> >>>
> >>> -    const char *ct_lb_action = features->ct_no_masked_label
> >>> -                               ? "ct_lb_mark"
> >>> -                               : "ct_lb";
> >>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> >>> -    struct ds actions = DS_EMPTY_INITIALIZER;
> >>> -    struct ds match = DS_EMPTY_INITIALIZER;
> >>> -
> >>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> >>> -        ds_clear(&match);
> >>> -        ds_clear(&actions);
> >>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
%s",
> >>> -                      lb_protocols[i]);
> >>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> >>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> >>> -                      lb_protocols[i], ct_lb_action);
> >>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>> -
> >>> -        ds_clear(&match);
> >>> -        ds_clear(&actions);
> >>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
%s",
> >>> -                      lb_protocols[i]);
> >>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> >>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> >>> -                      lb_protocols[i], ct_lb_action);
> >>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>> -    }
> >>> +    /* Note: priority-120 flows are added in
> > build_lb_rules_pre_stateful(). */
> >>>
> >>> -    ds_clear(&actions);
> >>> -    ds_put_format(&actions, "%s;", ct_lb_action);
> >>> +    const char *ct_lb_action = features->ct_no_masked_label
> >>> +                               ? "ct_lb_mark;"
> >>> +                               : "ct_lb;";
> >>>
> >>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> >>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> >>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>
> >>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> >>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> >>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>
> >>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
should
> > be
> >>>       * sent to conntrack for tracking and defragmentation. */
> >>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> >>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >>>
> >>> -    ds_destroy(&actions);
> >>> -    ds_destroy(&match);
> >>>  }
> >>>
> >>>  static void
> >>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
> > *lflows) {
> >>>  }
> >>>
> >>>  static void
> >>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
> > ct_lb_mark,
> >>> -               struct ds *match, struct ds *action,
> >>> -               const struct shash *meter_groups)
> >>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
> > *lb,
> >>> +                            bool ct_lb_mark, struct ds *match,
> >>> +                            struct ds *action)
> >>>  {
> >>>      for (size_t i = 0; i < lb->n_vips; i++) {
> >>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> >>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> >>> -        const char *ip_match = NULL;
> >>> -
> >>>          ds_clear(action);
> >>>          ds_clear(match);
> >>> +        const char *ip_match = NULL;
> >>>
> >>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
> > Otherwise
> >>> -         * the load balanced packet will be committed again in
> >>> -         * S_SWITCH_IN_STATEFUL. */
> >>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> >>>          /* Store the original destination IP to be used when
generating
> >>>           * hairpin flows.
> >>>           */
> >>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> > ovn_northd_lb *lb, bool ct_lb_mark,
> >>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
> >>>                            lb_vip->vip_port);
> >>>          }
> >>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> > "ct_lb");
> >>> +
> >>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> > lb_vip->vip_str);
> >>> +        if (lb_vip->vip_port) {
> >>> +            ds_put_format(match, " && %s.dst == %d", proto,
> > lb_vip->vip_port);
> >>> +        }
> >>> +
> >>> +        struct ovn_lflow *lflow_ref = NULL;
> >>> +        uint32_t hash = ovn_logical_flow_hash(
> >>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> >>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
120,
> >>> +                ds_cstr(match), ds_cstr(action));
> >>> +
> >>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> >>> +            struct ovn_datapath *od = lb->nb_ls[j];
> >>> +
> >>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> >>> +                lflow_ref = ovn_lflow_add_at_with_hash(
> >>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>> +                        ds_cstr(match), ds_cstr(action),
> >>> +                        NULL, NULL, &lb->nlb->header_,
> >>> +                        OVS_SOURCE_LOCATOR, hash);
> >>> +            }
> >>> +        }
> >>> +    }
> >>
> >> I see how this can fix the dataplane issue because we're limiting the
> >> number of dp flows but this now induces a measurable penalty on the
> >> control plane side because we essentially double the number of logical
> >> flows that ovn-northd generates for load balancer VIPs
> >>
> >> Using a NB database [2] from a 250 node density heavy test [3] we ran
> >> in-house I see the following:
> >>
> >> a. before this patch (main):
> >> - number of SB lflows:         146233
> >> - SB size on disk (compacted): 70MB
> >> - northd poll loop intervals:  8525ms
> >>
> >> b. with this patch:
> >> - number of SB lflows:         163303
> >> - SB size on disk (compacted): 76MB
> >> - northd poll loop intervals:  9958ms
> >>
> >> [2]
> >>
> >
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> >> [3]
> >>
> >
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
> >>
> >> This raises some concerns.  However, I went ahead and also ran a test
> >> with Ilya's load balancer optimization series [4] (up for review) and I
> > got:
> >>
> >> c. with this patch + Ilya's series:
> >> - number of SB lflows:         163303  << Didn't change compared to "b"
> >> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
> >> - northd poll loop intervals:  6437ms  << 25% better than "a"
> >>
> >> Then I ran a test with main + Ilya's series.
> >>
> >> d. main + Ilya's series:
> >> - number of SB lflows:         146233  << Didn't change compared to "a"
> >> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
> >> - northd poll loop intervals:  5413ms  << 35% better than "a"
> >>
> >> [4]
https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
> >>
> >> I guess what I'm trying to say is that if go ahead with the approach
you
> >> proposed and especially if we want to backport it to the LTS we should
> >> consider accepting/backporting other optimizations too (such as Ilya's)
> >> that would compensate (and more) for the control plane performance hit.
> >>
> > Thanks for the scale test!
> > I do expect some cost in the control plane, and your test result is
aligned
> > with my expectation. It is unpleasant to see ~10% SB size increase (in
the
> > density-heavy and LB/service-heavy test scenario), but if it is
necessary
> > to fix the dataplane performance bug I think it is worth it. After all,
> > everything the control plane does is to serve the dataplane, right? The
> > resulting dataplane performance boost is at least 10x for short-lived
> > connection scenarios.
> > It's great to see Ilya's patch helps the LB scale, and I have no
objection
> > to backport them if it is considered necessary to stable branches, but
I'd
> > consider it independent of the current patch.
> >
>
> They are independent.  But the problem is your fix will visibily impact
> control plane latency (~16% on my setup).  So, if we can avoid some of
> that hit by considering both patchsets together, I'd vote for that.
>
Again, no objection to backporting Ilya's patches, but this is the criteria
in my mind:

1. For this patch, it improves dataplane performance (to me it is in fact
fixing a performance bug) at the cost of the control plane. This is not
uncommon in the OVN project history - when adding some new features
sometimes it does add control plane cost. When this happens, whether to
accept the new feature is based on the value of the feature v.s. the cost
of the control plane - a tradeoff has to be made in many cases. In this
case, for 16% control plane latency increase we get 10x dataplane speed up,
I would  vote for it.

2. For backporting Ilya's optimization, I'd evaluate using the criteria for
any optimization backporting. Say the patch has 10-20% speed up, and we
think that speed up is critical for a branch for some reason, and the
changes are well contained, I'd vote for the backport.

Does this make sense?
@Numan Siddique <numans@ovn.org> @Mark Michelson <mmichels@redhat.com> any
comments?

Thanks,
Han

> Given that Ilya's changes seem very contained I don't think this should
> be a problem.  They just need a proper review first.
>
> >> I tried to think of an alternative that wouldn't require doubling the
> >> number of VIP logical flows but couldn't come up with one.  Maybe you
> >> have more ideas?
> >>
> > This is the approach that has the least impact and changes to existing
> > pipelines I have thought about so far. The earlier attempt wouldn't
> > increase the number of flows but it had to compromise to either failing
to
> > support a corner case (that we are still not sure if it is really
useful at
> > all) or losing HW offloading capability. I'll still think about some
more
> > approaches but the NAT/LB/ACL pipelines are very complex now and I'd
avoid
> > big changes in case any corner case is broken. I'd spend more time
sorting
> > out different use cases and the documentation before making more
> > optimizations, but I don't think we should also fix this performance
bug as
> > soon as we can. (I am also curious why we haven't heard other OVN users
> > complain about this yet ...)
>
> It's a guess, but we probably didn't notice this issue because it should
> be more visible when there are lots of short lived connections.
>
> Thanks,
> Dumitru
>
Han Zhou Aug. 31, 2022, 3:12 p.m. UTC | #8
On Wed, Aug 31, 2022 at 2:06 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 8/31/22 01:32, Han Zhou wrote:
> >
> >
> > On Tue, Aug 30, 2022 at 9:35 AM Ilya Maximets <i.maximets@ovn.org
<mailto:i.maximets@ovn.org>> wrote:
> >>
> >> On 8/24/22 08:40, Han Zhou wrote:
> >> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
to
> >> > registers is causing a critical dataplane performance impact to
> >> > short-lived connections, because it unwildcards megaflows with exact
> >> > match on dst IP and L4 ports. Any new connections with a different
> >> > client side L4 port will encounter datapath flow miss and upcall to
> >> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> >> > RESTful API calls suffer big performance degredations.
> >> >
> >> > These fields (dst IP and port) were saved to registers to solve a
> >> > problem of LB hairpin use case when different VIPs are sharing
> >> > overlapping backend+port [0]. The change [0] might not have as wide
> >> > performance impact as it is now because at that time one of the match
> >> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
and
> >> > natted traffic, while now the impact is more obvious because
> >> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >> > configured on the LS) since commit [1], after several other
indirectly
> >> > related optimizations and refactors.
> >> >
> >> > This patch fixes the problem by modifying the priority-120 flows in
> >> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
any
> >> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
> >> > port only for traffic matching the LB VIPs, because these are the
ones
> >> > that need to be saved for the hairpin purpose. The existed
priority-110
> >> > flows will match the rest of the traffic just like before but
wouldn't
> >> > not save dst IP and L4 port, so any server->client traffic would not
> >> > unwildcard megaflows with client side L4 ports.
> >>
> >> Hmm, but if higher priority flows have matches on these fields,
datapath
> >> flows will have them unwildcarded anyway.  So, why exactly that is
better
> >> than the current approach?
> >>
> > Hi Ilya,
> >
> > The problem of the current approach is that it blindly saves the L4 dst
port for any traffic in any direction, as long as there are VIPs configured
on the datapath.
> > So consider the most typical scenario of a client sending API requests
to server backends behind a VIP. On the server side, any *reply* packets
would hit the flow that saves the client side L4 port because for
server->client direction the client port is the dst. If the client sends 10
requests, each with a different source port, the server side will end up
with unwildcarded DP flows like below: (192.168.1.2 is client IP)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51224),
packets:5, bytes:2475, used:1.118s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x20)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51226),
packets:5, bytes:2475, used:1.105s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x21)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=37798),
packets:5, bytes:2475, used:0.574s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x40)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51250),
packets:5, bytes:2475, used:0.872s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x2d)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46940),
packets:5, bytes:2475, used:0.109s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x60)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46938),
packets:5, bytes:2475, used:0.118s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x5f)
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51236),
packets:5, bytes:2475, used:0.938s, flags:FP.,
actions:ct(zone=8,nat),recirc(0x26)
> > ...
> >
> > As a result, DP flows explode and every new request is going to be a
miss and upcall to userspace, which is very inefficient. Even worse, as the
flow is so generic, even traffic unrelated to the VIP would have the same
impact, as long as a server on a LS with any VIP configuration is replying
client requests.
> > With the fix, only the client->VIP packets would hit such flows, and in
those cases the dst port is the server (well known) port, which is expected
to be matched in megaflows anyway, while the client side port is not
unwildcarded, so new requests/replies will match megaflows in fast path.
> > The above megaflows become:
> >
recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=
128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no>),
packets:263, bytes:112082, used:0.013s, flags:SFP.,
actions:ct(zone=8,nat),recirc(0xd)
>
>
> Oh, OK.  Thanks for the explanation!
>
> So, it's a reply traffic, and it will not have matches on L3 level
unwildcarded
> too much since, I suppose, it has a destination address typically in a
different
> subnet.

After the fix, yes. Before the fix, no, because of the flow that saves dst
IP and port to registers.

> So, the ipv4 trie on addresses cuts off the rest of the L3/L4 headers
> including source ip and the ports from the match criteria.

Sorry I am not sure if I understand your question here.
If you are talking about the server(source)->client(destination) direction,
for the source/server ip and port, this is correct (before and after the
fix).
If you are talking about the client ip and ports, it is the case after the
fix, but not before the fix.

Thanks,
Han

>
> Did I get that right?
>
> >
> > Thanks,
> > Han
> >
> >> I see how that can help for the case where vIPs has no ports specified,
> >> because we will not have ports unwildcarded in this case, but I thought
> >> it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
> >> even one vIP will have a port, all the datapath flows will have a port
> >> match.  Or am I missing something?
> >>
> >> Best regards, Ilya Maximets.
>
Ilya Maximets Aug. 31, 2022, 3:14 p.m. UTC | #9
On 8/31/22 17:12, Han Zhou wrote:
> 
> 
> On Wed, Aug 31, 2022 at 2:06 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> On 8/31/22 01:32, Han Zhou wrote:
>> >
>> >
>> > On Tue, Aug 30, 2022 at 9:35 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org> <mailto:i.maximets@ovn.org <mailto:i.maximets@ovn.org>>> wrote:
>> >>
>> >> On 8/24/22 08:40, Han Zhou wrote:
>> >> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>> >> > registers is causing a critical dataplane performance impact to
>> >> > short-lived connections, because it unwildcards megaflows with exact
>> >> > match on dst IP and L4 ports. Any new connections with a different
>> >> > client side L4 port will encounter datapath flow miss and upcall to
>> >> > ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>> >> > RESTful API calls suffer big performance degredations.
>> >> >
>> >> > These fields (dst IP and port) were saved to registers to solve a
>> >> > problem of LB hairpin use case when different VIPs are sharing
>> >> > overlapping backend+port [0]. The change [0] might not have as wide
>> >> > performance impact as it is now because at that time one of the match
>> >> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>> >> > natted traffic, while now the impact is more obvious because
>> >> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>> >> > configured on the LS) since commit [1], after several other indirectly
>> >> > related optimizations and refactors.
>> >> >
>> >> > This patch fixes the problem by modifying the priority-120 flows in
>> >> > ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
>> >> > traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>> >> > port only for traffic matching the LB VIPs, because these are the ones
>> >> > that need to be saved for the hairpin purpose. The existed priority-110
>> >> > flows will match the rest of the traffic just like before but wouldn't
>> >> > not save dst IP and L4 port, so any server->client traffic would not
>> >> > unwildcard megaflows with client side L4 ports.
>> >>
>> >> Hmm, but if higher priority flows have matches on these fields, datapath
>> >> flows will have them unwildcarded anyway.  So, why exactly that is better
>> >> than the current approach?
>> >>
>> > Hi Ilya,
>> >
>> > The problem of the current approach is that it blindly saves the L4 dst port for any traffic in any direction, as long as there are VIPs configured on the datapath.
>> > So consider the most typical scenario of a client sending API requests to server backends behind a VIP. On the server side, any *reply* packets would hit the flow that saves the client side L4 port because for server->client direction the client port is the dst. If the client sends 10 requests, each with a different source port, the server side will end up with unwildcarded DP flows like below: (192.168.1.2 is client IP)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51224), packets:5, bytes:2475, used:1.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x20)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51226), packets:5, bytes:2475, used:1.105s, flags:FP., actions:ct(zone=8,nat),recirc(0x21)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=37798), packets:5, bytes:2475, used:0.574s, flags:FP., actions:ct(zone=8,nat),recirc(0x40)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51250), packets:5, bytes:2475, used:0.872s, flags:FP., actions:ct(zone=8,nat),recirc(0x2d)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46940), packets:5, bytes:2475, used:0.109s, flags:FP., actions:ct(zone=8,nat),recirc(0x60)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46938), packets:5, bytes:2475, used:0.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x5f)
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51236), packets:5, bytes:2475, used:0.938s, flags:FP., actions:ct(zone=8,nat),recirc(0x26)
>> > ...
>> >
>> > As a result, DP flows explode and every new request is going to be a miss and upcall to userspace, which is very inefficient. Even worse, as the flow is so generic, even traffic unrelated to the VIP would have the same impact, as long as a server on a LS with any VIP configuration is replying client requests.
>> > With the fix, only the client->VIP packets would hit such flows, and in those cases the dst port is the server (well known) port, which is expected to be matched in megaflows anyway, while the client side port is not unwildcarded, so new requests/replies will match megaflows in fast path.
>> > The above megaflows become:
>> > recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no> <http://128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no>>), packets:263, bytes:112082, used:0.013s, flags:SFP., actions:ct(zone=8,nat),recirc(0xd)
>>
>>
>> Oh, OK.  Thanks for the explanation!
>>
>> So, it's a reply traffic, and it will not have matches on L3 level unwildcarded
>> too much since, I suppose, it has a destination address typically in a different
>> subnet.
> 
> After the fix, yes. Before the fix, no, because of the flow that saves dst IP and port to registers.
> 
>> So, the ipv4 trie on addresses cuts off the rest of the L3/L4 headers
>> including source ip and the ports from the match criteria.
> 
> Sorry I am not sure if I understand your question here.
> If you are talking about the server(source)->client(destination) direction, for the source/server ip and port, this is correct (before and after the fix).
> If you are talking about the client ip and ports, it is the case after the fix, but not before the fix.


I meant after the fix.  Ack.
Thanks for clarification!

> 
> Thanks,
> Han
> 
>>
>> Did I get that right?
>>
>> >
>> > Thanks,
>> > Han
>> >
>> >> I see how that can help for the case where vIPs has no ports specified,
>> >> because we will not have ports unwildcarded in this case, but I thought
>> >> it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
>> >> even one vIP will have a port, all the datapath flows will have a port
>> >> match.  Or am I missing something?
>> >>
>> >> Best regards, Ilya Maximets.
>>
Ilya Maximets Aug. 31, 2022, 3:17 p.m. UTC | #10
On 8/31/22 17:14, Ilya Maximets wrote:
> On 8/31/22 17:12, Han Zhou wrote:
>>
>>
>> On Wed, Aug 31, 2022 at 2:06 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>>
>>> On 8/31/22 01:32, Han Zhou wrote:
>>>>
>>>>
>>>> On Tue, Aug 30, 2022 at 9:35 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org> <mailto:i.maximets@ovn.org <mailto:i.maximets@ovn.org>>> wrote:
>>>>>
>>>>> On 8/24/22 08:40, Han Zhou wrote:
>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>>>>> registers is causing a critical dataplane performance impact to
>>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>>>>> RESTful API calls suffer big performance degredations.
>>>>>>
>>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>>> performance impact as it is now because at that time one of the match
>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>>>>> natted traffic, while now the impact is more obvious because
>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>>> configured on the LS) since commit [1], after several other indirectly
>>>>>> related optimizations and refactors.
>>>>>>
>>>>>> This patch fixes the problem by modifying the priority-120 flows in
>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for any
>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>>>>>> port only for traffic matching the LB VIPs, because these are the ones
>>>>>> that need to be saved for the hairpin purpose. The existed priority-110
>>>>>> flows will match the rest of the traffic just like before but wouldn't
>>>>>> not save dst IP and L4 port, so any server->client traffic would not
>>>>>> unwildcard megaflows with client side L4 ports.
>>>>>
>>>>> Hmm, but if higher priority flows have matches on these fields, datapath
>>>>> flows will have them unwildcarded anyway.  So, why exactly that is better
>>>>> than the current approach?
>>>>>
>>>> Hi Ilya,
>>>>
>>>> The problem of the current approach is that it blindly saves the L4 dst port for any traffic in any direction, as long as there are VIPs configured on the datapath.
>>>> So consider the most typical scenario of a client sending API requests to server backends behind a VIP. On the server side, any *reply* packets would hit the flow that saves the client side L4 port because for server->client direction the client port is the dst. If the client sends 10 requests, each with a different source port, the server side will end up with unwildcarded DP flows like below: (192.168.1.2 is client IP)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51224), packets:5, bytes:2475, used:1.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x20)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51226), packets:5, bytes:2475, used:1.105s, flags:FP., actions:ct(zone=8,nat),recirc(0x21)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=37798), packets:5, bytes:2475, used:0.574s, flags:FP., actions:ct(zone=8,nat),recirc(0x40)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51250), packets:5, bytes:2475, used:0.872s, flags:FP., actions:ct(zone=8,nat),recirc(0x2d)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46940), packets:5, bytes:2475, used:0.109s, flags:FP., actions:ct(zone=8,nat),recirc(0x60)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=46938), packets:5, bytes:2475, used:0.118s, flags:FP., actions:ct(zone=8,nat),recirc(0x5f)
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=192.168.1.2,proto=6,frag=no),tcp(dst=51236), packets:5, bytes:2475, used:0.938s, flags:FP., actions:ct(zone=8,nat),recirc(0x26)
>>>> ...
>>>>
>>>> As a result, DP flows explode and every new request is going to be a miss and upcall to userspace, which is very inefficient. Even worse, as the flow is so generic, even traffic unrelated to the VIP would have the same impact, as long as a server on a LS with any VIP configuration is replying client requests.
>>>> With the fix, only the client->VIP packets would hit such flows, and in those cases the dst port is the server (well known) port, which is expected to be matched in megaflows anyway, while the client side port is not unwildcarded, so new requests/replies will match megaflows in fast path.
>>>> The above megaflows become:
>>>> recirc_id(0),in_port(4),eth(src=00:00:00:00:00:00/01:00:00:00:00:00,dst=00:00:01:01:02:04),eth_type(0x0800),ipv4(dst=128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no> <http://128.0.0.0/128.0.0.0,frag=no <http://128.0.0.0/128.0.0.0,frag=no>>), packets:263, bytes:112082, used:0.013s, flags:SFP., actions:ct(zone=8,nat),recirc(0xd)
>>>
>>>
>>> Oh, OK.  Thanks for the explanation!
>>>
>>> So, it's a reply traffic, and it will not have matches on L3 level unwildcarded
>>> too much since, I suppose, it has a destination address typically in a different
>>> subnet.
>>
>> After the fix, yes. Before the fix, no, because of the flow that saves dst IP and port to registers.
>>
>>> So, the ipv4 trie on addresses cuts off the rest of the L3/L4 headers
>>> including source ip and the ports from the match criteria.
>>
>> Sorry I am not sure if I understand your question here.
>> If you are talking about the server(source)->client(destination) direction, for the source/server ip and port, this is correct (before and after the fix).
>> If you are talking about the client ip and ports, it is the case after the fix, but not before the fix.
> 
> 
> I meant after the fix.  Ack.

"Ack" as if I got the explanation.  I didn't review the code itself. :)

> Thanks for clarification!
> 
>>
>> Thanks,
>> Han
>>
>>>
>>> Did I get that right?
>>>
>>>>
>>>> Thanks,
>>>> Han
>>>>
>>>>> I see how that can help for the case where vIPs has no ports specified,
>>>>> because we will not have ports unwildcarded in this case, but I thought
>>>>> it's a very unlikely scenario for, e.g., ovn-kubernetes setups.  And if
>>>>> even one vIP will have a port, all the datapath flows will have a port
>>>>> match.  Or am I missing something?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>
>
Dumitru Ceara Sept. 6, 2022, 1:42 p.m. UTC | #11
On 8/31/22 16:59, Han Zhou wrote:
> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 8/31/22 02:17, Han Zhou wrote:
>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> Hi Han,
>>>>
>>>> On 8/24/22 08:40, Han Zhou wrote:
>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>>>> registers is causing a critical dataplane performance impact to
>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>>>> RESTful API calls suffer big performance degredations.
>>>>>
>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>> performance impact as it is now because at that time one of the match
>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>>>> natted traffic, while now the impact is more obvious because
>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>> configured on the LS) since commit [1], after several other indirectly
>>>>> related optimizations and refactors.
>>>>>
>>>>> This patch fixes the problem by modifying the priority-120 flows in
>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
> any
>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and L4
>>>>> port only for traffic matching the LB VIPs, because these are the ones
>>>>> that need to be saved for the hairpin purpose. The existed
> priority-110
>>>>> flows will match the rest of the traffic just like before but wouldn't
>>>>> not save dst IP and L4 port, so any server->client traffic would not
>>>>> unwildcard megaflows with client side L4 ports.
>>>>>
>>>>
>>>> Just to be 100% sure, the client->server traffic will still generate up
>>>> to V x H flows where V is "the number of VIP:port tuples" and H is "the
>>>> number of potential (dp_)hash values", right?
>>>
>>> Yes, if all the VIPs and backends are being accessed at the same time.
> This
>>> is expected. For any given pair of client<->server, regardless of the
>>> connections (source port number changes), the packets should match the
> same
>>> mega-flow and be handled by fastpath (instead of going to userspace,
> which
>>> is the behavior before this patch).
>>>
>>
>> Thanks for the confirmation!
>>
>>>>
>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> shared
>>> backends.")
>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> traffic.")
>>>>>
>>>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>>>>> ---
>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I forgot
>>> to commit
>>>>>           before sending v1.
>>>>>
>>>>>  northd/northd.c         | 125
> +++++++++++++++++++++++++---------------
>>>>>  northd/ovn-northd.8.xml |  14 ++---
>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>>>  tests/ovn.at            |  14 ++---
>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>>>
>>>>> diff --git a/northd/northd.c b/northd/northd.c
>>>>> index 7e2681865..860641936 100644
>>>>> --- a/northd/northd.c
>>>>> +++ b/northd/northd.c
>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>>>        |
>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>>>        |
>>>>>   * +----+----------------------------------------------+ X |
>>>        |
>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>>>        |
>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>>>        |
>>>>>   * +----+----------------------------------------------+ E |
>>>        |
>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>>>        |
>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>>>        |
>>>>>   * +----+----------------------------------------------+ 0 |
>>>        |
>>>>>   * | R3 |                  ACL LABEL                   |   |
>>>        |
>>>>>   *
>>>
> +----+----------------------------------------------+---+------------------+
>>>>>   * | R4 |                   UNUSED                     |   |
>>>        |
>>>>> - * +----+----------------------------------------------+ X |
>>> ORIG_DIP_IPV6  |
>>>>> - * | R5 |                   UNUSED                     | X | (>=
>>> IN_STATEFUL) |
>>>>> + * +----+----------------------------------------------+ X |
>>> ORIG_DIP_IPV6(>= |
>>>>> + * | R5 |                   UNUSED                     | X |
>>> IN_PRE_STATEFUL) |
>>>>>   * +----+----------------------------------------------+ R |
>>>        |
>>>>>   * | R6 |                   UNUSED                     | E |
>>>        |
>>>>>   * +----+----------------------------------------------+ G |
>>>        |
>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
>>> "next;");
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
>>> "next;");
>>>>>
>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>>>> -                               ? "ct_lb_mark"
>>>>> -                               : "ct_lb";
>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>>>> -
>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>>>> -        ds_clear(&match);
>>>>> -        ds_clear(&actions);
>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
> %s",
>>>>> -                      lb_protocols[i]);
>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>> -
>>>>> -        ds_clear(&match);
>>>>> -        ds_clear(&actions);
>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
> %s",
>>>>> -                      lb_protocols[i]);
>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>> -    }
>>>>> +    /* Note: priority-120 flows are added in
>>> build_lb_rules_pre_stateful(). */
>>>>>
>>>>> -    ds_clear(&actions);
>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>>>> +                               ? "ct_lb_mark;"
>>>>> +                               : "ct_lb;";
>>>>>
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>
>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
> should
>>> be
>>>>>       * sent to conntrack for tracking and defragmentation. */
>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>>>
>>>>> -    ds_destroy(&actions);
>>>>> -    ds_destroy(&match);
>>>>>  }
>>>>>
>>>>>  static void
>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct hmap
>>> *lflows) {
>>>>>  }
>>>>>
>>>>>  static void
>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
>>> ct_lb_mark,
>>>>> -               struct ds *match, struct ds *action,
>>>>> -               const struct shash *meter_groups)
>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb
>>> *lb,
>>>>> +                            bool ct_lb_mark, struct ds *match,
>>>>> +                            struct ds *action)
>>>>>  {
>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>>>> -        const char *ip_match = NULL;
>>>>> -
>>>>>          ds_clear(action);
>>>>>          ds_clear(match);
>>>>> +        const char *ip_match = NULL;
>>>>>
>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.
>>> Otherwise
>>>>> -         * the load balanced packet will be committed again in
>>>>> -         * S_SWITCH_IN_STATEFUL. */
>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>>>          /* Store the original destination IP to be used when
> generating
>>>>>           * hairpin flows.
>>>>>           */
>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
>>> ovn_northd_lb *lb, bool ct_lb_mark,
>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
>>>>>                            lb_vip->vip_port);
>>>>>          }
>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
>>> "ct_lb");
>>>>> +
>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
>>> lb_vip->vip_str);
>>>>> +        if (lb_vip->vip_port) {
>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
>>> lb_vip->vip_port);
>>>>> +        }
>>>>> +
>>>>> +        struct ovn_lflow *lflow_ref = NULL;
>>>>> +        uint32_t hash = ovn_logical_flow_hash(
>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
> 120,
>>>>> +                ds_cstr(match), ds_cstr(action));
>>>>> +
>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>>>> +
>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>> +                        ds_cstr(match), ds_cstr(action),
>>>>> +                        NULL, NULL, &lb->nlb->header_,
>>>>> +                        OVS_SOURCE_LOCATOR, hash);
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>
>>>> I see how this can fix the dataplane issue because we're limiting the
>>>> number of dp flows but this now induces a measurable penalty on the
>>>> control plane side because we essentially double the number of logical
>>>> flows that ovn-northd generates for load balancer VIPs
>>>>
>>>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>>>> in-house I see the following:
>>>>
>>>> a. before this patch (main):
>>>> - number of SB lflows:         146233
>>>> - SB size on disk (compacted): 70MB
>>>> - northd poll loop intervals:  8525ms
>>>>
>>>> b. with this patch:
>>>> - number of SB lflows:         163303
>>>> - SB size on disk (compacted): 76MB
>>>> - northd poll loop intervals:  9958ms
>>>>
>>>> [2]
>>>>
>>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>>>> [3]
>>>>
>>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>>>
>>>> This raises some concerns.  However, I went ahead and also ran a test
>>>> with Ilya's load balancer optimization series [4] (up for review) and I
>>> got:
>>>>
>>>> c. with this patch + Ilya's series:
>>>> - number of SB lflows:         163303  << Didn't change compared to "b"
>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to "b"
>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>>>
>>>> Then I ran a test with main + Ilya's series.
>>>>
>>>> d. main + Ilya's series:
>>>> - number of SB lflows:         146233  << Didn't change compared to "a"
>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to "a"
>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>>>
>>>> [4]
> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>>>
>>>> I guess what I'm trying to say is that if go ahead with the approach
> you
>>>> proposed and especially if we want to backport it to the LTS we should
>>>> consider accepting/backporting other optimizations too (such as Ilya's)
>>>> that would compensate (and more) for the control plane performance hit.
>>>>
>>> Thanks for the scale test!
>>> I do expect some cost in the control plane, and your test result is
> aligned
>>> with my expectation. It is unpleasant to see ~10% SB size increase (in
> the
>>> density-heavy and LB/service-heavy test scenario), but if it is
> necessary
>>> to fix the dataplane performance bug I think it is worth it. After all,
>>> everything the control plane does is to serve the dataplane, right? The
>>> resulting dataplane performance boost is at least 10x for short-lived
>>> connection scenarios.
>>> It's great to see Ilya's patch helps the LB scale, and I have no
> objection
>>> to backport them if it is considered necessary to stable branches, but
> I'd
>>> consider it independent of the current patch.
>>>
>>
>> They are independent.  But the problem is your fix will visibily impact
>> control plane latency (~16% on my setup).  So, if we can avoid some of
>> that hit by considering both patchsets together, I'd vote for that.
>>
> Again, no objection to backporting Ilya's patches, but this is the criteria
> in my mind:
> 
> 1. For this patch, it improves dataplane performance (to me it is in fact
> fixing a performance bug) at the cost of the control plane. This is not
> uncommon in the OVN project history - when adding some new features
> sometimes it does add control plane cost. When this happens, whether to
> accept the new feature is based on the value of the feature v.s. the cost
> of the control plane - a tradeoff has to be made in many cases. In this
> case, for 16% control plane latency increase we get 10x dataplane speed up,
> I would  vote for it.

Hi Han,

I'm not arguing that we don't get dataplane speed up.  I'm sure we will
and it should be a decent amount.  But I've been trying to replicate
this 10x boost with an ovn-kubernetes kind deployment and I didn't
really manage to.

I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our
lab.

a. build an ovnkube image (with and without your patch); essentially
what our ovn-kubernetes CI jobs do:

https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39

b. start a kind cluster with the image we built at step "a".

c. edit the ovs-node daemonset definition and increase the resource
limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
and 300MB).  This restarts the ovs-daemon pods, wait for those to come
back up.

d. start an nginx deployment with 50 replicas: this is essentially 50
backends accepting http requests on port 80.

e. create a service load balancing http requests received on a port,
e.g. 42424, to port 80 on the pods in the nginx deployment above.

f. create a clients deployment with 100 replicas: 100 pods running the
following bash script, a hack to simulate 40 new connections with
different ephemeral source ports every 2 seconds:

i=0
while true; do
  i=$((i+1))
  [ $((i%40)) -eq 0 ] && sleep 2
  nc -z -v $SERVICE_VIP 42424 &
done

Comparing the runs with and without your patch applied I see improvement:

1. without this patch applied:
- 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
- OVS cpu usage on the 3 nodes: 30-50%

2. with this patch applied:
- 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
- OVS cpu usage on the 3 nodes: 10-20%

In both cases the whole system is not overloaded, OVS doesn't reach it's
CPU limits.

I see clear improvement but I was expecting to see more than this.  It
does look like my test doesn't generate enough unique sources per second
so I was wondering if you could share your test methodology?

For the OVN code changes, they look good to me.  I think we should apply
them to the main branch.  For that:

Acked-by: Dumitru Ceara <dceara@redhat.com>

For backporting the change to stable branches I'd really like wait to
first understand the exact scenario in which the megaflow undwildcarding
created such a big performance regression in ovn-kubernetes.

Thanks,
Dumitru

> 
> 2. For backporting Ilya's optimization, I'd evaluate using the criteria for
> any optimization backporting. Say the patch has 10-20% speed up, and we
> think that speed up is critical for a branch for some reason, and the
> changes are well contained, I'd vote for the backport.
> 
> Does this make sense?
> @Numan Siddique <numans@ovn.org> @Mark Michelson <mmichels@redhat.com> any
> comments?
> 
> Thanks,
> Han
> 
>> Given that Ilya's changes seem very contained I don't think this should
>> be a problem.  They just need a proper review first.
>>
>>>> I tried to think of an alternative that wouldn't require doubling the
>>>> number of VIP logical flows but couldn't come up with one.  Maybe you
>>>> have more ideas?
>>>>
>>> This is the approach that has the least impact and changes to existing
>>> pipelines I have thought about so far. The earlier attempt wouldn't
>>> increase the number of flows but it had to compromise to either failing
> to
>>> support a corner case (that we are still not sure if it is really
> useful at
>>> all) or losing HW offloading capability. I'll still think about some
> more
>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
> avoid
>>> big changes in case any corner case is broken. I'd spend more time
> sorting
>>> out different use cases and the documentation before making more
>>> optimizations, but I don't think we should also fix this performance
> bug as
>>> soon as we can. (I am also curious why we haven't heard other OVN users
>>> complain about this yet ...)
>>
>> It's a guess, but we probably didn't notice this issue because it should
>> be more visible when there are lots of short lived connections.
>>
>> Thanks,
>> Dumitru
>>
>
Han Zhou Sept. 6, 2022, 8:03 p.m. UTC | #12
On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 8/31/22 16:59, Han Zhou wrote:
> > On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 8/31/22 02:17, Han Zhou wrote:
> >>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >>>>
> >>>> Hi Han,
> >>>>
> >>>> On 8/24/22 08:40, Han Zhou wrote:
> >>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
to
> >>>>> registers is causing a critical dataplane performance impact to
> >>>>> short-lived connections, because it unwildcards megaflows with exact
> >>>>> match on dst IP and L4 ports. Any new connections with a different
> >>>>> client side L4 port will encounter datapath flow miss and upcall to
> >>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> >>>>> RESTful API calls suffer big performance degredations.
> >>>>>
> >>>>> These fields (dst IP and port) were saved to registers to solve a
> >>>>> problem of LB hairpin use case when different VIPs are sharing
> >>>>> overlapping backend+port [0]. The change [0] might not have as wide
> >>>>> performance impact as it is now because at that time one of the
match
> >>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
and
> >>>>> natted traffic, while now the impact is more obvious because
> >>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>>>> configured on the LS) since commit [1], after several other
indirectly
> >>>>> related optimizations and refactors.
> >>>>>
> >>>>> This patch fixes the problem by modifying the priority-120 flows in
> >>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
> > any
> >>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and
L4
> >>>>> port only for traffic matching the LB VIPs, because these are the
ones
> >>>>> that need to be saved for the hairpin purpose. The existed
> > priority-110
> >>>>> flows will match the rest of the traffic just like before but
wouldn't
> >>>>> not save dst IP and L4 port, so any server->client traffic would not
> >>>>> unwildcard megaflows with client side L4 ports.
> >>>>>
> >>>>
> >>>> Just to be 100% sure, the client->server traffic will still generate
up
> >>>> to V x H flows where V is "the number of VIP:port tuples" and H is
"the
> >>>> number of potential (dp_)hash values", right?
> >>>
> >>> Yes, if all the VIPs and backends are being accessed at the same time.
> > This
> >>> is expected. For any given pair of client<->server, regardless of the
> >>> connections (source port number changes), the packets should match the
> > same
> >>> mega-flow and be handled by fastpath (instead of going to userspace,
> > which
> >>> is the behavior before this patch).
> >>>
> >>
> >> Thanks for the confirmation!
> >>
> >>>>
> >>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> > shared
> >>> backends.")
> >>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> > traffic.")
> >>>>>
> >>>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
> >>>>> ---
> >>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I
forgot
> >>> to commit
> >>>>>           before sending v1.
> >>>>>
> >>>>>  northd/northd.c         | 125
> > +++++++++++++++++++++++++---------------
> >>>>>  northd/ovn-northd.8.xml |  14 ++---
> >>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
> >>>>>  tests/ovn.at            |  14 ++---
> >>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
> >>>>>
> >>>>> diff --git a/northd/northd.c b/northd/northd.c
> >>>>> index 7e2681865..860641936 100644
> >>>>> --- a/northd/northd.c
> >>>>> +++ b/northd/northd.c
> >>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
> >>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
> >>>        |
> >>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
> >>>        |
> >>>>>   * +----+----------------------------------------------+ X |
> >>>        |
> >>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
> >>>        |
> >>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
> >>>        |
> >>>>>   * +----+----------------------------------------------+ E |
> >>>        |
> >>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
> >>>        |
> >>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
> >>>        |
> >>>>>   * +----+----------------------------------------------+ 0 |
> >>>        |
> >>>>>   * | R3 |                  ACL LABEL                   |   |
> >>>        |
> >>>>>   *
> >>>
> >
+----+----------------------------------------------+---+------------------+
> >>>>>   * | R4 |                   UNUSED                     |   |
> >>>        |
> >>>>> - * +----+----------------------------------------------+ X |
> >>> ORIG_DIP_IPV6  |
> >>>>> - * | R5 |                   UNUSED                     | X | (>=
> >>> IN_STATEFUL) |
> >>>>> + * +----+----------------------------------------------+ X |
> >>> ORIG_DIP_IPV6(>= |
> >>>>> + * | R5 |                   UNUSED                     | X |
> >>> IN_PRE_STATEFUL) |
> >>>>>   * +----+----------------------------------------------+ R |
> >>>        |
> >>>>>   * | R6 |                   UNUSED                     | E |
> >>>        |
> >>>>>   * +----+----------------------------------------------+ G |
> >>>        |
> >>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> >>> "next;");
> >>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> >>> "next;");
> >>>>>
> >>>>> -    const char *ct_lb_action = features->ct_no_masked_label
> >>>>> -                               ? "ct_lb_mark"
> >>>>> -                               : "ct_lb";
> >>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> >>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
> >>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
> >>>>> -
> >>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> >>>>> -        ds_clear(&match);
> >>>>> -        ds_clear(&actions);
> >>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
> > %s",
> >>>>> -                      lb_protocols[i]);
> >>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> >>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> >>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>> -
> >>>>> -        ds_clear(&match);
> >>>>> -        ds_clear(&actions);
> >>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
> > %s",
> >>>>> -                      lb_protocols[i]);
> >>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> >>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
> >>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>> -    }
> >>>>> +    /* Note: priority-120 flows are added in
> >>> build_lb_rules_pre_stateful(). */
> >>>>>
> >>>>> -    ds_clear(&actions);
> >>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
> >>>>> +    const char *ct_lb_action = features->ct_no_masked_label
> >>>>> +                               ? "ct_lb_mark;"
> >>>>> +                               : "ct_lb;";
> >>>>>
> >>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> >>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> >>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>
> >>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> >>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
> >>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>
> >>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
> > should
> >>> be
> >>>>>       * sent to conntrack for tracking and defragmentation. */
> >>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> >>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >>>>>
> >>>>> -    ds_destroy(&actions);
> >>>>> -    ds_destroy(&match);
> >>>>>  }
> >>>>>
> >>>>>  static void
> >>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct
hmap
> >>> *lflows) {
> >>>>>  }
> >>>>>
> >>>>>  static void
> >>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
> >>> ct_lb_mark,
> >>>>> -               struct ds *match, struct ds *action,
> >>>>> -               const struct shash *meter_groups)
> >>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct
ovn_northd_lb
> >>> *lb,
> >>>>> +                            bool ct_lb_mark, struct ds *match,
> >>>>> +                            struct ds *action)
> >>>>>  {
> >>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
> >>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> >>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> >>>>> -        const char *ip_match = NULL;
> >>>>> -
> >>>>>          ds_clear(action);
> >>>>>          ds_clear(match);
> >>>>> +        const char *ip_match = NULL;
> >>>>>
> >>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT
flag.
> >>> Otherwise
> >>>>> -         * the load balanced packet will be committed again in
> >>>>> -         * S_SWITCH_IN_STATEFUL. */
> >>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> >>>>>          /* Store the original destination IP to be used when
> > generating
> >>>>>           * hairpin flows.
> >>>>>           */
> >>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> >>> ovn_northd_lb *lb, bool ct_lb_mark,
> >>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16";
",
> >>>>>                            lb_vip->vip_port);
> >>>>>          }
> >>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> >>> "ct_lb");
> >>>>> +
> >>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> >>> lb_vip->vip_str);
> >>>>> +        if (lb_vip->vip_port) {
> >>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
> >>> lb_vip->vip_port);
> >>>>> +        }
> >>>>> +
> >>>>> +        struct ovn_lflow *lflow_ref = NULL;
> >>>>> +        uint32_t hash = ovn_logical_flow_hash(
> >>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> >>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
> > 120,
> >>>>> +                ds_cstr(match), ds_cstr(action));
> >>>>> +
> >>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> >>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
> >>>>> +
> >>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
> >>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
> >>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>> +                        ds_cstr(match), ds_cstr(action),
> >>>>> +                        NULL, NULL, &lb->nlb->header_,
> >>>>> +                        OVS_SOURCE_LOCATOR, hash);
> >>>>> +            }
> >>>>> +        }
> >>>>> +    }
> >>>>
> >>>> I see how this can fix the dataplane issue because we're limiting the
> >>>> number of dp flows but this now induces a measurable penalty on the
> >>>> control plane side because we essentially double the number of
logical
> >>>> flows that ovn-northd generates for load balancer VIPs
> >>>>
> >>>> Using a NB database [2] from a 250 node density heavy test [3] we ran
> >>>> in-house I see the following:
> >>>>
> >>>> a. before this patch (main):
> >>>> - number of SB lflows:         146233
> >>>> - SB size on disk (compacted): 70MB
> >>>> - northd poll loop intervals:  8525ms
> >>>>
> >>>> b. with this patch:
> >>>> - number of SB lflows:         163303
> >>>> - SB size on disk (compacted): 76MB
> >>>> - northd poll loop intervals:  9958ms
> >>>>
> >>>> [2]
> >>>>
> >>>
> >
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> >>>> [3]
> >>>>
> >>>
> >
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
> >>>>
> >>>> This raises some concerns.  However, I went ahead and also ran a test
> >>>> with Ilya's load balancer optimization series [4] (up for review)
and I
> >>> got:
> >>>>
> >>>> c. with this patch + Ilya's series:
> >>>> - number of SB lflows:         163303  << Didn't change compared to
"b"
> >>>> - SB size on disk (compacted): 76MB    << Didn't change compared to
"b"
> >>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
> >>>>
> >>>> Then I ran a test with main + Ilya's series.
> >>>>
> >>>> d. main + Ilya's series:
> >>>> - number of SB lflows:         146233  << Didn't change compared to
"a"
> >>>> - SB size on disk (compacted): 70MB    << Didn't change compared to
"a"
> >>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
> >>>>
> >>>> [4]
> > https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
> >>>>
> >>>> I guess what I'm trying to say is that if go ahead with the approach
> > you
> >>>> proposed and especially if we want to backport it to the LTS we
should
> >>>> consider accepting/backporting other optimizations too (such as
Ilya's)
> >>>> that would compensate (and more) for the control plane performance
hit.
> >>>>
> >>> Thanks for the scale test!
> >>> I do expect some cost in the control plane, and your test result is
> > aligned
> >>> with my expectation. It is unpleasant to see ~10% SB size increase (in
> > the
> >>> density-heavy and LB/service-heavy test scenario), but if it is
> > necessary
> >>> to fix the dataplane performance bug I think it is worth it. After
all,
> >>> everything the control plane does is to serve the dataplane, right?
The
> >>> resulting dataplane performance boost is at least 10x for short-lived
> >>> connection scenarios.
> >>> It's great to see Ilya's patch helps the LB scale, and I have no
> > objection
> >>> to backport them if it is considered necessary to stable branches, but
> > I'd
> >>> consider it independent of the current patch.
> >>>
> >>
> >> They are independent.  But the problem is your fix will visibily impact
> >> control plane latency (~16% on my setup).  So, if we can avoid some of
> >> that hit by considering both patchsets together, I'd vote for that.
> >>
> > Again, no objection to backporting Ilya's patches, but this is the
criteria
> > in my mind:
> >
> > 1. For this patch, it improves dataplane performance (to me it is in
fact
> > fixing a performance bug) at the cost of the control plane. This is not
> > uncommon in the OVN project history - when adding some new features
> > sometimes it does add control plane cost. When this happens, whether to
> > accept the new feature is based on the value of the feature v.s. the
cost
> > of the control plane - a tradeoff has to be made in many cases. In this
> > case, for 16% control plane latency increase we get 10x dataplane speed
up,
> > I would  vote for it.
>
> Hi Han,
>
> I'm not arguing that we don't get dataplane speed up.  I'm sure we will
> and it should be a decent amount.  But I've been trying to replicate
> this 10x boost with an ovn-kubernetes kind deployment and I didn't
> really manage to.
>
> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our
> lab.
>
> a. build an ovnkube image (with and without your patch); essentially
> what our ovn-kubernetes CI jobs do:
>
>
https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39
>
> b. start a kind cluster with the image we built at step "a".
>
> c. edit the ovs-node daemonset definition and increase the resource
> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
> and 300MB).  This restarts the ovs-daemon pods, wait for those to come
> back up.
>
> d. start an nginx deployment with 50 replicas: this is essentially 50
> backends accepting http requests on port 80.
>
> e. create a service load balancing http requests received on a port,
> e.g. 42424, to port 80 on the pods in the nginx deployment above.
>
> f. create a clients deployment with 100 replicas: 100 pods running the
> following bash script, a hack to simulate 40 new connections with
> different ephemeral source ports every 2 seconds:
>
> i=0
> while true; do
>   i=$((i+1))
>   [ $((i%40)) -eq 0 ] && sleep 2
>   nc -z -v $SERVICE_VIP 42424 &
> done
>
> Comparing the runs with and without your patch applied I see improvement:
>
> 1. without this patch applied:
> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
> - OVS cpu usage on the 3 nodes: 30-50%
>
> 2. with this patch applied:
> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
> - OVS cpu usage on the 3 nodes: 10-20%
>
> In both cases the whole system is not overloaded, OVS doesn't reach it's
> CPU limits.
>
> I see clear improvement but I was expecting to see more than this.  It
> does look like my test doesn't generate enough unique sources per second
> so I was wondering if you could share your test methodology?
>

Thanks Dumitru for the performance test. I guess if you reduce the number
of clients while increasing the connection speed per-client you would see
more differences. And I am not sure why you saw 10-20% OVS cpu usage with
the patch applied. If all the 100 clients keep sending packets you should
have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are
other problems causing flow miss). BTW, I assume you don't have HW-offload
enabled, right? Could you also try sending to a backend directly instead of
VIP and see the difference?

Here is a brief test I had, with the below patch based on the system test
case "load-balancing". It is basically a single client sending requests to
one of the LB backends using the "ab - apache benchmark" test.
-------------------------------------------------------------------------
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index 992813614..3adc72da1 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \
 # Config OVN load-balancer with a VIP.
 ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"
 ovn-nbctl ls-lb-add foo lb1
+ovn-nbctl ls-lb-add bar lb1

 # Create another load-balancer with another VIP.
 lb2_uuid=`ovn-nbctl create load_balancer name=lb2
vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"`
 ovn-nbctl ls-lb-add foo lb2
+ovn-nbctl ls-lb-add bar lb2

 # Config OVN load-balancer with another VIP (this time with ports).
 ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='"
172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"'
@@ -1503,6 +1505,10 @@
tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s
 tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
 ])

+#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 &&
ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80"
allow-related
+#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4"
allow-related
+#check ovn-nbctl --wait=hv sync
+
 dnl Test load-balancing that includes L4 ports in NAT.
 dnl Each server should have at least one connection.
 OVS_WAIT_FOR_OUTPUT([
@@ -1516,6 +1522,15 @@
tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s
 tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
 ])

+ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/
+
 # Configure selection_fields.
 ovn-nbctl set load_balancer $lb2_uuid
selection_fields="ip_src,ip_dst,tp_src,tp_dst"
 OVS_WAIT_UNTIL([
-----------------------------------------------------------------------

Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
The test results are:

// Before the change
Concurrency Level:      1
Time taken for tests:   12.913 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      21450000 bytes
HTML transferred:       19590000 bytes
Requests per second:    774.41 [#/sec] (mean)
Time per request:       1.291 [ms] (mean)
Time per request:       1.291 [ms] (mean, across all concurrent requests)
Transfer rate:          1622.18 [Kbytes/sec] received

OVS CPU: ~280%

// After the change
Concurrency Level:      1
Time taken for tests:   5.629 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      21450000 bytes
HTML transferred:       19590000 bytes
Requests per second:    1776.62 [#/sec] (mean)
Time per request:       0.563 [ms] (mean)
Time per request:       0.563 [ms] (mean, across all concurrent requests)
Transfer rate:          3721.54 [Kbytes/sec] received

OVS CPU: ~0%

We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a single
thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind
that the "ab" test has a complete TCP handshake and then request/response,
and finally tear down, including usually 6 packets per connection, so at
least half of the packets are still going through fast-path even before the
change. But for the "first" packets of each connection, the speed up should
be at 10x level, which we can easily tell by a ping test and see the RTT
difference between the first packet that goes through the slow-path v.s.
the later packets that go through the fast path only.

And the CPU utilization for OVS is huge before the change while after the
change it is 0. So if we combine more complex scenarios with the system
overloaded I believe the difference should also be even bigger.

> For the OVN code changes, they look good to me.  I think we should apply
> them to the main branch.  For that:
>
> Acked-by: Dumitru Ceara <dceara@redhat.com>
>
> For backporting the change to stable branches I'd really like wait to
> first understand the exact scenario in which the megaflow undwildcarding
> created such a big performance regression in ovn-kubernetes.

Thank you for all the reviews and tests. I applied to main and am waiting
for your confirmation for backporting.
For branch-22.09 I think maybe we should backport for now before it is
released?

Han

>
> Thanks,
> Dumitru
>
> >
> > 2. For backporting Ilya's optimization, I'd evaluate using the criteria
for
> > any optimization backporting. Say the patch has 10-20% speed up, and we
> > think that speed up is critical for a branch for some reason, and the
> > changes are well contained, I'd vote for the backport.
> >
> > Does this make sense?
> > @Numan Siddique <numans@ovn.org> @Mark Michelson <mmichels@redhat.com>
any
> > comments?
> >
> > Thanks,
> > Han
> >
> >> Given that Ilya's changes seem very contained I don't think this should
> >> be a problem.  They just need a proper review first.
> >>
> >>>> I tried to think of an alternative that wouldn't require doubling the
> >>>> number of VIP logical flows but couldn't come up with one.  Maybe you
> >>>> have more ideas?
> >>>>
> >>> This is the approach that has the least impact and changes to existing
> >>> pipelines I have thought about so far. The earlier attempt wouldn't
> >>> increase the number of flows but it had to compromise to either
failing
> > to
> >>> support a corner case (that we are still not sure if it is really
> > useful at
> >>> all) or losing HW offloading capability. I'll still think about some
> > more
> >>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
> > avoid
> >>> big changes in case any corner case is broken. I'd spend more time
> > sorting
> >>> out different use cases and the documentation before making more
> >>> optimizations, but I don't think we should also fix this performance
> > bug as
> >>> soon as we can. (I am also curious why we haven't heard other OVN
users
> >>> complain about this yet ...)
> >>
> >> It's a guess, but we probably didn't notice this issue because it
should
> >> be more visible when there are lots of short lived connections.
> >>
> >> Thanks,
> >> Dumitru
> >>
> >
>
Dumitru Ceara Sept. 7, 2022, 3:47 p.m. UTC | #13
On 9/6/22 22:03, Han Zhou wrote:
> On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 8/31/22 16:59, Han Zhou wrote:
>>> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> On 8/31/22 02:17, Han Zhou wrote:
>>>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>>>
>>>>>> Hi Han,
>>>>>>
>>>>>> On 8/24/22 08:40, Han Zhou wrote:
>>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> to
>>>>>>> registers is causing a critical dataplane performance impact to
>>>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
>>>>>>> RESTful API calls suffer big performance degredations.
>>>>>>>
>>>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>>>> performance impact as it is now because at that time one of the
> match
>>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> and
>>>>>>> natted traffic, while now the impact is more obvious because
>>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>>>> configured on the LS) since commit [1], after several other
> indirectly
>>>>>>> related optimizations and refactors.
>>>>>>>
>>>>>>> This patch fixes the problem by modifying the priority-120 flows in
>>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port for
>>> any
>>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and
> L4
>>>>>>> port only for traffic matching the LB VIPs, because these are the
> ones
>>>>>>> that need to be saved for the hairpin purpose. The existed
>>> priority-110
>>>>>>> flows will match the rest of the traffic just like before but
> wouldn't
>>>>>>> not save dst IP and L4 port, so any server->client traffic would not
>>>>>>> unwildcard megaflows with client side L4 ports.
>>>>>>>
>>>>>>
>>>>>> Just to be 100% sure, the client->server traffic will still generate
> up
>>>>>> to V x H flows where V is "the number of VIP:port tuples" and H is
> "the
>>>>>> number of potential (dp_)hash values", right?
>>>>>
>>>>> Yes, if all the VIPs and backends are being accessed at the same time.
>>> This
>>>>> is expected. For any given pair of client<->server, regardless of the
>>>>> connections (source port number changes), the packets should match the
>>> same
>>>>> mega-flow and be handled by fastpath (instead of going to userspace,
>>> which
>>>>> is the behavior before this patch).
>>>>>
>>>>
>>>> Thanks for the confirmation!
>>>>
>>>>>>
>>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
>>> shared
>>>>> backends.")
>>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
>>> traffic.")
>>>>>>>
>>>>>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>>>>>>> ---
>>>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I
> forgot
>>>>> to commit
>>>>>>>           before sending v1.
>>>>>>>
>>>>>>>  northd/northd.c         | 125
>>> +++++++++++++++++++++++++---------------
>>>>>>>  northd/ovn-northd.8.xml |  14 ++---
>>>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
>>>>>>>  tests/ovn.at            |  14 ++---
>>>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
>>>>>>>
>>>>>>> diff --git a/northd/northd.c b/northd/northd.c
>>>>>>> index 7e2681865..860641936 100644
>>>>>>> --- a/northd/northd.c
>>>>>>> +++ b/northd/northd.c
>>>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
>>>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
>>>>>        |
>>>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ X |
>>>>>        |
>>>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
>>>>>        |
>>>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ E |
>>>>>        |
>>>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
>>>>>        |
>>>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ 0 |
>>>>>        |
>>>>>>>   * | R3 |                  ACL LABEL                   |   |
>>>>>        |
>>>>>>>   *
>>>>>
>>>
> +----+----------------------------------------------+---+------------------+
>>>>>>>   * | R4 |                   UNUSED                     |   |
>>>>>        |
>>>>>>> - * +----+----------------------------------------------+ X |
>>>>> ORIG_DIP_IPV6  |
>>>>>>> - * | R5 |                   UNUSED                     | X | (>=
>>>>> IN_STATEFUL) |
>>>>>>> + * +----+----------------------------------------------+ X |
>>>>> ORIG_DIP_IPV6(>= |
>>>>>>> + * | R5 |                   UNUSED                     | X |
>>>>> IN_PRE_STATEFUL) |
>>>>>>>   * +----+----------------------------------------------+ R |
>>>>>        |
>>>>>>>   * | R6 |                   UNUSED                     | E |
>>>>>        |
>>>>>>>   * +----+----------------------------------------------+ G |
>>>>>        |
>>>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
>>>>> "next;");
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
>>>>> "next;");
>>>>>>>
>>>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
>>>>>>> -                               ? "ct_lb_mark"
>>>>>>> -                               : "ct_lb";
>>>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
>>>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
>>>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
>>>>>>> -
>>>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
>>>>>>> -        ds_clear(&match);
>>>>>>> -        ds_clear(&actions);
>>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 &&
>>> %s",
>>>>>>> -                      lb_protocols[i]);
>>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
>>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>>>> -
>>>>>>> -        ds_clear(&match);
>>>>>>> -        ds_clear(&actions);
>>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 &&
>>> %s",
>>>>>>> -                      lb_protocols[i]);
>>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
>>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
>>>>>>> -                      lb_protocols[i], ct_lb_action);
>>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
>>>>>>> -    }
>>>>>>> +    /* Note: priority-120 flows are added in
>>>>> build_lb_rules_pre_stateful(). */
>>>>>>>
>>>>>>> -    ds_clear(&actions);
>>>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
>>>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
>>>>>>> +                               ? "ct_lb_mark;"
>>>>>>> +                               : "ct_lb;";
>>>>>>>
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
>>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>>>
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
>>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
>>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
>>>>>>>
>>>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
>>> should
>>>>> be
>>>>>>>       * sent to conntrack for tracking and defragmentation. */
>>>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
>>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
>>>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
>>>>>>>
>>>>>>> -    ds_destroy(&actions);
>>>>>>> -    ds_destroy(&match);
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct
> hmap
>>>>> *lflows) {
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool
>>>>> ct_lb_mark,
>>>>>>> -               struct ds *match, struct ds *action,
>>>>>>> -               const struct shash *meter_groups)
>>>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct
> ovn_northd_lb
>>>>> *lb,
>>>>>>> +                            bool ct_lb_mark, struct ds *match,
>>>>>>> +                            struct ds *action)
>>>>>>>  {
>>>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
>>>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
>>>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
>>>>>>> -        const char *ip_match = NULL;
>>>>>>> -
>>>>>>>          ds_clear(action);
>>>>>>>          ds_clear(match);
>>>>>>> +        const char *ip_match = NULL;
>>>>>>>
>>>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT
> flag.
>>>>> Otherwise
>>>>>>> -         * the load balanced packet will be committed again in
>>>>>>> -         * S_SWITCH_IN_STATEFUL. */
>>>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
>>>>>>>          /* Store the original destination IP to be used when
>>> generating
>>>>>>>           * hairpin flows.
>>>>>>>           */
>>>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
>>>>> ovn_northd_lb *lb, bool ct_lb_mark,
>>>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16";
> ",
>>>>>>>                            lb_vip->vip_port);
>>>>>>>          }
>>>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
>>>>> "ct_lb");
>>>>>>> +
>>>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
>>>>> lb_vip->vip_str);
>>>>>>> +        if (lb_vip->vip_port) {
>>>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
>>>>> lb_vip->vip_port);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        struct ovn_lflow *lflow_ref = NULL;
>>>>>>> +        uint32_t hash = ovn_logical_flow_hash(
>>>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
>>>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
>>> 120,
>>>>>>> +                ds_cstr(match), ds_cstr(action));
>>>>>>> +
>>>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
>>>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
>>>>>>> +
>>>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
>>>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
>>>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
>>>>>>> +                        ds_cstr(match), ds_cstr(action),
>>>>>>> +                        NULL, NULL, &lb->nlb->header_,
>>>>>>> +                        OVS_SOURCE_LOCATOR, hash);
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +    }
>>>>>>
>>>>>> I see how this can fix the dataplane issue because we're limiting the
>>>>>> number of dp flows but this now induces a measurable penalty on the
>>>>>> control plane side because we essentially double the number of
> logical
>>>>>> flows that ovn-northd generates for load balancer VIPs
>>>>>>
>>>>>> Using a NB database [2] from a 250 node density heavy test [3] we ran
>>>>>> in-house I see the following:
>>>>>>
>>>>>> a. before this patch (main):
>>>>>> - number of SB lflows:         146233
>>>>>> - SB size on disk (compacted): 70MB
>>>>>> - northd poll loop intervals:  8525ms
>>>>>>
>>>>>> b. with this patch:
>>>>>> - number of SB lflows:         163303
>>>>>> - SB size on disk (compacted): 76MB
>>>>>> - northd poll loop intervals:  9958ms
>>>>>>
>>>>>> [2]
>>>>>>
>>>>>
>>>
> https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
>>>>>> [3]
>>>>>>
>>>>>
>>>
> https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
>>>>>>
>>>>>> This raises some concerns.  However, I went ahead and also ran a test
>>>>>> with Ilya's load balancer optimization series [4] (up for review)
> and I
>>>>> got:
>>>>>>
>>>>>> c. with this patch + Ilya's series:
>>>>>> - number of SB lflows:         163303  << Didn't change compared to
> "b"
>>>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to
> "b"
>>>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
>>>>>>
>>>>>> Then I ran a test with main + Ilya's series.
>>>>>>
>>>>>> d. main + Ilya's series:
>>>>>> - number of SB lflows:         146233  << Didn't change compared to
> "a"
>>>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to
> "a"
>>>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
>>>>>>
>>>>>> [4]
>>> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
>>>>>>
>>>>>> I guess what I'm trying to say is that if go ahead with the approach
>>> you
>>>>>> proposed and especially if we want to backport it to the LTS we
> should
>>>>>> consider accepting/backporting other optimizations too (such as
> Ilya's)
>>>>>> that would compensate (and more) for the control plane performance
> hit.
>>>>>>
>>>>> Thanks for the scale test!
>>>>> I do expect some cost in the control plane, and your test result is
>>> aligned
>>>>> with my expectation. It is unpleasant to see ~10% SB size increase (in
>>> the
>>>>> density-heavy and LB/service-heavy test scenario), but if it is
>>> necessary
>>>>> to fix the dataplane performance bug I think it is worth it. After
> all,
>>>>> everything the control plane does is to serve the dataplane, right?
> The
>>>>> resulting dataplane performance boost is at least 10x for short-lived
>>>>> connection scenarios.
>>>>> It's great to see Ilya's patch helps the LB scale, and I have no
>>> objection
>>>>> to backport them if it is considered necessary to stable branches, but
>>> I'd
>>>>> consider it independent of the current patch.
>>>>>
>>>>
>>>> They are independent.  But the problem is your fix will visibily impact
>>>> control plane latency (~16% on my setup).  So, if we can avoid some of
>>>> that hit by considering both patchsets together, I'd vote for that.
>>>>
>>> Again, no objection to backporting Ilya's patches, but this is the
> criteria
>>> in my mind:
>>>
>>> 1. For this patch, it improves dataplane performance (to me it is in
> fact
>>> fixing a performance bug) at the cost of the control plane. This is not
>>> uncommon in the OVN project history - when adding some new features
>>> sometimes it does add control plane cost. When this happens, whether to
>>> accept the new feature is based on the value of the feature v.s. the
> cost
>>> of the control plane - a tradeoff has to be made in many cases. In this
>>> case, for 16% control plane latency increase we get 10x dataplane speed
> up,
>>> I would  vote for it.
>>
>> Hi Han,
>>
>> I'm not arguing that we don't get dataplane speed up.  I'm sure we will
>> and it should be a decent amount.  But I've been trying to replicate
>> this 10x boost with an ovn-kubernetes kind deployment and I didn't
>> really manage to.
>>
>> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in our
>> lab.
>>
>> a. build an ovnkube image (with and without your patch); essentially
>> what our ovn-kubernetes CI jobs do:
>>
>>
> https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39
>>
>> b. start a kind cluster with the image we built at step "a".
>>
>> c. edit the ovs-node daemonset definition and increase the resource
>> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
>> and 300MB).  This restarts the ovs-daemon pods, wait for those to come
>> back up.
>>
>> d. start an nginx deployment with 50 replicas: this is essentially 50
>> backends accepting http requests on port 80.
>>
>> e. create a service load balancing http requests received on a port,
>> e.g. 42424, to port 80 on the pods in the nginx deployment above.
>>
>> f. create a clients deployment with 100 replicas: 100 pods running the
>> following bash script, a hack to simulate 40 new connections with
>> different ephemeral source ports every 2 seconds:
>>
>> i=0
>> while true; do
>>   i=$((i+1))
>>   [ $((i%40)) -eq 0 ] && sleep 2
>>   nc -z -v $SERVICE_VIP 42424 &
>> done
>>
>> Comparing the runs with and without your patch applied I see improvement:
>>
>> 1. without this patch applied:
>> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
>> - OVS cpu usage on the 3 nodes: 30-50%
>>
>> 2. with this patch applied:
>> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
>> - OVS cpu usage on the 3 nodes: 10-20%
>>
>> In both cases the whole system is not overloaded, OVS doesn't reach it's
>> CPU limits.
>>
>> I see clear improvement but I was expecting to see more than this.  It
>> does look like my test doesn't generate enough unique sources per second
>> so I was wondering if you could share your test methodology?
>>
> 
> Thanks Dumitru for the performance test. I guess if you reduce the number
> of clients while increasing the connection speed per-client you would see
> more differences. And I am not sure why you saw 10-20% OVS cpu usage with
> the patch applied. If all the 100 clients keep sending packets you should
> have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are
> other problems causing flow miss). BTW, I assume you don't have HW-offload
> enabled, right? Could you also try sending to a backend directly instead of
> VIP and see the difference?

No HWOL enabled on my platform.  I think my setup was still too
overloaded (too many containers and virtual nodes on the same physical
machine).

> 
> Here is a brief test I had, with the below patch based on the system test
> case "load-balancing". It is basically a single client sending requests to
> one of the LB backends using the "ab - apache benchmark" test.
> -------------------------------------------------------------------------
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index 992813614..3adc72da1 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \
>  # Config OVN load-balancer with a VIP.
>  ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"
>  ovn-nbctl ls-lb-add foo lb1
> +ovn-nbctl ls-lb-add bar lb1
> 
>  # Create another load-balancer with another VIP.
>  lb2_uuid=`ovn-nbctl create load_balancer name=lb2
> vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"`
>  ovn-nbctl ls-lb-add foo lb2
> +ovn-nbctl ls-lb-add bar lb2
> 
>  # Config OVN load-balancer with another VIP (this time with ports).
>  ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='"
> 172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"'
> @@ -1503,6 +1505,10 @@
> tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s
>  tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
>  ])
> 
> +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 &&
> ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80"
> allow-related
> +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4"
> allow-related
> +#check ovn-nbctl --wait=hv sync
> +
>  dnl Test load-balancing that includes L4 ports in NAT.
>  dnl Each server should have at least one connection.
>  OVS_WAIT_FOR_OUTPUT([
> @@ -1516,6 +1522,15 @@
> tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s
>  tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
>  ])
> 
> +ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/
> +
>  # Configure selection_fields.
>  ovn-nbctl set load_balancer $lb2_uuid
> selection_fields="ip_src,ip_dst,tp_src,tp_dst"
>  OVS_WAIT_UNTIL([
> -----------------------------------------------------------------------
> 
> Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
> The test results are:
> 
> // Before the change
> Concurrency Level:      1
> Time taken for tests:   12.913 seconds
> Complete requests:      10000
> Failed requests:        0
> Total transferred:      21450000 bytes
> HTML transferred:       19590000 bytes
> Requests per second:    774.41 [#/sec] (mean)
> Time per request:       1.291 [ms] (mean)
> Time per request:       1.291 [ms] (mean, across all concurrent requests)
> Transfer rate:          1622.18 [Kbytes/sec] received
> 
> OVS CPU: ~280%
> 
> // After the change
> Concurrency Level:      1
> Time taken for tests:   5.629 seconds
> Complete requests:      10000
> Failed requests:        0
> Total transferred:      21450000 bytes
> HTML transferred:       19590000 bytes
> Requests per second:    1776.62 [#/sec] (mean)
> Time per request:       0.563 [ms] (mean)
> Time per request:       0.563 [ms] (mean, across all concurrent requests)
> Transfer rate:          3721.54 [Kbytes/sec] received
> 
> OVS CPU: ~0%
> 
> We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a single
> thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind
> that the "ab" test has a complete TCP handshake and then request/response,
> and finally tear down, including usually 6 packets per connection, so at
> least half of the packets are still going through fast-path even before the
> change. But for the "first" packets of each connection, the speed up should
> be at 10x level, which we can easily tell by a ping test and see the RTT
> difference between the first packet that goes through the slow-path v.s.
> the later packets that go through the fast path only.

I changed the test, on an ovn-kubernetes kind cluster, with netperf
TCP_CRR (connection + request + response):

https://github.com/jtaleric/k8s-netperf/pull/8

> 
> And the CPU utilization for OVS is huge before the change while after the
> change it is 0. So if we combine more complex scenarios with the system
> overloaded I believe the difference should also be even bigger.
> 

I see a big difference in CPU usage too.  It goes up to ~90% CPU (almost
one whole CPU) just for processing CRR between a client and a server.
While with your patch applied the CPU usage for OVS is negligible.

The number of DP flows also drops drastically, as expected.

Comparing the number of "transactions" (connect + request + response)
between a single pair of client + server I see on my machine:

a. without your patch: ~2K TPS
b. with your patch: ~6K TPS

Probably a coincidence because our platforms and tests are very
different but I also see 2-3x boost in transactions per second.

>> For the OVN code changes, they look good to me.  I think we should apply
>> them to the main branch.  For that:
>>
>> Acked-by: Dumitru Ceara <dceara@redhat.com>
>>
>> For backporting the change to stable branches I'd really like wait to
>> first understand the exact scenario in which the megaflow undwildcarding
>> created such a big performance regression in ovn-kubernetes.
> 
> Thank you for all the reviews and tests. I applied to main and am waiting
> for your confirmation for backporting.
> For branch-22.09 I think maybe we should backport for now before it is
> released?
> 

In light of your results and based on the findings above I think we
should accept the backport even if it might cause a small control plane
degradation.  This also having in mind Ilya's northd patches that I
think are a good candidate for stable branches too.  But I'll let Mark
and Numan, from their maintainer perspective, share their thoughts on
this matter.

Regards,
Dumitru

> Han
> 
>>
>> Thanks,
>> Dumitru
>>
>>>
>>> 2. For backporting Ilya's optimization, I'd evaluate using the criteria
> for
>>> any optimization backporting. Say the patch has 10-20% speed up, and we
>>> think that speed up is critical for a branch for some reason, and the
>>> changes are well contained, I'd vote for the backport.
>>>
>>> Does this make sense?
>>> @Numan Siddique <numans@ovn.org> @Mark Michelson <mmichels@redhat.com>
> any
>>> comments?
>>>
>>> Thanks,
>>> Han
>>>
>>>> Given that Ilya's changes seem very contained I don't think this should
>>>> be a problem.  They just need a proper review first.
>>>>
>>>>>> I tried to think of an alternative that wouldn't require doubling the
>>>>>> number of VIP logical flows but couldn't come up with one.  Maybe you
>>>>>> have more ideas?
>>>>>>
>>>>> This is the approach that has the least impact and changes to existing
>>>>> pipelines I have thought about so far. The earlier attempt wouldn't
>>>>> increase the number of flows but it had to compromise to either
> failing
>>> to
>>>>> support a corner case (that we are still not sure if it is really
>>> useful at
>>>>> all) or losing HW offloading capability. I'll still think about some
>>> more
>>>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
>>> avoid
>>>>> big changes in case any corner case is broken. I'd spend more time
>>> sorting
>>>>> out different use cases and the documentation before making more
>>>>> optimizations, but I don't think we should also fix this performance
>>> bug as
>>>>> soon as we can. (I am also curious why we haven't heard other OVN
> users
>>>>> complain about this yet ...)
>>>>
>>>> It's a guess, but we probably didn't notice this issue because it
> should
>>>> be more visible when there are lots of short lived connections.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>
>>
>
Han Zhou Sept. 9, 2022, 12:24 a.m. UTC | #14
On Wed, Sep 7, 2022 at 8:47 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 9/6/22 22:03, Han Zhou wrote:
> > On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 8/31/22 16:59, Han Zhou wrote:
> >>> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >>>>
> >>>> On 8/31/22 02:17, Han Zhou wrote:
> >>>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dceara@redhat.com>
> > wrote:
> >>>>>>
> >>>>>> Hi Han,
> >>>>>>
> >>>>>> On 8/24/22 08:40, Han Zhou wrote:
> >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and
Port
> > to
> >>>>>>> registers is causing a critical dataplane performance impact to
> >>>>>>> short-lived connections, because it unwildcards megaflows with
exact
> >>>>>>> match on dst IP and L4 ports. Any new connections with a different
> >>>>>>> client side L4 port will encounter datapath flow miss and upcall
to
> >>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> >>>>>>> RESTful API calls suffer big performance degredations.
> >>>>>>>
> >>>>>>> These fields (dst IP and port) were saved to registers to solve a
> >>>>>>> problem of LB hairpin use case when different VIPs are sharing
> >>>>>>> overlapping backend+port [0]. The change [0] might not have as
wide
> >>>>>>> performance impact as it is now because at that time one of the
> > match
> >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> > and
> >>>>>>> natted traffic, while now the impact is more obvious because
> >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>>>>>> configured on the LS) since commit [1], after several other
> > indirectly
> >>>>>>> related optimizations and refactors.
> >>>>>>>
> >>>>>>> This patch fixes the problem by modifying the priority-120 flows
in
> >>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port
for
> >>> any
> >>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and
> > L4
> >>>>>>> port only for traffic matching the LB VIPs, because these are the
> > ones
> >>>>>>> that need to be saved for the hairpin purpose. The existed
> >>> priority-110
> >>>>>>> flows will match the rest of the traffic just like before but
> > wouldn't
> >>>>>>> not save dst IP and L4 port, so any server->client traffic would
not
> >>>>>>> unwildcard megaflows with client side L4 ports.
> >>>>>>>
> >>>>>>
> >>>>>> Just to be 100% sure, the client->server traffic will still
generate
> > up
> >>>>>> to V x H flows where V is "the number of VIP:port tuples" and H is
> > "the
> >>>>>> number of potential (dp_)hash values", right?
> >>>>>
> >>>>> Yes, if all the VIPs and backends are being accessed at the same
time.
> >>> This
> >>>>> is expected. For any given pair of client<->server, regardless of
the
> >>>>> connections (source port number changes), the packets should match
the
> >>> same
> >>>>> mega-flow and be handled by fastpath (instead of going to userspace,
> >>> which
> >>>>> is the behavior before this patch).
> >>>>>
> >>>>
> >>>> Thanks for the confirmation!
> >>>>
> >>>>>>
> >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> >>> shared
> >>>>> backends.")
> >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> >>> traffic.")
> >>>>>>>
> >>>>>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
> >>>>>>> ---
> >>>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I
> > forgot
> >>>>> to commit
> >>>>>>>           before sending v1.
> >>>>>>>
> >>>>>>>  northd/northd.c         | 125
> >>> +++++++++++++++++++++++++---------------
> >>>>>>>  northd/ovn-northd.8.xml |  14 ++---
> >>>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
> >>>>>>>  tests/ovn.at            |  14 ++---
> >>>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/northd/northd.c b/northd/northd.c
> >>>>>>> index 7e2681865..860641936 100644
> >>>>>>> --- a/northd/northd.c
> >>>>>>> +++ b/northd/northd.c
> >>>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
> >>>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
> >>>>>        |
> >>>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ X |
> >>>>>        |
> >>>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
> >>>>>        |
> >>>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ E |
> >>>>>        |
> >>>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
> >>>>>        |
> >>>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ 0 |
> >>>>>        |
> >>>>>>>   * | R3 |                  ACL LABEL                   |   |
> >>>>>        |
> >>>>>>>   *
> >>>>>
> >>>
> >
+----+----------------------------------------------+---+------------------+
> >>>>>>>   * | R4 |                   UNUSED                     |   |
> >>>>>        |
> >>>>>>> - * +----+----------------------------------------------+ X |
> >>>>> ORIG_DIP_IPV6  |
> >>>>>>> - * | R5 |                   UNUSED                     | X | (>=
> >>>>> IN_STATEFUL) |
> >>>>>>> + * +----+----------------------------------------------+ X |
> >>>>> ORIG_DIP_IPV6(>= |
> >>>>>>> + * | R5 |                   UNUSED                     | X |
> >>>>> IN_PRE_STATEFUL) |
> >>>>>>>   * +----+----------------------------------------------+ R |
> >>>>>        |
> >>>>>>>   * | R6 |                   UNUSED                     | E |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ G |
> >>>>>        |
> >>>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath
*od,
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> >>>>> "next;");
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> >>>>> "next;");
> >>>>>>>
> >>>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
> >>>>>>> -                               ? "ct_lb_mark"
> >>>>>>> -                               : "ct_lb";
> >>>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> >>>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
> >>>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
> >>>>>>> -
> >>>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> >>>>>>> -        ds_clear(&match);
> >>>>>>> -        ds_clear(&actions);
> >>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4
&&
> >>> %s",
> >>>>>>> -                      lb_protocols[i]);
> >>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> >>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst;
%s;",
> >>>>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>>>> -
> >>>>>>> -        ds_clear(&match);
> >>>>>>> -        ds_clear(&actions);
> >>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6
&&
> >>> %s",
> >>>>>>> -                      lb_protocols[i]);
> >>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> >>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst;
%s;",
> >>>>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>>>> -    }
> >>>>>>> +    /* Note: priority-120 flows are added in
> >>>>> build_lb_rules_pre_stateful(). */
> >>>>>>>
> >>>>>>> -    ds_clear(&actions);
> >>>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
> >>>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
> >>>>>>> +                               ? "ct_lb_mark;"
> >>>>>>> +                               : "ct_lb;";
> >>>>>>>
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> >>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1",
ds_cstr(&actions));
> >>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>>>
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> >>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1",
ds_cstr(&actions));
> >>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>>>
> >>>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
> >>> should
> >>>>> be
> >>>>>>>       * sent to conntrack for tracking and defragmentation. */
> >>>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> >>>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >>>>>>>
> >>>>>>> -    ds_destroy(&actions);
> >>>>>>> -    ds_destroy(&match);
> >>>>>>>  }
> >>>>>>>
> >>>>>>>  static void
> >>>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct
> > hmap
> >>>>> *lflows) {
> >>>>>>>  }
> >>>>>>>
> >>>>>>>  static void
> >>>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb,
bool
> >>>>> ct_lb_mark,
> >>>>>>> -               struct ds *match, struct ds *action,
> >>>>>>> -               const struct shash *meter_groups)
> >>>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct
> > ovn_northd_lb
> >>>>> *lb,
> >>>>>>> +                            bool ct_lb_mark, struct ds *match,
> >>>>>>> +                            struct ds *action)
> >>>>>>>  {
> >>>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
> >>>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> >>>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> >>>>>>> -        const char *ip_match = NULL;
> >>>>>>> -
> >>>>>>>          ds_clear(action);
> >>>>>>>          ds_clear(match);
> >>>>>>> +        const char *ip_match = NULL;
> >>>>>>>
> >>>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT
> > flag.
> >>>>> Otherwise
> >>>>>>> -         * the load balanced packet will be committed again in
> >>>>>>> -         * S_SWITCH_IN_STATEFUL. */
> >>>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> >>>>>>>          /* Store the original destination IP to be used when
> >>> generating
> >>>>>>>           * hairpin flows.
> >>>>>>>           */
> >>>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> >>>>> ovn_northd_lb *lb, bool ct_lb_mark,
> >>>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " =
%"PRIu16";
> > ",
> >>>>>>>                            lb_vip->vip_port);
> >>>>>>>          }
> >>>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> >>>>> "ct_lb");
> >>>>>>> +
> >>>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> >>>>> lb_vip->vip_str);
> >>>>>>> +        if (lb_vip->vip_port) {
> >>>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
> >>>>> lb_vip->vip_port);
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        struct ovn_lflow *lflow_ref = NULL;
> >>>>>>> +        uint32_t hash = ovn_logical_flow_hash(
> >>>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> >>>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
> >>> 120,
> >>>>>>> +                ds_cstr(match), ds_cstr(action));
> >>>>>>> +
> >>>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> >>>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
> >>>>>>> +
> >>>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od))
{
> >>>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
> >>>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL,
120,
> >>>>>>> +                        ds_cstr(match), ds_cstr(action),
> >>>>>>> +                        NULL, NULL, &lb->nlb->header_,
> >>>>>>> +                        OVS_SOURCE_LOCATOR, hash);
> >>>>>>> +            }
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>
> >>>>>> I see how this can fix the dataplane issue because we're limiting
the
> >>>>>> number of dp flows but this now induces a measurable penalty on the
> >>>>>> control plane side because we essentially double the number of
> > logical
> >>>>>> flows that ovn-northd generates for load balancer VIPs
> >>>>>>
> >>>>>> Using a NB database [2] from a 250 node density heavy test [3] we
ran
> >>>>>> in-house I see the following:
> >>>>>>
> >>>>>> a. before this patch (main):
> >>>>>> - number of SB lflows:         146233
> >>>>>> - SB size on disk (compacted): 70MB
> >>>>>> - northd poll loop intervals:  8525ms
> >>>>>>
> >>>>>> b. with this patch:
> >>>>>> - number of SB lflows:         163303
> >>>>>> - SB size on disk (compacted): 76MB
> >>>>>> - northd poll loop intervals:  9958ms
> >>>>>>
> >>>>>> [2]
> >>>>>>
> >>>>>
> >>>
> >
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> >>>>>> [3]
> >>>>>>
> >>>>>
> >>>
> >
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
> >>>>>>
> >>>>>> This raises some concerns.  However, I went ahead and also ran a
test
> >>>>>> with Ilya's load balancer optimization series [4] (up for review)
> > and I
> >>>>> got:
> >>>>>>
> >>>>>> c. with this patch + Ilya's series:
> >>>>>> - number of SB lflows:         163303  << Didn't change compared to
> > "b"
> >>>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to
> > "b"
> >>>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
> >>>>>>
> >>>>>> Then I ran a test with main + Ilya's series.
> >>>>>>
> >>>>>> d. main + Ilya's series:
> >>>>>> - number of SB lflows:         146233  << Didn't change compared to
> > "a"
> >>>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to
> > "a"
> >>>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
> >>>>>>
> >>>>>> [4]
> >>> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
> >>>>>>
> >>>>>> I guess what I'm trying to say is that if go ahead with the
approach
> >>> you
> >>>>>> proposed and especially if we want to backport it to the LTS we
> > should
> >>>>>> consider accepting/backporting other optimizations too (such as
> > Ilya's)
> >>>>>> that would compensate (and more) for the control plane performance
> > hit.
> >>>>>>
> >>>>> Thanks for the scale test!
> >>>>> I do expect some cost in the control plane, and your test result is
> >>> aligned
> >>>>> with my expectation. It is unpleasant to see ~10% SB size increase
(in
> >>> the
> >>>>> density-heavy and LB/service-heavy test scenario), but if it is
> >>> necessary
> >>>>> to fix the dataplane performance bug I think it is worth it. After
> > all,
> >>>>> everything the control plane does is to serve the dataplane, right?
> > The
> >>>>> resulting dataplane performance boost is at least 10x for
short-lived
> >>>>> connection scenarios.
> >>>>> It's great to see Ilya's patch helps the LB scale, and I have no
> >>> objection
> >>>>> to backport them if it is considered necessary to stable branches,
but
> >>> I'd
> >>>>> consider it independent of the current patch.
> >>>>>
> >>>>
> >>>> They are independent.  But the problem is your fix will visibily
impact
> >>>> control plane latency (~16% on my setup).  So, if we can avoid some
of
> >>>> that hit by considering both patchsets together, I'd vote for that.
> >>>>
> >>> Again, no objection to backporting Ilya's patches, but this is the
> > criteria
> >>> in my mind:
> >>>
> >>> 1. For this patch, it improves dataplane performance (to me it is in
> > fact
> >>> fixing a performance bug) at the cost of the control plane. This is
not
> >>> uncommon in the OVN project history - when adding some new features
> >>> sometimes it does add control plane cost. When this happens, whether
to
> >>> accept the new feature is based on the value of the feature v.s. the
> > cost
> >>> of the control plane - a tradeoff has to be made in many cases. In
this
> >>> case, for 16% control plane latency increase we get 10x dataplane
speed
> > up,
> >>> I would  vote for it.
> >>
> >> Hi Han,
> >>
> >> I'm not arguing that we don't get dataplane speed up.  I'm sure we will
> >> and it should be a decent amount.  But I've been trying to replicate
> >> this 10x boost with an ovn-kubernetes kind deployment and I didn't
> >> really manage to.
> >>
> >> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in
our
> >> lab.
> >>
> >> a. build an ovnkube image (with and without your patch); essentially
> >> what our ovn-kubernetes CI jobs do:
> >>
> >>
> >
https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39
> >>
> >> b. start a kind cluster with the image we built at step "a".
> >>
> >> c. edit the ovs-node daemonset definition and increase the resource
> >> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
> >> and 300MB).  This restarts the ovs-daemon pods, wait for those to come
> >> back up.
> >>
> >> d. start an nginx deployment with 50 replicas: this is essentially 50
> >> backends accepting http requests on port 80.
> >>
> >> e. create a service load balancing http requests received on a port,
> >> e.g. 42424, to port 80 on the pods in the nginx deployment above.
> >>
> >> f. create a clients deployment with 100 replicas: 100 pods running the
> >> following bash script, a hack to simulate 40 new connections with
> >> different ephemeral source ports every 2 seconds:
> >>
> >> i=0
> >> while true; do
> >>   i=$((i+1))
> >>   [ $((i%40)) -eq 0 ] && sleep 2
> >>   nc -z -v $SERVICE_VIP 42424 &
> >> done
> >>
> >> Comparing the runs with and without your patch applied I see
improvement:
> >>
> >> 1. without this patch applied:
> >> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
> >> - OVS cpu usage on the 3 nodes: 30-50%
> >>
> >> 2. with this patch applied:
> >> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
> >> - OVS cpu usage on the 3 nodes: 10-20%
> >>
> >> In both cases the whole system is not overloaded, OVS doesn't reach
it's
> >> CPU limits.
> >>
> >> I see clear improvement but I was expecting to see more than this.  It
> >> does look like my test doesn't generate enough unique sources per
second
> >> so I was wondering if you could share your test methodology?
> >>
> >
> > Thanks Dumitru for the performance test. I guess if you reduce the
number
> > of clients while increasing the connection speed per-client you would
see
> > more differences. And I am not sure why you saw 10-20% OVS cpu usage
with
> > the patch applied. If all the 100 clients keep sending packets you
should
> > have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are
> > other problems causing flow miss). BTW, I assume you don't have
HW-offload
> > enabled, right? Could you also try sending to a backend directly
instead of
> > VIP and see the difference?
>
> No HWOL enabled on my platform.  I think my setup was still too
> overloaded (too many containers and virtual nodes on the same physical
> machine).
>
> >
> > Here is a brief test I had, with the below patch based on the system
test
> > case "load-balancing". It is basically a single client sending requests
to
> > one of the LB backends using the "ab - apache benchmark" test.
> >
-------------------------------------------------------------------------
> > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> > index 992813614..3adc72da1 100644
> > --- a/tests/system-ovn.at
> > +++ b/tests/system-ovn.at
> > @@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \
> >  # Config OVN load-balancer with a VIP.
> >  ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"
> >  ovn-nbctl ls-lb-add foo lb1
> > +ovn-nbctl ls-lb-add bar lb1
> >
> >  # Create another load-balancer with another VIP.
> >  lb2_uuid=`ovn-nbctl create load_balancer name=lb2
> > vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"`
> >  ovn-nbctl ls-lb-add foo lb2
> > +ovn-nbctl ls-lb-add bar lb2
> >
> >  # Config OVN load-balancer with another VIP (this time with ports).
> >  ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='"
> > 172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"'
> > @@ -1503,6 +1505,10 @@
> >
tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s
> >
 tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
> >  ])
> >
> > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 &&
> > ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80"
> > allow-related
> > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4"
> > allow-related
> > +#check ovn-nbctl --wait=hv sync
> > +
> >  dnl Test load-balancing that includes L4 ports in NAT.
> >  dnl Each server should have at least one connection.
> >  OVS_WAIT_FOR_OUTPUT([
> > @@ -1516,6 +1522,15 @@
> >
tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s
> >
 tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
> >  ])
> >
> > +ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/
> > +
> >  # Configure selection_fields.
> >  ovn-nbctl set load_balancer $lb2_uuid
> > selection_fields="ip_src,ip_dst,tp_src,tp_dst"
> >  OVS_WAIT_UNTIL([
> > -----------------------------------------------------------------------
> >
> > Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
> > The test results are:
> >
> > // Before the change
> > Concurrency Level:      1
> > Time taken for tests:   12.913 seconds
> > Complete requests:      10000
> > Failed requests:        0
> > Total transferred:      21450000 bytes
> > HTML transferred:       19590000 bytes
> > Requests per second:    774.41 [#/sec] (mean)
> > Time per request:       1.291 [ms] (mean)
> > Time per request:       1.291 [ms] (mean, across all concurrent
requests)
> > Transfer rate:          1622.18 [Kbytes/sec] received
> >
> > OVS CPU: ~280%
> >
> > // After the change
> > Concurrency Level:      1
> > Time taken for tests:   5.629 seconds
> > Complete requests:      10000
> > Failed requests:        0
> > Total transferred:      21450000 bytes
> > HTML transferred:       19590000 bytes
> > Requests per second:    1776.62 [#/sec] (mean)
> > Time per request:       0.563 [ms] (mean)
> > Time per request:       0.563 [ms] (mean, across all concurrent
requests)
> > Transfer rate:          3721.54 [Kbytes/sec] received
> >
> > OVS CPU: ~0%
> >
> > We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a
single
> > thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind
> > that the "ab" test has a complete TCP handshake and then
request/response,
> > and finally tear down, including usually 6 packets per connection, so at
> > least half of the packets are still going through fast-path even before
the
> > change. But for the "first" packets of each connection, the speed up
should
> > be at 10x level, which we can easily tell by a ping test and see the RTT
> > difference between the first packet that goes through the slow-path v.s.
> > the later packets that go through the fast path only.
>
> I changed the test, on an ovn-kubernetes kind cluster, with netperf
> TCP_CRR (connection + request + response):
>
> https://github.com/jtaleric/k8s-netperf/pull/8
>
> >
> > And the CPU utilization for OVS is huge before the change while after
the
> > change it is 0. So if we combine more complex scenarios with the system
> > overloaded I believe the difference should also be even bigger.
> >
>
> I see a big difference in CPU usage too.  It goes up to ~90% CPU (almost
> one whole CPU) just for processing CRR between a client and a server.
> While with your patch applied the CPU usage for OVS is negligible.
>
> The number of DP flows also drops drastically, as expected.
>
> Comparing the number of "transactions" (connect + request + response)
> between a single pair of client + server I see on my machine:
>
> a. without your patch: ~2K TPS
> b. with your patch: ~6K TPS
>
> Probably a coincidence because our platforms and tests are very
> different but I also see 2-3x boost in transactions per second.
>
> >> For the OVN code changes, they look good to me.  I think we should
apply
> >> them to the main branch.  For that:
> >>
> >> Acked-by: Dumitru Ceara <dceara@redhat.com>
> >>
> >> For backporting the change to stable branches I'd really like wait to
> >> first understand the exact scenario in which the megaflow
undwildcarding
> >> created such a big performance regression in ovn-kubernetes.
> >
> > Thank you for all the reviews and tests. I applied to main and am
waiting
> > for your confirmation for backporting.
> > For branch-22.09 I think maybe we should backport for now before it is
> > released?
> >
>
> In light of your results and based on the findings above I think we
> should accept the backport even if it might cause a small control plane
> degradation.  This also having in mind Ilya's northd patches that I
> think are a good candidate for stable branches too.  But I'll let Mark
> and Numan, from their maintainer perspective, share their thoughts on
> this matter.
>
> Regards,
> Dumitru

Thanks Dumitru for confirming. In today's OVN meeting I got confirmation
from Mark that he is in favor of backporting it as well, so I just
backported down to branch-22.03.

Han

>
> > Han
> >
> >>
> >> Thanks,
> >> Dumitru
> >>
> >>>
> >>> 2. For backporting Ilya's optimization, I'd evaluate using the
criteria
> > for
> >>> any optimization backporting. Say the patch has 10-20% speed up, and
we
> >>> think that speed up is critical for a branch for some reason, and the
> >>> changes are well contained, I'd vote for the backport.
> >>>
> >>> Does this make sense?
> >>> @Numan Siddique <numans@ovn.org> @Mark Michelson <mmichels@redhat.com>
> > any
> >>> comments?
> >>>
> >>> Thanks,
> >>> Han
> >>>
> >>>> Given that Ilya's changes seem very contained I don't think this
should
> >>>> be a problem.  They just need a proper review first.
> >>>>
> >>>>>> I tried to think of an alternative that wouldn't require doubling
the
> >>>>>> number of VIP logical flows but couldn't come up with one.  Maybe
you
> >>>>>> have more ideas?
> >>>>>>
> >>>>> This is the approach that has the least impact and changes to
existing
> >>>>> pipelines I have thought about so far. The earlier attempt wouldn't
> >>>>> increase the number of flows but it had to compromise to either
> > failing
> >>> to
> >>>>> support a corner case (that we are still not sure if it is really
> >>> useful at
> >>>>> all) or losing HW offloading capability. I'll still think about some
> >>> more
> >>>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
> >>> avoid
> >>>>> big changes in case any corner case is broken. I'd spend more time
> >>> sorting
> >>>>> out different use cases and the documentation before making more
> >>>>> optimizations, but I don't think we should also fix this performance
> >>> bug as
> >>>>> soon as we can. (I am also curious why we haven't heard other OVN
> > users
> >>>>> complain about this yet ...)
> >>>>
> >>>> It's a guess, but we probably didn't notice this issue because it
> > should
> >>>> be more visible when there are lots of short lived connections.
> >>>>
> >>>> Thanks,
> >>>> Dumitru
> >>>>
> >>>
> >>
> >
>
diff mbox series

Patch

diff --git a/northd/northd.c b/northd/northd.c
index 7e2681865..860641936 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -273,15 +273,15 @@  enum ovn_stage {
  * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |                  |
  * |    |     REGBIT_ACL_LABEL                         | X |                  |
  * +----+----------------------------------------------+ X |                  |
- * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |                  |
+ * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |                  |
  * +----+----------------------------------------------+ E |                  |
- * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |                  |
+ * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |                  |
  * +----+----------------------------------------------+ 0 |                  |
  * | R3 |                  ACL LABEL                   |   |                  |
  * +----+----------------------------------------------+---+------------------+
  * | R4 |                   UNUSED                     |   |                  |
- * +----+----------------------------------------------+ X |   ORIG_DIP_IPV6  |
- * | R5 |                   UNUSED                     | X | (>= IN_STATEFUL) |
+ * +----+----------------------------------------------+ X | ORIG_DIP_IPV6(>= |
+ * | R5 |                   UNUSED                     | X | IN_PRE_STATEFUL) |
  * +----+----------------------------------------------+ R |                  |
  * | R6 |                   UNUSED                     | E |                  |
  * +----+----------------------------------------------+ G |                  |
@@ -5899,43 +5899,17 @@  build_pre_stateful(struct ovn_datapath *od,
     ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1", "next;");
     ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1", "next;");
 
-    const char *ct_lb_action = features->ct_no_masked_label
-                               ? "ct_lb_mark"
-                               : "ct_lb";
-    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
-    struct ds actions = DS_EMPTY_INITIALIZER;
-    struct ds match = DS_EMPTY_INITIALIZER;
-
-    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
-        ds_clear(&match);
-        ds_clear(&actions);
-        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4 && %s",
-                      lb_protocols[i]);
-        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
-                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
-                      lb_protocols[i], ct_lb_action);
-        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
-                      ds_cstr(&match), ds_cstr(&actions));
-
-        ds_clear(&match);
-        ds_clear(&actions);
-        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6 && %s",
-                      lb_protocols[i]);
-        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
-                                REG_ORIG_TP_DPORT " = %s.dst; %s;",
-                      lb_protocols[i], ct_lb_action);
-        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
-                      ds_cstr(&match), ds_cstr(&actions));
-    }
+    /* Note: priority-120 flows are added in build_lb_rules_pre_stateful(). */
 
-    ds_clear(&actions);
-    ds_put_format(&actions, "%s;", ct_lb_action);
+    const char *ct_lb_action = features->ct_no_masked_label
+                               ? "ct_lb_mark;"
+                               : "ct_lb;";
 
     ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
-                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
+                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
 
     ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
-                  REGBIT_CONNTRACK_NAT" == 1", ds_cstr(&actions));
+                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
 
     /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets should be
      * sent to conntrack for tracking and defragmentation. */
@@ -5945,8 +5919,6 @@  build_pre_stateful(struct ovn_datapath *od,
     ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
                   REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
 
-    ds_destroy(&actions);
-    ds_destroy(&match);
 }
 
 static void
@@ -6841,22 +6813,16 @@  build_qos(struct ovn_datapath *od, struct hmap *lflows) {
 }
 
 static void
-build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool ct_lb_mark,
-               struct ds *match, struct ds *action,
-               const struct shash *meter_groups)
+build_lb_rules_pre_stateful(struct hmap *lflows, struct ovn_northd_lb *lb,
+                            bool ct_lb_mark, struct ds *match,
+                            struct ds *action)
 {
     for (size_t i = 0; i < lb->n_vips; i++) {
         struct ovn_lb_vip *lb_vip = &lb->vips[i];
-        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
-        const char *ip_match = NULL;
-
         ds_clear(action);
         ds_clear(match);
+        const char *ip_match = NULL;
 
-        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.  Otherwise
-         * the load balanced packet will be committed again in
-         * S_SWITCH_IN_STATEFUL. */
-        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
         /* Store the original destination IP to be used when generating
          * hairpin flows.
          */
@@ -6887,6 +6853,67 @@  build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool ct_lb_mark,
             ds_put_format(action, REG_ORIG_TP_DPORT " = %"PRIu16"; ",
                           lb_vip->vip_port);
         }
+        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" : "ct_lb");
+
+        ds_put_format(match, "%s.dst == %s", ip_match, lb_vip->vip_str);
+        if (lb_vip->vip_port) {
+            ds_put_format(match, " && %s.dst == %d", proto, lb_vip->vip_port);
+        }
+
+        struct ovn_lflow *lflow_ref = NULL;
+        uint32_t hash = ovn_logical_flow_hash(
+                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
+                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL), 120,
+                ds_cstr(match), ds_cstr(action));
+
+        for (size_t j = 0; j < lb->n_nb_ls; j++) {
+            struct ovn_datapath *od = lb->nb_ls[j];
+
+            if (!ovn_dp_group_add_with_reference(lflow_ref, od)) {
+                lflow_ref = ovn_lflow_add_at_with_hash(
+                        lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
+                        ds_cstr(match), ds_cstr(action),
+                        NULL, NULL, &lb->nlb->header_,
+                        OVS_SOURCE_LOCATOR, hash);
+            }
+        }
+    }
+}
+
+static void
+build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb, bool ct_lb_mark,
+               struct ds *match, struct ds *action,
+               const struct shash *meter_groups)
+{
+    for (size_t i = 0; i < lb->n_vips; i++) {
+        struct ovn_lb_vip *lb_vip = &lb->vips[i];
+        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
+        const char *ip_match = NULL;
+        if (IN6_IS_ADDR_V4MAPPED(&lb_vip->vip)) {
+            ip_match = "ip4";
+        } else {
+            ip_match = "ip6";
+        }
+
+        const char *proto = NULL;
+        if (lb_vip->vip_port) {
+            proto = "tcp";
+            if (lb->nlb->protocol) {
+                if (!strcmp(lb->nlb->protocol, "udp")) {
+                    proto = "udp";
+                } else if (!strcmp(lb->nlb->protocol, "sctp")) {
+                    proto = "sctp";
+                }
+            }
+        }
+
+        ds_clear(action);
+        ds_clear(match);
+
+        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT flag.  Otherwise
+         * the load balanced packet will be committed again in
+         * S_SWITCH_IN_STATEFUL. */
+        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
 
         /* New connections in Ingress table. */
         const char *meter = NULL;
@@ -10170,6 +10197,8 @@  build_lswitch_flows_for_lb(struct ovn_northd_lb *lb, struct hmap *lflows,
      * a higher priority rule for load balancing below also commits the
      * connection, so it is okay if we do not hit the above match on
      * REGBIT_CONNTRACK_COMMIT. */
+    build_lb_rules_pre_stateful(lflows, lb, features->ct_no_masked_label,
+                                match, action);
     build_lb_rules(lflows, lb, features->ct_no_masked_label,
                    match, action, meter_groups);
 }
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index b8f871394..58ee9f27a 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -538,10 +538,8 @@ 
       <li>
         Priority-120 flows that send the packets to connection tracker using
         <code>ct_lb_mark;</code> as the action so that the already established
-        traffic destined to the load balancer VIP gets DNATted based on a hint
-        provided by the previous tables (with a match
-        for <code>reg0[2] == 1</code> and on supported load balancer protocols
-        and address families).  For IPv4 traffic the flows also load the
+        traffic destined to the load balancer VIP gets DNATted. These flows
+        match each VIPs IP and port. For IPv4 traffic the flows also load the
         original destination IP and transport port in registers
         <code>reg1</code> and <code>reg2</code>.  For IPv6 traffic the flows
         also load the original destination IP and transport port in
@@ -549,12 +547,10 @@ 
       </li>
 
       <li>
-         A priority-110 flow sends the packets to connection tracker based
-         on a hint provided by the previous tables
+         A priority-110 flow sends the packets that don't match the above flows
+         to connection tracker based on a hint provided by the previous tables
          (with a match for <code>reg0[2] == 1</code>) by using the
-         <code>ct_lb_mark;</code> action.  This flow is added to handle
-         the traffic for load balancer VIPs whose protocol is not defined
-         (mainly for ICMP traffic).
+         <code>ct_lb_mark;</code> action.
       </li>
 
       <li>
diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
index 157f9f60c..19ae0a362 100644
--- a/tests/ovn-northd.at
+++ b/tests/ovn-northd.at
@@ -1227,7 +1227,7 @@  check ovn-nbctl --wait=sb ls-lb-add sw0 lb1
 AT_CAPTURE_FILE([sbflows])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows | grep 'priority=120.*backends' | sed 's/table=..//'], 0, [dnl
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
 ])
 
 AS_BOX([Delete the Load_Balancer_Health_Check])
@@ -1237,7 +1237,7 @@  wait_row_count Service_Monitor 0
 AT_CAPTURE_FILE([sbflows2])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows2 | grep 'priority=120.*backends' | sed 's/table=..//'], [0],
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
 ])
 
 AS_BOX([Create the Load_Balancer_Health_Check again.])
@@ -1249,7 +1249,7 @@  check ovn-nbctl --wait=sb sync
 
 ovn-sbctl dump-flows sw0 | grep backends | grep priority=120 > lflows.txt
 AT_CHECK([cat lflows.txt | sed 's/table=..//'], [0], [dnl
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
 ])
 
 AS_BOX([Get the uuid of both the service_monitor])
@@ -1259,7 +1259,7 @@  sm_sw1_p1=$(fetch_column Service_Monitor _uuid logical_port=sw1-p1)
 AT_CAPTURE_FILE([sbflows3])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows 3 | grep 'priority=120.*backends' | sed 's/table=..//'], [0],
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
 ])
 
 AS_BOX([Set the service monitor for sw1-p1 to offline])
@@ -1270,7 +1270,7 @@  check ovn-nbctl --wait=sb sync
 AT_CAPTURE_FILE([sbflows4])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows4 | grep 'priority=120.*backends' | sed 's/table=..//'], [0],
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80);)
 ])
 
 AS_BOX([Set the service monitor for sw0-p1 to offline])
@@ -1285,7 +1285,7 @@  OVS_WAIT_FOR_OUTPUT(
 
 AT_CAPTURE_FILE([sbflows6])
 OVS_WAIT_FOR_OUTPUT(
-  [ovn-sbctl dump-flows sw0 | tee sbflows6 | grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" | grep priority=120 | sed 's/table=..//'], [0], [dnl
+  [ovn-sbctl dump-flows sw0 | tee sbflows6 | grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" | grep priority=120 | grep ls_in_lb | sed 's/table=..//'], [0], [dnl
   (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(drop;)
 ])
 
@@ -1299,7 +1299,7 @@  check ovn-nbctl --wait=sb sync
 AT_CAPTURE_FILE([sbflows7])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows7 | grep backends | grep priority=120 | sed 's/table=..//'], 0,
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
 ])
 
 AS_BOX([Set the service monitor for sw1-p1 to error])
@@ -1309,8 +1309,8 @@  check ovn-nbctl --wait=sb sync
 
 ovn-sbctl dump-flows sw0 | grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" \
 | grep priority=120 > lflows.txt
-AT_CHECK([cat lflows.txt | sed 's/table=..//'], [0], [dnl
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80);)
+AT_CHECK([cat lflows.txt | grep ls_in_lb | sed 's/table=..//'], [0], [dnl
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80);)
 ])
 
 AS_BOX([Add one more vip to lb1])
@@ -1336,8 +1336,8 @@  AT_CAPTURE_FILE([sbflows9])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows9 | grep backends | grep priority=120 | sed 's/table=..//' | sort],
   0,
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80);)
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; reg1 = 10.0.0.40; reg2[[0..15]] = 1000; ct_lb(backends=10.0.0.3:1000);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:1000);)
 ])
 
 AS_BOX([Set the service monitor for sw1-p1 to online])
@@ -1350,8 +1350,8 @@  AT_CAPTURE_FILE([sbflows10])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw0 | tee sbflows10 | grep backends | grep priority=120 | sed 's/table=..//' | sort],
   0,
-[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; reg1 = 10.0.0.40; reg2[[0..15]] = 1000; ct_lb(backends=10.0.0.3:1000,20.0.0.3:80);)
+[  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:1000,20.0.0.3:80);)
 ])
 
 AS_BOX([Associate lb1 to sw1])
@@ -1360,8 +1360,8 @@  AT_CAPTURE_FILE([sbflows11])
 OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows sw1 | tee sbflows11 | grep backends | grep priority=120 | sed 's/table=..//' | sort],
   0, [dnl
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; reg1 = 10.0.0.40; reg2[[0..15]] = 1000; ct_lb(backends=10.0.0.3:1000,20.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:80,20.0.0.3:80);)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.40 && tcp.dst == 1000), action=(reg0[[1]] = 0; ct_lb(backends=10.0.0.3:1000,20.0.0.3:80);)
 ])
 
 AS_BOX([Now create lb2 same as lb1 but udp protocol.])
@@ -1417,7 +1417,7 @@  ovn-sbctl set service_monitor $sm_sw1_p1 status=offline
 
 AT_CAPTURE_FILE([sbflows12])
 OVS_WAIT_FOR_OUTPUT(
-  [ovn-sbctl dump-flows sw0 | tee sbflows12 | grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" | grep priority=120 | sed 's/table=..//'], [0], [dnl
+  [ovn-sbctl dump-flows sw0 | tee sbflows12 | grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" | grep priority=120 | grep ls_in_lb | sed 's/table=..//'], [0], [dnl
   (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0 = 0; reject { outport <-> inport; next(pipeline=egress,table=5);};)
 ])
 
@@ -3889,18 +3889,14 @@  check_stateful_flows() {
   table=? (ls_in_pre_stateful ), priority=0    , match=(1), action=(next;)
   table=? (ls_in_pre_stateful ), priority=100  , match=(reg0[[0]] == 1), action=(ct_next;)
   table=? (ls_in_pre_stateful ), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
+  table=? (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb_mark;)
+  table=? (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 10.0.0.20 && tcp.dst == 80), action=(reg1 = 10.0.0.20; reg2[[0..15]] = 80; ct_lb_mark;)
 ])
 
     AT_CHECK([grep "ls_in_lb" sw0flows | sort | sed 's/table=../table=??/'], [0], [dnl
   table=??(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb_mark(backends=10.0.0.4:8080);)
-  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.20 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.20; reg2[[0..15]] = 80; ct_lb_mark(backends=10.0.0.40:8080);)
+  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.4:8080);)
+  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.20 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.40:8080);)
 ])
 
     AT_CHECK([grep "ls_in_stateful" sw0flows | sort | sed 's/table=../table=??/'], [0], [dnl
@@ -3961,12 +3957,6 @@  AT_CHECK([grep "ls_in_pre_stateful" sw0flows | sort | sed 's/table=./table=?/'],
   table=? (ls_in_pre_stateful ), priority=0    , match=(1), action=(next;)
   table=? (ls_in_pre_stateful ), priority=100  , match=(reg0[[0]] == 1), action=(ct_next;)
   table=? (ls_in_pre_stateful ), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=? (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
 ])
 
 AT_CHECK([grep "ls_in_lb" sw0flows | sort | sed 's/table=../table=??/'], [0], [dnl
@@ -6415,7 +6405,7 @@  AT_CHECK([grep -e "ls_in_acl" lsflows | sed 's/table=../table=??/' | sort], [0],
 
 AT_CHECK([grep -e "ls_in_lb" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
   table=??(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; reg1 = 10.0.0.2; ct_lb_mark(backends=10.0.0.10);)
+  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.10);)
 ])
 
 AT_CHECK([grep -e "ls_in_stateful" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
@@ -6468,7 +6458,7 @@  AT_CHECK([grep -e "ls_in_acl" lsflows | sed 's/table=../table=??/' | sort], [0],
 
 AT_CHECK([grep -e "ls_in_lb" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
   table=??(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; reg1 = 10.0.0.2; ct_lb_mark(backends=10.0.0.10);)
+  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.10);)
 ])
 
 AT_CHECK([grep -e "ls_in_stateful" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
@@ -6521,7 +6511,7 @@  AT_CHECK([grep -e "ls_in_acl" lsflows | sed 's/table=../table=??/' | sort], [0],
 
 AT_CHECK([grep -e "ls_in_lb" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
   table=??(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; reg1 = 10.0.0.2; ct_lb_mark(backends=10.0.0.10);)
+  table=??(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 10.0.0.2), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.10);)
 ])
 
 AT_CHECK([grep -e "ls_in_stateful" lsflows | sed 's/table=../table=??/' | sort], [0], [dnl
@@ -7480,14 +7470,9 @@  check ovn-nbctl --wait=sb sync
 AT_CHECK([ovn-sbctl lflow-list | grep -e natted -e ct_lb], [0], [dnl
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.est && ip4 && reg0 == 66.66.66.66 && ct_mark.natted == 1), action=(next;)
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.new && ip4 && reg0 == 66.66.66.66), action=(ct_lb_mark(backends=42.42.42.2);)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
+  table=6 (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 66.66.66.66), action=(reg1 = 66.66.66.66; ct_lb_mark;)
   table=6 (ls_in_pre_stateful ), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
-  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; reg1 = 66.66.66.66; ct_lb_mark(backends=42.42.42.2);)
+  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; ct_lb_mark(backends=42.42.42.2);)
   table=2 (ls_out_pre_stateful), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
 ])
 
@@ -7497,14 +7482,9 @@  check ovn-nbctl --wait=sb sync
 AT_CHECK([ovn-sbctl lflow-list | grep -e natted -e ct_lb], [0], [dnl
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.est && ip4 && reg0 == 66.66.66.66 && ct_label.natted == 1), action=(next;)
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.new && ip4 && reg0 == 66.66.66.66), action=(ct_lb(backends=42.42.42.2);)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb;)
+  table=6 (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 66.66.66.66), action=(reg1 = 66.66.66.66; ct_lb;)
   table=6 (ls_in_pre_stateful ), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb;)
-  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; reg1 = 66.66.66.66; ct_lb(backends=42.42.42.2);)
+  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; ct_lb(backends=42.42.42.2);)
   table=2 (ls_out_pre_stateful), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb;)
 ])
 
@@ -7514,14 +7494,9 @@  check ovn-nbctl --wait=sb sync
 AT_CHECK([ovn-sbctl lflow-list | grep -e natted -e ct_lb], [0], [dnl
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.est && ip4 && reg0 == 66.66.66.66 && ct_mark.natted == 1), action=(next;)
   table=6 (lr_in_dnat         ), priority=110  , match=(ct.new && ip4 && reg0 == 66.66.66.66), action=(ct_lb_mark(backends=42.42.42.2);)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  table=6 (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
+  table=6 (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 66.66.66.66), action=(reg1 = 66.66.66.66; ct_lb_mark;)
   table=6 (ls_in_pre_stateful ), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
-  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; reg1 = 66.66.66.66; ct_lb_mark(backends=42.42.42.2);)
+  table=11(ls_in_lb           ), priority=110  , match=(ct.new && ip4.dst == 66.66.66.66), action=(reg0[[1]] = 0; ct_lb_mark(backends=42.42.42.2);)
   table=2 (ls_out_pre_stateful), priority=110  , match=(reg0[[2]] == 1), action=(ct_lb_mark;)
 ])
 
@@ -7692,11 +7667,11 @@  AT_CAPTURE_FILE([S1flows])
 
 AT_CHECK([grep "ls_in_lb" S0flows | sort], [0], [dnl
   table=11(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=11(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 172.16.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 172.16.0.10; reg2[[0..15]] = 80; ct_lb_mark(backends=10.0.0.2:80);)
+  table=11(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 172.16.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.2:80);)
 ])
 AT_CHECK([grep "ls_in_lb" S1flows | sort], [0], [dnl
   table=11(ls_in_lb           ), priority=0    , match=(1), action=(next;)
-  table=11(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 172.16.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 172.16.0.10; reg2[[0..15]] = 80; ct_lb_mark(backends=10.0.0.2:80);)
+  table=11(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 172.16.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.2:80);)
 ])
 
 ovn-sbctl get datapath S0 _uuid > dp_uuids
diff --git a/tests/ovn.at b/tests/ovn.at
index bba2c9c1d..0875fecc2 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -23586,13 +23586,8 @@  OVS_WAIT_FOR_OUTPUT(
   [ovn-sbctl dump-flows > sbflows
    ovn-sbctl dump-flows sw0 | grep ct_lb_mark | grep priority=120 | sed 's/table=..//'], 0,
   [dnl
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && sctp), action=(reg1 = ip4.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && tcp), action=(reg1 = ip4.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip4 && udp), action=(reg1 = ip4.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && sctp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = sctp.dst; ct_lb_mark;)
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && tcp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = tcp.dst; ct_lb_mark;)
-  (ls_in_pre_stateful ), priority=120  , match=(reg0[[2]] == 1 && ip6 && udp), action=(xxreg1 = ip6.dst; reg2[[0..15]] = udp.dst; ct_lb_mark;)
-  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb_mark(backends=10.0.0.3:80,20.0.0.3:80; hash_fields="ip_dst,ip_src,tcp_dst,tcp_src");)
+  (ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb_mark;)
+  (ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg0[[1]] = 0; ct_lb_mark(backends=10.0.0.3:80,20.0.0.3:80; hash_fields="ip_dst,ip_src,tcp_dst,tcp_src");)
 ])
 
 AT_CAPTURE_FILE([sbflows2])
@@ -23633,8 +23628,9 @@  AT_CAPTURE_FILE([sbflows3])
 ovn-sbctl dump-flows sw0 > sbflows3
 AT_CHECK(
   [grep "ip4.dst == 10.0.0.10 && tcp.dst == 80" sbflows3 | grep priority=120 |\
-   sed 's/table=../table=??/'], [0],
-  [  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(drop;)
+   sed 's/table=../table=??/'], [0], [dnl
+  table=??(ls_in_pre_stateful ), priority=120  , match=(ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(reg1 = 10.0.0.10; reg2[[0..15]] = 80; ct_lb_mark;)
+  table=??(ls_in_lb           ), priority=120  , match=(ct.new && ip4.dst == 10.0.0.10 && tcp.dst == 80), action=(drop;)
 ])
 
 AT_CAPTURE_FILE([sbflows4])