diff mbox series

[ovs-dev] northd: Don't skip the unSNAT stage for traffic towards VIPs.

Message ID 20240226125830.2173125-1-dceara@redhat.com
State Accepted
Headers show
Series [ovs-dev] northd: Don't skip the unSNAT stage for traffic towards VIPs. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Dumitru Ceara Feb. 26, 2024, 12:58 p.m. UTC
Otherwise, in case there's also a SNAT rule that uses the VIP as
external IP, we break sessions initiated from behind the VIP.

This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for
load balancer vips in router ingress pipeline").  That's OK because
commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
DNAT/Load Balancers") addressed the original issue in a better way:

    In the reply direction, the order of traversal of the tables
    "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
    datapath flows that check ct_state in the wrong conntrack zone.
    This is illustrated below where reply trafic enters the physical host
    port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
    DNAT zone and then on to Logical Switch Port zone (22). The third
    flow is incorrectly checking the state from the SNAT zone instead
    of the DNAT zone.

We also add a system test to ensure traffic initiated from behind a VIP
+ SNAT is not broken.

Another nice side effect is that the northd I-P is slightly simplified
because we don't need to track NAT external IPs anymore.

Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline")
Reported-at: https://issues.redhat.com/browse/FDP-291
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 northd/en-lflow.c       |  3 +-
 northd/en-lr-nat.c      |  5 ---
 northd/en-lr-nat.h      |  3 --
 northd/en-lr-stateful.c | 51 --------------------------
 northd/en-lr-stateful.h |  9 +----
 northd/northd.c         | 33 +----------------
 tests/ovn-northd.at     | 24 +++++--------
 tests/system-ovn.at     | 80 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 91 insertions(+), 117 deletions(-)

Comments

Numan Siddique March 5, 2024, 2:56 p.m. UTC | #1
On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> Otherwise, in case there's also a SNAT rule that uses the VIP as
> external IP, we break sessions initiated from behind the VIP.
>
> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for
> load balancer vips in router ingress pipeline").  That's OK because
> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
> DNAT/Load Balancers") addressed the original issue in a better way:
>
>     In the reply direction, the order of traversal of the tables
>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>     datapath flows that check ct_state in the wrong conntrack zone.
>     This is illustrated below where reply trafic enters the physical host
>     port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
>     DNAT zone and then on to Logical Switch Port zone (22). The third
>     flow is incorrectly checking the state from the SNAT zone instead
>     of the DNAT zone.
>
> We also add a system test to ensure traffic initiated from behind a VIP
> + SNAT is not broken.
>
> Another nice side effect is that the northd I-P is slightly simplified
> because we don't need to track NAT external IPs anymore.
>
> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline")
> Reported-at: https://issues.redhat.com/browse/FDP-291
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>


Thanks for the fix.  It also simplified the lr-nat-stateful code.

Acked-by: Numan Siddique <numans@ovn.org>

Numan

> ---
>  northd/en-lflow.c       |  3 +-
>  northd/en-lr-nat.c      |  5 ---
>  northd/en-lr-nat.h      |  3 --
>  northd/en-lr-stateful.c | 51 --------------------------
>  northd/en-lr-stateful.h |  9 +----
>  northd/northd.c         | 33 +----------------
>  tests/ovn-northd.at     | 24 +++++--------
>  tests/system-ovn.at     | 80 +++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 91 insertions(+), 117 deletions(-)
>
> diff --git a/northd/en-lflow.c b/northd/en-lflow.c
> index f1a83839df..c4b927fb8c 100644
> --- a/northd/en-lflow.c
> +++ b/northd/en-lflow.c
> @@ -174,8 +174,7 @@ lflow_lr_stateful_handler(struct engine_node *node, void *data)
>      struct ed_type_lr_stateful *lr_sful_data =
>          engine_get_input_data("lr_stateful", node);
>
> -    if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)
> -        || lr_sful_data->trk_data.vip_nats_changed) {
> +    if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) {
>          return false;
>      }
>
> diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c
> index ad11025c69..215d924e42 100644
> --- a/northd/en-lr-nat.c
> +++ b/northd/en-lr-nat.c
> @@ -219,10 +219,6 @@ lr_nat_record_init(struct lr_nat_record *lrnat_rec,
>      lrnat_rec->nbr_uuid = od->nbr->header_.uuid;
>
>      shash_init(&lrnat_rec->snat_ips);
> -    sset_init(&lrnat_rec->external_ips);
> -    for (size_t i = 0; i < od->nbr->n_nat; i++) {
> -        sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip);
> -    }
>
>      sset_init(&lrnat_rec->external_macs);
>      lrnat_rec->has_distributed_nat = false;
> @@ -343,7 +339,6 @@ lr_nat_record_clear(struct lr_nat_record *lrnat_rec)
>      }
>
>      free(lrnat_rec->nat_entries);
> -    sset_destroy(&lrnat_rec->external_ips);
>      sset_destroy(&lrnat_rec->external_macs);
>  }
>
> diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h
> index 71d3a1782d..81a7b0abd7 100644
> --- a/northd/en-lr-nat.h
> +++ b/northd/en-lr-nat.h
> @@ -64,9 +64,6 @@ struct lr_nat_record {
>
>      bool has_distributed_nat;
>
> -    /* Set of nat external ips on the router. */
> -    struct sset external_ips;
> -
>      /* Set of nat external macs on the router. */
>      struct sset external_macs;
>
> diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c
> index 6d0192487c..baf1bd2f89 100644
> --- a/northd/en-lr-stateful.c
> +++ b/northd/en-lr-stateful.c
> @@ -82,7 +82,6 @@ static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *,
>                                              enum lb_neighbor_responder_mode,
>                                              const struct sset *lb_ips_v4,
>                                              const struct sset *lb_ips_v6);
> -static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *);
>
>  /* 'lr_stateful' engine node manages the NB logical router LB data.
>   */
> @@ -110,7 +109,6 @@ en_lr_stateful_clear_tracked_data(void *data_)
>      struct ed_type_lr_stateful *data = data_;
>
>      hmapx_clear(&data->trk_data.crupdated);
> -    data->trk_data.vip_nats_changed = false;
>  }
>
>  void
> @@ -198,10 +196,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_)
>
>              /* Add the lr_stateful_rec rec to the tracking data. */
>              hmapx_add(&data->trk_data.crupdated, lr_stateful_rec);
> -
> -            if (!sset_is_empty(&lr_stateful_rec->vip_nats)) {
> -                data->trk_data.vip_nats_changed = true;
> -            }
>              continue;
>          }
>
> @@ -319,9 +313,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_)
>
>          HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) {
>              struct lr_stateful_record *lr_stateful_rec = hmapx_node->data;
> -            if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) {
> -                data->trk_data.vip_nats_changed = true;
> -            }
>              const struct ovn_datapath *od =
>                  ovn_datapaths_find_by_index(input_data.lr_datapaths,
>                                              lr_stateful_rec->lr_index);
> @@ -360,13 +351,6 @@ lr_stateful_lr_nat_handler(struct engine_node *node, void *data_)
>                  lr_stateful_record_create(&data->table, lrnat_rec, od,
>                                            input_data.lb_datapaths_map,
>                                            input_data.lbgrp_datapaths_map);
> -            if (!sset_is_empty(&lr_stateful_rec->vip_nats)) {
> -                data->trk_data.vip_nats_changed = true;
> -            }
> -        } else {
> -            if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) {
> -                data->trk_data.vip_nats_changed = true;
> -            }
>          }
>
>          /* Add the lr_stateful_rec rec to the tracking data. */
> @@ -525,11 +509,6 @@ lr_stateful_record_create(struct lr_stateful_table *table,
>          build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb);
>      }
>
> -    sset_init(&lr_stateful_rec->vip_nats);
> -
> -    if (nbr->n_nat) {
> -        lr_stateful_rebuild_vip_nats(lr_stateful_rec);
> -    }
>      lr_stateful_rec->has_lb_vip = od_has_lb_vip(od);
>
>      hmap_insert(&table->entries, &lr_stateful_rec->key_node,
> @@ -556,7 +535,6 @@ static void
>  lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec)
>  {
>      ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips);
> -    sset_destroy(&lr_stateful_rec->vip_nats);
>      lflow_ref_destroy(lr_stateful_rec->lflow_ref);
>      free(lr_stateful_rec);
>  }
> @@ -671,32 +649,3 @@ remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec,
>                               ip_address);
>      }
>  }
> -
> -static bool
> -lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec)
> -{
> -    struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats);
> -    sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats);
> -
> -    const char *external_ip;
> -    SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) {
> -        bool is_vip_nat = false;
> -        if (addr_is_ipv6(external_ip)) {
> -            is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6,
> -                                       external_ip);
> -        } else {
> -            is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4,
> -                                       external_ip);
> -        }
> -
> -        if (is_vip_nat) {
> -            sset_add(&lr_stateful_rec->vip_nats, external_ip);
> -        }
> -    }
> -
> -    bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats,
> -                                         &old_vip_nats);
> -    sset_destroy(&old_vip_nats);
> -
> -    return vip_nats_changed;
> -}
> diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h
> index 4a07174c31..e15c61e575 100644
> --- a/northd/en-lr-stateful.h
> +++ b/northd/en-lr-stateful.h
> @@ -62,9 +62,6 @@ struct lr_stateful_record {
>      /* Load Balancer vIPs relevant for this datapath. */
>      struct ovn_lb_ip_set *lb_ips;
>
> -    /* sset of vips which are also part of lr nats. */
> -    struct sset vip_nats;
> -
>      /* 'lflow_ref' is used to reference logical flows generated for
>       * this lr_stateful_record.
>       *
> @@ -107,10 +104,6 @@ struct lr_stateful_table {
>  struct lr_stateful_tracked_data {
>      /* Created or updated logical router with LB and/or NAT data. */
>      struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */
> -
> -    /* Indicates if any router's NATs changed which were also LB vips
> -     * or vice versa. */
> -    bool vip_nats_changed;
>  };
>
>  struct ed_type_lr_stateful {
> @@ -142,7 +135,7 @@ const struct lr_stateful_record *lr_stateful_table_find_by_index(
>  static inline bool
>  lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data)
>  {
> -    return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed;
> +    return !hmapx_is_empty(&trk_data->crupdated);
>  }
>
>  static inline bool
> diff --git a/northd/northd.c b/northd/northd.c
> index 2c3560ce2d..4b39137e7c 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -11061,7 +11061,6 @@ build_lrouter_nat_flows_for_lb(
>      struct ds skip_snat_act = DS_EMPTY_INITIALIZER;
>      struct ds force_snat_act = DS_EMPTY_INITIALIZER;
>      struct ds undnat_match = DS_EMPTY_INITIALIZER;
> -    struct ds unsnat_match = DS_EMPTY_INITIALIZER;
>      struct ds gw_redir_action = DS_EMPTY_INITIALIZER;
>
>      ds_clear(match);
> @@ -11107,13 +11106,6 @@ build_lrouter_nat_flows_for_lb(
>      /* Remove the trailing " || ". */
>      ds_truncate(&undnat_match, undnat_match.length - 4);
>
> -    ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s",
> -                  ip_match, ip_match, lb_vip->vip_str, lb->proto);
> -    if (lb_vip->port_str) {
> -        ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto,
> -                      lb_vip->port_str);
> -    }
> -
>      struct lrouter_nat_lb_flows_ctx ctx = {
>          .lb_vip = lb_vip,
>          .lb = lb,
> @@ -11171,23 +11163,6 @@ build_lrouter_nat_flows_for_lb(
>          if (lb->affinity_timeout) {
>              bitmap_set1(aff_dp_bitmap[type], index);
>          }
> -
> -        if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) {
> -            /* The load balancer vip is also present in the NAT entries.
> -             * So add a high priority lflow to advance the the packet
> -             * destined to the vip (and the vip port if defined)
> -             * in the S_ROUTER_IN_UNSNAT stage.
> -             * There seems to be an issue with ovs-vswitchd. When the new
> -             * connection packet destined for the lb vip is received,
> -             * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat
> -             * conntrack zone. For the next packet, if it goes through
> -             * unsnat stage, the conntrack flags are not set properly, and
> -             * it doesn't hit the established state flows in
> -             * S_ROUTER_IN_DNAT stage. */
> -            ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120,
> -                                    ds_cstr(&unsnat_match), "next;",
> -                                    &lb->nlb->header_, lb_dps->lflow_ref);
> -        }
>      }
>
>      for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) {
> @@ -11199,7 +11174,6 @@ build_lrouter_nat_flows_for_lb(
>                                     lr_datapaths, lb_dps->lflow_ref);
>      }
>
> -    ds_destroy(&unsnat_match);
>      ds_destroy(&undnat_match);
>      ds_destroy(&skip_snat_act);
>      ds_destroy(&force_snat_act);
> @@ -15070,12 +15044,7 @@ build_lrouter_nat_defrag_and_lb(
>                                     l3dgw_port, stateless, lflow_ref);
>
>          /* ARP resolve for NAT IPs. */
> -        if (od->is_gw_router) {
> -            /* Add the NAT external_ip to the nat_entries for
> -             * gateway routers. This is required for adding load balancer
> -             * flows.*/
> -            sset_add(&nat_entries, nat->external_ip);
> -        } else {
> +        if (!od->is_gw_router) {
>              if (!sset_contains(&nat_entries, nat->external_ip)) {
>                  /* Drop packets coming in from external that still has
>                   * destination IP equals to the NAT external IP, to avoid loop.
> diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
> index 6fdd761dac..c7f65b9666 100644
> --- a/tests/ovn-northd.at
> +++ b/tests/ovn-northd.at
> @@ -1610,9 +1610,6 @@ AT_CAPTURE_FILE([sbflows])
>  # dnat_and_snat or snat entry.
>  AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl
>    table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;)
>  ])
> @@ -1643,9 +1640,6 @@ AT_CAPTURE_FILE([sbflows])
>  # dnat_and_snat or snat entry.
>  AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl
>    table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;)
>  ])
> @@ -5868,7 +5862,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl
>    table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
>    table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;)
> @@ -5945,7 +5938,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl
>    table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;)
> -  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;)
>    table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;)
> @@ -12010,7 +12002,7 @@ check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080
>  check ovn-nbctl lr-lb-add lr0 lb1
>  check ovn-nbctl lr-lb-add lr0 lb2
>
> -# lflow engine should recompute since the nat ip 172.168.0.140
> +# lflow engine should not recompute even if the nat ip 172.168.0.140
>  # is a lb vip.
>  check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
>  check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20
> @@ -12019,10 +12011,10 @@ check_engine_stats lr_nat norecompute compute
>  check_engine_stats lr_stateful norecompute compute
>  check_engine_stats sync_to_sb_pb recompute nocompute
>  check_engine_stats sync_to_sb_lb norecompute compute
> -check_engine_stats lflow recompute nocompute
> +check_engine_stats lflow norecompute compute
>  CHECK_NO_CHANGE_AFTER_RECOMPUTE
>
> -# lflow engine should recompute since the nat ip 172.168.0.150
> +# lflow engine should not recompute even if the nat ip 172.168.0.150
>  # is a lb vip.
>  check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
>  check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41
> @@ -12031,10 +12023,10 @@ check_engine_stats lr_nat norecompute compute
>  check_engine_stats lr_stateful norecompute compute
>  check_engine_stats sync_to_sb_pb recompute nocompute
>  check_engine_stats sync_to_sb_lb norecompute compute
> -check_engine_stats lflow recompute nocompute
> +check_engine_stats lflow norecompute compute
>  CHECK_NO_CHANGE_AFTER_RECOMPUTE
>
> -# lflow engine should recompute since the deleted nat ip 172.168.0.150
> +# lflow engine should not recompute even if the deleted nat ip 172.168.0.150
>  # is a lb vip.
>  check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
>  check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150
> @@ -12043,10 +12035,10 @@ check_engine_stats lr_nat norecompute compute
>  check_engine_stats lr_stateful norecompute compute
>  check_engine_stats sync_to_sb_pb recompute nocompute
>  check_engine_stats sync_to_sb_lb norecompute compute
> -check_engine_stats lflow recompute nocompute
> +check_engine_stats lflow norecompute compute
>  CHECK_NO_CHANGE_AFTER_RECOMPUTE
>
> -# lflow engine should recompute since the deleted nat ip 172.168.0.140
> +# lflow engine should not recompute even if the deleted nat ip 172.168.0.140
>  # is a lb vip.
>  check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
>  check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140
> @@ -12055,7 +12047,7 @@ check_engine_stats lr_nat norecompute compute
>  check_engine_stats lr_stateful norecompute compute
>  check_engine_stats sync_to_sb_pb recompute nocompute
>  check_engine_stats sync_to_sb_lb norecompute compute
> -check_engine_stats lflow recompute nocompute
> +check_engine_stats lflow norecompute compute
>  CHECK_NO_CHANGE_AFTER_RECOMPUTE
>
>  # Delete the NAT
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index c22c7882fc..2411b0267e 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -12096,6 +12096,86 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
>  /connection dropped.*/d"])
>  AT_CLEANUP
>
> +OVN_FOR_EACH_NORTHD([
> +AT_SETUP([load balancing in gateway router - client behind LB with SNAT])
> +AT_SKIP_IF([test $HAVE_NC = no])
> +AT_KEYWORDS([lb])
> +
> +ovn_start
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-int])
> +
> +check ovs-vsctl \
> +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> +        -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
> +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> +        -- set bridge br-int fail-mode=secure other-config:disable-in-band=true
> +
> +start_daemon ovn-controller
> +
> +check ovn-nbctl lr-add lr \
> +    -- set logical_router lr options:chassis=hv1
> +check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24
> +check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24
> +check ovn-nbctl ls-add ls1
> +check ovn-nbctl ls-add ls2
> +
> +check ovn-nbctl lsp-add ls1 ls1-lr
> +check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00
> +check ovn-nbctl lsp-set-type ls1-lr router
> +check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1
> +check ovn-nbctl lsp-add ls1 vm1
> +check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01
> +
> +check ovn-nbctl lsp-add ls2 ls2-lr
> +check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00
> +check ovn-nbctl lsp-set-type ls2-lr router
> +check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2
> +check ovn-nbctl lsp-add ls2 vm2
> +check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02
> +
> +dnl LB using the router IP connected to vm2 as VIP.
> +check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \
> +    -- lr-lb-add lr lb-test
> +
> +dnl SNAT everything coming from vm1 to the router IP (towards vm2).
> +check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1
> +
> +ADD_NAMESPACES(vm1)
> +ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2")
> +
> +ADD_NAMESPACES(vm2)
> +ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2")
> +
> +dnl Start a server on vm2.
> +NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid])
> +
> +dnl Wait for ovn-controller to catch up.
> +wait_for_ports_up
> +check ovn-nbctl --wait=hv sync
> +
> +dnl Test the connection originating something that uses the same source port
> +dnl as the LB VIP.
> +NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore])
> +
> +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as
> +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
> +/connection dropped.*/d"])
> +AT_CLEANUP
> +])
> +
>  OVN_FOR_EACH_NORTHD([
>  AT_SETUP([load balancing affinity sessions - auto clear learnt flows])
>  AT_SKIP_IF([test $HAVE_NC = no])
> --
> 2.39.3
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara March 6, 2024, 7:13 p.m. UTC | #2
On 3/5/24 15:56, Numan Siddique wrote:
> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> Otherwise, in case there's also a SNAT rule that uses the VIP as
>> external IP, we break sessions initiated from behind the VIP.
>>
>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for
>> load balancer vips in router ingress pipeline").  That's OK because
>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
>> DNAT/Load Balancers") addressed the original issue in a better way:
>>
>>     In the reply direction, the order of traversal of the tables
>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>>     datapath flows that check ct_state in the wrong conntrack zone.
>>     This is illustrated below where reply trafic enters the physical host
>>     port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
>>     DNAT zone and then on to Logical Switch Port zone (22). The third
>>     flow is incorrectly checking the state from the SNAT zone instead
>>     of the DNAT zone.
>>
>> We also add a system test to ensure traffic initiated from behind a VIP
>> + SNAT is not broken.
>>
>> Another nice side effect is that the northd I-P is slightly simplified
>> because we don't need to track NAT external IPs anymore.
>>
>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline")
>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> 
> 
> Thanks for the fix.  It also simplified the lr-nat-stateful code.
> 
> Acked-by: Numan Siddique <numans@ovn.org>
> 

Thanks, Numan!

Applied to main and backported to all branches down to 22.03.

Regards,
Dumitru
Han Zhou Aug. 29, 2024, 4:14 p.m. UTC | #3
On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 3/5/24 15:56, Numan Siddique wrote:
> > On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> Otherwise, in case there's also a SNAT rule that uses the VIP as
> >> external IP, we break sessions initiated from behind the VIP.
> >>
> >> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for
> >> load balancer vips in router ingress pipeline").  That's OK because
> >> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
> >> DNAT/Load Balancers") addressed the original issue in a better way:
> >>
> >>     In the reply direction, the order of traversal of the tables
> >>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
> >>     datapath flows that check ct_state in the wrong conntrack zone.
> >>     This is illustrated below where reply trafic enters the physical host
> >>     port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
> >>     DNAT zone and then on to Logical Switch Port zone (22). The third
> >>     flow is incorrectly checking the state from the SNAT zone instead
> >>     of the DNAT zone.
> >>
> >> We also add a system test to ensure traffic initiated from behind a VIP
> >> + SNAT is not broken.
> >>
> >> Another nice side effect is that the northd I-P is slightly simplified
> >> because we don't need to track NAT external IPs anymore.
> >>
> >> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline")
> >> Reported-at: https://issues.redhat.com/browse/FDP-291
> >> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >
> >
> > Thanks for the fix.  It also simplified the lr-nat-stateful code.
> >
> > Acked-by: Numan Siddique <numans@ovn.org>
> >
>
> Thanks, Numan!
>
> Applied to main and backported to all branches down to 22.03.
>
> Regards,
> Dumitru
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Hi Dumitru, Numan, Tim and folks,

I noticed that the HW offload of k8s nodePort traffic is broken due to
this change. The reason is that for client to nodePort (LB with VIP
being the node IP) traffic, when the packet is going through the
unSNAT stage in the SNAT CT zone, since the entry is never committed
to the SNAT zone, it will have CT state returned as "new", which
prevents the HW offload to work for such packets.

At the moment I have to revert this change in our downstream. For the
problem that was fixed by this change [0], I think we can avoid it by
separating the port range of SNAT and DNAT. For DNAT, the nodePort
range in k8s is configured by API-server option:

--service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767

For SNAT, it can be configured in the OVN's NAT table's
external_port_range column, and we can choose something like
10000-30000.

An extra benefit of this is that it reduces a CT recirc.

Alternatively solutions are:
Alternative1: Always commit to both SNAT and DNAT zones. This would
introduce unnecessary cost of CT entries and extra CT recirc.
Alternative2: Use a common zone for SNAT and DNAT. But there are other
issues reported for using the common zone [1]

Could you let me know if other thoughts on this?

[0] https://issues.redhat.com/browse/FDP-291
[1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4

Thanks,
Han
Dumitru Ceara Aug. 30, 2024, 9:26 a.m. UTC | #4
On 8/29/24 18:14, Han Zhou wrote:
> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 3/5/24 15:56, Numan Siddique wrote:
>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> Otherwise, in case there's also a SNAT rule that uses the VIP as
>>>> external IP, we break sessions initiated from behind the VIP.
>>>>
>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for
>>>> load balancer vips in router ingress pipeline").  That's OK because
>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
>>>> DNAT/Load Balancers") addressed the original issue in a better way:
>>>>
>>>>     In the reply direction, the order of traversal of the tables
>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>>>>     datapath flows that check ct_state in the wrong conntrack zone.
>>>>     This is illustrated below where reply trafic enters the physical host
>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
>>>>     DNAT zone and then on to Logical Switch Port zone (22). The third
>>>>     flow is incorrectly checking the state from the SNAT zone instead
>>>>     of the DNAT zone.
>>>>
>>>> We also add a system test to ensure traffic initiated from behind a VIP
>>>> + SNAT is not broken.
>>>>
>>>> Another nice side effect is that the northd I-P is slightly simplified
>>>> because we don't need to track NAT external IPs anymore.
>>>>
>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline")
>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>
>>>
>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>>>
>>> Acked-by: Numan Siddique <numans@ovn.org>
>>>
>>
>> Thanks, Numan!
>>
>> Applied to main and backported to all branches down to 22.03.
>>
>> Regards,
>> Dumitru
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> Hi Dumitru, Numan, Tim and folks,
> 

Hi Han,

> I noticed that the HW offload of k8s nodePort traffic is broken due to
> this change. The reason is that for client to nodePort (LB with VIP
> being the node IP) traffic, when the packet is going through the
> unSNAT stage in the SNAT CT zone, since the entry is never committed
> to the SNAT zone, it will have CT state returned as "new", which
> prevents the HW offload to work for such packets.
> 

Sorry about that, I forgot we don't always commit in the SNAT zone.

> At the moment I have to revert this change in our downstream. For the
> problem that was fixed by this change [0], I think we can avoid it by
> separating the port range of SNAT and DNAT. For DNAT, the nodePort
> range in k8s is configured by API-server option:
> 
> --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767
> 
> For SNAT, it can be configured in the OVN's NAT table's
> external_port_range column, and we can choose something like
> 10000-30000.
> 

Tim, does this look OK to you?  If it's acceptable to limit the SNAT
port range this workaround should be fine.

> An extra benefit of this is that it reduces a CT recirc.
> 
> Alternatively solutions are:
> Alternative1: Always commit to both SNAT and DNAT zones. This would
> introduce unnecessary cost of CT entries and extra CT recirc.

Extra recirculation aside, I would actually love it if we could use this
alternative.  I think it's the "most correct" option.  I think it would
allow us to avoid other workarounds like:

https://github.com/ovn-org/ovn/commit/40136a2f2c8

or

https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/

I do understand the worry about the extra recirculation though.  In the
HWOL context does that cause visible performance impact?

We'd probably have to do more performance testing without HWOL to figure
out the impact in the software datapath.

> Alternative2: Use a common zone for SNAT and DNAT. But there are other
> issues reported for using the common zone [1]
> 
> Could you let me know if other thoughts on this?
> 

On a related note, I know it has been discussed in different settings
but I don't think this ever moved forward: would it be possible for
NVIDIA to help out with automatically testing HWOL impact for incoming
patches?

Maybe we could some "simple" system-like tests that ensure that traffic
is correctly offloaded in common scenarios?  Alternatively, I guess we
could also tag a subset of the existing system tests and just run those
on actual hardware?

It's quite simple (AFAIU) for external CIs to report status on each OVN
patch posted on patchwork.  That would at least allow us to flag this
kind of breakages early (even before they get merged).

What do you think?

> [0] https://issues.redhat.com/browse/FDP-291
> [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> 
> Thanks,
> Han
> 

Thanks,
Dumitru
Tim Rozet Aug. 30, 2024, 8:04 p.m. UTC | #5
Hi Han,
My understanding is that the unSNAT stage was originally being skipped for
LB ports, then Dumitru changed it to attempt unSNAT, and this breaks HWOL
for ingress connections that never need to get unSNAT'ed and therefore not
committed. In the OVNK use case for pods simply connecting egress, almost
all packets will be SNAT'ed to the node IP. Therefore on egress reply, we
always enter the unSNAT stage for these packets. So this skipping unSNAT
stage seems to be specific to LB related traffic. While I agree splitting
the port range would work, I think it makes more sense to just always
commit in the unSNAT stage for all traffic. I get there is a performance
hit there, but the datapath pipeline seems more consistent and I think
outweighs the cost of committing LB traffic.

Thanks,
Tim Rozet
Red Hat OpenShift Networking Team


On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote:

> On 8/29/24 18:14, Han Zhou wrote:
> > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 3/5/24 15:56, Numan Siddique wrote:
> >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >>>>
> >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as
> >>>> external IP, we break sessions initiated from behind the VIP.
> >>>>
> >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows
> for
> >>>> load balancer vips in router ingress pipeline").  That's OK because
> >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with
> >>>> DNAT/Load Balancers") addressed the original issue in a better way:
> >>>>
> >>>>     In the reply direction, the order of traversal of the tables
> >>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
> >>>>     datapath flows that check ct_state in the wrong conntrack zone.
> >>>>     This is illustrated below where reply trafic enters the physical
> host
> >>>>     port (6) and traverses DNAT zone (14), SNAT zone (default), back
> to the
> >>>>     DNAT zone and then on to Logical Switch Port zone (22). The third
> >>>>     flow is incorrectly checking the state from the SNAT zone instead
> >>>>     of the DNAT zone.
> >>>>
> >>>> We also add a system test to ensure traffic initiated from behind a
> VIP
> >>>> + SNAT is not broken.
> >>>>
> >>>> Another nice side effect is that the northd I-P is slightly simplified
> >>>> because we don't need to track NAT external IPs anymore.
> >>>>
> >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer
> vips in router ingress pipeline")
> >>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >>>
> >>>
> >>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
> >>>
> >>> Acked-by: Numan Siddique <numans@ovn.org>
> >>>
> >>
> >> Thanks, Numan!
> >>
> >> Applied to main and backported to all branches down to 22.03.
> >>
> >> Regards,
> >> Dumitru
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> > Hi Dumitru, Numan, Tim and folks,
> >
>
> Hi Han,
>
> > I noticed that the HW offload of k8s nodePort traffic is broken due to
> > this change. The reason is that for client to nodePort (LB with VIP
> > being the node IP) traffic, when the packet is going through the
> > unSNAT stage in the SNAT CT zone, since the entry is never committed
> > to the SNAT zone, it will have CT state returned as "new", which
> > prevents the HW offload to work for such packets.
> >
>
> Sorry about that, I forgot we don't always commit in the SNAT zone.
>
> > At the moment I have to revert this change in our downstream. For the
> > problem that was fixed by this change [0], I think we can avoid it by
> > separating the port range of SNAT and DNAT. For DNAT, the nodePort
> > range in k8s is configured by API-server option:
> >
> > --service-node-port-range <a string in the form 'N1-N2'> Default:
> 30000-32767
> >
> > For SNAT, it can be configured in the OVN's NAT table's
> > external_port_range column, and we can choose something like
> > 10000-30000.
> >
>
> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
> port range this workaround should be fine.
>
> > An extra benefit of this is that it reduces a CT recirc.
> >
> > Alternatively solutions are:
> > Alternative1: Always commit to both SNAT and DNAT zones. This would
> > introduce unnecessary cost of CT entries and extra CT recirc.
>
> Extra recirculation aside, I would actually love it if we could use this
> alternative.  I think it's the "most correct" option.  I think it would
> allow us to avoid other workarounds like:
>
> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>
> or
>
>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>
> I do understand the worry about the extra recirculation though.  In the
> HWOL context does that cause visible performance impact?
>
> We'd probably have to do more performance testing without HWOL to figure
> out the impact in the software datapath.
>
> > Alternative2: Use a common zone for SNAT and DNAT. But there are other
> > issues reported for using the common zone [1]
> >
> > Could you let me know if other thoughts on this?
> >
>
> On a related note, I know it has been discussed in different settings
> but I don't think this ever moved forward: would it be possible for
> NVIDIA to help out with automatically testing HWOL impact for incoming
> patches?
>
> Maybe we could some "simple" system-like tests that ensure that traffic
> is correctly offloaded in common scenarios?  Alternatively, I guess we
> could also tag a subset of the existing system tests and just run those
> on actual hardware?
>
> It's quite simple (AFAIU) for external CIs to report status on each OVN
> patch posted on patchwork.  That would at least allow us to flag this
> kind of breakages early (even before they get merged).
>
> What do you think?
>
> > [0] https://issues.redhat.com/browse/FDP-291
> > [1] b8c40e7593
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >
> > Thanks,
> > Han
> >
>
> Thanks,
> Dumitru
>
>
Han Zhou May 15, 2025, 4:24 p.m. UTC | #6
On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
>
> Hi Han,
> My understanding is that the unSNAT stage was originally being skipped
for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks
HWOL for ingress connections that never need to get unSNAT'ed and therefore
not committed. In the OVNK use case for pods simply connecting egress,
almost all packets will be SNAT'ed to the node IP. Therefore on egress
reply, we always enter the unSNAT stage for these packets. So this skipping
unSNAT stage seems to be specific to LB related traffic. While I agree
splitting the port range would work, I think it makes more sense to just
always commit in the unSNAT stage for all traffic. I get there is a
performance hit there, but the datapath pipeline seems more consistent and
I think outweighs the cost of committing LB traffic.

Hi Tim and folks,

+ @Ales Musil <amusil@redhat.com>

We've spent a lot of effort on the alternative solution: Always commit to
both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in
this thread is still unresolved. I'd like to discuss the current situation
and ask for opinions.

A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd:
Add LR option to commit all traffic."). When the ct-commit-all option is
enabled, it commits the connection to both SNAT and DNAT zones. However, in
the ovn-k8s nodeport scenario (a very typical use case) the HW offload is
still broken. In that scenario, a packet may come from external network
with dest IP being a node's IP, which is also used as an OVN gateway router
IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP
is SNATed to another LRP's IP on this router. The src/dst IPs in different
stages are:

Original: EXT IP <-> GR IP1
After DNAT: EXT IP <-> BE IP
After SNAT: GR IP2 <-> BE IP

With the ct-commit-all option, it commits after DNAT and after SNAT:
DNAT zone: EXT IP <-> GR IP1/BE IP
SNAT zone: EXT IP/GR IP2 <-> BE IP

So, at the UNSNAT stage, the packet traverses the SNAT zone with the
original header, which will still always be "new" state, which breaks HW
offload.

I tried to fix it by also committing to the SNAT zone before DNAT (at the
POST_SNAT stage). Now there are 3 entries committed:
SNAT zone: EXT IP <-> GR IP1
DNAT zone: EXT IP <-> GR IP1/BE IP
SNAT zone: EXT IP/GR IP2 <-> BE IP

HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on
distributed router - E/W", which tests communication between two workloads
using "floating IPs". In this scenario, both workloads are under different
LSes behind the same router, and both have their own floating IPs
(dnat_and_snat) on the public side of the LR. When one of them sends
packets to the other using the floating IP, what's expected is:

Original: overlay IP1 <-> floating IP2
After SNAT: floating IP1 <-> floating IP2
After DNAT: floating IP1 <-> overlay IP2

With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
SNAT: overlay IP1/floating IP1 <-> floating IP2

This happens in the egress pipeline of the LR, and then it enters into the
ingress pipeline again because this is like a hairpin (the dest IP is a
dnat_and_snat IP on the same router). It will hit the UNSNAT stage first,
traverse the CT zone, and for the first packet the reply is not seen yet,
so the CT state is still new, so in the POST_UNSNAT we will try to commit
it (because of the issue I am trying to fix above). Because the connection
floating IP1 <-> floating IP2 is conflicting with the above committed entry
"SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and
the packet is dropped.

There is still a way to solve this, probably by skipping the unnecessary
UNSNAT traversal and the conflict commit if the source IP is one of the
snat/dnat_and_snat IPs of this same router.

However, as we see, the NAT related pipelines have been very complex today,
and with the above changes it would become even much more complex. I wonder
if this is maintainable in the long run.

In addition, I am not even sure if this is the last problem of my patch. It
is not easy to tell because today the ct-commit-all feature is an option
and by default disabled. So all our test cases by default don't test for
this option except the 4 test cases added by Ales's patch. I had to
manually set this option to true as default and of course many test cases
will fail. Most of them fail because of the expected flows won't match, but
I can't tell for sure unless updating every case with new flows and then
retest them. Before doing that I just examined ones I think might be
impacted by the patch and the "DNAT and SNAT on distributed router - E/W"
is the first broken one I found. So there can be other issues not
discovered yet.

Moreover, we should also keep in mind the performance penalty of these
extra CT commit & traversal.

I'd like to listen to your opinion before moving forward. Is this worth the
effort, or shall we just keep the pipeline simple (relatively) and avoid
the original problem (https://issues.redhat.com/browse/FDP-291) by
separating the port range of SNAT and DNAT, the first proposal at the
beginning of this thread?

Thanks,
Han

>
> Thanks,
> Tim Rozet
> Red Hat OpenShift Networking Team
>
>
> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 8/29/24 18:14, Han Zhou wrote:
>> > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>> >>
>> >> On 3/5/24 15:56, Numan Siddique wrote:
>> >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>> >>>>>> >>>> Otherwise, in case there's also a SNAT rule that uses the
VIP as
>> >>>> external IP, we break sessions initiated from behind the VIP.
>> >>>>
>> >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows
for
>> >>>> load balancer vips in router ingress pipeline").  That's OK because
>> >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
with
>> >>>> DNAT/Load Balancers") addressed the original issue in a better way:
>> >>>>
>> >>>>     In the reply direction, the order of traversal of the tables
>> >>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>> >>>>     datapath flows that check ct_state in the wrong conntrack zone.
>> >>>>     This is illustrated below where reply trafic enters the
physical host
>> >>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
back to the
>> >>>>     DNAT zone and then on to Logical Switch Port zone (22). The
third
>> >>>>     flow is incorrectly checking the state from the SNAT zone
instead
>> >>>>     of the DNAT zone.
>> >>>>
>> >>>> We also add a system test to ensure traffic initiated from behind a
VIP
>> >>>> + SNAT is not broken.
>> >>>>
>> >>>> Another nice side effect is that the northd I-P is slightly
simplified
>> >>>> because we don't need to track NAT external IPs anymore.
>> >>>>
>> >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
balancer vips in router ingress pipeline")
>> >>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >>>
>> >>>
>> >>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>> >>>
>> >>> Acked-by: Numan Siddique <numans@ovn.org>
>> >>>
>> >>
>> >> Thanks, Numan!
>> >>
>> >> Applied to main and backported to all branches down to 22.03.
>> >>
>> >> Regards,
>> >> Dumitru
>> >>
>> >> _______________________________________________
>> >> dev mailing list
>> >> dev@openvswitch.org
>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >
>> > Hi Dumitru, Numan, Tim and folks,
>> >
>>
>> Hi Han,
>>
>> > I noticed that the HW offload of k8s nodePort traffic is broken due to
>> > this change. The reason is that for client to nodePort (LB with VIP
>> > being the node IP) traffic, when the packet is going through the
>> > unSNAT stage in the SNAT CT zone, since the entry is never committed
>> > to the SNAT zone, it will have CT state returned as "new", which
>> > prevents the HW offload to work for such packets.
>> >
>>
>> Sorry about that, I forgot we don't always commit in the SNAT zone.
>>
>> > At the moment I have to revert this change in our downstream. For the
>> > problem that was fixed by this change [0], I think we can avoid it by
>> > separating the port range of SNAT and DNAT. For DNAT, the nodePort
>> > range in k8s is configured by API-server option:
>> >
>> > --service-node-port-range <a string in the form 'N1-N2'> Default:
30000-32767
>> >
>> > For SNAT, it can be configured in the OVN's NAT table's
>> > external_port_range column, and we can choose something like
>> > 10000-30000.
>> >
>>
>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
>> port range this workaround should be fine.
>>
>> > An extra benefit of this is that it reduces a CT recirc.
>> >
>> > Alternatively solutions are:
>> > Alternative1: Always commit to both SNAT and DNAT zones. This would
>> > introduce unnecessary cost of CT entries and extra CT recirc.
>>
>> Extra recirculation aside, I would actually love it if we could use this
>> alternative.  I think it's the "most correct" option.  I think it would
>> allow us to avoid other workarounds like:
>>
>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>>
>> or
>>
>>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>>
>> I do understand the worry about the extra recirculation though.  In the
>> HWOL context does that cause visible performance impact?
>>
>> We'd probably have to do more performance testing without HWOL to figure
>> out the impact in the software datapath.
>>
>> > Alternative2: Use a common zone for SNAT and DNAT. But there are other
>> > issues reported for using the common zone [1]
>> >
>> > Could you let me know if other thoughts on this?
>> >
>>
>> On a related note, I know it has been discussed in different settings
>> but I don't think this ever moved forward: would it be possible for
>> NVIDIA to help out with automatically testing HWOL impact for incoming
>> patches?
>>
>> Maybe we could some "simple" system-like tests that ensure that traffic
>> is correctly offloaded in common scenarios?  Alternatively, I guess we
>> could also tag a subset of the existing system tests and just run those
>> on actual hardware?
>>
>> It's quite simple (AFAIU) for external CIs to report status on each OVN
>> patch posted on patchwork.  That would at least allow us to flag this
>> kind of breakages early (even before they get merged).
>>
>> What do you think?
>>
>> > [0] https://issues.redhat.com/browse/FDP-291
>> > [1] b8c40e7593
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >
>> > Thanks,
>> > Han
>> >
>>
>> Thanks,
>> Dumitru
Dumitru Ceara May 16, 2025, 11:22 a.m. UTC | #7
On 5/15/25 6:24 PM, Han Zhou wrote:
> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
>>
>> Hi Han,
>> My understanding is that the unSNAT stage was originally being skipped
> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks
> HWOL for ingress connections that never need to get unSNAT'ed and therefore
> not committed. In the OVNK use case for pods simply connecting egress,
> almost all packets will be SNAT'ed to the node IP. Therefore on egress
> reply, we always enter the unSNAT stage for these packets. So this skipping
> unSNAT stage seems to be specific to LB related traffic. While I agree
> splitting the port range would work, I think it makes more sense to just
> always commit in the unSNAT stage for all traffic. I get there is a
> performance hit there, but the datapath pipeline seems more consistent and
> I think outweighs the cost of committing LB traffic.
> 
> Hi Tim and folks,
> 

Hi Han,

> + @Ales Musil <amusil@redhat.com>
> 
> We've spent a lot of effort on the alternative solution: Always commit to
> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in
> this thread is still unresolved. I'd like to discuss the current situation
> and ask for opinions.
> 

Thanks again for working on this!

> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd:
> Add LR option to commit all traffic."). When the ct-commit-all option is
> enabled, it commits the connection to both SNAT and DNAT zones. However, in
> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is
> still broken. In that scenario, a packet may come from external network
> with dest IP being a node's IP, which is also used as an OVN gateway router
> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP
> is SNATed to another LRP's IP on this router. The src/dst IPs in different
> stages are:
> 
> Original: EXT IP <-> GR IP1
> After DNAT: EXT IP <-> BE IP
> After SNAT: GR IP2 <-> BE IP
> 
> With the ct-commit-all option, it commits after DNAT and after SNAT:
> DNAT zone: EXT IP <-> GR IP1/BE IP
> SNAT zone: EXT IP/GR IP2 <-> BE IP
> 
> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
> original header, which will still always be "new" state, which breaks HW
> offload.
> 

Ah, now I understand the problem you were facing.

> I tried to fix it by also committing to the SNAT zone before DNAT (at the
> POST_SNAT stage). Now there are 3 entries committed:
> SNAT zone: EXT IP <-> GR IP1
> DNAT zone: EXT IP <-> GR IP1/BE IP
> SNAT zone: EXT IP/GR IP2 <-> BE IP
> 
> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on
> distributed router - E/W", which tests communication between two workloads
> using "floating IPs". In this scenario, both workloads are under different
> LSes behind the same router, and both have their own floating IPs
> (dnat_and_snat) on the public side of the LR. When one of them sends
> packets to the other using the floating IP, what's expected is:
> 
> Original: overlay IP1 <-> floating IP2
> After SNAT: floating IP1 <-> floating IP2
> After DNAT: floating IP1 <-> overlay IP2
> 
> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
> SNAT: overlay IP1/floating IP1 <-> floating IP2
> 
> This happens in the egress pipeline of the LR, and then it enters into the
> ingress pipeline again because this is like a hairpin (the dest IP is a
> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first,
> traverse the CT zone, and for the first packet the reply is not seen yet,
> so the CT state is still new, so in the POST_UNSNAT we will try to commit
> it (because of the issue I am trying to fix above). Because the connection
> floating IP1 <-> floating IP2 is conflicting with the above committed entry
> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and
> the packet is dropped.
> 
> There is still a way to solve this, probably by skipping the unnecessary
> UNSNAT traversal and the conflict commit if the source IP is one of the
> snat/dnat_and_snat IPs of this same router.
> 
> However, as we see, the NAT related pipelines have been very complex today,
> and with the above changes it would become even much more complex. I wonder
> if this is maintainable in the long run.
> 
> In addition, I am not even sure if this is the last problem of my patch. It
> is not easy to tell because today the ct-commit-all feature is an option
> and by default disabled. So all our test cases by default don't test for
> this option except the 4 test cases added by Ales's patch. I had to
> manually set this option to true as default and of course many test cases
> will fail. Most of them fail because of the expected flows won't match, but
> I can't tell for sure unless updating every case with new flows and then
> retest them. Before doing that I just examined ones I think might be
> impacted by the patch and the "DNAT and SNAT on distributed router - E/W"
> is the first broken one I found. So there can be other issues not
> discovered yet.
> 
> Moreover, we should also keep in mind the performance penalty of these
> extra CT commit & traversal.
> 

But basically this happens because we use the same SNAT zone for all
router ports.

In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
router port fix it?  E.g.:

host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21,
DNAT ZONE: 22) ---- host2

That'd also imply that we commit that session (from host1 to host2 -
potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4 different
CT lookups for all packets traversing the router.

I didn't try any of this but I have the impression that this would solve
the HWOL issue: all subsequent packets are part of known and committed
sessions in all 4 conntrack zones.

So what if we change the implementation of the "commit-all" feature to
add the missing ct commits?  Wouldn't that work?

It feels like (lookup performance hit aside) this is the correct way to
forward traffic, i.e., never traverse a CT zone without having a
committed entry in that zone.

> I'd like to listen to your opinion before moving forward. Is this worth the
> effort, or shall we just keep the pipeline simple (relatively) and avoid
> the original problem (https://issues.redhat.com/browse/FDP-291) by
> separating the port range of SNAT and DNAT, the first proposal at the
> beginning of this thread?
> 

Now that I understand the problem you're describing better it might be
worth reverting the fix for FDP-291 like you said until we find a proper
generic solution.

However, we can't afford to do that without separating the port range of
SNAT and DNAT like you said [0] because we'd be regressing (at least
from OpenShift perspective).

Because selecting the nodeport (DNAT) range is done through a k8s API
server cmdline argument that means that users can in theory choose
whatever range they want.  I guess we'd need a command line argument in
ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes
what SNAT port range to use so that it doesn't conflict with the KAPI
nodeport range.

Tim, you're the expert on this side, what are your thoughts on the matter?

Regards,
Dumitru

[0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html

> Thanks,
> Han
> 
>>
>> Thanks,
>> Tim Rozet
>> Red Hat OpenShift Networking Team
>>
>>
>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>
>>> On 8/29/24 18:14, Han Zhou wrote:
>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>>
>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
> VIP as
>>>>>>> external IP, we break sessions initiated from behind the VIP.
>>>>>>>
>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows
> for
>>>>>>> load balancer vips in router ingress pipeline").  That's OK because
>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
> with
>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way:
>>>>>>>
>>>>>>>     In the reply direction, the order of traversal of the tables
>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>>>>>>>     datapath flows that check ct_state in the wrong conntrack zone.
>>>>>>>     This is illustrated below where reply trafic enters the
> physical host
>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
> back to the
>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
> third
>>>>>>>     flow is incorrectly checking the state from the SNAT zone
> instead
>>>>>>>     of the DNAT zone.
>>>>>>>
>>>>>>> We also add a system test to ensure traffic initiated from behind a
> VIP
>>>>>>> + SNAT is not broken.
>>>>>>>
>>>>>>> Another nice side effect is that the northd I-P is slightly
> simplified
>>>>>>> because we don't need to track NAT external IPs anymore.
>>>>>>>
>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
> balancer vips in router ingress pipeline")
>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>>>
>>>>>>
>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>>>>>>
>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>>>>>>
>>>>>
>>>>> Thanks, Numan!
>>>>>
>>>>> Applied to main and backported to all branches down to 22.03.
>>>>>
>>>>> Regards,
>>>>> Dumitru
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev@openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>> Hi Dumitru, Numan, Tim and folks,
>>>>
>>>
>>> Hi Han,
>>>
>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to
>>>> this change. The reason is that for client to nodePort (LB with VIP
>>>> being the node IP) traffic, when the packet is going through the
>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed
>>>> to the SNAT zone, it will have CT state returned as "new", which
>>>> prevents the HW offload to work for such packets.
>>>>
>>>
>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
>>>
>>>> At the moment I have to revert this change in our downstream. For the
>>>> problem that was fixed by this change [0], I think we can avoid it by
>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
>>>> range in k8s is configured by API-server option:
>>>>
>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
> 30000-32767
>>>>
>>>> For SNAT, it can be configured in the OVN's NAT table's
>>>> external_port_range column, and we can choose something like
>>>> 10000-30000.
>>>>
>>>
>>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
>>> port range this workaround should be fine.
>>>
>>>> An extra benefit of this is that it reduces a CT recirc.
>>>>
>>>> Alternatively solutions are:
>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would
>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>>>
>>> Extra recirculation aside, I would actually love it if we could use this
>>> alternative.  I think it's the "most correct" option.  I think it would
>>> allow us to avoid other workarounds like:
>>>
>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>>>
>>> or
>>>
>>>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>>>
>>> I do understand the worry about the extra recirculation though.  In the
>>> HWOL context does that cause visible performance impact?
>>>
>>> We'd probably have to do more performance testing without HWOL to figure
>>> out the impact in the software datapath.
>>>
>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other
>>>> issues reported for using the common zone [1]
>>>>
>>>> Could you let me know if other thoughts on this?
>>>>
>>>
>>> On a related note, I know it has been discussed in different settings
>>> but I don't think this ever moved forward: would it be possible for
>>> NVIDIA to help out with automatically testing HWOL impact for incoming
>>> patches?
>>>
>>> Maybe we could some "simple" system-like tests that ensure that traffic
>>> is correctly offloaded in common scenarios?  Alternatively, I guess we
>>> could also tag a subset of the existing system tests and just run those
>>> on actual hardware?
>>>
>>> It's quite simple (AFAIU) for external CIs to report status on each OVN
>>> patch posted on patchwork.  That would at least allow us to flag this
>>> kind of breakages early (even before they get merged).
>>>
>>> What do you think?
>>>
>>>> [0] https://issues.redhat.com/browse/FDP-291
>>>> [1] b8c40e7593
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>>>>
>>>> Thanks,
>>>> Han
>>>>
>>>
>>> Thanks,
>>> Dumitru
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Dumitru Ceara May 16, 2025, 11:59 a.m. UTC | #8
On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> On 5/15/25 6:24 PM, Han Zhou wrote:
>> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
>>>
>>> Hi Han,
>>> My understanding is that the unSNAT stage was originally being skipped
>> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks
>> HWOL for ingress connections that never need to get unSNAT'ed and therefore
>> not committed. In the OVNK use case for pods simply connecting egress,
>> almost all packets will be SNAT'ed to the node IP. Therefore on egress
>> reply, we always enter the unSNAT stage for these packets. So this skipping
>> unSNAT stage seems to be specific to LB related traffic. While I agree
>> splitting the port range would work, I think it makes more sense to just
>> always commit in the unSNAT stage for all traffic. I get there is a
>> performance hit there, but the datapath pipeline seems more consistent and
>> I think outweighs the cost of committing LB traffic.
>>
>> Hi Tim and folks,
>>
> 
> Hi Han,
> 
>> + @Ales Musil <amusil@redhat.com>
>>
>> We've spent a lot of effort on the alternative solution: Always commit to
>> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in
>> this thread is still unresolved. I'd like to discuss the current situation
>> and ask for opinions.
>>
> 
> Thanks again for working on this!
> 
>> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd:
>> Add LR option to commit all traffic."). When the ct-commit-all option is
>> enabled, it commits the connection to both SNAT and DNAT zones. However, in
>> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is
>> still broken. In that scenario, a packet may come from external network
>> with dest IP being a node's IP, which is also used as an OVN gateway router
>> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP
>> is SNATed to another LRP's IP on this router. The src/dst IPs in different
>> stages are:
>>
>> Original: EXT IP <-> GR IP1
>> After DNAT: EXT IP <-> BE IP
>> After SNAT: GR IP2 <-> BE IP
>>
>> With the ct-commit-all option, it commits after DNAT and after SNAT:
>> DNAT zone: EXT IP <-> GR IP1/BE IP
>> SNAT zone: EXT IP/GR IP2 <-> BE IP
>>
>> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
>> original header, which will still always be "new" state, which breaks HW
>> offload.
>>
> 
> Ah, now I understand the problem you were facing.
> 
>> I tried to fix it by also committing to the SNAT zone before DNAT (at the
>> POST_SNAT stage). Now there are 3 entries committed:
>> SNAT zone: EXT IP <-> GR IP1
>> DNAT zone: EXT IP <-> GR IP1/BE IP
>> SNAT zone: EXT IP/GR IP2 <-> BE IP
>>
>> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on
>> distributed router - E/W", which tests communication between two workloads
>> using "floating IPs". In this scenario, both workloads are under different
>> LSes behind the same router, and both have their own floating IPs
>> (dnat_and_snat) on the public side of the LR. When one of them sends
>> packets to the other using the floating IP, what's expected is:
>>
>> Original: overlay IP1 <-> floating IP2
>> After SNAT: floating IP1 <-> floating IP2
>> After DNAT: floating IP1 <-> overlay IP2
>>
>> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
>> SNAT: overlay IP1/floating IP1 <-> floating IP2
>>
>> This happens in the egress pipeline of the LR, and then it enters into the
>> ingress pipeline again because this is like a hairpin (the dest IP is a
>> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first,
>> traverse the CT zone, and for the first packet the reply is not seen yet,
>> so the CT state is still new, so in the POST_UNSNAT we will try to commit
>> it (because of the issue I am trying to fix above). Because the connection
>> floating IP1 <-> floating IP2 is conflicting with the above committed entry
>> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and
>> the packet is dropped.
>>
>> There is still a way to solve this, probably by skipping the unnecessary
>> UNSNAT traversal and the conflict commit if the source IP is one of the
>> snat/dnat_and_snat IPs of this same router.
>>
>> However, as we see, the NAT related pipelines have been very complex today,
>> and with the above changes it would become even much more complex. I wonder
>> if this is maintainable in the long run.
>>
>> In addition, I am not even sure if this is the last problem of my patch. It
>> is not easy to tell because today the ct-commit-all feature is an option
>> and by default disabled. So all our test cases by default don't test for
>> this option except the 4 test cases added by Ales's patch. I had to
>> manually set this option to true as default and of course many test cases
>> will fail. Most of them fail because of the expected flows won't match, but
>> I can't tell for sure unless updating every case with new flows and then
>> retest them. Before doing that I just examined ones I think might be
>> impacted by the patch and the "DNAT and SNAT on distributed router - E/W"
>> is the first broken one I found. So there can be other issues not
>> discovered yet.
>>
>> Moreover, we should also keep in mind the performance penalty of these
>> extra CT commit & traversal.
>>
> 
> But basically this happens because we use the same SNAT zone for all
> router ports.
> 
> In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
> router port fix it?  E.g.:
> 
> host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21,
> DNAT ZONE: 22) ---- host2
> 
> That'd also imply that we commit that session (from host1 to host2 -
> potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
> LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4 different
> CT lookups for all packets traversing the router.
> 
> I didn't try any of this but I have the impression that this would solve
> the HWOL issue: all subsequent packets are part of known and committed
> sessions in all 4 conntrack zones.
> 
> So what if we change the implementation of the "commit-all" feature to
> add the missing ct commits?  Wouldn't that work?
> 
> It feels like (lookup performance hit aside) this is the correct way to
> forward traffic, i.e., never traverse a CT zone without having a
> committed entry in that zone.
> 
>> I'd like to listen to your opinion before moving forward. Is this worth the
>> effort, or shall we just keep the pipeline simple (relatively) and avoid
>> the original problem (https://issues.redhat.com/browse/FDP-291) by
>> separating the port range of SNAT and DNAT, the first proposal at the
>> beginning of this thread?
>>
> 
> Now that I understand the problem you're describing better it might be
> worth reverting the fix for FDP-291 like you said until we find a proper
> generic solution.
> 
> However, we can't afford to do that without separating the port range of
> SNAT and DNAT like you said [0] because we'd be regressing (at least
> from OpenShift perspective).
> 

For reference, Tim, this is the OpenShift bug that needed the fix for
FDP-291:

https://issues.redhat.com/browse/OCPBUGS-25889

> Because selecting the nodeport (DNAT) range is done through a k8s API
> server cmdline argument that means that users can in theory choose
> whatever range they want.  I guess we'd need a command line argument in
> ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes
> what SNAT port range to use so that it doesn't conflict with the KAPI
> nodeport range.
> 
> Tim, you're the expert on this side, what are your thoughts on the matter?
> 
> Regards,
> Dumitru
> 
> [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> 
>> Thanks,
>> Han
>>
>>>
>>> Thanks,
>>> Tim Rozet
>>> Red Hat OpenShift Networking Team
>>>
>>>
>>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> On 8/29/24 18:14, Han Zhou wrote:
>>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
>> wrote:
>>>>>>
>>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
>> wrote:
>>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
>> VIP as
>>>>>>>> external IP, we break sessions initiated from behind the VIP.
>>>>>>>>
>>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows
>> for
>>>>>>>> load balancer vips in router ingress pipeline").  That's OK because
>>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
>> with
>>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way:
>>>>>>>>
>>>>>>>>     In the reply direction, the order of traversal of the tables
>>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
>>>>>>>>     datapath flows that check ct_state in the wrong conntrack zone.
>>>>>>>>     This is illustrated below where reply trafic enters the
>> physical host
>>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
>> back to the
>>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
>> third
>>>>>>>>     flow is incorrectly checking the state from the SNAT zone
>> instead
>>>>>>>>     of the DNAT zone.
>>>>>>>>
>>>>>>>> We also add a system test to ensure traffic initiated from behind a
>> VIP
>>>>>>>> + SNAT is not broken.
>>>>>>>>
>>>>>>>> Another nice side effect is that the northd I-P is slightly
>> simplified
>>>>>>>> because we don't need to track NAT external IPs anymore.
>>>>>>>>
>>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
>> balancer vips in router ingress pipeline")
>>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>>>>>>>
>>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>>>>>>>
>>>>>>
>>>>>> Thanks, Numan!
>>>>>>
>>>>>> Applied to main and backported to all branches down to 22.03.
>>>>>>
>>>>>> Regards,
>>>>>> Dumitru
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> dev@openvswitch.org
>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>
>>>>> Hi Dumitru, Numan, Tim and folks,
>>>>>
>>>>
>>>> Hi Han,
>>>>
>>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to
>>>>> this change. The reason is that for client to nodePort (LB with VIP
>>>>> being the node IP) traffic, when the packet is going through the
>>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed
>>>>> to the SNAT zone, it will have CT state returned as "new", which
>>>>> prevents the HW offload to work for such packets.
>>>>>
>>>>
>>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
>>>>
>>>>> At the moment I have to revert this change in our downstream. For the
>>>>> problem that was fixed by this change [0], I think we can avoid it by
>>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
>>>>> range in k8s is configured by API-server option:
>>>>>
>>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
>> 30000-32767
>>>>>
>>>>> For SNAT, it can be configured in the OVN's NAT table's
>>>>> external_port_range column, and we can choose something like
>>>>> 10000-30000.
>>>>>
>>>>
>>>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
>>>> port range this workaround should be fine.
>>>>
>>>>> An extra benefit of this is that it reduces a CT recirc.
>>>>>
>>>>> Alternatively solutions are:
>>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would
>>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>>>>
>>>> Extra recirculation aside, I would actually love it if we could use this
>>>> alternative.  I think it's the "most correct" option.  I think it would
>>>> allow us to avoid other workarounds like:
>>>>
>>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>>>>
>>>> or
>>>>
>>>>
>> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>>>>
>>>> I do understand the worry about the extra recirculation though.  In the
>>>> HWOL context does that cause visible performance impact?
>>>>
>>>> We'd probably have to do more performance testing without HWOL to figure
>>>> out the impact in the software datapath.
>>>>
>>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other
>>>>> issues reported for using the common zone [1]
>>>>>
>>>>> Could you let me know if other thoughts on this?
>>>>>
>>>>
>>>> On a related note, I know it has been discussed in different settings
>>>> but I don't think this ever moved forward: would it be possible for
>>>> NVIDIA to help out with automatically testing HWOL impact for incoming
>>>> patches?
>>>>
>>>> Maybe we could some "simple" system-like tests that ensure that traffic
>>>> is correctly offloaded in common scenarios?  Alternatively, I guess we
>>>> could also tag a subset of the existing system tests and just run those
>>>> on actual hardware?
>>>>
>>>> It's quite simple (AFAIU) for external CIs to report status on each OVN
>>>> patch posted on patchwork.  That would at least allow us to flag this
>>>> kind of breakages early (even before they get merged).
>>>>
>>>> What do you think?
>>>>
>>>>> [0] https://issues.redhat.com/browse/FDP-291
>>>>> [1] b8c40e7593
>> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>>>>>
>>>>> Thanks,
>>>>> Han
>>>>>
>>>>
>>>> Thanks,
>>>> Dumitru
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Tim Rozet May 19, 2025, 3:41 p.m. UTC | #9
Hi All,
Sorry Han that you are still facing issues with this problem. It seems like
the best option would be to configure the range of ephemeral ports for OVN
to use when doing an SNAT. Note, in OpenShift the nodeport service range is
configured 30000–32767, and the ephemeral port range in the kernel is set
to:
net.ipv4.ip_local_port_range = 32768 60999

I think it would make sense to revert and add an option to make this range
configurable in OVN. Then OVN-Kubernetes should pass along this config to
OVN.

As Dumitru mentioned, we can't just revert the previous patch without a
substitute fix in place, otherwise we would risk regressing.

Thanks,

Tim Rozet
Red Hat OpenShift Networking Team


On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:

> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
> >>>
> >>> Hi Han,
> >>> My understanding is that the unSNAT stage was originally being skipped
> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks
> >> HWOL for ingress connections that never need to get unSNAT'ed and
> therefore
> >> not committed. In the OVNK use case for pods simply connecting egress,
> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress
> >> reply, we always enter the unSNAT stage for these packets. So this
> skipping
> >> unSNAT stage seems to be specific to LB related traffic. While I agree
> >> splitting the port range would work, I think it makes more sense to just
> >> always commit in the unSNAT stage for all traffic. I get there is a
> >> performance hit there, but the datapath pipeline seems more consistent
> and
> >> I think outweighs the cost of committing LB traffic.
> >>
> >> Hi Tim and folks,
> >>
> >
> > Hi Han,
> >
> >> + @Ales Musil <amusil@redhat.com>
> >>
> >> We've spent a lot of effort on the alternative solution: Always commit
> to
> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed
> in
> >> this thread is still unresolved. I'd like to discuss the current
> situation
> >> and ask for opinions.
> >>
> >
> > Thanks again for working on this!
> >
> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
> ("northd:
> >> Add LR option to commit all traffic."). When the ct-commit-all option is
> >> enabled, it commits the connection to both SNAT and DNAT zones.
> However, in
> >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload
> is
> >> still broken. In that scenario, a packet may come from external network
> >> with dest IP being a node's IP, which is also used as an OVN gateway
> router
> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src
> IP
> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
> different
> >> stages are:
> >>
> >> Original: EXT IP <-> GR IP1
> >> After DNAT: EXT IP <-> BE IP
> >> After SNAT: GR IP2 <-> BE IP
> >>
> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >>
> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
> >> original header, which will still always be "new" state, which breaks HW
> >> offload.
> >>
> >
> > Ah, now I understand the problem you were facing.
> >
> >> I tried to fix it by also committing to the SNAT zone before DNAT (at
> the
> >> POST_SNAT stage). Now there are 3 entries committed:
> >> SNAT zone: EXT IP <-> GR IP1
> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >>
> >> HW offload worked, but it breaks at least the scenario of "DNAT and
> SNAT on
> >> distributed router - E/W", which tests communication between two
> workloads
> >> using "floating IPs". In this scenario, both workloads are under
> different
> >> LSes behind the same router, and both have their own floating IPs
> >> (dnat_and_snat) on the public side of the LR. When one of them sends
> >> packets to the other using the floating IP, what's expected is:
> >>
> >> Original: overlay IP1 <-> floating IP2
> >> After SNAT: floating IP1 <-> floating IP2
> >> After DNAT: floating IP1 <-> overlay IP2
> >>
> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >>
> >> This happens in the egress pipeline of the LR, and then it enters into
> the
> >> ingress pipeline again because this is like a hairpin (the dest IP is a
> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
> first,
> >> traverse the CT zone, and for the first packet the reply is not seen
> yet,
> >> so the CT state is still new, so in the POST_UNSNAT we will try to
> commit
> >> it (because of the issue I am trying to fix above). Because the
> connection
> >> floating IP1 <-> floating IP2 is conflicting with the above committed
> entry
> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails
> and
> >> the packet is dropped.
> >>
> >> There is still a way to solve this, probably by skipping the unnecessary
> >> UNSNAT traversal and the conflict commit if the source IP is one of the
> >> snat/dnat_and_snat IPs of this same router.
> >>
> >> However, as we see, the NAT related pipelines have been very complex
> today,
> >> and with the above changes it would become even much more complex. I
> wonder
> >> if this is maintainable in the long run.
> >>
> >> In addition, I am not even sure if this is the last problem of my
> patch. It
> >> is not easy to tell because today the ct-commit-all feature is an option
> >> and by default disabled. So all our test cases by default don't test for
> >> this option except the 4 test cases added by Ales's patch. I had to
> >> manually set this option to true as default and of course many test
> cases
> >> will fail. Most of them fail because of the expected flows won't match,
> but
> >> I can't tell for sure unless updating every case with new flows and then
> >> retest them. Before doing that I just examined ones I think might be
> >> impacted by the patch and the "DNAT and SNAT on distributed router -
> E/W"
> >> is the first broken one I found. So there can be other issues not
> >> discovered yet.
> >>
> >> Moreover, we should also keep in mind the performance penalty of these
> >> extra CT commit & traversal.
> >>
> >
> > But basically this happens because we use the same SNAT zone for all
> > router ports.
> >
> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
> > router port fix it?  E.g.:
> >
> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21,
> > DNAT ZONE: 22) ---- host2
> >
> > That'd also imply that we commit that session (from host1 to host2 -
> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4 different
> > CT lookups for all packets traversing the router.
> >
> > I didn't try any of this but I have the impression that this would solve
> > the HWOL issue: all subsequent packets are part of known and committed
> > sessions in all 4 conntrack zones.
> >
> > So what if we change the implementation of the "commit-all" feature to
> > add the missing ct commits?  Wouldn't that work?
> >
> > It feels like (lookup performance hit aside) this is the correct way to
> > forward traffic, i.e., never traverse a CT zone without having a
> > committed entry in that zone.
> >
> >> I'd like to listen to your opinion before moving forward. Is this worth
> the
> >> effort, or shall we just keep the pipeline simple (relatively) and avoid
> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
> >> separating the port range of SNAT and DNAT, the first proposal at the
> >> beginning of this thread?
> >>
> >
> > Now that I understand the problem you're describing better it might be
> > worth reverting the fix for FDP-291 like you said until we find a proper
> > generic solution.
> >
> > However, we can't afford to do that without separating the port range of
> > SNAT and DNAT like you said [0] because we'd be regressing (at least
> > from OpenShift perspective).
> >
>
> For reference, Tim, this is the OpenShift bug that needed the fix for
> FDP-291:
>
> https://issues.redhat.com/browse/OCPBUGS-25889
>
> > Because selecting the nodeport (DNAT) range is done through a k8s API
> > server cmdline argument that means that users can in theory choose
> > whatever range they want.  I guess we'd need a command line argument in
> > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes
> > what SNAT port range to use so that it doesn't conflict with the KAPI
> > nodeport range.
> >
> > Tim, you're the expert on this side, what are your thoughts on the
> matter?
> >
> > Regards,
> > Dumitru
> >
> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >
> >> Thanks,
> >> Han
> >>
> >>>
> >>> Thanks,
> >>> Tim Rozet
> >>> Red Hat OpenShift Networking Team
> >>>
> >>>
> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >>>>
> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
> >> wrote:
> >>>>>>
> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
> >> wrote:
> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
> >> VIP as
> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
> >>>>>>>>
> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
> flows
> >> for
> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
> because
> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
> >> with
> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
> way:
> >>>>>>>>
> >>>>>>>>     In the reply direction, the order of traversal of the tables
> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
> zone.
> >>>>>>>>     This is illustrated below where reply trafic enters the
> >> physical host
> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
> >> back to the
> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
> >> third
> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
> >> instead
> >>>>>>>>     of the DNAT zone.
> >>>>>>>>
> >>>>>>>> We also add a system test to ensure traffic initiated from behind
> a
> >> VIP
> >>>>>>>> + SNAT is not broken.
> >>>>>>>>
> >>>>>>>> Another nice side effect is that the northd I-P is slightly
> >> simplified
> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >>>>>>>>
> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
> >> balancer vips in router ingress pipeline")
> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
> >>>>>>>
> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >>>>>>>
> >>>>>>
> >>>>>> Thanks, Numan!
> >>>>>>
> >>>>>> Applied to main and backported to all branches down to 22.03.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Dumitru
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> dev mailing list
> >>>>>> dev@openvswitch.org
> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>>
> >>>>> Hi Dumitru, Numan, Tim and folks,
> >>>>>
> >>>>
> >>>> Hi Han,
> >>>>
> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due
> to
> >>>>> this change. The reason is that for client to nodePort (LB with VIP
> >>>>> being the node IP) traffic, when the packet is going through the
> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed
> >>>>> to the SNAT zone, it will have CT state returned as "new", which
> >>>>> prevents the HW offload to work for such packets.
> >>>>>
> >>>>
> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
> >>>>
> >>>>> At the moment I have to revert this change in our downstream. For the
> >>>>> problem that was fixed by this change [0], I think we can avoid it by
> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
> >>>>> range in k8s is configured by API-server option:
> >>>>>
> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
> >> 30000-32767
> >>>>>
> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >>>>> external_port_range column, and we can choose something like
> >>>>> 10000-30000.
> >>>>>
> >>>>
> >>>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
> >>>> port range this workaround should be fine.
> >>>>
> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >>>>>
> >>>>> Alternatively solutions are:
> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would
> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
> >>>>
> >>>> Extra recirculation aside, I would actually love it if we could use
> this
> >>>> alternative.  I think it's the "most correct" option.  I think it
> would
> >>>> allow us to avoid other workarounds like:
> >>>>
> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >>>>
> >>>> or
> >>>>
> >>>>
> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >>>>
> >>>> I do understand the worry about the extra recirculation though.  In
> the
> >>>> HWOL context does that cause visible performance impact?
> >>>>
> >>>> We'd probably have to do more performance testing without HWOL to
> figure
> >>>> out the impact in the software datapath.
> >>>>
> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
> other
> >>>>> issues reported for using the common zone [1]
> >>>>>
> >>>>> Could you let me know if other thoughts on this?
> >>>>>
> >>>>
> >>>> On a related note, I know it has been discussed in different settings
> >>>> but I don't think this ever moved forward: would it be possible for
> >>>> NVIDIA to help out with automatically testing HWOL impact for incoming
> >>>> patches?
> >>>>
> >>>> Maybe we could some "simple" system-like tests that ensure that
> traffic
> >>>> is correctly offloaded in common scenarios?  Alternatively, I guess we
> >>>> could also tag a subset of the existing system tests and just run
> those
> >>>> on actual hardware?
> >>>>
> >>>> It's quite simple (AFAIU) for external CIs to report status on each
> OVN
> >>>> patch posted on patchwork.  That would at least allow us to flag this
> >>>> kind of breakages early (even before they get merged).
> >>>>
> >>>> What do you think?
> >>>>
> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >>>>> [1] b8c40e7593
> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >>>>>
> >>>>> Thanks,
> >>>>> Han
> >>>>>
> >>>>
> >>>> Thanks,
> >>>> Dumitru
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
Han Zhou May 20, 2025, 5:37 a.m. UTC | #10
On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
>
> Hi All,
> Sorry Han that you are still facing issues with this problem. It seems
like the best option would be to configure the range of ephemeral ports for
OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
range is configured 30000–32767, and the ephemeral port range in the kernel
is set to:
> net.ipv4.ip_local_port_range = 32768 60999
>
> I think it would make sense to revert and add an option to make this
range configurable in OVN. Then OVN-Kubernetes should pass along this
config to OVN.
>
> As Dumitru mentioned, we can't just revert the previous patch without a
substitute fix in place, otherwise we would risk regressing.

Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
right?

OVN already provides configurable SNAT port range through the
external_port_range column in the NAT table. See:
https://man7.org/linux/man-pages/man5/ovn-nb.5.html

       external_port_range: string
              L4 source port range

              Range of ports, from which a port number will be picked
              that will replace the source port of to be NATed packet.
              This is basically PAT (port address translation).

              Value of the column is in the format, port_lo-port_hi. For
              example: external_port_range : "1-30000"

              Valid range of ports is 1-65535.

Does this satisfy the need you asked?

If this is agreed, I think we can move forward with below steps:
1. Update ovn-kubernetes to set the external_port_range, if specific ranges
are configured. (only needed when k8s configured nodeport range conflicts
with the kernel ephemeral port range, and it seems unnecessary for
OpenShift?)
2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
e.g. no need to commit DNATted connections to SNAT zone, and vice versa.

What do you think?

Thanks,
Han

>
> Thanks,
>
> Tim Rozet
> Red Hat OpenShift Networking Team
>
>
> On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>> > On 5/15/25 6:24 PM, Han Zhou wrote:
>> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
>> >>>
>> >>> Hi Han,
>> >>> My understanding is that the unSNAT stage was originally being
skipped
>> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
breaks
>> >> HWOL for ingress connections that never need to get unSNAT'ed and
therefore
>> >> not committed. In the OVNK use case for pods simply connecting egress,
>> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress
>> >> reply, we always enter the unSNAT stage for these packets. So this
skipping
>> >> unSNAT stage seems to be specific to LB related traffic. While I agree
>> >> splitting the port range would work, I think it makes more sense to
just
>> >> always commit in the unSNAT stage for all traffic. I get there is a
>> >> performance hit there, but the datapath pipeline seems more
consistent and
>> >> I think outweighs the cost of committing LB traffic.
>> >>
>> >> Hi Tim and folks,
>> >>
>> >
>> > Hi Han,
>> >
>> >> + @Ales Musil <amusil@redhat.com>
>> >>
>> >> We've spent a lot of effort on the alternative solution: Always
commit to
>> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
discussed in
>> >> this thread is still unresolved. I'd like to discuss the current
situation
>> >> and ask for opinions.
>> >>
>> >
>> > Thanks again for working on this!
>> >
>> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
("northd:
>> >> Add LR option to commit all traffic."). When the ct-commit-all option
is
>> >> enabled, it commits the connection to both SNAT and DNAT zones.
However, in
>> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
offload is
>> >> still broken. In that scenario, a packet may come from external
network
>> >> with dest IP being a node's IP, which is also used as an OVN gateway
router
>> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
src IP
>> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
different
>> >> stages are:
>> >>
>> >> Original: EXT IP <-> GR IP1
>> >> After DNAT: EXT IP <-> BE IP
>> >> After SNAT: GR IP2 <-> BE IP
>> >>
>> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
>> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>
>> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
>> >> original header, which will still always be "new" state, which breaks
HW
>> >> offload.
>> >>
>> >
>> > Ah, now I understand the problem you were facing.
>> >
>> >> I tried to fix it by also committing to the SNAT zone before DNAT (at
the
>> >> POST_SNAT stage). Now there are 3 entries committed:
>> >> SNAT zone: EXT IP <-> GR IP1
>> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>
>> >> HW offload worked, but it breaks at least the scenario of "DNAT and
SNAT on
>> >> distributed router - E/W", which tests communication between two
workloads
>> >> using "floating IPs". In this scenario, both workloads are under
different
>> >> LSes behind the same router, and both have their own floating IPs
>> >> (dnat_and_snat) on the public side of the LR. When one of them sends
>> >> packets to the other using the floating IP, what's expected is:
>> >>
>> >> Original: overlay IP1 <-> floating IP2
>> >> After SNAT: floating IP1 <-> floating IP2
>> >> After DNAT: floating IP1 <-> overlay IP2
>> >>
>> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
>> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>> >>
>> >> This happens in the egress pipeline of the LR, and then it enters
into the
>> >> ingress pipeline again because this is like a hairpin (the dest IP is
a
>> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
first,
>> >> traverse the CT zone, and for the first packet the reply is not seen
yet,
>> >> so the CT state is still new, so in the POST_UNSNAT we will try to
commit
>> >> it (because of the issue I am trying to fix above). Because the
connection
>> >> floating IP1 <-> floating IP2 is conflicting with the above committed
entry
>> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
fails and
>> >> the packet is dropped.
>> >>
>> >> There is still a way to solve this, probably by skipping the
unnecessary
>> >> UNSNAT traversal and the conflict commit if the source IP is one of
the
>> >> snat/dnat_and_snat IPs of this same router.
>> >>
>> >> However, as we see, the NAT related pipelines have been very complex
today,
>> >> and with the above changes it would become even much more complex. I
wonder
>> >> if this is maintainable in the long run.
>> >>
>> >> In addition, I am not even sure if this is the last problem of my
patch. It
>> >> is not easy to tell because today the ct-commit-all feature is an
option
>> >> and by default disabled. So all our test cases by default don't test
for
>> >> this option except the 4 test cases added by Ales's patch. I had to
>> >> manually set this option to true as default and of course many test
cases
>> >> will fail. Most of them fail because of the expected flows won't
match, but
>> >> I can't tell for sure unless updating every case with new flows and
then
>> >> retest them. Before doing that I just examined ones I think might be
>> >> impacted by the patch and the "DNAT and SNAT on distributed router -
E/W"
>> >> is the first broken one I found. So there can be other issues not
>> >> discovered yet.
>> >>
>> >> Moreover, we should also keep in mind the performance penalty of these
>> >> extra CT commit & traversal.
>> >>
>> >
>> > But basically this happens because we use the same SNAT zone for all
>> > router ports.
>> >
>> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
>> > router port fix it?  E.g.:
>> >
>> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21,
>> > DNAT ZONE: 22) ---- host2
>> >
>> > That'd also imply that we commit that session (from host1 to host2 -
>> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
>> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4 different
>> > CT lookups for all packets traversing the router.
>> >
>> > I didn't try any of this but I have the impression that this would
solve
>> > the HWOL issue: all subsequent packets are part of known and committed
>> > sessions in all 4 conntrack zones.
>> >
>> > So what if we change the implementation of the "commit-all" feature to
>> > add the missing ct commits?  Wouldn't that work?
>> >
>> > It feels like (lookup performance hit aside) this is the correct way to
>> > forward traffic, i.e., never traverse a CT zone without having a
>> > committed entry in that zone.
>> >
>> >> I'd like to listen to your opinion before moving forward. Is this
worth the
>> >> effort, or shall we just keep the pipeline simple (relatively) and
avoid
>> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
>> >> separating the port range of SNAT and DNAT, the first proposal at the
>> >> beginning of this thread?
>> >>
>> >
>> > Now that I understand the problem you're describing better it might be
>> > worth reverting the fix for FDP-291 like you said until we find a
proper
>> > generic solution.
>> >
>> > However, we can't afford to do that without separating the port range
of
>> > SNAT and DNAT like you said [0] because we'd be regressing (at least
>> > from OpenShift perspective).
>> >
>>
>> For reference, Tim, this is the OpenShift bug that needed the fix for
>> FDP-291:
>>
>> https://issues.redhat.com/browse/OCPBUGS-25889
>>
>> > Because selecting the nodeport (DNAT) range is done through a k8s API
>> > server cmdline argument that means that users can in theory choose
>> > whatever range they want.  I guess we'd need a command line argument in
>> > ovn-kubernetes to allow users (operators, etc.) to inform
ovn-kubernetes
>> > what SNAT port range to use so that it doesn't conflict with the KAPI
>> > nodeport range.
>> >
>> > Tim, you're the expert on this side, what are your thoughts on the
matter?
>> >
>> > Regards,
>> > Dumitru
>> >
>> > [0]
https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>> >
>> >> Thanks,
>> >> Han
>> >>
>> >>>
>> >>> Thanks,
>> >>> Tim Rozet
>> >>> Red Hat OpenShift Networking Team
>> >>>
>> >>>
>> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>> >>>>
>> >>>> On 8/29/24 18:14, Han Zhou wrote:
>> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
>> >> wrote:
>> >>>>>>
>> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
>> >> wrote:
>> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
>> >> VIP as
>> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
>> >>>>>>>>
>> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
flows
>> >> for
>> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
because
>> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
>> >> with
>> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
way:
>> >>>>>>>>
>> >>>>>>>>     In the reply direction, the order of traversal of the tables
>> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
incorrect
>> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
zone.
>> >>>>>>>>     This is illustrated below where reply trafic enters the
>> >> physical host
>> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
>> >> back to the
>> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
>> >> third
>> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
>> >> instead
>> >>>>>>>>     of the DNAT zone.
>> >>>>>>>>
>> >>>>>>>> We also add a system test to ensure traffic initiated from
behind a
>> >> VIP
>> >>>>>>>> + SNAT is not broken.
>> >>>>>>>>
>> >>>>>>>> Another nice side effect is that the northd I-P is slightly
>> >> simplified
>> >>>>>>>> because we don't need to track NAT external IPs anymore.
>> >>>>>>>>
>> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
>> >> balancer vips in router ingress pipeline")
>> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>> >>>>>>>
>> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>> >>>>>>>
>> >>>>>>
>> >>>>>> Thanks, Numan!
>> >>>>>>
>> >>>>>> Applied to main and backported to all branches down to 22.03.
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> Dumitru
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> dev mailing list
>> >>>>>> dev@openvswitch.org
>> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >>>>>
>> >>>>> Hi Dumitru, Numan, Tim and folks,
>> >>>>>
>> >>>>
>> >>>> Hi Han,
>> >>>>
>> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
due to
>> >>>>> this change. The reason is that for client to nodePort (LB with VIP
>> >>>>> being the node IP) traffic, when the packet is going through the
>> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
committed
>> >>>>> to the SNAT zone, it will have CT state returned as "new", which
>> >>>>> prevents the HW offload to work for such packets.
>> >>>>>
>> >>>>
>> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
>> >>>>
>> >>>>> At the moment I have to revert this change in our downstream. For
the
>> >>>>> problem that was fixed by this change [0], I think we can avoid it
by
>> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
>> >>>>> range in k8s is configured by API-server option:
>> >>>>>
>> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
>> >> 30000-32767
>> >>>>>
>> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>> >>>>> external_port_range column, and we can choose something like
>> >>>>> 10000-30000.
>> >>>>>
>> >>>>
>> >>>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
>> >>>> port range this workaround should be fine.
>> >>>>
>> >>>>> An extra benefit of this is that it reduces a CT recirc.
>> >>>>>
>> >>>>> Alternatively solutions are:
>> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would
>> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>> >>>>
>> >>>> Extra recirculation aside, I would actually love it if we could use
this
>> >>>> alternative.  I think it's the "most correct" option.  I think it
would
>> >>>> allow us to avoid other workarounds like:
>> >>>>
>> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>> >>>>
>> >>>> or
>> >>>>
>> >>>>
>> >>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>> >>>>
>> >>>> I do understand the worry about the extra recirculation though.  In
the
>> >>>> HWOL context does that cause visible performance impact?
>> >>>>
>> >>>> We'd probably have to do more performance testing without HWOL to
figure
>> >>>> out the impact in the software datapath.
>> >>>>
>> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
other
>> >>>>> issues reported for using the common zone [1]
>> >>>>>
>> >>>>> Could you let me know if other thoughts on this?
>> >>>>>
>> >>>>
>> >>>> On a related note, I know it has been discussed in different
settings
>> >>>> but I don't think this ever moved forward: would it be possible for
>> >>>> NVIDIA to help out with automatically testing HWOL impact for
incoming
>> >>>> patches?
>> >>>>
>> >>>> Maybe we could some "simple" system-like tests that ensure that
traffic
>> >>>> is correctly offloaded in common scenarios?  Alternatively, I guess
we
>> >>>> could also tag a subset of the existing system tests and just run
those
>> >>>> on actual hardware?
>> >>>>
>> >>>> It's quite simple (AFAIU) for external CIs to report status on each
OVN
>> >>>> patch posted on patchwork.  That would at least allow us to flag
this
>> >>>> kind of breakages early (even before they get merged).
>> >>>>
>> >>>> What do you think?
>> >>>>
>> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>> >>>>> [1] b8c40e7593
>> >>
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Han
>> >>>>>
>> >>>>
>> >>>> Thanks,
>> >>>> Dumitru
>> >> _______________________________________________
>> >> dev mailing list
>> >> dev@openvswitch.org
>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Tim Rozet May 20, 2025, 1:20 p.m. UTC | #11
On Tue, May 20, 2025 at 1:37 AM Han Zhou <zhouhan@gmail.com> wrote:

>
>
> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
> >
> > Hi All,
> > Sorry Han that you are still facing issues with this problem. It seems
> like the best option would be to configure the range of ephemeral ports for
> OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
> range is configured 30000–32767, and the ephemeral port range in the kernel
> is set to:
> > net.ipv4.ip_local_port_range = 32768 60999
> >
> > I think it would make sense to revert and add an option to make this
> range configurable in OVN. Then OVN-Kubernetes should pass along this
> config to OVN.
> >
> > As Dumitru mentioned, we can't just revert the previous patch without a
> substitute fix in place, otherwise we would risk regressing.
>
> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
> right?
>

> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>
>        external_port_range: string
>               L4 source port range
>
>               Range of ports, from which a port number will be picked
>               that will replace the source port of to be NATed packet.
>               This is basically PAT (port address translation).
>
>               Value of the column is in the format, port_lo-port_hi. For
>               example: external_port_range : "1-30000"
>
>               Valid range of ports is 1-65535.
>
> Does this satisfy the need you asked?
>

Oh, I didn't realize this is already supported.


>
> If this is agreed, I think we can move forward with below steps:
> 1. Update ovn-kubernetes to set the external_port_range, if specific
> ranges are configured. (only needed when k8s configured nodeport range
> conflicts with the kernel ephemeral port range, and it seems unnecessary
> for OpenShift?)
>
Yeah, we should provide a config knob to set the range though, in case
ovnk users do not use the default range. It looks like we already have a
test case to cover this scenario as well:
https://github.com/openshift/ovn-kubernetes/pull/2120/commits/34b1a4be8cf9d4bf61d65464015c1b3cdb30adfd

I'll post a PR to do this part.

2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
> e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>
Sure, after 1 gets in, you can try to post a PR reverting the change and
lets see if it passes the above test.


>
> What do you think?
>
> Thanks,
> Han
>
> >
> > Thanks,
> >
> > Tim Rozet
> > Red Hat OpenShift Networking Team
> >
> >
> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
> >> >>>
> >> >>> Hi Han,
> >> >>> My understanding is that the unSNAT stage was originally being
> skipped
> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
> breaks
> >> >> HWOL for ingress connections that never need to get unSNAT'ed and
> therefore
> >> >> not committed. In the OVNK use case for pods simply connecting
> egress,
> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
> egress
> >> >> reply, we always enter the unSNAT stage for these packets. So this
> skipping
> >> >> unSNAT stage seems to be specific to LB related traffic. While I
> agree
> >> >> splitting the port range would work, I think it makes more sense to
> just
> >> >> always commit in the unSNAT stage for all traffic. I get there is a
> >> >> performance hit there, but the datapath pipeline seems more
> consistent and
> >> >> I think outweighs the cost of committing LB traffic.
> >> >>
> >> >> Hi Tim and folks,
> >> >>
> >> >
> >> > Hi Han,
> >> >
> >> >> + @Ales Musil <amusil@redhat.com>
> >> >>
> >> >> We've spent a lot of effort on the alternative solution: Always
> commit to
> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
> discussed in
> >> >> this thread is still unresolved. I'd like to discuss the current
> situation
> >> >> and ask for opinions.
> >> >>
> >> >
> >> > Thanks again for working on this!
> >> >
> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
> ("northd:
> >> >> Add LR option to commit all traffic."). When the ct-commit-all
> option is
> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
> However, in
> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
> offload is
> >> >> still broken. In that scenario, a packet may come from external
> network
> >> >> with dest IP being a node's IP, which is also used as an OVN gateway
> router
> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
> src IP
> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
> different
> >> >> stages are:
> >> >>
> >> >> Original: EXT IP <-> GR IP1
> >> >> After DNAT: EXT IP <-> BE IP
> >> >> After SNAT: GR IP2 <-> BE IP
> >> >>
> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
> >> >> original header, which will still always be "new" state, which
> breaks HW
> >> >> offload.
> >> >>
> >> >
> >> > Ah, now I understand the problem you were facing.
> >> >
> >> >> I tried to fix it by also committing to the SNAT zone before DNAT
> (at the
> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >> >> SNAT zone: EXT IP <-> GR IP1
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and
> SNAT on
> >> >> distributed router - E/W", which tests communication between two
> workloads
> >> >> using "floating IPs". In this scenario, both workloads are under
> different
> >> >> LSes behind the same router, and both have their own floating IPs
> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends
> >> >> packets to the other using the floating IP, what's expected is:
> >> >>
> >> >> Original: overlay IP1 <-> floating IP2
> >> >> After SNAT: floating IP1 <-> floating IP2
> >> >> After DNAT: floating IP1 <-> overlay IP2
> >> >>
> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >> >>
> >> >> This happens in the egress pipeline of the LR, and then it enters
> into the
> >> >> ingress pipeline again because this is like a hairpin (the dest IP
> is a
> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
> first,
> >> >> traverse the CT zone, and for the first packet the reply is not seen
> yet,
> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to
> commit
> >> >> it (because of the issue I am trying to fix above). Because the
> connection
> >> >> floating IP1 <-> floating IP2 is conflicting with the above
> committed entry
> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
> fails and
> >> >> the packet is dropped.
> >> >>
> >> >> There is still a way to solve this, probably by skipping the
> unnecessary
> >> >> UNSNAT traversal and the conflict commit if the source IP is one of
> the
> >> >> snat/dnat_and_snat IPs of this same router.
> >> >>
> >> >> However, as we see, the NAT related pipelines have been very complex
> today,
> >> >> and with the above changes it would become even much more complex. I
> wonder
> >> >> if this is maintainable in the long run.
> >> >>
> >> >> In addition, I am not even sure if this is the last problem of my
> patch. It
> >> >> is not easy to tell because today the ct-commit-all feature is an
> option
> >> >> and by default disabled. So all our test cases by default don't test
> for
> >> >> this option except the 4 test cases added by Ales's patch. I had to
> >> >> manually set this option to true as default and of course many test
> cases
> >> >> will fail. Most of them fail because of the expected flows won't
> match, but
> >> >> I can't tell for sure unless updating every case with new flows and
> then
> >> >> retest them. Before doing that I just examined ones I think might be
> >> >> impacted by the patch and the "DNAT and SNAT on distributed router -
> E/W"
> >> >> is the first broken one I found. So there can be other issues not
> >> >> discovered yet.
> >> >>
> >> >> Moreover, we should also keep in mind the performance penalty of
> these
> >> >> extra CT commit & traversal.
> >> >>
> >> >
> >> > But basically this happens because we use the same SNAT zone for all
> >> > router ports.
> >> >
> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
> >> > router port fix it?  E.g.:
> >> >
> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE:
> 21,
> >> > DNAT ZONE: 22) ---- host2
> >> >
> >> > That'd also imply that we commit that session (from host1 to host2 -
> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
> different
> >> > CT lookups for all packets traversing the router.
> >> >
> >> > I didn't try any of this but I have the impression that this would
> solve
> >> > the HWOL issue: all subsequent packets are part of known and committed
> >> > sessions in all 4 conntrack zones.
> >> >
> >> > So what if we change the implementation of the "commit-all" feature to
> >> > add the missing ct commits?  Wouldn't that work?
> >> >
> >> > It feels like (lookup performance hit aside) this is the correct way
> to
> >> > forward traffic, i.e., never traverse a CT zone without having a
> >> > committed entry in that zone.
> >> >
> >> >> I'd like to listen to your opinion before moving forward. Is this
> worth the
> >> >> effort, or shall we just keep the pipeline simple (relatively) and
> avoid
> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
> >> >> separating the port range of SNAT and DNAT, the first proposal at the
> >> >> beginning of this thread?
> >> >>
> >> >
> >> > Now that I understand the problem you're describing better it might be
> >> > worth reverting the fix for FDP-291 like you said until we find a
> proper
> >> > generic solution.
> >> >
> >> > However, we can't afford to do that without separating the port range
> of
> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least
> >> > from OpenShift perspective).
> >> >
> >>
> >> For reference, Tim, this is the OpenShift bug that needed the fix for
> >> FDP-291:
> >>
> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >>
> >> > Because selecting the nodeport (DNAT) range is done through a k8s API
> >> > server cmdline argument that means that users can in theory choose
> >> > whatever range they want.  I guess we'd need a command line argument
> in
> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >> > what SNAT port range to use so that it doesn't conflict with the KAPI
> >> > nodeport range.
> >> >
> >> > Tim, you're the expert on this side, what are your thoughts on the
> matter?
> >> >
> >> > Regards,
> >> > Dumitru
> >> >
> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >> >
> >> >> Thanks,
> >> >> Han
> >> >>
> >> >>>
> >> >>> Thanks,
> >> >>> Tim Rozet
> >> >>> Red Hat OpenShift Networking Team
> >> >>>
> >> >>>
> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >> >>>>
> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
> >> >> wrote:
> >> >>>>>>
> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >> wrote:
> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
> >> >> VIP as
> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
> >> >>>>>>>>
> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
> flows
> >> >> for
> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
> because
> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
> routers
> >> >> with
> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
> way:
> >> >>>>>>>>
> >> >>>>>>>>     In the reply direction, the order of traversal of the
> tables
> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
> incorrect
> >> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
> zone.
> >> >>>>>>>>     This is illustrated below where reply trafic enters the
> >> >> physical host
> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
> >> >> back to the
> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
> >> >> third
> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
> >> >> instead
> >> >>>>>>>>     of the DNAT zone.
> >> >>>>>>>>
> >> >>>>>>>> We also add a system test to ensure traffic initiated from
> behind a
> >> >> VIP
> >> >>>>>>>> + SNAT is not broken.
> >> >>>>>>>>
> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
> >> >> simplified
> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >> >>>>>>>>
> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
> >> >> balancer vips in router ingress pipeline")
> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
> code.
> >> >>>>>>>
> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Thanks, Numan!
> >> >>>>>>
> >> >>>>>> Applied to main and backported to all branches down to 22.03.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Dumitru
> >> >>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> dev mailing list
> >> >>>>>> dev@openvswitch.org
> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >>>>>
> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >> >>>>>
> >> >>>>
> >> >>>> Hi Han,
> >> >>>>
> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
> due to
> >> >>>>> this change. The reason is that for client to nodePort (LB with
> VIP
> >> >>>>> being the node IP) traffic, when the packet is going through the
> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
> committed
> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which
> >> >>>>> prevents the HW offload to work for such packets.
> >> >>>>>
> >> >>>>
> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
> >> >>>>
> >> >>>>> At the moment I have to revert this change in our downstream. For
> the
> >> >>>>> problem that was fixed by this change [0], I think we can avoid
> it by
> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
> >> >>>>> range in k8s is configured by API-server option:
> >> >>>>>
> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
> >> >> 30000-32767
> >> >>>>>
> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >> >>>>> external_port_range column, and we can choose something like
> >> >>>>> 10000-30000.
> >> >>>>>
> >> >>>>
> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit the
> SNAT
> >> >>>> port range this workaround should be fine.
> >> >>>>
> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >> >>>>>
> >> >>>>> Alternatively solutions are:
> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
> would
> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
> >> >>>>
> >> >>>> Extra recirculation aside, I would actually love it if we could
> use this
> >> >>>> alternative.  I think it's the "most correct" option.  I think it
> would
> >> >>>> allow us to avoid other workarounds like:
> >> >>>>
> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >> >>>>
> >> >>>> or
> >> >>>>
> >> >>>>
> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >> >>>>
> >> >>>> I do understand the worry about the extra recirculation though.
> In the
> >> >>>> HWOL context does that cause visible performance impact?
> >> >>>>
> >> >>>> We'd probably have to do more performance testing without HWOL to
> figure
> >> >>>> out the impact in the software datapath.
> >> >>>>
> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
> other
> >> >>>>> issues reported for using the common zone [1]
> >> >>>>>
> >> >>>>> Could you let me know if other thoughts on this?
> >> >>>>>
> >> >>>>
> >> >>>> On a related note, I know it has been discussed in different
> settings
> >> >>>> but I don't think this ever moved forward: would it be possible for
> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
> incoming
> >> >>>> patches?
> >> >>>>
> >> >>>> Maybe we could some "simple" system-like tests that ensure that
> traffic
> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
> guess we
> >> >>>> could also tag a subset of the existing system tests and just run
> those
> >> >>>> on actual hardware?
> >> >>>>
> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
> each OVN
> >> >>>> patch posted on patchwork.  That would at least allow us to flag
> this
> >> >>>> kind of breakages early (even before they get merged).
> >> >>>>
> >> >>>> What do you think?
> >> >>>>
> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >> >>>>> [1] b8c40e7593
> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Han
> >> >>>>>
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Dumitru
> >> >> _______________________________________________
> >> >> dev mailing list
> >> >> dev@openvswitch.org
> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ales Musil May 20, 2025, 1:40 p.m. UTC | #12
On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:

>
>
> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
> >
> > Hi All,
> > Sorry Han that you are still facing issues with this problem. It seems
> like the best option would be to configure the range of ephemeral ports for
> OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
> range is configured 30000–32767, and the ephemeral port range in the kernel
> is set to:
> > net.ipv4.ip_local_port_range = 32768 60999
> >
> > I think it would make sense to revert and add an option to make this
> range configurable in OVN. Then OVN-Kubernetes should pass along this
> config to OVN.
> >
> > As Dumitru mentioned, we can't just revert the previous patch without a
> substitute fix in place, otherwise we would risk regressing.
>
> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
> right?
>
> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>
>        external_port_range: string
>               L4 source port range
>
>               Range of ports, from which a port number will be picked
>               that will replace the source port of to be NATed packet.
>               This is basically PAT (port address translation).
>
>               Value of the column is in the format, port_lo-port_hi. For
>               example: external_port_range : "1-30000"
>
>               Valid range of ports is 1-65535.
>
> Does this satisfy the need you asked?
>
> If this is agreed, I think we can move forward with below steps:
> 1. Update ovn-kubernetes to set the external_port_range, if specific
> ranges are configured. (only needed when k8s configured nodeport range
> conflicts with the kernel ephemeral port range, and it seems unnecessary
> for OpenShift?)
> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
> e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>

Hi Han,

I don't think we should revert/partially revert ct-commit-all, the whole
point of that knob is what you have described, commit to SNAT zone if there
is any SNAT configured and commit to DNAT zone if there is any DNAT
configured. If you still feel like there are some changes that we could
make to this approach we can discuss it. There is one system test called
"Commit all - UDN" which needs to pass unchanged if there is any
revert/change to ct-commit-all.

Thanks,
Ales

>
> What do you think?
>
> Thanks,
> Han
>
> >
> > Thanks,
> >
> > Tim Rozet
> > Red Hat OpenShift Networking Team
> >
> >
> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote:
> >> >>>
> >> >>> Hi Han,
> >> >>> My understanding is that the unSNAT stage was originally being
> skipped
> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
> breaks
> >> >> HWOL for ingress connections that never need to get unSNAT'ed and
> therefore
> >> >> not committed. In the OVNK use case for pods simply connecting
> egress,
> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
> egress
> >> >> reply, we always enter the unSNAT stage for these packets. So this
> skipping
> >> >> unSNAT stage seems to be specific to LB related traffic. While I
> agree
> >> >> splitting the port range would work, I think it makes more sense to
> just
> >> >> always commit in the unSNAT stage for all traffic. I get there is a
> >> >> performance hit there, but the datapath pipeline seems more
> consistent and
> >> >> I think outweighs the cost of committing LB traffic.
> >> >>
> >> >> Hi Tim and folks,
> >> >>
> >> >
> >> > Hi Han,
> >> >
> >> >> + @Ales Musil <amusil@redhat.com>
> >> >>
> >> >> We've spent a lot of effort on the alternative solution: Always
> commit to
> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
> discussed in
> >> >> this thread is still unresolved. I'd like to discuss the current
> situation
> >> >> and ask for opinions.
> >> >>
> >> >
> >> > Thanks again for working on this!
> >> >
> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
> ("northd:
> >> >> Add LR option to commit all traffic."). When the ct-commit-all
> option is
> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
> However, in
> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
> offload is
> >> >> still broken. In that scenario, a packet may come from external
> network
> >> >> with dest IP being a node's IP, which is also used as an OVN gateway
> router
> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
> src IP
> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
> different
> >> >> stages are:
> >> >>
> >> >> Original: EXT IP <-> GR IP1
> >> >> After DNAT: EXT IP <-> BE IP
> >> >> After SNAT: GR IP2 <-> BE IP
> >> >>
> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
> >> >> original header, which will still always be "new" state, which
> breaks HW
> >> >> offload.
> >> >>
> >> >
> >> > Ah, now I understand the problem you were facing.
> >> >
> >> >> I tried to fix it by also committing to the SNAT zone before DNAT
> (at the
> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >> >> SNAT zone: EXT IP <-> GR IP1
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and
> SNAT on
> >> >> distributed router - E/W", which tests communication between two
> workloads
> >> >> using "floating IPs". In this scenario, both workloads are under
> different
> >> >> LSes behind the same router, and both have their own floating IPs
> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends
> >> >> packets to the other using the floating IP, what's expected is:
> >> >>
> >> >> Original: overlay IP1 <-> floating IP2
> >> >> After SNAT: floating IP1 <-> floating IP2
> >> >> After DNAT: floating IP1 <-> overlay IP2
> >> >>
> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >> >>
> >> >> This happens in the egress pipeline of the LR, and then it enters
> into the
> >> >> ingress pipeline again because this is like a hairpin (the dest IP
> is a
> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
> first,
> >> >> traverse the CT zone, and for the first packet the reply is not seen
> yet,
> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to
> commit
> >> >> it (because of the issue I am trying to fix above). Because the
> connection
> >> >> floating IP1 <-> floating IP2 is conflicting with the above
> committed entry
> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
> fails and
> >> >> the packet is dropped.
> >> >>
> >> >> There is still a way to solve this, probably by skipping the
> unnecessary
> >> >> UNSNAT traversal and the conflict commit if the source IP is one of
> the
> >> >> snat/dnat_and_snat IPs of this same router.
> >> >>
> >> >> However, as we see, the NAT related pipelines have been very complex
> today,
> >> >> and with the above changes it would become even much more complex. I
> wonder
> >> >> if this is maintainable in the long run.
> >> >>
> >> >> In addition, I am not even sure if this is the last problem of my
> patch. It
> >> >> is not easy to tell because today the ct-commit-all feature is an
> option
> >> >> and by default disabled. So all our test cases by default don't test
> for
> >> >> this option except the 4 test cases added by Ales's patch. I had to
> >> >> manually set this option to true as default and of course many test
> cases
> >> >> will fail. Most of them fail because of the expected flows won't
> match, but
> >> >> I can't tell for sure unless updating every case with new flows and
> then
> >> >> retest them. Before doing that I just examined ones I think might be
> >> >> impacted by the patch and the "DNAT and SNAT on distributed router -
> E/W"
> >> >> is the first broken one I found. So there can be other issues not
> >> >> discovered yet.
> >> >>
> >> >> Moreover, we should also keep in mind the performance penalty of
> these
> >> >> extra CT commit & traversal.
> >> >>
> >> >
> >> > But basically this happens because we use the same SNAT zone for all
> >> > router ports.
> >> >
> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
> >> > router port fix it?  E.g.:
> >> >
> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE:
> 21,
> >> > DNAT ZONE: 22) ---- host2
> >> >
> >> > That'd also imply that we commit that session (from host1 to host2 -
> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
> different
> >> > CT lookups for all packets traversing the router.
> >> >
> >> > I didn't try any of this but I have the impression that this would
> solve
> >> > the HWOL issue: all subsequent packets are part of known and committed
> >> > sessions in all 4 conntrack zones.
> >> >
> >> > So what if we change the implementation of the "commit-all" feature to
> >> > add the missing ct commits?  Wouldn't that work?
> >> >
> >> > It feels like (lookup performance hit aside) this is the correct way
> to
> >> > forward traffic, i.e., never traverse a CT zone without having a
> >> > committed entry in that zone.
> >> >
> >> >> I'd like to listen to your opinion before moving forward. Is this
> worth the
> >> >> effort, or shall we just keep the pipeline simple (relatively) and
> avoid
> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
> >> >> separating the port range of SNAT and DNAT, the first proposal at the
> >> >> beginning of this thread?
> >> >>
> >> >
> >> > Now that I understand the problem you're describing better it might be
> >> > worth reverting the fix for FDP-291 like you said until we find a
> proper
> >> > generic solution.
> >> >
> >> > However, we can't afford to do that without separating the port range
> of
> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least
> >> > from OpenShift perspective).
> >> >
> >>
> >> For reference, Tim, this is the OpenShift bug that needed the fix for
> >> FDP-291:
> >>
> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >>
> >> > Because selecting the nodeport (DNAT) range is done through a k8s API
> >> > server cmdline argument that means that users can in theory choose
> >> > whatever range they want.  I guess we'd need a command line argument
> in
> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >> > what SNAT port range to use so that it doesn't conflict with the KAPI
> >> > nodeport range.
> >> >
> >> > Tim, you're the expert on this side, what are your thoughts on the
> matter?
> >> >
> >> > Regards,
> >> > Dumitru
> >> >
> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >> >
> >> >> Thanks,
> >> >> Han
> >> >>
> >> >>>
> >> >>> Thanks,
> >> >>> Tim Rozet
> >> >>> Red Hat OpenShift Networking Team
> >> >>>
> >> >>>
> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >> >>>>
> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com>
> >> >> wrote:
> >> >>>>>>
> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >> wrote:
> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
> >> >> VIP as
> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
> >> >>>>>>>>
> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
> flows
> >> >> for
> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
> because
> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
> routers
> >> >> with
> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
> way:
> >> >>>>>>>>
> >> >>>>>>>>     In the reply direction, the order of traversal of the
> tables
> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
> incorrect
> >> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
> zone.
> >> >>>>>>>>     This is illustrated below where reply trafic enters the
> >> >> physical host
> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
> >> >> back to the
> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
> >> >> third
> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
> >> >> instead
> >> >>>>>>>>     of the DNAT zone.
> >> >>>>>>>>
> >> >>>>>>>> We also add a system test to ensure traffic initiated from
> behind a
> >> >> VIP
> >> >>>>>>>> + SNAT is not broken.
> >> >>>>>>>>
> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
> >> >> simplified
> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >> >>>>>>>>
> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
> >> >> balancer vips in router ingress pipeline")
> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
> code.
> >> >>>>>>>
> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Thanks, Numan!
> >> >>>>>>
> >> >>>>>> Applied to main and backported to all branches down to 22.03.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Dumitru
> >> >>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> dev mailing list
> >> >>>>>> dev@openvswitch.org
> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >>>>>
> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >> >>>>>
> >> >>>>
> >> >>>> Hi Han,
> >> >>>>
> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
> due to
> >> >>>>> this change. The reason is that for client to nodePort (LB with
> VIP
> >> >>>>> being the node IP) traffic, when the packet is going through the
> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
> committed
> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which
> >> >>>>> prevents the HW offload to work for such packets.
> >> >>>>>
> >> >>>>
> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
> >> >>>>
> >> >>>>> At the moment I have to revert this change in our downstream. For
> the
> >> >>>>> problem that was fixed by this change [0], I think we can avoid
> it by
> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
> >> >>>>> range in k8s is configured by API-server option:
> >> >>>>>
> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
> >> >> 30000-32767
> >> >>>>>
> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >> >>>>> external_port_range column, and we can choose something like
> >> >>>>> 10000-30000.
> >> >>>>>
> >> >>>>
> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit the
> SNAT
> >> >>>> port range this workaround should be fine.
> >> >>>>
> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >> >>>>>
> >> >>>>> Alternatively solutions are:
> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
> would
> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
> >> >>>>
> >> >>>> Extra recirculation aside, I would actually love it if we could
> use this
> >> >>>> alternative.  I think it's the "most correct" option.  I think it
> would
> >> >>>> allow us to avoid other workarounds like:
> >> >>>>
> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >> >>>>
> >> >>>> or
> >> >>>>
> >> >>>>
> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >> >>>>
> >> >>>> I do understand the worry about the extra recirculation though.
> In the
> >> >>>> HWOL context does that cause visible performance impact?
> >> >>>>
> >> >>>> We'd probably have to do more performance testing without HWOL to
> figure
> >> >>>> out the impact in the software datapath.
> >> >>>>
> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
> other
> >> >>>>> issues reported for using the common zone [1]
> >> >>>>>
> >> >>>>> Could you let me know if other thoughts on this?
> >> >>>>>
> >> >>>>
> >> >>>> On a related note, I know it has been discussed in different
> settings
> >> >>>> but I don't think this ever moved forward: would it be possible for
> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
> incoming
> >> >>>> patches?
> >> >>>>
> >> >>>> Maybe we could some "simple" system-like tests that ensure that
> traffic
> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
> guess we
> >> >>>> could also tag a subset of the existing system tests and just run
> those
> >> >>>> on actual hardware?
> >> >>>>
> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
> each OVN
> >> >>>> patch posted on patchwork.  That would at least allow us to flag
> this
> >> >>>> kind of breakages early (even before they get merged).
> >> >>>>
> >> >>>> What do you think?
> >> >>>>
> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >> >>>>> [1] b8c40e7593
> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Han
> >> >>>>>
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Dumitru
> >> >> _______________________________________________
> >> >> dev mailing list
> >> >> dev@openvswitch.org
> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Tim Rozet May 28, 2025, 11:58 a.m. UTC | #13
Hi Han,
I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265.
Please review it when you have time.

Thanks,
Tim Rozet
Red Hat OpenShift Networking Team


On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote:

>
>
> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
>
>>
>>
>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
>> >
>> > Hi All,
>> > Sorry Han that you are still facing issues with this problem. It seems
>> like the best option would be to configure the range of ephemeral ports for
>> OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
>> range is configured 30000–32767, and the ephemeral port range in the kernel
>> is set to:
>> > net.ipv4.ip_local_port_range = 32768 60999
>> >
>> > I think it would make sense to revert and add an option to make this
>> range configurable in OVN. Then OVN-Kubernetes should pass along this
>> config to OVN.
>> >
>> > As Dumitru mentioned, we can't just revert the previous patch without a
>> substitute fix in place, otherwise we would risk regressing.
>>
>> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
>> right?
>>
>> OVN already provides configurable SNAT port range through the
>> external_port_range column in the NAT table. See:
>> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>>
>>        external_port_range: string
>>               L4 source port range
>>
>>               Range of ports, from which a port number will be picked
>>               that will replace the source port of to be NATed packet.
>>               This is basically PAT (port address translation).
>>
>>               Value of the column is in the format, port_lo-port_hi. For
>>               example: external_port_range : "1-30000"
>>
>>               Valid range of ports is 1-65535.
>>
>> Does this satisfy the need you asked?
>>
>> If this is agreed, I think we can move forward with below steps:
>> 1. Update ovn-kubernetes to set the external_port_range, if specific
>> ranges are configured. (only needed when k8s configured nodeport range
>> conflicts with the kernel ephemeral port range, and it seems unnecessary
>> for OpenShift?)
>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
>> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
>> e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>>
>
> Hi Han,
>
> I don't think we should revert/partially revert ct-commit-all, the whole
> point of that knob is what you have described, commit to SNAT zone if there
> is any SNAT configured and commit to DNAT zone if there is any DNAT
> configured. If you still feel like there are some changes that we could
> make to this approach we can discuss it. There is one system test called
> "Commit all - UDN" which needs to pass unchanged if there is any
> revert/change to ct-commit-all.
>
> Thanks,
> Ales
>
>>
>> What do you think?
>>
>> Thanks,
>> Han
>>
>> >
>> > Thanks,
>> >
>> > Tim Rozet
>> > Red Hat OpenShift Networking Team
>> >
>> >
>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
>> wrote:
>> >>
>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
>> wrote:
>> >> >>>
>> >> >>> Hi Han,
>> >> >>> My understanding is that the unSNAT stage was originally being
>> skipped
>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
>> breaks
>> >> >> HWOL for ingress connections that never need to get unSNAT'ed and
>> therefore
>> >> >> not committed. In the OVNK use case for pods simply connecting
>> egress,
>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
>> egress
>> >> >> reply, we always enter the unSNAT stage for these packets. So this
>> skipping
>> >> >> unSNAT stage seems to be specific to LB related traffic. While I
>> agree
>> >> >> splitting the port range would work, I think it makes more sense to
>> just
>> >> >> always commit in the unSNAT stage for all traffic. I get there is a
>> >> >> performance hit there, but the datapath pipeline seems more
>> consistent and
>> >> >> I think outweighs the cost of committing LB traffic.
>> >> >>
>> >> >> Hi Tim and folks,
>> >> >>
>> >> >
>> >> > Hi Han,
>> >> >
>> >> >> + @Ales Musil <amusil@redhat.com>
>> >> >>
>> >> >> We've spent a lot of effort on the alternative solution: Always
>> commit to
>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
>> discussed in
>> >> >> this thread is still unresolved. I'd like to discuss the current
>> situation
>> >> >> and ask for opinions.
>> >> >>
>> >> >
>> >> > Thanks again for working on this!
>> >> >
>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
>> ("northd:
>> >> >> Add LR option to commit all traffic."). When the ct-commit-all
>> option is
>> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
>> However, in
>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
>> offload is
>> >> >> still broken. In that scenario, a packet may come from external
>> network
>> >> >> with dest IP being a node's IP, which is also used as an OVN
>> gateway router
>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
>> src IP
>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
>> different
>> >> >> stages are:
>> >> >>
>> >> >> Original: EXT IP <-> GR IP1
>> >> >> After DNAT: EXT IP <-> BE IP
>> >> >> After SNAT: GR IP2 <-> BE IP
>> >> >>
>> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >> >>
>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
>> >> >> original header, which will still always be "new" state, which
>> breaks HW
>> >> >> offload.
>> >> >>
>> >> >
>> >> > Ah, now I understand the problem you were facing.
>> >> >
>> >> >> I tried to fix it by also committing to the SNAT zone before DNAT
>> (at the
>> >> >> POST_SNAT stage). Now there are 3 entries committed:
>> >> >> SNAT zone: EXT IP <-> GR IP1
>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >> >>
>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and
>> SNAT on
>> >> >> distributed router - E/W", which tests communication between two
>> workloads
>> >> >> using "floating IPs". In this scenario, both workloads are under
>> different
>> >> >> LSes behind the same router, and both have their own floating IPs
>> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends
>> >> >> packets to the other using the floating IP, what's expected is:
>> >> >>
>> >> >> Original: overlay IP1 <-> floating IP2
>> >> >> After SNAT: floating IP1 <-> floating IP2
>> >> >> After DNAT: floating IP1 <-> overlay IP2
>> >> >>
>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT
>> zone:
>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>> >> >>
>> >> >> This happens in the egress pipeline of the LR, and then it enters
>> into the
>> >> >> ingress pipeline again because this is like a hairpin (the dest IP
>> is a
>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
>> first,
>> >> >> traverse the CT zone, and for the first packet the reply is not
>> seen yet,
>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to
>> commit
>> >> >> it (because of the issue I am trying to fix above). Because the
>> connection
>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
>> committed entry
>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
>> fails and
>> >> >> the packet is dropped.
>> >> >>
>> >> >> There is still a way to solve this, probably by skipping the
>> unnecessary
>> >> >> UNSNAT traversal and the conflict commit if the source IP is one of
>> the
>> >> >> snat/dnat_and_snat IPs of this same router.
>> >> >>
>> >> >> However, as we see, the NAT related pipelines have been very
>> complex today,
>> >> >> and with the above changes it would become even much more complex.
>> I wonder
>> >> >> if this is maintainable in the long run.
>> >> >>
>> >> >> In addition, I am not even sure if this is the last problem of my
>> patch. It
>> >> >> is not easy to tell because today the ct-commit-all feature is an
>> option
>> >> >> and by default disabled. So all our test cases by default don't
>> test for
>> >> >> this option except the 4 test cases added by Ales's patch. I had to
>> >> >> manually set this option to true as default and of course many test
>> cases
>> >> >> will fail. Most of them fail because of the expected flows won't
>> match, but
>> >> >> I can't tell for sure unless updating every case with new flows and
>> then
>> >> >> retest them. Before doing that I just examined ones I think might be
>> >> >> impacted by the patch and the "DNAT and SNAT on distributed router
>> - E/W"
>> >> >> is the first broken one I found. So there can be other issues not
>> >> >> discovered yet.
>> >> >>
>> >> >> Moreover, we should also keep in mind the performance penalty of
>> these
>> >> >> extra CT commit & traversal.
>> >> >>
>> >> >
>> >> > But basically this happens because we use the same SNAT zone for all
>> >> > router ports.
>> >> >
>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone
>> per
>> >> > router port fix it?  E.g.:
>> >> >
>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE:
>> 21,
>> >> > DNAT ZONE: 22) ---- host2
>> >> >
>> >> > That'd also imply that we commit that session (from host1 to host2 -
>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
>> different
>> >> > CT lookups for all packets traversing the router.
>> >> >
>> >> > I didn't try any of this but I have the impression that this would
>> solve
>> >> > the HWOL issue: all subsequent packets are part of known and
>> committed
>> >> > sessions in all 4 conntrack zones.
>> >> >
>> >> > So what if we change the implementation of the "commit-all" feature
>> to
>> >> > add the missing ct commits?  Wouldn't that work?
>> >> >
>> >> > It feels like (lookup performance hit aside) this is the correct way
>> to
>> >> > forward traffic, i.e., never traverse a CT zone without having a
>> >> > committed entry in that zone.
>> >> >
>> >> >> I'd like to listen to your opinion before moving forward. Is this
>> worth the
>> >> >> effort, or shall we just keep the pipeline simple (relatively) and
>> avoid
>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
>> >> >> separating the port range of SNAT and DNAT, the first proposal at
>> the
>> >> >> beginning of this thread?
>> >> >>
>> >> >
>> >> > Now that I understand the problem you're describing better it might
>> be
>> >> > worth reverting the fix for FDP-291 like you said until we find a
>> proper
>> >> > generic solution.
>> >> >
>> >> > However, we can't afford to do that without separating the port
>> range of
>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least
>> >> > from OpenShift perspective).
>> >> >
>> >>
>> >> For reference, Tim, this is the OpenShift bug that needed the fix for
>> >> FDP-291:
>> >>
>> >> https://issues.redhat.com/browse/OCPBUGS-25889
>> >>
>> >> > Because selecting the nodeport (DNAT) range is done through a k8s API
>> >> > server cmdline argument that means that users can in theory choose
>> >> > whatever range they want.  I guess we'd need a command line argument
>> in
>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
>> ovn-kubernetes
>> >> > what SNAT port range to use so that it doesn't conflict with the KAPI
>> >> > nodeport range.
>> >> >
>> >> > Tim, you're the expert on this side, what are your thoughts on the
>> matter?
>> >> >
>> >> > Regards,
>> >> > Dumitru
>> >> >
>> >> > [0]
>> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>> >> >
>> >> >> Thanks,
>> >> >> Han
>> >> >>
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Tim Rozet
>> >> >>> Red Hat OpenShift Networking Team
>> >> >>>
>> >> >>>
>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
>> wrote:
>> >> >>>>
>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com
>> >
>> >> >> wrote:
>> >> >>>>>>
>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
>> dceara@redhat.com>
>> >> >> wrote:
>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses
>> the
>> >> >> VIP as
>> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
>> >> >>>>>>>>
>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
>> flows
>> >> >> for
>> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
>> because
>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
>> routers
>> >> >> with
>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a
>> better way:
>> >> >>>>>>>>
>> >> >>>>>>>>     In the reply direction, the order of traversal of the
>> tables
>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
>> incorrect
>> >> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
>> zone.
>> >> >>>>>>>>     This is illustrated below where reply trafic enters the
>> >> >> physical host
>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
>> (default),
>> >> >> back to the
>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22).
>> The
>> >> >> third
>> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
>> >> >> instead
>> >> >>>>>>>>     of the DNAT zone.
>> >> >>>>>>>>
>> >> >>>>>>>> We also add a system test to ensure traffic initiated from
>> behind a
>> >> >> VIP
>> >> >>>>>>>> + SNAT is not broken.
>> >> >>>>>>>>
>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
>> >> >> simplified
>> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
>> >> >>>>>>>>
>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
>> >> >> balancer vips in router ingress pipeline")
>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
>> code.
>> >> >>>>>>>
>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>> Thanks, Numan!
>> >> >>>>>>
>> >> >>>>>> Applied to main and backported to all branches down to 22.03.
>> >> >>>>>>
>> >> >>>>>> Regards,
>> >> >>>>>> Dumitru
>> >> >>>>>>
>> >> >>>>>> _______________________________________________
>> >> >>>>>> dev mailing list
>> >> >>>>>> dev@openvswitch.org
>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >> >>>>>
>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
>> >> >>>>>
>> >> >>>>
>> >> >>>> Hi Han,
>> >> >>>>
>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
>> due to
>> >> >>>>> this change. The reason is that for client to nodePort (LB with
>> VIP
>> >> >>>>> being the node IP) traffic, when the packet is going through the
>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
>> committed
>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which
>> >> >>>>> prevents the HW offload to work for such packets.
>> >> >>>>>
>> >> >>>>
>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT
>> zone.
>> >> >>>>
>> >> >>>>> At the moment I have to revert this change in our downstream.
>> For the
>> >> >>>>> problem that was fixed by this change [0], I think we can avoid
>> it by
>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the
>> nodePort
>> >> >>>>> range in k8s is configured by API-server option:
>> >> >>>>>
>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
>> >> >> 30000-32767
>> >> >>>>>
>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>> >> >>>>> external_port_range column, and we can choose something like
>> >> >>>>> 10000-30000.
>> >> >>>>>
>> >> >>>>
>> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit the
>> SNAT
>> >> >>>> port range this workaround should be fine.
>> >> >>>>
>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
>> >> >>>>>
>> >> >>>>> Alternatively solutions are:
>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
>> would
>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>> >> >>>>
>> >> >>>> Extra recirculation aside, I would actually love it if we could
>> use this
>> >> >>>> alternative.  I think it's the "most correct" option.  I think it
>> would
>> >> >>>> allow us to avoid other workarounds like:
>> >> >>>>
>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>> >> >>>>
>> >> >>>> or
>> >> >>>>
>> >> >>>>
>> >> >>
>> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>> >> >>>>
>> >> >>>> I do understand the worry about the extra recirculation though.
>> In the
>> >> >>>> HWOL context does that cause visible performance impact?
>> >> >>>>
>> >> >>>> We'd probably have to do more performance testing without HWOL to
>> figure
>> >> >>>> out the impact in the software datapath.
>> >> >>>>
>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
>> other
>> >> >>>>> issues reported for using the common zone [1]
>> >> >>>>>
>> >> >>>>> Could you let me know if other thoughts on this?
>> >> >>>>>
>> >> >>>>
>> >> >>>> On a related note, I know it has been discussed in different
>> settings
>> >> >>>> but I don't think this ever moved forward: would it be possible
>> for
>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
>> incoming
>> >> >>>> patches?
>> >> >>>>
>> >> >>>> Maybe we could some "simple" system-like tests that ensure that
>> traffic
>> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
>> guess we
>> >> >>>> could also tag a subset of the existing system tests and just run
>> those
>> >> >>>> on actual hardware?
>> >> >>>>
>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
>> each OVN
>> >> >>>> patch posted on patchwork.  That would at least allow us to flag
>> this
>> >> >>>> kind of breakages early (even before they get merged).
>> >> >>>>
>> >> >>>> What do you think?
>> >> >>>>
>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>> >> >>>>> [1] b8c40e7593
>> >> >>
>> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >> >>>>>
>> >> >>>>> Thanks,
>> >> >>>>> Han
>> >> >>>>>
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>> Dumitru
>> >> >> _______________________________________________
>> >> >> dev mailing list
>> >> >> dev@openvswitch.org
>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
Han Zhou June 2, 2025, 5:14 p.m. UTC | #14
On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
>
> Hi Han,
> I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265.
Please review it when you have time.
>
> Thanks,
> Tim Rozet
> Red Hat OpenShift Networking Team
>

Thanks Tim. I posted a comment last week. Please take a look.

I also sent the OVN revert patch today:
https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
Hope we can resolve the ovn-k PR soon and then this one can get merged.

>
> On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote:
>>
>>
>>
>> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
>>>
>>>
>>>
>>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
>>> >
>>> > Hi All,
>>> > Sorry Han that you are still facing issues with this problem. It
seems like the best option would be to configure the range of ephemeral
ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
service range is configured 30000–32767, and the ephemeral port range in
the kernel is set to:
>>> > net.ipv4.ip_local_port_range = 32768 60999
>>> >
>>> > I think it would make sense to revert and add an option to make this
range configurable in OVN. Then OVN-Kubernetes should pass along this
config to OVN.
>>> >
>>> > As Dumitru mentioned, we can't just revert the previous patch without
a substitute fix in place, otherwise we would risk regressing.
>>>
>>> Thanks Tim! For the substitute fix, I assume you meant a fix in
ovn-k8s, right?
>>>
>>> OVN already provides configurable SNAT port range through the
external_port_range column in the NAT table. See:
https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>>>
>>>        external_port_range: string
>>>               L4 source port range
>>>
>>>               Range of ports, from which a port number will be picked
>>>               that will replace the source port of to be NATed packet.
>>>               This is basically PAT (port address translation).
>>>
>>>               Value of the column is in the format, port_lo-port_hi. For
>>>               example: external_port_range : "1-30000"
>>>
>>>               Valid range of ports is 1-65535.
>>>
>>> Does this satisfy the need you asked?
>>>
>>> If this is agreed, I think we can move forward with below steps:
>>> 1. Update ovn-kubernetes to set the external_port_range, if specific
ranges are configured. (only needed when k8s configured nodeport range
conflicts with the kernel ephemeral port range, and it seems unnecessary
for OpenShift?)
>>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>>
>>
>> Hi Han,
>>
>> I don't think we should revert/partially revert ct-commit-all, the whole
point of that knob is what you have described, commit to SNAT zone if there
is any SNAT configured and commit to DNAT zone if there is any DNAT
configured. If you still feel like there are some changes that we could
make to this approach we can discuss it. There is one system test called
"Commit all - UDN" which needs to pass unchanged if there is any
revert/change to ct-commit-all.
>>

Hi Ales, sorry for the confusion. I didn't mean revert the major
ct-commit-all change that was required to fix the UDN issue reported, which
was supposed to commit all traffic to DNAT zone if there are stateful DNAT
and all traffic to SNAT zone if there are stateful SNAT. In one of the
revisions of your patch, I commented that to solve the HW offload issue
discussed in this thread, we shouldn't distinguish between DNAT and SNAT
and always commit to both zones if there are stateful NATs. Now that
solving the HW offload issue is not any more a goal of your ct-commit-all
patch, we might bring back the separation between SNAT and DNAT, so that in
simpler scenarios (where only DNAT or SNAT exist but not both) we could
avoid the extra CT commit, which might help reduce some performance
penalty. Shall I leave this to you if you think this makes sense and want
to follow up?

Thanks,
Han

>> Thanks,
>> Ales
>>>
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> Han
>>>
>>> >
>>> > Thanks,
>>> >
>>> > Tim Rozet
>>> > Red Hat OpenShift Networking Team
>>> >
>>> >
>>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>>> >>
>>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
>>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
wrote:
>>> >> >>>
>>> >> >>> Hi Han,
>>> >> >>> My understanding is that the unSNAT stage was originally being
skipped
>>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
breaks
>>> >> >> HWOL for ingress connections that never need to get unSNAT'ed and
therefore
>>> >> >> not committed. In the OVNK use case for pods simply connecting
egress,
>>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
egress
>>> >> >> reply, we always enter the unSNAT stage for these packets. So
this skipping
>>> >> >> unSNAT stage seems to be specific to LB related traffic. While I
agree
>>> >> >> splitting the port range would work, I think it makes more sense
to just
>>> >> >> always commit in the unSNAT stage for all traffic. I get there is
a
>>> >> >> performance hit there, but the datapath pipeline seems more
consistent and
>>> >> >> I think outweighs the cost of committing LB traffic.
>>> >> >>
>>> >> >> Hi Tim and folks,
>>> >> >>
>>> >> >
>>> >> > Hi Han,
>>> >> >
>>> >> >> + @Ales Musil <amusil@redhat.com>
>>> >> >>
>>> >> >> We've spent a lot of effort on the alternative solution: Always
commit to
>>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
discussed in
>>> >> >> this thread is still unresolved. I'd like to discuss the current
situation
>>> >> >> and ask for opinions.
>>> >> >>
>>> >> >
>>> >> > Thanks again for working on this!
>>> >> >
>>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
("northd:
>>> >> >> Add LR option to commit all traffic."). When the ct-commit-all
option is
>>> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
However, in
>>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
offload is
>>> >> >> still broken. In that scenario, a packet may come from external
network
>>> >> >> with dest IP being a node's IP, which is also used as an OVN
gateway router
>>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then
the src IP
>>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
different
>>> >> >> stages are:
>>> >> >>
>>> >> >> Original: EXT IP <-> GR IP1
>>> >> >> After DNAT: EXT IP <-> BE IP
>>> >> >> After SNAT: GR IP2 <-> BE IP
>>> >> >>
>>> >> >> With the ct-commit-all option, it commits after DNAT and after
SNAT:
>>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>>> >> >>
>>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with
the
>>> >> >> original header, which will still always be "new" state, which
breaks HW
>>> >> >> offload.
>>> >> >>
>>> >> >
>>> >> > Ah, now I understand the problem you were facing.
>>> >> >
>>> >> >> I tried to fix it by also committing to the SNAT zone before DNAT
(at the
>>> >> >> POST_SNAT stage). Now there are 3 entries committed:
>>> >> >> SNAT zone: EXT IP <-> GR IP1
>>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>>> >> >>
>>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT
and SNAT on
>>> >> >> distributed router - E/W", which tests communication between two
workloads
>>> >> >> using "floating IPs". In this scenario, both workloads are under
different
>>> >> >> LSes behind the same router, and both have their own floating IPs
>>> >> >> (dnat_and_snat) on the public side of the LR. When one of them
sends
>>> >> >> packets to the other using the floating IP, what's expected is:
>>> >> >>
>>> >> >> Original: overlay IP1 <-> floating IP2
>>> >> >> After SNAT: floating IP1 <-> floating IP2
>>> >> >> After DNAT: floating IP1 <-> overlay IP2
>>> >> >>
>>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT
zone:
>>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>>> >> >>
>>> >> >> This happens in the egress pipeline of the LR, and then it enters
into the
>>> >> >> ingress pipeline again because this is like a hairpin (the dest
IP is a
>>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT
stage first,
>>> >> >> traverse the CT zone, and for the first packet the reply is not
seen yet,
>>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try
to commit
>>> >> >> it (because of the issue I am trying to fix above). Because the
connection
>>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
committed entry
>>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
fails and
>>> >> >> the packet is dropped.
>>> >> >>
>>> >> >> There is still a way to solve this, probably by skipping the
unnecessary
>>> >> >> UNSNAT traversal and the conflict commit if the source IP is one
of the
>>> >> >> snat/dnat_and_snat IPs of this same router.
>>> >> >>
>>> >> >> However, as we see, the NAT related pipelines have been very
complex today,
>>> >> >> and with the above changes it would become even much more
complex. I wonder
>>> >> >> if this is maintainable in the long run.
>>> >> >>
>>> >> >> In addition, I am not even sure if this is the last problem of my
patch. It
>>> >> >> is not easy to tell because today the ct-commit-all feature is an
option
>>> >> >> and by default disabled. So all our test cases by default don't
test for
>>> >> >> this option except the 4 test cases added by Ales's patch. I had
to
>>> >> >> manually set this option to true as default and of course many
test cases
>>> >> >> will fail. Most of them fail because of the expected flows won't
match, but
>>> >> >> I can't tell for sure unless updating every case with new flows
and then
>>> >> >> retest them. Before doing that I just examined ones I think might
be
>>> >> >> impacted by the patch and the "DNAT and SNAT on distributed
router - E/W"
>>> >> >> is the first broken one I found. So there can be other issues not
>>> >> >> discovered yet.
>>> >> >>
>>> >> >> Moreover, we should also keep in mind the performance penalty of
these
>>> >> >> extra CT commit & traversal.
>>> >> >>
>>> >> >
>>> >> > But basically this happens because we use the same SNAT zone for
all
>>> >> > router ports.
>>> >> >
>>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone
per
>>> >> > router port fix it?  E.g.:
>>> >> >
>>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT
ZONE: 21,
>>> >> > DNAT ZONE: 22) ---- host2
>>> >> >
>>> >> > That'd also imply that we commit that session (from host1 to host2
-
>>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT
for
>>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
different
>>> >> > CT lookups for all packets traversing the router.
>>> >> >
>>> >> > I didn't try any of this but I have the impression that this would
solve
>>> >> > the HWOL issue: all subsequent packets are part of known and
committed
>>> >> > sessions in all 4 conntrack zones.
>>> >> >
>>> >> > So what if we change the implementation of the "commit-all"
feature to
>>> >> > add the missing ct commits?  Wouldn't that work?
>>> >> >
>>> >> > It feels like (lookup performance hit aside) this is the correct
way to
>>> >> > forward traffic, i.e., never traverse a CT zone without having a
>>> >> > committed entry in that zone.
>>> >> >
>>> >> >> I'd like to listen to your opinion before moving forward. Is this
worth the
>>> >> >> effort, or shall we just keep the pipeline simple (relatively)
and avoid
>>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
>>> >> >> separating the port range of SNAT and DNAT, the first proposal at
the
>>> >> >> beginning of this thread?
>>> >> >>
>>> >> >
>>> >> > Now that I understand the problem you're describing better it
might be
>>> >> > worth reverting the fix for FDP-291 like you said until we find a
proper
>>> >> > generic solution.
>>> >> >
>>> >> > However, we can't afford to do that without separating the port
range of
>>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at
least
>>> >> > from OpenShift perspective).
>>> >> >
>>> >>
>>> >> For reference, Tim, this is the OpenShift bug that needed the fix for
>>> >> FDP-291:
>>> >>
>>> >> https://issues.redhat.com/browse/OCPBUGS-25889
>>> >>
>>> >> > Because selecting the nodeport (DNAT) range is done through a k8s
API
>>> >> > server cmdline argument that means that users can in theory choose
>>> >> > whatever range they want.  I guess we'd need a command line
argument in
>>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
ovn-kubernetes
>>> >> > what SNAT port range to use so that it doesn't conflict with the
KAPI
>>> >> > nodeport range.
>>> >> >
>>> >> > Tim, you're the expert on this side, what are your thoughts on the
matter?
>>> >> >
>>> >> > Regards,
>>> >> > Dumitru
>>> >> >
>>> >> > [0]
https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>>> >> >
>>> >> >> Thanks,
>>> >> >> Han
>>> >> >>
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> Tim Rozet
>>> >> >>> Red Hat OpenShift Networking Team
>>> >> >>>
>>> >> >>>
>>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>>> >> >>>>
>>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
>>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
dceara@redhat.com>
>>> >> >> wrote:
>>> >> >>>>>>
>>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
dceara@redhat.com>
>>> >> >> wrote:
>>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses
the
>>> >> >> VIP as
>>> >> >>>>>>>> external IP, we break sessions initiated from behind the
VIP.
>>> >> >>>>>>>>
>>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip
unsnat flows
>>> >> >> for
>>> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
because
>>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
routers
>>> >> >> with
>>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a
better way:
>>> >> >>>>>>>>
>>> >> >>>>>>>>     In the reply direction, the order of traversal of the
tables
>>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
incorrect
>>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
conntrack zone.
>>> >> >>>>>>>>     This is illustrated below where reply trafic enters the
>>> >> >> physical host
>>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
(default),
>>> >> >> back to the
>>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22).
The
>>> >> >> third
>>> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT
zone
>>> >> >> instead
>>> >> >>>>>>>>     of the DNAT zone.
>>> >> >>>>>>>>
>>> >> >>>>>>>> We also add a system test to ensure traffic initiated from
behind a
>>> >> >> VIP
>>> >> >>>>>>>> + SNAT is not broken.
>>> >> >>>>>>>>
>>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
>>> >> >> simplified
>>> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
>>> >> >>>>>>>>
>>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
>>> >> >> balancer vips in router ingress pipeline")
>>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
code.
>>> >> >>>>>>>
>>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Thanks, Numan!
>>> >> >>>>>>
>>> >> >>>>>> Applied to main and backported to all branches down to 22.03.
>>> >> >>>>>>
>>> >> >>>>>> Regards,
>>> >> >>>>>> Dumitru
>>> >> >>>>>>
>>> >> >>>>>> _______________________________________________
>>> >> >>>>>> dev mailing list
>>> >> >>>>>> dev@openvswitch.org
>>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>> >> >>>>>
>>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Hi Han,
>>> >> >>>>
>>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is
broken due to
>>> >> >>>>> this change. The reason is that for client to nodePort (LB
with VIP
>>> >> >>>>> being the node IP) traffic, when the packet is going through
the
>>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
committed
>>> >> >>>>> to the SNAT zone, it will have CT state returned as "new",
which
>>> >> >>>>> prevents the HW offload to work for such packets.
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT
zone.
>>> >> >>>>
>>> >> >>>>> At the moment I have to revert this change in our downstream.
For the
>>> >> >>>>> problem that was fixed by this change [0], I think we can
avoid it by
>>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the
nodePort
>>> >> >>>>> range in k8s is configured by API-server option:
>>> >> >>>>>
>>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'>
Default:
>>> >> >> 30000-32767
>>> >> >>>>>
>>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>>> >> >>>>> external_port_range column, and we can choose something like
>>> >> >>>>> 10000-30000.
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit the
SNAT
>>> >> >>>> port range this workaround should be fine.
>>> >> >>>>
>>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
>>> >> >>>>>
>>> >> >>>>> Alternatively solutions are:
>>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
would
>>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>>> >> >>>>
>>> >> >>>> Extra recirculation aside, I would actually love it if we could
use this
>>> >> >>>> alternative.  I think it's the "most correct" option.  I think
it would
>>> >> >>>> allow us to avoid other workarounds like:
>>> >> >>>>
>>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>>> >> >>>>
>>> >> >>>> or
>>> >> >>>>
>>> >> >>>>
>>> >> >>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>>> >> >>>>
>>> >> >>>> I do understand the worry about the extra recirculation
though.  In the
>>> >> >>>> HWOL context does that cause visible performance impact?
>>> >> >>>>
>>> >> >>>> We'd probably have to do more performance testing without HWOL
to figure
>>> >> >>>> out the impact in the software datapath.
>>> >> >>>>
>>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there
are other
>>> >> >>>>> issues reported for using the common zone [1]
>>> >> >>>>>
>>> >> >>>>> Could you let me know if other thoughts on this?
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> On a related note, I know it has been discussed in different
settings
>>> >> >>>> but I don't think this ever moved forward: would it be possible
for
>>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
incoming
>>> >> >>>> patches?
>>> >> >>>>
>>> >> >>>> Maybe we could some "simple" system-like tests that ensure that
traffic
>>> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
guess we
>>> >> >>>> could also tag a subset of the existing system tests and just
run those
>>> >> >>>> on actual hardware?
>>> >> >>>>
>>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
each OVN
>>> >> >>>> patch posted on patchwork.  That would at least allow us to
flag this
>>> >> >>>> kind of breakages early (even before they get merged).
>>> >> >>>>
>>> >> >>>> What do you think?
>>> >> >>>>
>>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>>> >> >>>>> [1] b8c40e7593
>>> >> >>
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>>> >> >>>>>
>>> >> >>>>> Thanks,
>>> >> >>>>> Han
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Thanks,
>>> >> >>>> Dumitru
>>> >> >> _______________________________________________
>>> >> >> dev mailing list
>>> >> >> dev@openvswitch.org
>>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>
Ales Musil June 3, 2025, 1:22 p.m. UTC | #15
On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote:

>
>
> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
> >
> > Hi Han,
> > I tagged you on
> https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review
> it when you have time.
> >
> > Thanks,
> > Tim Rozet
> > Red Hat OpenShift Networking Team
> >
>
> Thanks Tim. I posted a comment last week. Please take a look.
>
> I also sent the OVN revert patch today:
> https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
> Hope we can resolve the ovn-k PR soon and then this one can get merged.
>
> >
> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote:
> >>
> >>
> >>
> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
> >>> >
> >>> > Hi All,
> >>> > Sorry Han that you are still facing issues with this problem. It
> seems like the best option would be to configure the range of ephemeral
> ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
> service range is configured 30000–32767, and the ephemeral port range in
> the kernel is set to:
> >>> > net.ipv4.ip_local_port_range = 32768 60999
> >>> >
> >>> > I think it would make sense to revert and add an option to make this
> range configurable in OVN. Then OVN-Kubernetes should pass along this
> config to OVN.
> >>> >
> >>> > As Dumitru mentioned, we can't just revert the previous patch
> without a substitute fix in place, otherwise we would risk regressing.
> >>>
> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in
> ovn-k8s, right?
> >>>
> >>> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
> >>>
> >>>        external_port_range: string
> >>>               L4 source port range
> >>>
> >>>               Range of ports, from which a port number will be picked
> >>>               that will replace the source port of to be NATed packet.
> >>>               This is basically PAT (port address translation).
> >>>
> >>>               Value of the column is in the format, port_lo-port_hi.
> For
> >>>               example: external_port_range : "1-30000"
> >>>
> >>>               Valid range of ports is 1-65535.
> >>>
> >>> Does this satisfy the need you asked?
> >>>
> >>> If this is agreed, I think we can move forward with below steps:
> >>> 1. Update ovn-kubernetes to set the external_port_range, if specific
> ranges are configured. (only needed when k8s configured nodeport range
> conflicts with the kernel ephemeral port range, and it seems unnecessary
> for OpenShift?)
> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
> e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
> >>
> >>
> >> Hi Han,
> >>
> >> I don't think we should revert/partially revert ct-commit-all, the
> whole point of that knob is what you have described, commit to SNAT zone if
> there is any SNAT configured and commit to DNAT zone if there is any DNAT
> configured. If you still feel like there are some changes that we could
> make to this approach we can discuss it. There is one system test called
> "Commit all - UDN" which needs to pass unchanged if there is any
> revert/change to ct-commit-all.
> >>
>
> Hi Ales, sorry for the confusion.
>

Hi Han,

I didn't mean revert the major ct-commit-all change that was required to
> fix the UDN issue reported, which was supposed to commit all traffic to
> DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there
> are stateful SNAT. In one of the revisions of your patch, I commented that
> to solve the HW offload issue discussed in this thread, we shouldn't
> distinguish between DNAT and SNAT and always commit to both zones if there
> are stateful NATs. Now that solving the HW offload issue is not any more a
> goal of your ct-commit-all patch, we might bring back the separation
> between SNAT and DNAT, so that in simpler scenarios (where only DNAT or
> SNAT exist but not both) we could avoid the extra CT commit, which might
> help reduce some performance penalty. Shall I leave this to you if you
> think this makes sense and want to follow up?
>

I was thinking about looking into the proposal from Dumitru with a zone per
port, because we might need that anyway for UDNs. If that doesn't work out
we can work out which bits could be simplified. Any objections?

Thanks,
Ales


> Thanks,
> Han
>
> >> Thanks,
> >> Ales
> >>>
> >>>
> >>> What do you think?
> >>>
> >>> Thanks,
> >>> Han
> >>>
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Tim Rozet
> >>> > Red Hat OpenShift Networking Team
> >>> >
> >>> >
> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >>> >>
> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
> wrote:
> >>> >> >>>
> >>> >> >>> Hi Han,
> >>> >> >>> My understanding is that the unSNAT stage was originally being
> skipped
> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and
> this breaks
> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed
> and therefore
> >>> >> >> not committed. In the OVNK use case for pods simply connecting
> egress,
> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
> egress
> >>> >> >> reply, we always enter the unSNAT stage for these packets. So
> this skipping
> >>> >> >> unSNAT stage seems to be specific to LB related traffic. While I
> agree
> >>> >> >> splitting the port range would work, I think it makes more sense
> to just
> >>> >> >> always commit in the unSNAT stage for all traffic. I get there
> is a
> >>> >> >> performance hit there, but the datapath pipeline seems more
> consistent and
> >>> >> >> I think outweighs the cost of committing LB traffic.
> >>> >> >>
> >>> >> >> Hi Tim and folks,
> >>> >> >>
> >>> >> >
> >>> >> > Hi Han,
> >>> >> >
> >>> >> >> + @Ales Musil <amusil@redhat.com>
> >>> >> >>
> >>> >> >> We've spent a lot of effort on the alternative solution: Always
> commit to
> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
> discussed in
> >>> >> >> this thread is still unresolved. I'd like to discuss the current
> situation
> >>> >> >> and ask for opinions.
> >>> >> >>
> >>> >> >
> >>> >> > Thanks again for working on this!
> >>> >> >
> >>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
> ("northd:
> >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all
> option is
> >>> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
> However, in
> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
> offload is
> >>> >> >> still broken. In that scenario, a packet may come from external
> network
> >>> >> >> with dest IP being a node's IP, which is also used as an OVN
> gateway router
> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then
> the src IP
> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
> different
> >>> >> >> stages are:
> >>> >> >>
> >>> >> >> Original: EXT IP <-> GR IP1
> >>> >> >> After DNAT: EXT IP <-> BE IP
> >>> >> >> After SNAT: GR IP2 <-> BE IP
> >>> >> >>
> >>> >> >> With the ct-commit-all option, it commits after DNAT and after
> SNAT:
> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >>> >> >>
> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with
> the
> >>> >> >> original header, which will still always be "new" state, which
> breaks HW
> >>> >> >> offload.
> >>> >> >>
> >>> >> >
> >>> >> > Ah, now I understand the problem you were facing.
> >>> >> >
> >>> >> >> I tried to fix it by also committing to the SNAT zone before
> DNAT (at the
> >>> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >>> >> >> SNAT zone: EXT IP <-> GR IP1
> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >>> >> >>
> >>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT
> and SNAT on
> >>> >> >> distributed router - E/W", which tests communication between two
> workloads
> >>> >> >> using "floating IPs". In this scenario, both workloads are under
> different
> >>> >> >> LSes behind the same router, and both have their own floating IPs
> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them
> sends
> >>> >> >> packets to the other using the floating IP, what's expected is:
> >>> >> >>
> >>> >> >> Original: overlay IP1 <-> floating IP2
> >>> >> >> After SNAT: floating IP1 <-> floating IP2
> >>> >> >> After DNAT: floating IP1 <-> overlay IP2
> >>> >> >>
> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT
> zone:
> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >>> >> >>
> >>> >> >> This happens in the egress pipeline of the LR, and then it
> enters into the
> >>> >> >> ingress pipeline again because this is like a hairpin (the dest
> IP is a
> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT
> stage first,
> >>> >> >> traverse the CT zone, and for the first packet the reply is not
> seen yet,
> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try
> to commit
> >>> >> >> it (because of the issue I am trying to fix above). Because the
> connection
> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
> committed entry
> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
> fails and
> >>> >> >> the packet is dropped.
> >>> >> >>
> >>> >> >> There is still a way to solve this, probably by skipping the
> unnecessary
> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is one
> of the
> >>> >> >> snat/dnat_and_snat IPs of this same router.
> >>> >> >>
> >>> >> >> However, as we see, the NAT related pipelines have been very
> complex today,
> >>> >> >> and with the above changes it would become even much more
> complex. I wonder
> >>> >> >> if this is maintainable in the long run.
> >>> >> >>
> >>> >> >> In addition, I am not even sure if this is the last problem of
> my patch. It
> >>> >> >> is not easy to tell because today the ct-commit-all feature is
> an option
> >>> >> >> and by default disabled. So all our test cases by default don't
> test for
> >>> >> >> this option except the 4 test cases added by Ales's patch. I had
> to
> >>> >> >> manually set this option to true as default and of course many
> test cases
> >>> >> >> will fail. Most of them fail because of the expected flows won't
> match, but
> >>> >> >> I can't tell for sure unless updating every case with new flows
> and then
> >>> >> >> retest them. Before doing that I just examined ones I think
> might be
> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed
> router - E/W"
> >>> >> >> is the first broken one I found. So there can be other issues not
> >>> >> >> discovered yet.
> >>> >> >>
> >>> >> >> Moreover, we should also keep in mind the performance penalty of
> these
> >>> >> >> extra CT commit & traversal.
> >>> >> >>
> >>> >> >
> >>> >> > But basically this happens because we use the same SNAT zone for
> all
> >>> >> > router ports.
> >>> >> >
> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT
> zone per
> >>> >> > router port fix it?  E.g.:
> >>> >> >
> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT
> ZONE: 21,
> >>> >> > DNAT ZONE: 22) ---- host2
> >>> >> >
> >>> >> > That'd also imply that we commit that session (from host1 to
> host2 -
> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT
> for
> >>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
> different
> >>> >> > CT lookups for all packets traversing the router.
> >>> >> >
> >>> >> > I didn't try any of this but I have the impression that this
> would solve
> >>> >> > the HWOL issue: all subsequent packets are part of known and
> committed
> >>> >> > sessions in all 4 conntrack zones.
> >>> >> >
> >>> >> > So what if we change the implementation of the "commit-all"
> feature to
> >>> >> > add the missing ct commits?  Wouldn't that work?
> >>> >> >
> >>> >> > It feels like (lookup performance hit aside) this is the correct
> way to
> >>> >> > forward traffic, i.e., never traverse a CT zone without having a
> >>> >> > committed entry in that zone.
> >>> >> >
> >>> >> >> I'd like to listen to your opinion before moving forward. Is
> this worth the
> >>> >> >> effort, or shall we just keep the pipeline simple (relatively)
> and avoid
> >>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291)
> by
> >>> >> >> separating the port range of SNAT and DNAT, the first proposal
> at the
> >>> >> >> beginning of this thread?
> >>> >> >>
> >>> >> >
> >>> >> > Now that I understand the problem you're describing better it
> might be
> >>> >> > worth reverting the fix for FDP-291 like you said until we find a
> proper
> >>> >> > generic solution.
> >>> >> >
> >>> >> > However, we can't afford to do that without separating the port
> range of
> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at
> least
> >>> >> > from OpenShift perspective).
> >>> >> >
> >>> >>
> >>> >> For reference, Tim, this is the OpenShift bug that needed the fix
> for
> >>> >> FDP-291:
> >>> >>
> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >>> >>
> >>> >> > Because selecting the nodeport (DNAT) range is done through a k8s
> API
> >>> >> > server cmdline argument that means that users can in theory choose
> >>> >> > whatever range they want.  I guess we'd need a command line
> argument in
> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >>> >> > what SNAT port range to use so that it doesn't conflict with the
> KAPI
> >>> >> > nodeport range.
> >>> >> >
> >>> >> > Tim, you're the expert on this side, what are your thoughts on
> the matter?
> >>> >> >
> >>> >> > Regards,
> >>> >> > Dumitru
> >>> >> >
> >>> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >>> >> >
> >>> >> >> Thanks,
> >>> >> >> Han
> >>> >> >>
> >>> >> >>>
> >>> >> >>> Thanks,
> >>> >> >>> Tim Rozet
> >>> >> >>> Red Hat OpenShift Networking Team
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <
> dceara@redhat.com> wrote:
> >>> >> >>>>
> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
> dceara@redhat.com>
> >>> >> >> wrote:
> >>> >> >>>>>>
> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com>
> >>> >> >> wrote:
> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that
> uses the
> >>> >> >> VIP as
> >>> >> >>>>>>>> external IP, we break sessions initiated from behind the
> VIP.
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip
> unsnat flows
> >>> >> >> for
> >>> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's
> OK because
> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
> routers
> >>> >> >> with
> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a
> better way:
> >>> >> >>>>>>>>
> >>> >> >>>>>>>>     In the reply direction, the order of traversal of the
> tables
> >>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
> incorrect
> >>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
> conntrack zone.
> >>> >> >>>>>>>>     This is illustrated below where reply trafic enters the
> >>> >> >> physical host
> >>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
> (default),
> >>> >> >> back to the
> >>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone
> (22). The
> >>> >> >> third
> >>> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT
> zone
> >>> >> >> instead
> >>> >> >>>>>>>>     of the DNAT zone.
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated from
> behind a
> >>> >> >> VIP
> >>> >> >>>>>>>> + SNAT is not broken.
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
> >>> >> >> simplified
> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for
> load
> >>> >> >> balancer vips in router ingress pipeline")
> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >>> >> >>>>>>>
> >>> >> >>>>>>>
> >>> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
> code.
> >>> >> >>>>>>>
> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >>> >> >>>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Thanks, Numan!
> >>> >> >>>>>>
> >>> >> >>>>>> Applied to main and backported to all branches down to 22.03.
> >>> >> >>>>>>
> >>> >> >>>>>> Regards,
> >>> >> >>>>>> Dumitru
> >>> >> >>>>>>
> >>> >> >>>>>> _______________________________________________
> >>> >> >>>>>> dev mailing list
> >>> >> >>>>>> dev@openvswitch.org
> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>> >> >>>>>
> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>> Hi Han,
> >>> >> >>>>
> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is
> broken due to
> >>> >> >>>>> this change. The reason is that for client to nodePort (LB
> with VIP
> >>> >> >>>>> being the node IP) traffic, when the packet is going through
> the
> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
> committed
> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new",
> which
> >>> >> >>>>> prevents the HW offload to work for such packets.
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT
> zone.
> >>> >> >>>>
> >>> >> >>>>> At the moment I have to revert this change in our downstream.
> For the
> >>> >> >>>>> problem that was fixed by this change [0], I think we can
> avoid it by
> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the
> nodePort
> >>> >> >>>>> range in k8s is configured by API-server option:
> >>> >> >>>>>
> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'>
> Default:
> >>> >> >> 30000-32767
> >>> >> >>>>>
> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >>> >> >>>>> external_port_range column, and we can choose something like
> >>> >> >>>>> 10000-30000.
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit
> the SNAT
> >>> >> >>>> port range this workaround should be fine.
> >>> >> >>>>
> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >>> >> >>>>>
> >>> >> >>>>> Alternatively solutions are:
> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
> would
> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
> >>> >> >>>>
> >>> >> >>>> Extra recirculation aside, I would actually love it if we
> could use this
> >>> >> >>>> alternative.  I think it's the "most correct" option.  I think
> it would
> >>> >> >>>> allow us to avoid other workarounds like:
> >>> >> >>>>
> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >>> >> >>>>
> >>> >> >>>> or
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >>> >> >>>>
> >>> >> >>>> I do understand the worry about the extra recirculation
> though.  In the
> >>> >> >>>> HWOL context does that cause visible performance impact?
> >>> >> >>>>
> >>> >> >>>> We'd probably have to do more performance testing without HWOL
> to figure
> >>> >> >>>> out the impact in the software datapath.
> >>> >> >>>>
> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there
> are other
> >>> >> >>>>> issues reported for using the common zone [1]
> >>> >> >>>>>
> >>> >> >>>>> Could you let me know if other thoughts on this?
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>> On a related note, I know it has been discussed in different
> settings
> >>> >> >>>> but I don't think this ever moved forward: would it be
> possible for
> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
> incoming
> >>> >> >>>> patches?
> >>> >> >>>>
> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure
> that traffic
> >>> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
> guess we
> >>> >> >>>> could also tag a subset of the existing system tests and just
> run those
> >>> >> >>>> on actual hardware?
> >>> >> >>>>
> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
> each OVN
> >>> >> >>>> patch posted on patchwork.  That would at least allow us to
> flag this
> >>> >> >>>> kind of breakages early (even before they get merged).
> >>> >> >>>>
> >>> >> >>>> What do you think?
> >>> >> >>>>
> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >>> >> >>>>> [1] b8c40e7593
> >>> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >>> >> >>>>>
> >>> >> >>>>> Thanks,
> >>> >> >>>>> Han
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>> Thanks,
> >>> >> >>>> Dumitru
> >>> >> >> _______________________________________________
> >>> >> >> dev mailing list
> >>> >> >> dev@openvswitch.org
> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>>
Han Zhou June 4, 2025, 6:40 a.m. UTC | #16
On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote:
>
>
>
> On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote:
>>
>>
>>
>> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
>> >
>> > Hi Han,
>> > I tagged you on
https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review
it when you have time.
>> >
>> > Thanks,
>> > Tim Rozet
>> > Red Hat OpenShift Networking Team
>> >
>>
>> Thanks Tim. I posted a comment last week. Please take a look.
>>
>> I also sent the OVN revert patch today:
https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
>> Hope we can resolve the ovn-k PR soon and then this one can get merged.
>>
>> >
>> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote:
>> >>
>> >>
>> >>
>> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote:
>> >>> >
>> >>> > Hi All,
>> >>> > Sorry Han that you are still facing issues with this problem. It
seems like the best option would be to configure the range of ephemeral
ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
service range is configured 30000–32767, and the ephemeral port range in
the kernel is set to:
>> >>> > net.ipv4.ip_local_port_range = 32768 60999
>> >>> >
>> >>> > I think it would make sense to revert and add an option to make
this range configurable in OVN. Then OVN-Kubernetes should pass along this
config to OVN.
>> >>> >
>> >>> > As Dumitru mentioned, we can't just revert the previous patch
without a substitute fix in place, otherwise we would risk regressing.
>> >>>
>> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in
ovn-k8s, right?
>> >>>
>> >>> OVN already provides configurable SNAT port range through the
external_port_range column in the NAT table. See:
https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>> >>>
>> >>>        external_port_range: string
>> >>>               L4 source port range
>> >>>
>> >>>               Range of ports, from which a port number will be picked
>> >>>               that will replace the source port of to be NATed
packet.
>> >>>               This is basically PAT (port address translation).
>> >>>
>> >>>               Value of the column is in the format, port_lo-port_hi.
For
>> >>>               example: external_port_range : "1-30000"
>> >>>
>> >>>               Valid range of ports is 1-65535.
>> >>>
>> >>> Does this satisfy the need you asked?
>> >>>
>> >>> If this is agreed, I think we can move forward with below steps:
>> >>> 1. Update ovn-kubernetes to set the external_port_range, if specific
ranges are configured. (only needed when k8s configured nodeport range
conflicts with the kernel ephemeral port range, and it seems unnecessary
for OpenShift?)
>> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>> >>
>> >>
>> >> Hi Han,
>> >>
>> >> I don't think we should revert/partially revert ct-commit-all, the
whole point of that knob is what you have described, commit to SNAT zone if
there is any SNAT configured and commit to DNAT zone if there is any DNAT
configured. If you still feel like there are some changes that we could
make to this approach we can discuss it. There is one system test called
"Commit all - UDN" which needs to pass unchanged if there is any
revert/change to ct-commit-all.
>> >>
>>
>> Hi Ales, sorry for the confusion.
>
>
> Hi Han,
>
>> I didn't mean revert the major ct-commit-all change that was required to
fix the UDN issue reported, which was supposed to commit all traffic to
DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there
are stateful SNAT. In one of the revisions of your patch, I commented that
to solve the HW offload issue discussed in this thread, we shouldn't
distinguish between DNAT and SNAT and always commit to both zones if there
are stateful NATs. Now that solving the HW offload issue is not any more a
goal of your ct-commit-all patch, we might bring back the separation
between SNAT and DNAT, so that in simpler scenarios (where only DNAT or
SNAT exist but not both) we could avoid the extra CT commit, which might
help reduce some performance penalty. Shall I leave this to you if you
think this makes sense and want to follow up?
>
>
> I was thinking about looking into the proposal from Dumitru with a zone
per port, because we might need that anyway for UDNs. If that doesn't work
out we can work out which bits could be simplified. Any objections?

Hi Ales, thanks for looking into the zone-per-port proposal. Could you
provide more details why is that needed for UDNs?

Han

>
> Thanks,
> Ales
>
>>
>> Thanks,
>> Han
>>
>> >> Thanks,
>> >> Ales
>> >>>
>> >>>
>> >>> What do you think?
>> >>>
>> >>> Thanks,
>> >>> Han
>> >>>
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > Tim Rozet
>> >>> > Red Hat OpenShift Networking Team
>> >>> >
>> >>> >
>> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>> >>> >>
>> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
>> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
wrote:
>> >>> >> >>>
>> >>> >> >>> Hi Han,
>> >>> >> >>> My understanding is that the unSNAT stage was originally
being skipped
>> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and
this breaks
>> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed
and therefore
>> >>> >> >> not committed. In the OVNK use case for pods simply connecting
egress,
>> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore
on egress
>> >>> >> >> reply, we always enter the unSNAT stage for these packets. So
this skipping
>> >>> >> >> unSNAT stage seems to be specific to LB related traffic. While
I agree
>> >>> >> >> splitting the port range would work, I think it makes more
sense to just
>> >>> >> >> always commit in the unSNAT stage for all traffic. I get there
is a
>> >>> >> >> performance hit there, but the datapath pipeline seems more
consistent and
>> >>> >> >> I think outweighs the cost of committing LB traffic.
>> >>> >> >>
>> >>> >> >> Hi Tim and folks,
>> >>> >> >>
>> >>> >> >
>> >>> >> > Hi Han,
>> >>> >> >
>> >>> >> >> + @Ales Musil <amusil@redhat.com>
>> >>> >> >>
>> >>> >> >> We've spent a lot of effort on the alternative solution:
Always commit to
>> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
discussed in
>> >>> >> >> this thread is still unresolved. I'd like to discuss the
current situation
>> >>> >> >> and ask for opinions.
>> >>> >> >>
>> >>> >> >
>> >>> >> > Thanks again for working on this!
>> >>> >> >
>> >>> >> >> A while ago, Ales tried to fix it with this commit:
800fd0681579 ("northd:
>> >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all
option is
>> >>> >> >> enabled, it commits the connection to both SNAT and DNAT
zones. However, in
>> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
offload is
>> >>> >> >> still broken. In that scenario, a packet may come from
external network
>> >>> >> >> with dest IP being a node's IP, which is also used as an OVN
gateway router
>> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then
the src IP
>> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs
in different
>> >>> >> >> stages are:
>> >>> >> >>
>> >>> >> >> Original: EXT IP <-> GR IP1
>> >>> >> >> After DNAT: EXT IP <-> BE IP
>> >>> >> >> After SNAT: GR IP2 <-> BE IP
>> >>> >> >>
>> >>> >> >> With the ct-commit-all option, it commits after DNAT and after
SNAT:
>> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>> >> >>
>> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone
with the
>> >>> >> >> original header, which will still always be "new" state, which
breaks HW
>> >>> >> >> offload.
>> >>> >> >>
>> >>> >> >
>> >>> >> > Ah, now I understand the problem you were facing.
>> >>> >> >
>> >>> >> >> I tried to fix it by also committing to the SNAT zone before
DNAT (at the
>> >>> >> >> POST_SNAT stage). Now there are 3 entries committed:
>> >>> >> >> SNAT zone: EXT IP <-> GR IP1
>> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>> >> >>
>> >>> >> >> HW offload worked, but it breaks at least the scenario of
"DNAT and SNAT on
>> >>> >> >> distributed router - E/W", which tests communication between
two workloads
>> >>> >> >> using "floating IPs". In this scenario, both workloads are
under different
>> >>> >> >> LSes behind the same router, and both have their own floating
IPs
>> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them
sends
>> >>> >> >> packets to the other using the floating IP, what's expected is:
>> >>> >> >>
>> >>> >> >> Original: overlay IP1 <-> floating IP2
>> >>> >> >> After SNAT: floating IP1 <-> floating IP2
>> >>> >> >> After DNAT: floating IP1 <-> overlay IP2
>> >>> >> >>
>> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in
SNAT zone:
>> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>> >>> >> >>
>> >>> >> >> This happens in the egress pipeline of the LR, and then it
enters into the
>> >>> >> >> ingress pipeline again because this is like a hairpin (the
dest IP is a
>> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT
stage first,
>> >>> >> >> traverse the CT zone, and for the first packet the reply is
not seen yet,
>> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will
try to commit
>> >>> >> >> it (because of the issue I am trying to fix above). Because
the connection
>> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
committed entry
>> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT
action fails and
>> >>> >> >> the packet is dropped.
>> >>> >> >>
>> >>> >> >> There is still a way to solve this, probably by skipping the
unnecessary
>> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is
one of the
>> >>> >> >> snat/dnat_and_snat IPs of this same router.
>> >>> >> >>
>> >>> >> >> However, as we see, the NAT related pipelines have been very
complex today,
>> >>> >> >> and with the above changes it would become even much more
complex. I wonder
>> >>> >> >> if this is maintainable in the long run.
>> >>> >> >>
>> >>> >> >> In addition, I am not even sure if this is the last problem of
my patch. It
>> >>> >> >> is not easy to tell because today the ct-commit-all feature is
an option
>> >>> >> >> and by default disabled. So all our test cases by default
don't test for
>> >>> >> >> this option except the 4 test cases added by Ales's patch. I
had to
>> >>> >> >> manually set this option to true as default and of course many
test cases
>> >>> >> >> will fail. Most of them fail because of the expected flows
won't match, but
>> >>> >> >> I can't tell for sure unless updating every case with new
flows and then
>> >>> >> >> retest them. Before doing that I just examined ones I think
might be
>> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed
router - E/W"
>> >>> >> >> is the first broken one I found. So there can be other issues
not
>> >>> >> >> discovered yet.
>> >>> >> >>
>> >>> >> >> Moreover, we should also keep in mind the performance penalty
of these
>> >>> >> >> extra CT commit & traversal.
>> >>> >> >>
>> >>> >> >
>> >>> >> > But basically this happens because we use the same SNAT zone
for all
>> >>> >> > router ports.
>> >>> >> >
>> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT
zone per
>> >>> >> > router port fix it?  E.g.:
>> >>> >> >
>> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT
ZONE: 21,
>> >>> >> > DNAT ZONE: 22) ---- host2
>> >>> >> >
>> >>> >> > That'd also imply that we commit that session (from host1 to
host2 -
>> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and
DNAT for
>> >>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
different
>> >>> >> > CT lookups for all packets traversing the router.
>> >>> >> >
>> >>> >> > I didn't try any of this but I have the impression that this
would solve
>> >>> >> > the HWOL issue: all subsequent packets are part of known and
committed
>> >>> >> > sessions in all 4 conntrack zones.
>> >>> >> >
>> >>> >> > So what if we change the implementation of the "commit-all"
feature to
>> >>> >> > add the missing ct commits?  Wouldn't that work?
>> >>> >> >
>> >>> >> > It feels like (lookup performance hit aside) this is the
correct way to
>> >>> >> > forward traffic, i.e., never traverse a CT zone without having a
>> >>> >> > committed entry in that zone.
>> >>> >> >
>> >>> >> >> I'd like to listen to your opinion before moving forward. Is
this worth the
>> >>> >> >> effort, or shall we just keep the pipeline simple (relatively)
and avoid
>> >>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291)
by
>> >>> >> >> separating the port range of SNAT and DNAT, the first proposal
at the
>> >>> >> >> beginning of this thread?
>> >>> >> >>
>> >>> >> >
>> >>> >> > Now that I understand the problem you're describing better it
might be
>> >>> >> > worth reverting the fix for FDP-291 like you said until we find
a proper
>> >>> >> > generic solution.
>> >>> >> >
>> >>> >> > However, we can't afford to do that without separating the port
range of
>> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at
least
>> >>> >> > from OpenShift perspective).
>> >>> >> >
>> >>> >>
>> >>> >> For reference, Tim, this is the OpenShift bug that needed the fix
for
>> >>> >> FDP-291:
>> >>> >>
>> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889
>> >>> >>
>> >>> >> > Because selecting the nodeport (DNAT) range is done through a
k8s API
>> >>> >> > server cmdline argument that means that users can in theory
choose
>> >>> >> > whatever range they want.  I guess we'd need a command line
argument in
>> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
ovn-kubernetes
>> >>> >> > what SNAT port range to use so that it doesn't conflict with
the KAPI
>> >>> >> > nodeport range.
>> >>> >> >
>> >>> >> > Tim, you're the expert on this side, what are your thoughts on
the matter?
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > Dumitru
>> >>> >> >
>> >>> >> > [0]
https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>> >>> >> >
>> >>> >> >> Thanks,
>> >>> >> >> Han
>> >>> >> >>
>> >>> >> >>>
>> >>> >> >>> Thanks,
>> >>> >> >>> Tim Rozet
>> >>> >> >>> Red Hat OpenShift Networking Team
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <
dceara@redhat.com> wrote:
>> >>> >> >>>>
>> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
>> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
dceara@redhat.com>
>> >>> >> >> wrote:
>> >>> >> >>>>>>
>> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
dceara@redhat.com>
>> >>> >> >> wrote:
>> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that
uses the
>> >>> >> >> VIP as
>> >>> >> >>>>>>>> external IP, we break sessions initiated from behind the
VIP.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip
unsnat flows
>> >>> >> >> for
>> >>> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's
OK because
>> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
routers
>> >>> >> >> with
>> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a
better way:
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>>     In the reply direction, the order of traversal of
the tables
>> >>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
incorrect
>> >>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
conntrack zone.
>> >>> >> >>>>>>>>     This is illustrated below where reply trafic enters
the
>> >>> >> >> physical host
>> >>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
(default),
>> >>> >> >> back to the
>> >>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone
(22). The
>> >>> >> >> third
>> >>> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT
zone
>> >>> >> >> instead
>> >>> >> >>>>>>>>     of the DNAT zone.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated
from behind a
>> >>> >> >> VIP
>> >>> >> >>>>>>>> + SNAT is not broken.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is
slightly
>> >>> >> >> simplified
>> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for
load
>> >>> >> >> balancer vips in router ingress pipeline")
>> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Thanks for the fix.  It also simplified the
lr-nat-stateful code.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> Thanks, Numan!
>> >>> >> >>>>>>
>> >>> >> >>>>>> Applied to main and backported to all branches down to
22.03.
>> >>> >> >>>>>>
>> >>> >> >>>>>> Regards,
>> >>> >> >>>>>> Dumitru
>> >>> >> >>>>>>
>> >>> >> >>>>>> _______________________________________________
>> >>> >> >>>>>> dev mailing list
>> >>> >> >>>>>> dev@openvswitch.org
>> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >>> >> >>>>>
>> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>> Hi Han,
>> >>> >> >>>>
>> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is
broken due to
>> >>> >> >>>>> this change. The reason is that for client to nodePort (LB
with VIP
>> >>> >> >>>>> being the node IP) traffic, when the packet is going
through the
>> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
committed
>> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new",
which
>> >>> >> >>>>> prevents the HW offload to work for such packets.
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>> Sorry about that, I forgot we don't always commit in the
SNAT zone.
>> >>> >> >>>>
>> >>> >> >>>>> At the moment I have to revert this change in our
downstream. For the
>> >>> >> >>>>> problem that was fixed by this change [0], I think we can
avoid it by
>> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the
nodePort
>> >>> >> >>>>> range in k8s is configured by API-server option:
>> >>> >> >>>>>
>> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'>
Default:
>> >>> >> >> 30000-32767
>> >>> >> >>>>>
>> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>> >>> >> >>>>> external_port_range column, and we can choose something like
>> >>> >> >>>>> 10000-30000.
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit
the SNAT
>> >>> >> >>>> port range this workaround should be fine.
>> >>> >> >>>>
>> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
>> >>> >> >>>>>
>> >>> >> >>>>> Alternatively solutions are:
>> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones.
This would
>> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT
recirc.
>> >>> >> >>>>
>> >>> >> >>>> Extra recirculation aside, I would actually love it if we
could use this
>> >>> >> >>>> alternative.  I think it's the "most correct" option.  I
think it would
>> >>> >> >>>> allow us to avoid other workarounds like:
>> >>> >> >>>>
>> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>> >>> >> >>>>
>> >>> >> >>>> or
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>> >>> >> >>>>
>> >>> >> >>>> I do understand the worry about the extra recirculation
though.  In the
>> >>> >> >>>> HWOL context does that cause visible performance impact?
>> >>> >> >>>>
>> >>> >> >>>> We'd probably have to do more performance testing without
HWOL to figure
>> >>> >> >>>> out the impact in the software datapath.
>> >>> >> >>>>
>> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But
there are other
>> >>> >> >>>>> issues reported for using the common zone [1]
>> >>> >> >>>>>
>> >>> >> >>>>> Could you let me know if other thoughts on this?
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>> On a related note, I know it has been discussed in different
settings
>> >>> >> >>>> but I don't think this ever moved forward: would it be
possible for
>> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact
for incoming
>> >>> >> >>>> patches?
>> >>> >> >>>>
>> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure
that traffic
>> >>> >> >>>> is correctly offloaded in common scenarios?  Alternatively,
I guess we
>> >>> >> >>>> could also tag a subset of the existing system tests and
just run those
>> >>> >> >>>> on actual hardware?
>> >>> >> >>>>
>> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status
on each OVN
>> >>> >> >>>> patch posted on patchwork.  That would at least allow us to
flag this
>> >>> >> >>>> kind of breakages early (even before they get merged).
>> >>> >> >>>>
>> >>> >> >>>> What do you think?
>> >>> >> >>>>
>> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>> >>> >> >>>>> [1] b8c40e7593
>> >>> >> >>
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >>> >> >>>>>
>> >>> >> >>>>> Thanks,
>> >>> >> >>>>> Han
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>> Thanks,
>> >>> >> >>>> Dumitru
>> >>> >> >> _______________________________________________
>> >>> >> >> dev mailing list
>> >>> >> >> dev@openvswitch.org
>> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>
Ales Musil June 4, 2025, 6:50 a.m. UTC | #17
On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote:

>
>
> On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote:
> >
> >
> >
> > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
> >> >
> >> > Hi Han,
> >> > I tagged you on
> https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review
> it when you have time.
> >> >
> >> > Thanks,
> >> > Tim Rozet
> >> > Red Hat OpenShift Networking Team
> >> >
> >>
> >> Thanks Tim. I posted a comment last week. Please take a look.
> >>
> >> I also sent the OVN revert patch today:
> https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
> >> Hope we can resolve the ovn-k PR soon and then this one can get merged.
> >>
> >> >
> >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com>
> wrote:
> >> >>> >
> >> >>> > Hi All,
> >> >>> > Sorry Han that you are still facing issues with this problem. It
> seems like the best option would be to configure the range of ephemeral
> ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
> service range is configured 30000–32767, and the ephemeral port range in
> the kernel is set to:
> >> >>> > net.ipv4.ip_local_port_range = 32768 60999
> >> >>> >
> >> >>> > I think it would make sense to revert and add an option to make
> this range configurable in OVN. Then OVN-Kubernetes should pass along this
> config to OVN.
> >> >>> >
> >> >>> > As Dumitru mentioned, we can't just revert the previous patch
> without a substitute fix in place, otherwise we would risk regressing.
> >> >>>
> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in
> ovn-k8s, right?
> >> >>>
> >> >>> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
> >> >>>
> >> >>>        external_port_range: string
> >> >>>               L4 source port range
> >> >>>
> >> >>>               Range of ports, from which a port number will be
> picked
> >> >>>               that will replace the source port of to be NATed
> packet.
> >> >>>               This is basically PAT (port address translation).
> >> >>>
> >> >>>               Value of the column is in the format,
> port_lo-port_hi. For
> >> >>>               example: external_port_range : "1-30000"
> >> >>>
> >> >>>               Valid range of ports is 1-65535.
> >> >>>
> >> >>> Does this satisfy the need you asked?
> >> >>>
> >> >>> If this is agreed, I think we can move forward with below steps:
> >> >>> 1. Update ovn-kubernetes to set the external_port_range, if
> specific ranges are configured. (only needed when k8s configured nodeport
> range conflicts with the kernel ephemeral port range, and it seems
> unnecessary for OpenShift?)
> >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for
> traffic towards VIPs" in OVN. Also revert some of the changes for
> ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and
> vice versa.
> >> >>
> >> >>
> >> >> Hi Han,
> >> >>
> >> >> I don't think we should revert/partially revert ct-commit-all, the
> whole point of that knob is what you have described, commit to SNAT zone if
> there is any SNAT configured and commit to DNAT zone if there is any DNAT
> configured. If you still feel like there are some changes that we could
> make to this approach we can discuss it. There is one system test called
> "Commit all - UDN" which needs to pass unchanged if there is any
> revert/change to ct-commit-all.
> >> >>
> >>
> >> Hi Ales, sorry for the confusion.
> >
> >
> > Hi Han,
> >
> >> I didn't mean revert the major ct-commit-all change that was required
> to fix the UDN issue reported, which was supposed to commit all traffic to
> DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there
> are stateful SNAT. In one of the revisions of your patch, I commented that
> to solve the HW offload issue discussed in this thread, we shouldn't
> distinguish between DNAT and SNAT and always commit to both zones if there
> are stateful NATs. Now that solving the HW offload issue is not any more a
> goal of your ct-commit-all patch, we might bring back the separation
> between SNAT and DNAT, so that in simpler scenarios (where only DNAT or
> SNAT exist but not both) we could avoid the extra CT commit, which might
> help reduce some performance penalty. Shall I leave this to you if you
> think this makes sense and want to follow up?
> >
> >
> > I was thinking about looking into the proposal from Dumitru with a zone
> per port, because we might need that anyway for UDNs. If that doesn't work
> out we can work out which bits could be simplified. Any objections?
>
> Hi Ales, thanks for looking into the zone-per-port proposal. Could you
> provide more details why is that needed for UDNs?
>

The original bug report was for "k8s LoadBalancer services with
ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day.
Also I should have made it more clear that it's not for all UDNs, but for
specific configuration. So if there is a plan to get this working with HWOL
we will need to fix the ct-commit-all, possibly with the zone-per-port.

Thanks,
Ales

[0] https://issues.redhat.com/browse/FDP-787


> Han
>
> >
> > Thanks,
> > Ales
> >
> >>
> >> Thanks,
> >> Han
> >>
> >> >> Thanks,
> >> >> Ales
> >> >>>
> >> >>>
> >> >>> What do you think?
> >> >>>
> >> >>> Thanks,
> >> >>> Han
> >> >>>
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>> > Tim Rozet
> >> >>> > Red Hat OpenShift Networking Team
> >> >>> >
> >> >>> >
> >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> >> >>> >>
> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
> wrote:
> >> >>> >> >>>
> >> >>> >> >>> Hi Han,
> >> >>> >> >>> My understanding is that the unSNAT stage was originally
> being skipped
> >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and
> this breaks
> >> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed
> and therefore
> >> >>> >> >> not committed. In the OVNK use case for pods simply
> connecting egress,
> >> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore
> on egress
> >> >>> >> >> reply, we always enter the unSNAT stage for these packets. So
> this skipping
> >> >>> >> >> unSNAT stage seems to be specific to LB related traffic.
> While I agree
> >> >>> >> >> splitting the port range would work, I think it makes more
> sense to just
> >> >>> >> >> always commit in the unSNAT stage for all traffic. I get
> there is a
> >> >>> >> >> performance hit there, but the datapath pipeline seems more
> consistent and
> >> >>> >> >> I think outweighs the cost of committing LB traffic.
> >> >>> >> >>
> >> >>> >> >> Hi Tim and folks,
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > Hi Han,
> >> >>> >> >
> >> >>> >> >> + @Ales Musil <amusil@redhat.com>
> >> >>> >> >>
> >> >>> >> >> We've spent a lot of effort on the alternative solution:
> Always commit to
> >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
> discussed in
> >> >>> >> >> this thread is still unresolved. I'd like to discuss the
> current situation
> >> >>> >> >> and ask for opinions.
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > Thanks again for working on this!
> >> >>> >> >
> >> >>> >> >> A while ago, Ales tried to fix it with this commit:
> 800fd0681579 ("northd:
> >> >>> >> >> Add LR option to commit all traffic."). When the
> ct-commit-all option is
> >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT
> zones. However, in
> >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the
> HW offload is
> >> >>> >> >> still broken. In that scenario, a packet may come from
> external network
> >> >>> >> >> with dest IP being a node's IP, which is also used as an OVN
> gateway router
> >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and
> then the src IP
> >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs
> in different
> >> >>> >> >> stages are:
> >> >>> >> >>
> >> >>> >> >> Original: EXT IP <-> GR IP1
> >> >>> >> >> After DNAT: EXT IP <-> BE IP
> >> >>> >> >> After SNAT: GR IP2 <-> BE IP
> >> >>> >> >>
> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and
> after SNAT:
> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>> >> >>
> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone
> with the
> >> >>> >> >> original header, which will still always be "new" state,
> which breaks HW
> >> >>> >> >> offload.
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > Ah, now I understand the problem you were facing.
> >> >>> >> >
> >> >>> >> >> I tried to fix it by also committing to the SNAT zone before
> DNAT (at the
> >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >> >>> >> >> SNAT zone: EXT IP <-> GR IP1
> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>> >> >>
> >> >>> >> >> HW offload worked, but it breaks at least the scenario of
> "DNAT and SNAT on
> >> >>> >> >> distributed router - E/W", which tests communication between
> two workloads
> >> >>> >> >> using "floating IPs". In this scenario, both workloads are
> under different
> >> >>> >> >> LSes behind the same router, and both have their own floating
> IPs
> >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of
> them sends
> >> >>> >> >> packets to the other using the floating IP, what's expected
> is:
> >> >>> >> >>
> >> >>> >> >> Original: overlay IP1 <-> floating IP2
> >> >>> >> >> After SNAT: floating IP1 <-> floating IP2
> >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2
> >> >>> >> >>
> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in
> SNAT zone:
> >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >> >>> >> >>
> >> >>> >> >> This happens in the egress pipeline of the LR, and then it
> enters into the
> >> >>> >> >> ingress pipeline again because this is like a hairpin (the
> dest IP is a
> >> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT
> stage first,
> >> >>> >> >> traverse the CT zone, and for the first packet the reply is
> not seen yet,
> >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will
> try to commit
> >> >>> >> >> it (because of the issue I am trying to fix above). Because
> the connection
> >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
> committed entry
> >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT
> action fails and
> >> >>> >> >> the packet is dropped.
> >> >>> >> >>
> >> >>> >> >> There is still a way to solve this, probably by skipping the
> unnecessary
> >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is
> one of the
> >> >>> >> >> snat/dnat_and_snat IPs of this same router.
> >> >>> >> >>
> >> >>> >> >> However, as we see, the NAT related pipelines have been very
> complex today,
> >> >>> >> >> and with the above changes it would become even much more
> complex. I wonder
> >> >>> >> >> if this is maintainable in the long run.
> >> >>> >> >>
> >> >>> >> >> In addition, I am not even sure if this is the last problem
> of my patch. It
> >> >>> >> >> is not easy to tell because today the ct-commit-all feature
> is an option
> >> >>> >> >> and by default disabled. So all our test cases by default
> don't test for
> >> >>> >> >> this option except the 4 test cases added by Ales's patch. I
> had to
> >> >>> >> >> manually set this option to true as default and of course
> many test cases
> >> >>> >> >> will fail. Most of them fail because of the expected flows
> won't match, but
> >> >>> >> >> I can't tell for sure unless updating every case with new
> flows and then
> >> >>> >> >> retest them. Before doing that I just examined ones I think
> might be
> >> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed
> router - E/W"
> >> >>> >> >> is the first broken one I found. So there can be other issues
> not
> >> >>> >> >> discovered yet.
> >> >>> >> >>
> >> >>> >> >> Moreover, we should also keep in mind the performance penalty
> of these
> >> >>> >> >> extra CT commit & traversal.
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > But basically this happens because we use the same SNAT zone
> for all
> >> >>> >> > router ports.
> >> >>> >> >
> >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT
> zone per
> >> >>> >> > router port fix it?  E.g.:
> >> >>> >> >
> >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT
> ZONE: 21,
> >> >>> >> > DNAT ZONE: 22) ---- host2
> >> >>> >> >
> >> >>> >> > That'd also imply that we commit that session (from host1 to
> host2 -
> >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and
> DNAT for
> >> >>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
> different
> >> >>> >> > CT lookups for all packets traversing the router.
> >> >>> >> >
> >> >>> >> > I didn't try any of this but I have the impression that this
> would solve
> >> >>> >> > the HWOL issue: all subsequent packets are part of known and
> committed
> >> >>> >> > sessions in all 4 conntrack zones.
> >> >>> >> >
> >> >>> >> > So what if we change the implementation of the "commit-all"
> feature to
> >> >>> >> > add the missing ct commits?  Wouldn't that work?
> >> >>> >> >
> >> >>> >> > It feels like (lookup performance hit aside) this is the
> correct way to
> >> >>> >> > forward traffic, i.e., never traverse a CT zone without having
> a
> >> >>> >> > committed entry in that zone.
> >> >>> >> >
> >> >>> >> >> I'd like to listen to your opinion before moving forward. Is
> this worth the
> >> >>> >> >> effort, or shall we just keep the pipeline simple
> (relatively) and avoid
> >> >>> >> >> the original problem (
> https://issues.redhat.com/browse/FDP-291) by
> >> >>> >> >> separating the port range of SNAT and DNAT, the first
> proposal at the
> >> >>> >> >> beginning of this thread?
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > Now that I understand the problem you're describing better it
> might be
> >> >>> >> > worth reverting the fix for FDP-291 like you said until we
> find a proper
> >> >>> >> > generic solution.
> >> >>> >> >
> >> >>> >> > However, we can't afford to do that without separating the
> port range of
> >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at
> least
> >> >>> >> > from OpenShift perspective).
> >> >>> >> >
> >> >>> >>
> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the
> fix for
> >> >>> >> FDP-291:
> >> >>> >>
> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >> >>> >>
> >> >>> >> > Because selecting the nodeport (DNAT) range is done through a
> k8s API
> >> >>> >> > server cmdline argument that means that users can in theory
> choose
> >> >>> >> > whatever range they want.  I guess we'd need a command line
> argument in
> >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >> >>> >> > what SNAT port range to use so that it doesn't conflict with
> the KAPI
> >> >>> >> > nodeport range.
> >> >>> >> >
> >> >>> >> > Tim, you're the expert on this side, what are your thoughts on
> the matter?
> >> >>> >> >
> >> >>> >> > Regards,
> >> >>> >> > Dumitru
> >> >>> >> >
> >> >>> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >> >>> >> >
> >> >>> >> >> Thanks,
> >> >>> >> >> Han
> >> >>> >> >>
> >> >>> >> >>>
> >> >>> >> >>> Thanks,
> >> >>> >> >>> Tim Rozet
> >> >>> >> >>> Red Hat OpenShift Networking Team
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <
> dceara@redhat.com> wrote:
> >> >>> >> >>>>
> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >>> >> >> wrote:
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >>> >> >> wrote:
> >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that
> uses the
> >> >>> >> >> VIP as
> >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind
> the VIP.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip
> unsnat flows
> >> >>> >> >> for
> >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline").
> That's OK because
> >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows
> for routers
> >> >>> >> >> with
> >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a
> better way:
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>>     In the reply direction, the order of traversal of
> the tables
> >> >>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat"
> adds incorrect
> >> >>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
> conntrack zone.
> >> >>> >> >>>>>>>>     This is illustrated below where reply trafic enters
> the
> >> >>> >> >> physical host
> >> >>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
> (default),
> >> >>> >> >> back to the
> >> >>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone
> (22). The
> >> >>> >> >> third
> >> >>> >> >>>>>>>>     flow is incorrectly checking the state from the
> SNAT zone
> >> >>> >> >> instead
> >> >>> >> >>>>>>>>     of the DNAT zone.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated
> from behind a
> >> >>> >> >> VIP
> >> >>> >> >>>>>>>> + SNAT is not broken.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is
> slightly
> >> >>> >> >> simplified
> >> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for
> load
> >> >>> >> >> balancer vips in router ingress pipeline")
> >> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Thanks for the fix.  It also simplified the
> lr-nat-stateful code.
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> Thanks, Numan!
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> Applied to main and backported to all branches down to
> 22.03.
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> Regards,
> >> >>> >> >>>>>> Dumitru
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> _______________________________________________
> >> >>> >> >>>>>> dev mailing list
> >> >>> >> >>>>>> dev@openvswitch.org
> >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >>> >> >>>>>
> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>> Hi Han,
> >> >>> >> >>>>
> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is
> broken due to
> >> >>> >> >>>>> this change. The reason is that for client to nodePort (LB
> with VIP
> >> >>> >> >>>>> being the node IP) traffic, when the packet is going
> through the
> >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
> committed
> >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new",
> which
> >> >>> >> >>>>> prevents the HW offload to work for such packets.
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the
> SNAT zone.
> >> >>> >> >>>>
> >> >>> >> >>>>> At the moment I have to revert this change in our
> downstream. For the
> >> >>> >> >>>>> problem that was fixed by this change [0], I think we can
> avoid it by
> >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the
> nodePort
> >> >>> >> >>>>> range in k8s is configured by API-server option:
> >> >>> >> >>>>>
> >> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'>
> Default:
> >> >>> >> >> 30000-32767
> >> >>> >> >>>>>
> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >> >>> >> >>>>> external_port_range column, and we can choose something
> like
> >> >>> >> >>>>> 10000-30000.
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit
> the SNAT
> >> >>> >> >>>> port range this workaround should be fine.
> >> >>> >> >>>>
> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >> >>> >> >>>>>
> >> >>> >> >>>>> Alternatively solutions are:
> >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones.
> This would
> >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT
> recirc.
> >> >>> >> >>>>
> >> >>> >> >>>> Extra recirculation aside, I would actually love it if we
> could use this
> >> >>> >> >>>> alternative.  I think it's the "most correct" option.  I
> think it would
> >> >>> >> >>>> allow us to avoid other workarounds like:
> >> >>> >> >>>>
> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >> >>> >> >>>>
> >> >>> >> >>>> or
> >> >>> >> >>>>
> >> >>> >> >>>>
> >> >>> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >> >>> >> >>>>
> >> >>> >> >>>> I do understand the worry about the extra recirculation
> though.  In the
> >> >>> >> >>>> HWOL context does that cause visible performance impact?
> >> >>> >> >>>>
> >> >>> >> >>>> We'd probably have to do more performance testing without
> HWOL to figure
> >> >>> >> >>>> out the impact in the software datapath.
> >> >>> >> >>>>
> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But
> there are other
> >> >>> >> >>>>> issues reported for using the common zone [1]
> >> >>> >> >>>>>
> >> >>> >> >>>>> Could you let me know if other thoughts on this?
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>> On a related note, I know it has been discussed in
> different settings
> >> >>> >> >>>> but I don't think this ever moved forward: would it be
> possible for
> >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact
> for incoming
> >> >>> >> >>>> patches?
> >> >>> >> >>>>
> >> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure
> that traffic
> >> >>> >> >>>> is correctly offloaded in common scenarios?  Alternatively,
> I guess we
> >> >>> >> >>>> could also tag a subset of the existing system tests and
> just run those
> >> >>> >> >>>> on actual hardware?
> >> >>> >> >>>>
> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status
> on each OVN
> >> >>> >> >>>> patch posted on patchwork.  That would at least allow us to
> flag this
> >> >>> >> >>>> kind of breakages early (even before they get merged).
> >> >>> >> >>>>
> >> >>> >> >>>> What do you think?
> >> >>> >> >>>>
> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >> >>> >> >>>>> [1] b8c40e7593
> >> >>> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >> >>> >> >>>>>
> >> >>> >> >>>>> Thanks,
> >> >>> >> >>>>> Han
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>> Thanks,
> >> >>> >> >>>> Dumitru
> >> >>> >> >> _______________________________________________
> >> >>> >> >> dev mailing list
> >> >>> >> >> dev@openvswitch.org
> >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>>
Han Zhou June 4, 2025, 2:33 p.m. UTC | #18
On Tue, Jun 3, 2025 at 11:50 PM Ales Musil <amusil@redhat.com> wrote:
>
>
>
> On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote:
>>
>>
>>
>> On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote:
>> >
>> >
>> >
>> > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
>> >> >
>> >> > Hi Han,
>> >> > I tagged you on
https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review
it when you have time.
>> >> >
>> >> > Thanks,
>> >> > Tim Rozet
>> >> > Red Hat OpenShift Networking Team
>> >> >
>> >>
>> >> Thanks Tim. I posted a comment last week. Please take a look.
>> >>
>> >> I also sent the OVN revert patch today:
https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
>> >> Hope we can resolve the ovn-k PR soon and then this one can get
merged.
>> >>
>> >> >
>> >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com>
wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com>
wrote:
>> >> >>> >
>> >> >>> > Hi All,
>> >> >>> > Sorry Han that you are still facing issues with this problem.
It seems like the best option would be to configure the range of ephemeral
ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
service range is configured 30000–32767, and the ephemeral port range in
the kernel is set to:
>> >> >>> > net.ipv4.ip_local_port_range = 32768 60999
>> >> >>> >
>> >> >>> > I think it would make sense to revert and add an option to make
this range configurable in OVN. Then OVN-Kubernetes should pass along this
config to OVN.
>> >> >>> >
>> >> >>> > As Dumitru mentioned, we can't just revert the previous patch
without a substitute fix in place, otherwise we would risk regressing.
>> >> >>>
>> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in
ovn-k8s, right?
>> >> >>>
>> >> >>> OVN already provides configurable SNAT port range through the
external_port_range column in the NAT table. See:
https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>> >> >>>
>> >> >>>        external_port_range: string
>> >> >>>               L4 source port range
>> >> >>>
>> >> >>>               Range of ports, from which a port number will be
picked
>> >> >>>               that will replace the source port of to be NATed
packet.
>> >> >>>               This is basically PAT (port address translation).
>> >> >>>
>> >> >>>               Value of the column is in the format,
port_lo-port_hi. For
>> >> >>>               example: external_port_range : "1-30000"
>> >> >>>
>> >> >>>               Valid range of ports is 1-65535.
>> >> >>>
>> >> >>> Does this satisfy the need you asked?
>> >> >>>
>> >> >>> If this is agreed, I think we can move forward with below steps:
>> >> >>> 1. Update ovn-kubernetes to set the external_port_range, if
specific ranges are configured. (only needed when k8s configured nodeport
range conflicts with the kernel ephemeral port range, and it seems
unnecessary for OpenShift?)
>> >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for
traffic towards VIPs" in OVN. Also revert some of the changes for
ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and
vice versa.
>> >> >>
>> >> >>
>> >> >> Hi Han,
>> >> >>
>> >> >> I don't think we should revert/partially revert ct-commit-all, the
whole point of that knob is what you have described, commit to SNAT zone if
there is any SNAT configured and commit to DNAT zone if there is any DNAT
configured. If you still feel like there are some changes that we could
make to this approach we can discuss it. There is one system test called
"Commit all - UDN" which needs to pass unchanged if there is any
revert/change to ct-commit-all.
>> >> >>
>> >>
>> >> Hi Ales, sorry for the confusion.
>> >
>> >
>> > Hi Han,
>> >
>> >> I didn't mean revert the major ct-commit-all change that was required
to fix the UDN issue reported, which was supposed to commit all traffic to
DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there
are stateful SNAT. In one of the revisions of your patch, I commented that
to solve the HW offload issue discussed in this thread, we shouldn't
distinguish between DNAT and SNAT and always commit to both zones if there
are stateful NATs. Now that solving the HW offload issue is not any more a
goal of your ct-commit-all patch, we might bring back the separation
between SNAT and DNAT, so that in simpler scenarios (where only DNAT or
SNAT exist but not both) we could avoid the extra CT commit, which might
help reduce some performance penalty. Shall I leave this to you if you
think this makes sense and want to follow up?
>> >
>> >
>> > I was thinking about looking into the proposal from Dumitru with a
zone per port, because we might need that anyway for UDNs. If that doesn't
work out we can work out which bits could be simplified. Any objections?
>>
>> Hi Ales, thanks for looking into the zone-per-port proposal. Could you
provide more details why is that needed for UDNs?
>
>
> The original bug report was for "k8s LoadBalancer services with
ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day.
Also I should have made it more clear that it's not for all UDNs, but for
specific configuration. So if there is a plan to get this working with HWOL
we will need to fix the ct-commit-all, possibly with the zone-per-port.
>

The FDP-787 is already fixed by ct-commit-all, correct? And the HWOL issue
was introduced by the "northd: Don't skip the unSNAT stage for traffic
towards VIPs" which was supposed to fix FDP-291. Now we are reverting the
patch to fix the HWOL issue, and use the port range configuration in ovn-k
to avoid the problem reported by FDP-291. So I am still confused is there
another problem that requires additional fix (e.g. zone-per-port)? Is
ct-commit-all breaking HWOL for the UDN case you mentioned?

Thanks,
Han


> Thanks,
> Ales
>
> [0] https://issues.redhat.com/browse/FDP-787
>
>>
>> Han
>>
>> >
>> > Thanks,
>> > Ales
>> >
>> >>
>> >> Thanks,
>> >> Han
>> >>
>> >> >> Thanks,
>> >> >> Ales
>> >> >>>
>> >> >>>
>> >> >>> What do you think?
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Han
>> >> >>>
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> >
>> >> >>> > Tim Rozet
>> >> >>> > Red Hat OpenShift Networking Team
>> >> >>> >
>> >> >>> >
>> >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com>
wrote:
>> >> >>> >>
>> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>> >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
>> >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com>
wrote:
>> >> >>> >> >>>
>> >> >>> >> >>> Hi Han,
>> >> >>> >> >>> My understanding is that the unSNAT stage was originally
being skipped
>> >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT,
and this breaks
>> >> >>> >> >> HWOL for ingress connections that never need to get
unSNAT'ed and therefore
>> >> >>> >> >> not committed. In the OVNK use case for pods simply
connecting egress,
>> >> >>> >> >> almost all packets will be SNAT'ed to the node IP.
Therefore on egress
>> >> >>> >> >> reply, we always enter the unSNAT stage for these packets.
So this skipping
>> >> >>> >> >> unSNAT stage seems to be specific to LB related traffic.
While I agree
>> >> >>> >> >> splitting the port range would work, I think it makes more
sense to just
>> >> >>> >> >> always commit in the unSNAT stage for all traffic. I get
there is a
>> >> >>> >> >> performance hit there, but the datapath pipeline seems more
consistent and
>> >> >>> >> >> I think outweighs the cost of committing LB traffic.
>> >> >>> >> >>
>> >> >>> >> >> Hi Tim and folks,
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > Hi Han,
>> >> >>> >> >
>> >> >>> >> >> + @Ales Musil <amusil@redhat.com>
>> >> >>> >> >>
>> >> >>> >> >> We've spent a lot of effort on the alternative solution:
Always commit to
>> >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload
issue discussed in
>> >> >>> >> >> this thread is still unresolved. I'd like to discuss the
current situation
>> >> >>> >> >> and ask for opinions.
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > Thanks again for working on this!
>> >> >>> >> >
>> >> >>> >> >> A while ago, Ales tried to fix it with this commit:
800fd0681579 ("northd:
>> >> >>> >> >> Add LR option to commit all traffic."). When the
ct-commit-all option is
>> >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT
zones. However, in
>> >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the
HW offload is
>> >> >>> >> >> still broken. In that scenario, a packet may come from
external network
>> >> >>> >> >> with dest IP being a node's IP, which is also used as an
OVN gateway router
>> >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and
then the src IP
>> >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst
IPs in different
>> >> >>> >> >> stages are:
>> >> >>> >> >>
>> >> >>> >> >> Original: EXT IP <-> GR IP1
>> >> >>> >> >> After DNAT: EXT IP <-> BE IP
>> >> >>> >> >> After SNAT: GR IP2 <-> BE IP
>> >> >>> >> >>
>> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and
after SNAT:
>> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >> >>> >> >>
>> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone
with the
>> >> >>> >> >> original header, which will still always be "new" state,
which breaks HW
>> >> >>> >> >> offload.
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > Ah, now I understand the problem you were facing.
>> >> >>> >> >
>> >> >>> >> >> I tried to fix it by also committing to the SNAT zone
before DNAT (at the
>> >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed:
>> >> >>> >> >> SNAT zone: EXT IP <-> GR IP1
>> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >> >>> >> >>
>> >> >>> >> >> HW offload worked, but it breaks at least the scenario of
"DNAT and SNAT on
>> >> >>> >> >> distributed router - E/W", which tests communication
between two workloads
>> >> >>> >> >> using "floating IPs". In this scenario, both workloads are
under different
>> >> >>> >> >> LSes behind the same router, and both have their own
floating IPs
>> >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of
them sends
>> >> >>> >> >> packets to the other using the floating IP, what's expected
is:
>> >> >>> >> >>
>> >> >>> >> >> Original: overlay IP1 <-> floating IP2
>> >> >>> >> >> After SNAT: floating IP1 <-> floating IP2
>> >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2
>> >> >>> >> >>
>> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in
SNAT zone:
>> >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>> >> >>> >> >>
>> >> >>> >> >> This happens in the egress pipeline of the LR, and then it
enters into the
>> >> >>> >> >> ingress pipeline again because this is like a hairpin (the
dest IP is a
>> >> >>> >> >> dnat_and_snat IP on the same router). It will hit the
UNSNAT stage first,
>> >> >>> >> >> traverse the CT zone, and for the first packet the reply is
not seen yet,
>> >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will
try to commit
>> >> >>> >> >> it (because of the issue I am trying to fix above). Because
the connection
>> >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above
committed entry
>> >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT
action fails and
>> >> >>> >> >> the packet is dropped.
>> >> >>> >> >>
>> >> >>> >> >> There is still a way to solve this, probably by skipping
the unnecessary
>> >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP
is one of the
>> >> >>> >> >> snat/dnat_and_snat IPs of this same router.
>> >> >>> >> >>
>> >> >>> >> >> However, as we see, the NAT related pipelines have been
very complex today,
>> >> >>> >> >> and with the above changes it would become even much more
complex. I wonder
>> >> >>> >> >> if this is maintainable in the long run.
>> >> >>> >> >>
>> >> >>> >> >> In addition, I am not even sure if this is the last problem
of my patch. It
>> >> >>> >> >> is not easy to tell because today the ct-commit-all feature
is an option
>> >> >>> >> >> and by default disabled. So all our test cases by default
don't test for
>> >> >>> >> >> this option except the 4 test cases added by Ales's patch.
I had to
>> >> >>> >> >> manually set this option to true as default and of course
many test cases
>> >> >>> >> >> will fail. Most of them fail because of the expected flows
won't match, but
>> >> >>> >> >> I can't tell for sure unless updating every case with new
flows and then
>> >> >>> >> >> retest them. Before doing that I just examined ones I think
might be
>> >> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed
router - E/W"
>> >> >>> >> >> is the first broken one I found. So there can be other
issues not
>> >> >>> >> >> discovered yet.
>> >> >>> >> >>
>> >> >>> >> >> Moreover, we should also keep in mind the performance
penalty of these
>> >> >>> >> >> extra CT commit & traversal.
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > But basically this happens because we use the same SNAT zone
for all
>> >> >>> >> > router ports.
>> >> >>> >> >
>> >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT
DNAT zone per
>> >> >>> >> > router port fix it?  E.g.:
>> >> >>> >> >
>> >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2
(SNAT ZONE: 21,
>> >> >>> >> > DNAT ZONE: 22) ---- host2
>> >> >>> >> >
>> >> >>> >> > That'd also imply that we commit that session (from host1 to
host2 -
>> >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and
DNAT for
>> >> >>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do
4 different
>> >> >>> >> > CT lookups for all packets traversing the router.
>> >> >>> >> >
>> >> >>> >> > I didn't try any of this but I have the impression that this
would solve
>> >> >>> >> > the HWOL issue: all subsequent packets are part of known and
committed
>> >> >>> >> > sessions in all 4 conntrack zones.
>> >> >>> >> >
>> >> >>> >> > So what if we change the implementation of the "commit-all"
feature to
>> >> >>> >> > add the missing ct commits?  Wouldn't that work?
>> >> >>> >> >
>> >> >>> >> > It feels like (lookup performance hit aside) this is the
correct way to
>> >> >>> >> > forward traffic, i.e., never traverse a CT zone without
having a
>> >> >>> >> > committed entry in that zone.
>> >> >>> >> >
>> >> >>> >> >> I'd like to listen to your opinion before moving forward.
Is this worth the
>> >> >>> >> >> effort, or shall we just keep the pipeline simple
(relatively) and avoid
>> >> >>> >> >> the original problem (
https://issues.redhat.com/browse/FDP-291) by
>> >> >>> >> >> separating the port range of SNAT and DNAT, the first
proposal at the
>> >> >>> >> >> beginning of this thread?
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > Now that I understand the problem you're describing better
it might be
>> >> >>> >> > worth reverting the fix for FDP-291 like you said until we
find a proper
>> >> >>> >> > generic solution.
>> >> >>> >> >
>> >> >>> >> > However, we can't afford to do that without separating the
port range of
>> >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing
(at least
>> >> >>> >> > from OpenShift perspective).
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the
fix for
>> >> >>> >> FDP-291:
>> >> >>> >>
>> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889
>> >> >>> >>
>> >> >>> >> > Because selecting the nodeport (DNAT) range is done through
a k8s API
>> >> >>> >> > server cmdline argument that means that users can in theory
choose
>> >> >>> >> > whatever range they want.  I guess we'd need a command line
argument in
>> >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
ovn-kubernetes
>> >> >>> >> > what SNAT port range to use so that it doesn't conflict with
the KAPI
>> >> >>> >> > nodeport range.
>> >> >>> >> >
>> >> >>> >> > Tim, you're the expert on this side, what are your thoughts
on the matter?
>> >> >>> >> >
>> >> >>> >> > Regards,
>> >> >>> >> > Dumitru
>> >> >>> >> >
>> >> >>> >> > [0]
https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>> >> >>> >> >
>> >> >>> >> >> Thanks,
>> >> >>> >> >> Han
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> Thanks,
>> >> >>> >> >>> Tim Rozet
>> >> >>> >> >>> Red Hat OpenShift Networking Team
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <
dceara@redhat.com> wrote:
>> >> >>> >> >>>>
>> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
>> >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
dceara@redhat.com>
>> >> >>> >> >> wrote:
>> >> >>> >> >>>>>>
>> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>> >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
dceara@redhat.com>
>> >> >>> >> >> wrote:
>> >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule
that uses the
>> >> >>> >> >> VIP as
>> >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind
the VIP.
>> >> >>> >> >>>>>>>>
>> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd:
Skip unsnat flows
>> >> >>> >> >> for
>> >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline").
That's OK because
>> >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows
for routers
>> >> >>> >> >> with
>> >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in
a better way:
>> >> >>> >> >>>>>>>>
>> >> >>> >> >>>>>>>>     In the reply direction, the order of traversal of
the tables
>> >> >>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat"
adds incorrect
>> >> >>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
conntrack zone.
>> >> >>> >> >>>>>>>>     This is illustrated below where reply trafic
enters the
>> >> >>> >> >> physical host
>> >> >>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
(default),
>> >> >>> >> >> back to the
>> >> >>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone
(22). The
>> >> >>> >> >> third
>> >> >>> >> >>>>>>>>     flow is incorrectly checking the state from the
SNAT zone
>> >> >>> >> >> instead
>> >> >>> >> >>>>>>>>     of the DNAT zone.
>> >> >>> >> >>>>>>>>
>> >> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated
from behind a
>> >> >>> >> >> VIP
>> >> >>> >> >>>>>>>> + SNAT is not broken.
>> >> >>> >> >>>>>>>>
>> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is
slightly
>> >> >>> >> >> simplified
>> >> >>> >> >>>>>>>> because we don't need to track NAT external IPs
anymore.
>> >> >>> >> >>>>>>>>
>> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows
for load
>> >> >>> >> >> balancer vips in router ingress pipeline")
>> >> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >> >>> >> >>>>>>>
>> >> >>> >> >>>>>>>
>> >> >>> >> >>>>>>> Thanks for the fix.  It also simplified the
lr-nat-stateful code.
>> >> >>> >> >>>>>>>
>> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
>> >> >>> >> >>>>>>>
>> >> >>> >> >>>>>>
>> >> >>> >> >>>>>> Thanks, Numan!
>> >> >>> >> >>>>>>
>> >> >>> >> >>>>>> Applied to main and backported to all branches down to
22.03.
>> >> >>> >> >>>>>>
>> >> >>> >> >>>>>> Regards,
>> >> >>> >> >>>>>> Dumitru
>> >> >>> >> >>>>>>
>> >> >>> >> >>>>>> _______________________________________________
>> >> >>> >> >>>>>> dev mailing list
>> >> >>> >> >>>>>> dev@openvswitch.org
>> >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
>> >> >>> >> >>>>>
>> >> >>> >> >>>>
>> >> >>> >> >>>> Hi Han,
>> >> >>> >> >>>>
>> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is
broken due to
>> >> >>> >> >>>>> this change. The reason is that for client to nodePort
(LB with VIP
>> >> >>> >> >>>>> being the node IP) traffic, when the packet is going
through the
>> >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is
never committed
>> >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as
"new", which
>> >> >>> >> >>>>> prevents the HW offload to work for such packets.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>
>> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the
SNAT zone.
>> >> >>> >> >>>>
>> >> >>> >> >>>>> At the moment I have to revert this change in our
downstream. For the
>> >> >>> >> >>>>> problem that was fixed by this change [0], I think we
can avoid it by
>> >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT,
the nodePort
>> >> >>> >> >>>>> range in k8s is configured by API-server option:
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'>
Default:
>> >> >>> >> >> 30000-32767
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>> >> >>> >> >>>>> external_port_range column, and we can choose something
like
>> >> >>> >> >>>>> 10000-30000.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>
>> >> >>> >> >>>> Tim, does this look OK to you?  If it's acceptable to
limit the SNAT
>> >> >>> >> >>>> port range this workaround should be fine.
>> >> >>> >> >>>>
>> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Alternatively solutions are:
>> >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones.
This would
>> >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT
recirc.
>> >> >>> >> >>>>
>> >> >>> >> >>>> Extra recirculation aside, I would actually love it if we
could use this
>> >> >>> >> >>>> alternative.  I think it's the "most correct" option.  I
think it would
>> >> >>> >> >>>> allow us to avoid other workarounds like:
>> >> >>> >> >>>>
>> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>> >> >>> >> >>>>
>> >> >>> >> >>>> or
>> >> >>> >> >>>>
>> >> >>> >> >>>>
>> >> >>> >> >>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
>> >> >>> >> >>>>
>> >> >>> >> >>>> I do understand the worry about the extra recirculation
though.  In the
>> >> >>> >> >>>> HWOL context does that cause visible performance impact?
>> >> >>> >> >>>>
>> >> >>> >> >>>> We'd probably have to do more performance testing without
HWOL to figure
>> >> >>> >> >>>> out the impact in the software datapath.
>> >> >>> >> >>>>
>> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But
there are other
>> >> >>> >> >>>>> issues reported for using the common zone [1]
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Could you let me know if other thoughts on this?
>> >> >>> >> >>>>>
>> >> >>> >> >>>>
>> >> >>> >> >>>> On a related note, I know it has been discussed in
different settings
>> >> >>> >> >>>> but I don't think this ever moved forward: would it be
possible for
>> >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact
for incoming
>> >> >>> >> >>>> patches?
>> >> >>> >> >>>>
>> >> >>> >> >>>> Maybe we could some "simple" system-like tests that
ensure that traffic
>> >> >>> >> >>>> is correctly offloaded in common scenarios?
Alternatively, I guess we
>> >> >>> >> >>>> could also tag a subset of the existing system tests and
just run those
>> >> >>> >> >>>> on actual hardware?
>> >> >>> >> >>>>
>> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report
status on each OVN
>> >> >>> >> >>>> patch posted on patchwork.  That would at least allow us
to flag this
>> >> >>> >> >>>> kind of breakages early (even before they get merged).
>> >> >>> >> >>>>
>> >> >>> >> >>>> What do you think?
>> >> >>> >> >>>>
>> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>> >> >>> >> >>>>> [1] b8c40e7593
>> >> >>> >> >>
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Thanks,
>> >> >>> >> >>>>> Han
>> >> >>> >> >>>>>
>> >> >>> >> >>>>
>> >> >>> >> >>>> Thanks,
>> >> >>> >> >>>> Dumitru
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> dev mailing list
>> >> >>> >> >> dev@openvswitch.org
>> >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>
Ales Musil June 5, 2025, 6:04 a.m. UTC | #19
On Wed, Jun 4, 2025 at 4:33 PM Han Zhou <zhouhan@gmail.com> wrote:

>
>
> On Tue, Jun 3, 2025 at 11:50 PM Ales Musil <amusil@redhat.com> wrote:
> >
> >
> >
> > On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote:
> >>
> >>
> >>
> >> On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote:
> >> >
> >> >
> >> >
> >> > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote:
> >> >> >
> >> >> > Hi Han,
> >> >> > I tagged you on
> https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review
> it when you have time.
> >> >> >
> >> >> > Thanks,
> >> >> > Tim Rozet
> >> >> > Red Hat OpenShift Networking Team
> >> >> >
> >> >>
> >> >> Thanks Tim. I posted a comment last week. Please take a look.
> >> >>
> >> >> I also sent the OVN revert patch today:
> https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html
> >> >> Hope we can resolve the ovn-k PR soon and then this one can get
> merged.
> >> >>
> >> >> >
> >> >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com>
> wrote:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com>
> wrote:
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com>
> wrote:
> >> >> >>> >
> >> >> >>> > Hi All,
> >> >> >>> > Sorry Han that you are still facing issues with this problem.
> It seems like the best option would be to configure the range of ephemeral
> ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport
> service range is configured 30000–32767, and the ephemeral port range in
> the kernel is set to:
> >> >> >>> > net.ipv4.ip_local_port_range = 32768 60999
> >> >> >>> >
> >> >> >>> > I think it would make sense to revert and add an option to
> make this range configurable in OVN. Then OVN-Kubernetes should pass along
> this config to OVN.
> >> >> >>> >
> >> >> >>> > As Dumitru mentioned, we can't just revert the previous patch
> without a substitute fix in place, otherwise we would risk regressing.
> >> >> >>>
> >> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in
> ovn-k8s, right?
> >> >> >>>
> >> >> >>> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
> >> >> >>>
> >> >> >>>        external_port_range: string
> >> >> >>>               L4 source port range
> >> >> >>>
> >> >> >>>               Range of ports, from which a port number will be
> picked
> >> >> >>>               that will replace the source port of to be NATed
> packet.
> >> >> >>>               This is basically PAT (port address translation).
> >> >> >>>
> >> >> >>>               Value of the column is in the format,
> port_lo-port_hi. For
> >> >> >>>               example: external_port_range : "1-30000"
> >> >> >>>
> >> >> >>>               Valid range of ports is 1-65535.
> >> >> >>>
> >> >> >>> Does this satisfy the need you asked?
> >> >> >>>
> >> >> >>> If this is agreed, I think we can move forward with below steps:
> >> >> >>> 1. Update ovn-kubernetes to set the external_port_range, if
> specific ranges are configured. (only needed when k8s configured nodeport
> range conflicts with the kernel ephemeral port range, and it seems
> unnecessary for OpenShift?)
> >> >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for
> traffic towards VIPs" in OVN. Also revert some of the changes for
> ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and
> vice versa.
> >> >> >>
> >> >> >>
> >> >> >> Hi Han,
> >> >> >>
> >> >> >> I don't think we should revert/partially revert ct-commit-all,
> the whole point of that knob is what you have described, commit to SNAT
> zone if there is any SNAT configured and commit to DNAT zone if there is
> any DNAT configured. If you still feel like there are some changes that we
> could make to this approach we can discuss it. There is one system test
> called "Commit all - UDN" which needs to pass unchanged if there is any
> revert/change to ct-commit-all.
> >> >> >>
> >> >>
> >> >> Hi Ales, sorry for the confusion.
> >> >
> >> >
> >> > Hi Han,
> >> >
> >> >> I didn't mean revert the major ct-commit-all change that was
> required to fix the UDN issue reported, which was supposed to commit all
> traffic to DNAT zone if there are stateful DNAT and all traffic to SNAT
> zone if there are stateful SNAT. In one of the revisions of your patch, I
> commented that to solve the HW offload issue discussed in this thread, we
> shouldn't distinguish between DNAT and SNAT and always commit to both zones
> if there are stateful NATs. Now that solving the HW offload issue is not
> any more a goal of your ct-commit-all patch, we might bring back the
> separation between SNAT and DNAT, so that in simpler scenarios (where only
> DNAT or SNAT exist but not both) we could avoid the extra CT commit, which
> might help reduce some performance penalty. Shall I leave this to you if
> you think this makes sense and want to follow up?
> >> >
> >> >
> >> > I was thinking about looking into the proposal from Dumitru with a
> zone per port, because we might need that anyway for UDNs. If that doesn't
> work out we can work out which bits could be simplified. Any objections?
> >>
> >> Hi Ales, thanks for looking into the zone-per-port proposal. Could you
> provide more details why is that needed for UDNs?
> >
> >
> > The original bug report was for "k8s LoadBalancer services with
> ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day.
> Also I should have made it more clear that it's not for all UDNs, but for
> specific configuration. So if there is a plan to get this working with HWOL
> we will need to fix the ct-commit-all, possibly with the zone-per-port.
> >
>
> The FDP-787 is already fixed by ct-commit-all, correct? And the HWOL issue
> was introduced by the "northd: Don't skip the unSNAT stage for traffic
> towards VIPs" which was supposed to fix FDP-291. Now we are reverting the
> patch to fix the HWOL issue, and use the port range configuration in ovn-k
> to avoid the problem reported by FDP-291. So I am still confused is there
> another problem that requires additional fix (e.g. zone-per-port)? Is
> ct-commit-all breaking HWOL for the UDN case you mentioned?
>

My understanding was that ct-commit-all is breaking HWOL, but I suppose
that was never said and the message was that it's not fixing it. So we
might be fine, we probably need some testing to confirm though.

Thanks,
Ales


> Thanks,
> Han
>
>
> > Thanks,
> > Ales
> >
> > [0] https://issues.redhat.com/browse/FDP-787
> >
> >>
> >> Han
> >>
> >> >
> >> > Thanks,
> >> > Ales
> >> >
> >> >>
> >> >> Thanks,
> >> >> Han
> >> >>
> >> >> >> Thanks,
> >> >> >> Ales
> >> >> >>>
> >> >> >>>
> >> >> >>> What do you think?
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> Han
> >> >> >>>
> >> >> >>> >
> >> >> >>> > Thanks,
> >> >> >>> >
> >> >> >>> > Tim Rozet
> >> >> >>> > Red Hat OpenShift Networking Team
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com> wrote:
> >> >> >>> >>
> >> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >> >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <
> trozet@redhat.com> wrote:
> >> >> >>> >> >>>
> >> >> >>> >> >>> Hi Han,
> >> >> >>> >> >>> My understanding is that the unSNAT stage was originally
> being skipped
> >> >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT,
> and this breaks
> >> >> >>> >> >> HWOL for ingress connections that never need to get
> unSNAT'ed and therefore
> >> >> >>> >> >> not committed. In the OVNK use case for pods simply
> connecting egress,
> >> >> >>> >> >> almost all packets will be SNAT'ed to the node IP.
> Therefore on egress
> >> >> >>> >> >> reply, we always enter the unSNAT stage for these packets.
> So this skipping
> >> >> >>> >> >> unSNAT stage seems to be specific to LB related traffic.
> While I agree
> >> >> >>> >> >> splitting the port range would work, I think it makes more
> sense to just
> >> >> >>> >> >> always commit in the unSNAT stage for all traffic. I get
> there is a
> >> >> >>> >> >> performance hit there, but the datapath pipeline seems
> more consistent and
> >> >> >>> >> >> I think outweighs the cost of committing LB traffic.
> >> >> >>> >> >>
> >> >> >>> >> >> Hi Tim and folks,
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > Hi Han,
> >> >> >>> >> >
> >> >> >>> >> >> + @Ales Musil <amusil@redhat.com>
> >> >> >>> >> >>
> >> >> >>> >> >> We've spent a lot of effort on the alternative solution:
> Always commit to
> >> >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload
> issue discussed in
> >> >> >>> >> >> this thread is still unresolved. I'd like to discuss the
> current situation
> >> >> >>> >> >> and ask for opinions.
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > Thanks again for working on this!
> >> >> >>> >> >
> >> >> >>> >> >> A while ago, Ales tried to fix it with this commit:
> 800fd0681579 ("northd:
> >> >> >>> >> >> Add LR option to commit all traffic."). When the
> ct-commit-all option is
> >> >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT
> zones. However, in
> >> >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case)
> the HW offload is
> >> >> >>> >> >> still broken. In that scenario, a packet may come from
> external network
> >> >> >>> >> >> with dest IP being a node's IP, which is also used as an
> OVN gateway router
> >> >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and
> then the src IP
> >> >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst
> IPs in different
> >> >> >>> >> >> stages are:
> >> >> >>> >> >>
> >> >> >>> >> >> Original: EXT IP <-> GR IP1
> >> >> >>> >> >> After DNAT: EXT IP <-> BE IP
> >> >> >>> >> >> After SNAT: GR IP2 <-> BE IP
> >> >> >>> >> >>
> >> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and
> after SNAT:
> >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >> >>> >> >>
> >> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT
> zone with the
> >> >> >>> >> >> original header, which will still always be "new" state,
> which breaks HW
> >> >> >>> >> >> offload.
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > Ah, now I understand the problem you were facing.
> >> >> >>> >> >
> >> >> >>> >> >> I tried to fix it by also committing to the SNAT zone
> before DNAT (at the
> >> >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >> >> >>> >> >> SNAT zone: EXT IP <-> GR IP1
> >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >> >>> >> >>
> >> >> >>> >> >> HW offload worked, but it breaks at least the scenario of
> "DNAT and SNAT on
> >> >> >>> >> >> distributed router - E/W", which tests communication
> between two workloads
> >> >> >>> >> >> using "floating IPs". In this scenario, both workloads are
> under different
> >> >> >>> >> >> LSes behind the same router, and both have their own
> floating IPs
> >> >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of
> them sends
> >> >> >>> >> >> packets to the other using the floating IP, what's
> expected is:
> >> >> >>> >> >>
> >> >> >>> >> >> Original: overlay IP1 <-> floating IP2
> >> >> >>> >> >> After SNAT: floating IP1 <-> floating IP2
> >> >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2
> >> >> >>> >> >>
> >> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in
> SNAT zone:
> >> >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >> >> >>> >> >>
> >> >> >>> >> >> This happens in the egress pipeline of the LR, and then it
> enters into the
> >> >> >>> >> >> ingress pipeline again because this is like a hairpin (the
> dest IP is a
> >> >> >>> >> >> dnat_and_snat IP on the same router). It will hit the
> UNSNAT stage first,
> >> >> >>> >> >> traverse the CT zone, and for the first packet the reply
> is not seen yet,
> >> >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we
> will try to commit
> >> >> >>> >> >> it (because of the issue I am trying to fix above).
> Because the connection
> >> >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the
> above committed entry
> >> >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT
> action fails and
> >> >> >>> >> >> the packet is dropped.
> >> >> >>> >> >>
> >> >> >>> >> >> There is still a way to solve this, probably by skipping
> the unnecessary
> >> >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP
> is one of the
> >> >> >>> >> >> snat/dnat_and_snat IPs of this same router.
> >> >> >>> >> >>
> >> >> >>> >> >> However, as we see, the NAT related pipelines have been
> very complex today,
> >> >> >>> >> >> and with the above changes it would become even much more
> complex. I wonder
> >> >> >>> >> >> if this is maintainable in the long run.
> >> >> >>> >> >>
> >> >> >>> >> >> In addition, I am not even sure if this is the last
> problem of my patch. It
> >> >> >>> >> >> is not easy to tell because today the ct-commit-all
> feature is an option
> >> >> >>> >> >> and by default disabled. So all our test cases by default
> don't test for
> >> >> >>> >> >> this option except the 4 test cases added by Ales's patch.
> I had to
> >> >> >>> >> >> manually set this option to true as default and of course
> many test cases
> >> >> >>> >> >> will fail. Most of them fail because of the expected flows
> won't match, but
> >> >> >>> >> >> I can't tell for sure unless updating every case with new
> flows and then
> >> >> >>> >> >> retest them. Before doing that I just examined ones I
> think might be
> >> >> >>> >> >> impacted by the patch and the "DNAT and SNAT on
> distributed router - E/W"
> >> >> >>> >> >> is the first broken one I found. So there can be other
> issues not
> >> >> >>> >> >> discovered yet.
> >> >> >>> >> >>
> >> >> >>> >> >> Moreover, we should also keep in mind the performance
> penalty of these
> >> >> >>> >> >> extra CT commit & traversal.
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > But basically this happens because we use the same SNAT
> zone for all
> >> >> >>> >> > router ports.
> >> >> >>> >> >
> >> >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT
> DNAT zone per
> >> >> >>> >> > router port fix it?  E.g.:
> >> >> >>> >> >
> >> >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2
> (SNAT ZONE: 21,
> >> >> >>> >> > DNAT ZONE: 22) ---- host2
> >> >> >>> >> >
> >> >> >>> >> > That'd also imply that we commit that session (from host1
> to host2 -
> >> >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and
> DNAT for
> >> >> >>> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do
> 4 different
> >> >> >>> >> > CT lookups for all packets traversing the router.
> >> >> >>> >> >
> >> >> >>> >> > I didn't try any of this but I have the impression that
> this would solve
> >> >> >>> >> > the HWOL issue: all subsequent packets are part of known
> and committed
> >> >> >>> >> > sessions in all 4 conntrack zones.
> >> >> >>> >> >
> >> >> >>> >> > So what if we change the implementation of the "commit-all"
> feature to
> >> >> >>> >> > add the missing ct commits?  Wouldn't that work?
> >> >> >>> >> >
> >> >> >>> >> > It feels like (lookup performance hit aside) this is the
> correct way to
> >> >> >>> >> > forward traffic, i.e., never traverse a CT zone without
> having a
> >> >> >>> >> > committed entry in that zone.
> >> >> >>> >> >
> >> >> >>> >> >> I'd like to listen to your opinion before moving forward.
> Is this worth the
> >> >> >>> >> >> effort, or shall we just keep the pipeline simple
> (relatively) and avoid
> >> >> >>> >> >> the original problem (
> https://issues.redhat.com/browse/FDP-291) by
> >> >> >>> >> >> separating the port range of SNAT and DNAT, the first
> proposal at the
> >> >> >>> >> >> beginning of this thread?
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > Now that I understand the problem you're describing better
> it might be
> >> >> >>> >> > worth reverting the fix for FDP-291 like you said until we
> find a proper
> >> >> >>> >> > generic solution.
> >> >> >>> >> >
> >> >> >>> >> > However, we can't afford to do that without separating the
> port range of
> >> >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing
> (at least
> >> >> >>> >> > from OpenShift perspective).
> >> >> >>> >> >
> >> >> >>> >>
> >> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the
> fix for
> >> >> >>> >> FDP-291:
> >> >> >>> >>
> >> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >> >> >>> >>
> >> >> >>> >> > Because selecting the nodeport (DNAT) range is done through
> a k8s API
> >> >> >>> >> > server cmdline argument that means that users can in theory
> choose
> >> >> >>> >> > whatever range they want.  I guess we'd need a command line
> argument in
> >> >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >> >> >>> >> > what SNAT port range to use so that it doesn't conflict
> with the KAPI
> >> >> >>> >> > nodeport range.
> >> >> >>> >> >
> >> >> >>> >> > Tim, you're the expert on this side, what are your thoughts
> on the matter?
> >> >> >>> >> >
> >> >> >>> >> > Regards,
> >> >> >>> >> > Dumitru
> >> >> >>> >> >
> >> >> >>> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >> >> >>> >> >
> >> >> >>> >> >> Thanks,
> >> >> >>> >> >> Han
> >> >> >>> >> >>
> >> >> >>> >> >>>
> >> >> >>> >> >>> Thanks,
> >> >> >>> >> >>> Tim Rozet
> >> >> >>> >> >>> Red Hat OpenShift Networking Team
> >> >> >>> >> >>>
> >> >> >>> >> >>>
> >> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <
> dceara@redhat.com> wrote:
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >> >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >> >>> >> >> wrote:
> >> >> >>> >> >>>>>>
> >> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >> >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dceara@redhat.com>
> >> >> >>> >> >> wrote:
> >> >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule
> that uses the
> >> >> >>> >> >> VIP as
> >> >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind
> the VIP.
> >> >> >>> >> >>>>>>>>
> >> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd:
> Skip unsnat flows
> >> >> >>> >> >> for
> >> >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline").
> That's OK because
> >> >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows
> for routers
> >> >> >>> >> >> with
> >> >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue
> in a better way:
> >> >> >>> >> >>>>>>>>
> >> >> >>> >> >>>>>>>>     In the reply direction, the order of traversal
> of the tables
> >> >> >>> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat"
> adds incorrect
> >> >> >>> >> >>>>>>>>     datapath flows that check ct_state in the wrong
> conntrack zone.
> >> >> >>> >> >>>>>>>>     This is illustrated below where reply trafic
> enters the
> >> >> >>> >> >> physical host
> >> >> >>> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone
> (default),
> >> >> >>> >> >> back to the
> >> >> >>> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port
> zone (22). The
> >> >> >>> >> >> third
> >> >> >>> >> >>>>>>>>     flow is incorrectly checking the state from the
> SNAT zone
> >> >> >>> >> >> instead
> >> >> >>> >> >>>>>>>>     of the DNAT zone.
> >> >> >>> >> >>>>>>>>
> >> >> >>> >> >>>>>>>> We also add a system test to ensure traffic
> initiated from behind a
> >> >> >>> >> >> VIP
> >> >> >>> >> >>>>>>>> + SNAT is not broken.
> >> >> >>> >> >>>>>>>>
> >> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is
> slightly
> >> >> >>> >> >> simplified
> >> >> >>> >> >>>>>>>> because we don't need to track NAT external IPs
> anymore.
> >> >> >>> >> >>>>>>>>
> >> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows
> for load
> >> >> >>> >> >> balancer vips in router ingress pipeline")
> >> >> >>> >> >>>>>>>> Reported-at:
> https://issues.redhat.com/browse/FDP-291
> >> >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> >> >>> >> >>>>>>>
> >> >> >>> >> >>>>>>>
> >> >> >>> >> >>>>>>> Thanks for the fix.  It also simplified the
> lr-nat-stateful code.
> >> >> >>> >> >>>>>>>
> >> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org>
> >> >> >>> >> >>>>>>>
> >> >> >>> >> >>>>>>
> >> >> >>> >> >>>>>> Thanks, Numan!
> >> >> >>> >> >>>>>>
> >> >> >>> >> >>>>>> Applied to main and backported to all branches down to
> 22.03.
> >> >> >>> >> >>>>>>
> >> >> >>> >> >>>>>> Regards,
> >> >> >>> >> >>>>>> Dumitru
> >> >> >>> >> >>>>>>
> >> >> >>> >> >>>>>> _______________________________________________
> >> >> >>> >> >>>>>> dev mailing list
> >> >> >>> >> >>>>>> dev@openvswitch.org
> >> >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Hi Han,
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic
> is broken due to
> >> >> >>> >> >>>>> this change. The reason is that for client to nodePort
> (LB with VIP
> >> >> >>> >> >>>>> being the node IP) traffic, when the packet is going
> through the
> >> >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is
> never committed
> >> >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as
> "new", which
> >> >> >>> >> >>>>> prevents the HW offload to work for such packets.
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the
> SNAT zone.
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>> At the moment I have to revert this change in our
> downstream. For the
> >> >> >>> >> >>>>> problem that was fixed by this change [0], I think we
> can avoid it by
> >> >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT,
> the nodePort
> >> >> >>> >> >>>>> range in k8s is configured by API-server option:
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> --service-node-port-range <a string in the form
> 'N1-N2'> Default:
> >> >> >>> >> >> 30000-32767
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >> >> >>> >> >>>>> external_port_range column, and we can choose something
> like
> >> >> >>> >> >>>>> 10000-30000.
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Tim, does this look OK to you?  If it's acceptable to
> limit the SNAT
> >> >> >>> >> >>>> port range this workaround should be fine.
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> Alternatively solutions are:
> >> >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT
> zones. This would
> >> >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT
> recirc.
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Extra recirculation aside, I would actually love it if
> we could use this
> >> >> >>> >> >>>> alternative.  I think it's the "most correct" option.  I
> think it would
> >> >> >>> >> >>>> allow us to avoid other workarounds like:
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> or
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> I do understand the worry about the extra recirculation
> though.  In the
> >> >> >>> >> >>>> HWOL context does that cause visible performance impact?
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> We'd probably have to do more performance testing
> without HWOL to figure
> >> >> >>> >> >>>> out the impact in the software datapath.
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But
> there are other
> >> >> >>> >> >>>>> issues reported for using the common zone [1]
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> Could you let me know if other thoughts on this?
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> On a related note, I know it has been discussed in
> different settings
> >> >> >>> >> >>>> but I don't think this ever moved forward: would it be
> possible for
> >> >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL
> impact for incoming
> >> >> >>> >> >>>> patches?
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Maybe we could some "simple" system-like tests that
> ensure that traffic
> >> >> >>> >> >>>> is correctly offloaded in common scenarios?
> Alternatively, I guess we
> >> >> >>> >> >>>> could also tag a subset of the existing system tests and
> just run those
> >> >> >>> >> >>>> on actual hardware?
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report
> status on each OVN
> >> >> >>> >> >>>> patch posted on patchwork.  That would at least allow us
> to flag this
> >> >> >>> >> >>>> kind of breakages early (even before they get merged).
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> What do you think?
> >> >> >>> >> >>>>
> >> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >> >> >>> >> >>>>> [1] b8c40e7593
> >> >> >>> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>> Thanks,
> >> >> >>> >> >>>>> Han
> >> >> >>> >> >>>>>
> >> >> >>> >> >>>>
> >> >> >>> >> >>>> Thanks,
> >> >> >>> >> >>>> Dumitru
> >> >> >>> >> >> _______________________________________________
> >> >> >>> >> >> dev mailing list
> >> >> >>> >> >> dev@openvswitch.org
> >> >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>>
diff mbox series

Patch

diff --git a/northd/en-lflow.c b/northd/en-lflow.c
index f1a83839df..c4b927fb8c 100644
--- a/northd/en-lflow.c
+++ b/northd/en-lflow.c
@@ -174,8 +174,7 @@  lflow_lr_stateful_handler(struct engine_node *node, void *data)
     struct ed_type_lr_stateful *lr_sful_data =
         engine_get_input_data("lr_stateful", node);
 
-    if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)
-        || lr_sful_data->trk_data.vip_nats_changed) {
+    if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) {
         return false;
     }
 
diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c
index ad11025c69..215d924e42 100644
--- a/northd/en-lr-nat.c
+++ b/northd/en-lr-nat.c
@@ -219,10 +219,6 @@  lr_nat_record_init(struct lr_nat_record *lrnat_rec,
     lrnat_rec->nbr_uuid = od->nbr->header_.uuid;
 
     shash_init(&lrnat_rec->snat_ips);
-    sset_init(&lrnat_rec->external_ips);
-    for (size_t i = 0; i < od->nbr->n_nat; i++) {
-        sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip);
-    }
 
     sset_init(&lrnat_rec->external_macs);
     lrnat_rec->has_distributed_nat = false;
@@ -343,7 +339,6 @@  lr_nat_record_clear(struct lr_nat_record *lrnat_rec)
     }
 
     free(lrnat_rec->nat_entries);
-    sset_destroy(&lrnat_rec->external_ips);
     sset_destroy(&lrnat_rec->external_macs);
 }
 
diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h
index 71d3a1782d..81a7b0abd7 100644
--- a/northd/en-lr-nat.h
+++ b/northd/en-lr-nat.h
@@ -64,9 +64,6 @@  struct lr_nat_record {
 
     bool has_distributed_nat;
 
-    /* Set of nat external ips on the router. */
-    struct sset external_ips;
-
     /* Set of nat external macs on the router. */
     struct sset external_macs;
 
diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c
index 6d0192487c..baf1bd2f89 100644
--- a/northd/en-lr-stateful.c
+++ b/northd/en-lr-stateful.c
@@ -82,7 +82,6 @@  static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *,
                                             enum lb_neighbor_responder_mode,
                                             const struct sset *lb_ips_v4,
                                             const struct sset *lb_ips_v6);
-static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *);
 
 /* 'lr_stateful' engine node manages the NB logical router LB data.
  */
@@ -110,7 +109,6 @@  en_lr_stateful_clear_tracked_data(void *data_)
     struct ed_type_lr_stateful *data = data_;
 
     hmapx_clear(&data->trk_data.crupdated);
-    data->trk_data.vip_nats_changed = false;
 }
 
 void
@@ -198,10 +196,6 @@  lr_stateful_lb_data_handler(struct engine_node *node, void *data_)
 
             /* Add the lr_stateful_rec rec to the tracking data. */
             hmapx_add(&data->trk_data.crupdated, lr_stateful_rec);
-
-            if (!sset_is_empty(&lr_stateful_rec->vip_nats)) {
-                data->trk_data.vip_nats_changed = true;
-            }
             continue;
         }
 
@@ -319,9 +313,6 @@  lr_stateful_lb_data_handler(struct engine_node *node, void *data_)
 
         HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) {
             struct lr_stateful_record *lr_stateful_rec = hmapx_node->data;
-            if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) {
-                data->trk_data.vip_nats_changed = true;
-            }
             const struct ovn_datapath *od =
                 ovn_datapaths_find_by_index(input_data.lr_datapaths,
                                             lr_stateful_rec->lr_index);
@@ -360,13 +351,6 @@  lr_stateful_lr_nat_handler(struct engine_node *node, void *data_)
                 lr_stateful_record_create(&data->table, lrnat_rec, od,
                                           input_data.lb_datapaths_map,
                                           input_data.lbgrp_datapaths_map);
-            if (!sset_is_empty(&lr_stateful_rec->vip_nats)) {
-                data->trk_data.vip_nats_changed = true;
-            }
-        } else {
-            if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) {
-                data->trk_data.vip_nats_changed = true;
-            }
         }
 
         /* Add the lr_stateful_rec rec to the tracking data. */
@@ -525,11 +509,6 @@  lr_stateful_record_create(struct lr_stateful_table *table,
         build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb);
     }
 
-    sset_init(&lr_stateful_rec->vip_nats);
-
-    if (nbr->n_nat) {
-        lr_stateful_rebuild_vip_nats(lr_stateful_rec);
-    }
     lr_stateful_rec->has_lb_vip = od_has_lb_vip(od);
 
     hmap_insert(&table->entries, &lr_stateful_rec->key_node,
@@ -556,7 +535,6 @@  static void
 lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec)
 {
     ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips);
-    sset_destroy(&lr_stateful_rec->vip_nats);
     lflow_ref_destroy(lr_stateful_rec->lflow_ref);
     free(lr_stateful_rec);
 }
@@ -671,32 +649,3 @@  remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec,
                              ip_address);
     }
 }
-
-static bool
-lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec)
-{
-    struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats);
-    sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats);
-
-    const char *external_ip;
-    SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) {
-        bool is_vip_nat = false;
-        if (addr_is_ipv6(external_ip)) {
-            is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6,
-                                       external_ip);
-        } else {
-            is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4,
-                                       external_ip);
-        }
-
-        if (is_vip_nat) {
-            sset_add(&lr_stateful_rec->vip_nats, external_ip);
-        }
-    }
-
-    bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats,
-                                         &old_vip_nats);
-    sset_destroy(&old_vip_nats);
-
-    return vip_nats_changed;
-}
diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h
index 4a07174c31..e15c61e575 100644
--- a/northd/en-lr-stateful.h
+++ b/northd/en-lr-stateful.h
@@ -62,9 +62,6 @@  struct lr_stateful_record {
     /* Load Balancer vIPs relevant for this datapath. */
     struct ovn_lb_ip_set *lb_ips;
 
-    /* sset of vips which are also part of lr nats. */
-    struct sset vip_nats;
-
     /* 'lflow_ref' is used to reference logical flows generated for
      * this lr_stateful_record.
      *
@@ -107,10 +104,6 @@  struct lr_stateful_table {
 struct lr_stateful_tracked_data {
     /* Created or updated logical router with LB and/or NAT data. */
     struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */
-
-    /* Indicates if any router's NATs changed which were also LB vips
-     * or vice versa. */
-    bool vip_nats_changed;
 };
 
 struct ed_type_lr_stateful {
@@ -142,7 +135,7 @@  const struct lr_stateful_record *lr_stateful_table_find_by_index(
 static inline bool
 lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data)
 {
-    return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed;
+    return !hmapx_is_empty(&trk_data->crupdated);
 }
 
 static inline bool
diff --git a/northd/northd.c b/northd/northd.c
index 2c3560ce2d..4b39137e7c 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -11061,7 +11061,6 @@  build_lrouter_nat_flows_for_lb(
     struct ds skip_snat_act = DS_EMPTY_INITIALIZER;
     struct ds force_snat_act = DS_EMPTY_INITIALIZER;
     struct ds undnat_match = DS_EMPTY_INITIALIZER;
-    struct ds unsnat_match = DS_EMPTY_INITIALIZER;
     struct ds gw_redir_action = DS_EMPTY_INITIALIZER;
 
     ds_clear(match);
@@ -11107,13 +11106,6 @@  build_lrouter_nat_flows_for_lb(
     /* Remove the trailing " || ". */
     ds_truncate(&undnat_match, undnat_match.length - 4);
 
-    ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s",
-                  ip_match, ip_match, lb_vip->vip_str, lb->proto);
-    if (lb_vip->port_str) {
-        ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto,
-                      lb_vip->port_str);
-    }
-
     struct lrouter_nat_lb_flows_ctx ctx = {
         .lb_vip = lb_vip,
         .lb = lb,
@@ -11171,23 +11163,6 @@  build_lrouter_nat_flows_for_lb(
         if (lb->affinity_timeout) {
             bitmap_set1(aff_dp_bitmap[type], index);
         }
-
-        if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) {
-            /* The load balancer vip is also present in the NAT entries.
-             * So add a high priority lflow to advance the the packet
-             * destined to the vip (and the vip port if defined)
-             * in the S_ROUTER_IN_UNSNAT stage.
-             * There seems to be an issue with ovs-vswitchd. When the new
-             * connection packet destined for the lb vip is received,
-             * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat
-             * conntrack zone. For the next packet, if it goes through
-             * unsnat stage, the conntrack flags are not set properly, and
-             * it doesn't hit the established state flows in
-             * S_ROUTER_IN_DNAT stage. */
-            ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120,
-                                    ds_cstr(&unsnat_match), "next;",
-                                    &lb->nlb->header_, lb_dps->lflow_ref);
-        }
     }
 
     for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) {
@@ -11199,7 +11174,6 @@  build_lrouter_nat_flows_for_lb(
                                    lr_datapaths, lb_dps->lflow_ref);
     }
 
-    ds_destroy(&unsnat_match);
     ds_destroy(&undnat_match);
     ds_destroy(&skip_snat_act);
     ds_destroy(&force_snat_act);
@@ -15070,12 +15044,7 @@  build_lrouter_nat_defrag_and_lb(
                                    l3dgw_port, stateless, lflow_ref);
 
         /* ARP resolve for NAT IPs. */
-        if (od->is_gw_router) {
-            /* Add the NAT external_ip to the nat_entries for
-             * gateway routers. This is required for adding load balancer
-             * flows.*/
-            sset_add(&nat_entries, nat->external_ip);
-        } else {
+        if (!od->is_gw_router) {
             if (!sset_contains(&nat_entries, nat->external_ip)) {
                 /* Drop packets coming in from external that still has
                  * destination IP equals to the NAT external IP, to avoid loop.
diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
index 6fdd761dac..c7f65b9666 100644
--- a/tests/ovn-northd.at
+++ b/tests/ovn-northd.at
@@ -1610,9 +1610,6 @@  AT_CAPTURE_FILE([sbflows])
 # dnat_and_snat or snat entry.
 AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl
   table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;)
 ])
@@ -1643,9 +1640,6 @@  AT_CAPTURE_FILE([sbflows])
 # dnat_and_snat or snat entry.
 AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl
   table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;)
 ])
@@ -5868,7 +5862,6 @@  AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl
   table=??(lr_in_unsnat       ), priority=0    , match=(1), action=(next;)
   table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;)
@@ -5945,7 +5938,6 @@  AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl
   table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=110  , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;)
-  table=??(lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;)
   table=??(lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;)
@@ -12010,7 +12002,7 @@  check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080
 check ovn-nbctl lr-lb-add lr0 lb1
 check ovn-nbctl lr-lb-add lr0 lb2
 
-# lflow engine should recompute since the nat ip 172.168.0.140
+# lflow engine should not recompute even if the nat ip 172.168.0.140
 # is a lb vip.
 check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
 check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20
@@ -12019,10 +12011,10 @@  check_engine_stats lr_nat norecompute compute
 check_engine_stats lr_stateful norecompute compute
 check_engine_stats sync_to_sb_pb recompute nocompute
 check_engine_stats sync_to_sb_lb norecompute compute
-check_engine_stats lflow recompute nocompute
+check_engine_stats lflow norecompute compute
 CHECK_NO_CHANGE_AFTER_RECOMPUTE
 
-# lflow engine should recompute since the nat ip 172.168.0.150
+# lflow engine should not recompute even if the nat ip 172.168.0.150
 # is a lb vip.
 check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
 check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41
@@ -12031,10 +12023,10 @@  check_engine_stats lr_nat norecompute compute
 check_engine_stats lr_stateful norecompute compute
 check_engine_stats sync_to_sb_pb recompute nocompute
 check_engine_stats sync_to_sb_lb norecompute compute
-check_engine_stats lflow recompute nocompute
+check_engine_stats lflow norecompute compute
 CHECK_NO_CHANGE_AFTER_RECOMPUTE
 
-# lflow engine should recompute since the deleted nat ip 172.168.0.150
+# lflow engine should not recompute even if the deleted nat ip 172.168.0.150
 # is a lb vip.
 check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
 check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150
@@ -12043,10 +12035,10 @@  check_engine_stats lr_nat norecompute compute
 check_engine_stats lr_stateful norecompute compute
 check_engine_stats sync_to_sb_pb recompute nocompute
 check_engine_stats sync_to_sb_lb norecompute compute
-check_engine_stats lflow recompute nocompute
+check_engine_stats lflow norecompute compute
 CHECK_NO_CHANGE_AFTER_RECOMPUTE
 
-# lflow engine should recompute since the deleted nat ip 172.168.0.140
+# lflow engine should not recompute even if the deleted nat ip 172.168.0.140
 # is a lb vip.
 check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats
 check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140
@@ -12055,7 +12047,7 @@  check_engine_stats lr_nat norecompute compute
 check_engine_stats lr_stateful norecompute compute
 check_engine_stats sync_to_sb_pb recompute nocompute
 check_engine_stats sync_to_sb_lb norecompute compute
-check_engine_stats lflow recompute nocompute
+check_engine_stats lflow norecompute compute
 CHECK_NO_CHANGE_AFTER_RECOMPUTE
 
 # Delete the NAT
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index c22c7882fc..2411b0267e 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -12096,6 +12096,86 @@  OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
 /connection dropped.*/d"])
 AT_CLEANUP
 
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([load balancing in gateway router - client behind LB with SNAT])
+AT_SKIP_IF([test $HAVE_NC = no])
+AT_KEYWORDS([lb])
+
+ovn_start
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-int])
+
+check ovs-vsctl \
+        -- set Open_vSwitch . external-ids:system-id=hv1 \
+        -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
+        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
+        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
+        -- set bridge br-int fail-mode=secure other-config:disable-in-band=true
+
+start_daemon ovn-controller
+
+check ovn-nbctl lr-add lr \
+    -- set logical_router lr options:chassis=hv1
+check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24
+check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24
+check ovn-nbctl ls-add ls1
+check ovn-nbctl ls-add ls2
+
+check ovn-nbctl lsp-add ls1 ls1-lr
+check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00
+check ovn-nbctl lsp-set-type ls1-lr router
+check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1
+check ovn-nbctl lsp-add ls1 vm1
+check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01
+
+check ovn-nbctl lsp-add ls2 ls2-lr
+check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00
+check ovn-nbctl lsp-set-type ls2-lr router
+check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2
+check ovn-nbctl lsp-add ls2 vm2
+check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02
+
+dnl LB using the router IP connected to vm2 as VIP.
+check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \
+    -- lr-lb-add lr lb-test
+
+dnl SNAT everything coming from vm1 to the router IP (towards vm2).
+check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1
+
+ADD_NAMESPACES(vm1)
+ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2")
+
+ADD_NAMESPACES(vm2)
+ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2")
+
+dnl Start a server on vm2.
+NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid])
+
+dnl Wait for ovn-controller to catch up.
+wait_for_ports_up
+check ovn-nbctl --wait=hv sync
+
+dnl Test the connection originating something that uses the same source port
+dnl as the LB VIP.
+NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore])
+
+OVS_APP_EXIT_AND_WAIT([ovn-controller])
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as
+OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
+/connection dropped.*/d"])
+AT_CLEANUP
+])
+
 OVN_FOR_EACH_NORTHD([
 AT_SETUP([load balancing affinity sessions - auto clear learnt flows])
 AT_SKIP_IF([test $HAVE_NC = no])