| Message ID | 20240226125830.2173125-1-dceara@redhat.com |
|---|---|
| State | Accepted |
| Headers | show |
| Series | [ovs-dev] northd: Don't skip the unSNAT stage for traffic towards VIPs. | expand |
| Context | Check | Description |
|---|---|---|
| ovsrobot/apply-robot | success | apply and check: success |
| ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
| ovsrobot/github-robot-_ovn-kubernetes | success | github build: passed |
On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > > Otherwise, in case there's also a SNAT rule that uses the VIP as > external IP, we break sessions initiated from behind the VIP. > > This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for > load balancer vips in router ingress pipeline"). That's OK because > commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > DNAT/Load Balancers") addressed the original issue in a better way: > > In the reply direction, the order of traversal of the tables > "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > datapath flows that check ct_state in the wrong conntrack zone. > This is illustrated below where reply trafic enters the physical host > port (6) and traverses DNAT zone (14), SNAT zone (default), back to the > DNAT zone and then on to Logical Switch Port zone (22). The third > flow is incorrectly checking the state from the SNAT zone instead > of the DNAT zone. > > We also add a system test to ensure traffic initiated from behind a VIP > + SNAT is not broken. > > Another nice side effect is that the northd I-P is slightly simplified > because we don't need to track NAT external IPs anymore. > > Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") > Reported-at: https://issues.redhat.com/browse/FDP-291 > Signed-off-by: Dumitru Ceara <dceara@redhat.com> Thanks for the fix. It also simplified the lr-nat-stateful code. Acked-by: Numan Siddique <numans@ovn.org> Numan > --- > northd/en-lflow.c | 3 +- > northd/en-lr-nat.c | 5 --- > northd/en-lr-nat.h | 3 -- > northd/en-lr-stateful.c | 51 -------------------------- > northd/en-lr-stateful.h | 9 +---- > northd/northd.c | 33 +---------------- > tests/ovn-northd.at | 24 +++++-------- > tests/system-ovn.at | 80 +++++++++++++++++++++++++++++++++++++++++ > 8 files changed, 91 insertions(+), 117 deletions(-) > > diff --git a/northd/en-lflow.c b/northd/en-lflow.c > index f1a83839df..c4b927fb8c 100644 > --- a/northd/en-lflow.c > +++ b/northd/en-lflow.c > @@ -174,8 +174,7 @@ lflow_lr_stateful_handler(struct engine_node *node, void *data) > struct ed_type_lr_stateful *lr_sful_data = > engine_get_input_data("lr_stateful", node); > > - if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data) > - || lr_sful_data->trk_data.vip_nats_changed) { > + if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) { > return false; > } > > diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c > index ad11025c69..215d924e42 100644 > --- a/northd/en-lr-nat.c > +++ b/northd/en-lr-nat.c > @@ -219,10 +219,6 @@ lr_nat_record_init(struct lr_nat_record *lrnat_rec, > lrnat_rec->nbr_uuid = od->nbr->header_.uuid; > > shash_init(&lrnat_rec->snat_ips); > - sset_init(&lrnat_rec->external_ips); > - for (size_t i = 0; i < od->nbr->n_nat; i++) { > - sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip); > - } > > sset_init(&lrnat_rec->external_macs); > lrnat_rec->has_distributed_nat = false; > @@ -343,7 +339,6 @@ lr_nat_record_clear(struct lr_nat_record *lrnat_rec) > } > > free(lrnat_rec->nat_entries); > - sset_destroy(&lrnat_rec->external_ips); > sset_destroy(&lrnat_rec->external_macs); > } > > diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h > index 71d3a1782d..81a7b0abd7 100644 > --- a/northd/en-lr-nat.h > +++ b/northd/en-lr-nat.h > @@ -64,9 +64,6 @@ struct lr_nat_record { > > bool has_distributed_nat; > > - /* Set of nat external ips on the router. */ > - struct sset external_ips; > - > /* Set of nat external macs on the router. */ > struct sset external_macs; > > diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c > index 6d0192487c..baf1bd2f89 100644 > --- a/northd/en-lr-stateful.c > +++ b/northd/en-lr-stateful.c > @@ -82,7 +82,6 @@ static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *, > enum lb_neighbor_responder_mode, > const struct sset *lb_ips_v4, > const struct sset *lb_ips_v6); > -static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *); > > /* 'lr_stateful' engine node manages the NB logical router LB data. > */ > @@ -110,7 +109,6 @@ en_lr_stateful_clear_tracked_data(void *data_) > struct ed_type_lr_stateful *data = data_; > > hmapx_clear(&data->trk_data.crupdated); > - data->trk_data.vip_nats_changed = false; > } > > void > @@ -198,10 +196,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) > > /* Add the lr_stateful_rec rec to the tracking data. */ > hmapx_add(&data->trk_data.crupdated, lr_stateful_rec); > - > - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { > - data->trk_data.vip_nats_changed = true; > - } > continue; > } > > @@ -319,9 +313,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) > > HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) { > struct lr_stateful_record *lr_stateful_rec = hmapx_node->data; > - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { > - data->trk_data.vip_nats_changed = true; > - } > const struct ovn_datapath *od = > ovn_datapaths_find_by_index(input_data.lr_datapaths, > lr_stateful_rec->lr_index); > @@ -360,13 +351,6 @@ lr_stateful_lr_nat_handler(struct engine_node *node, void *data_) > lr_stateful_record_create(&data->table, lrnat_rec, od, > input_data.lb_datapaths_map, > input_data.lbgrp_datapaths_map); > - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { > - data->trk_data.vip_nats_changed = true; > - } > - } else { > - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { > - data->trk_data.vip_nats_changed = true; > - } > } > > /* Add the lr_stateful_rec rec to the tracking data. */ > @@ -525,11 +509,6 @@ lr_stateful_record_create(struct lr_stateful_table *table, > build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb); > } > > - sset_init(&lr_stateful_rec->vip_nats); > - > - if (nbr->n_nat) { > - lr_stateful_rebuild_vip_nats(lr_stateful_rec); > - } > lr_stateful_rec->has_lb_vip = od_has_lb_vip(od); > > hmap_insert(&table->entries, &lr_stateful_rec->key_node, > @@ -556,7 +535,6 @@ static void > lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec) > { > ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips); > - sset_destroy(&lr_stateful_rec->vip_nats); > lflow_ref_destroy(lr_stateful_rec->lflow_ref); > free(lr_stateful_rec); > } > @@ -671,32 +649,3 @@ remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec, > ip_address); > } > } > - > -static bool > -lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec) > -{ > - struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats); > - sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats); > - > - const char *external_ip; > - SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) { > - bool is_vip_nat = false; > - if (addr_is_ipv6(external_ip)) { > - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6, > - external_ip); > - } else { > - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4, > - external_ip); > - } > - > - if (is_vip_nat) { > - sset_add(&lr_stateful_rec->vip_nats, external_ip); > - } > - } > - > - bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats, > - &old_vip_nats); > - sset_destroy(&old_vip_nats); > - > - return vip_nats_changed; > -} > diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h > index 4a07174c31..e15c61e575 100644 > --- a/northd/en-lr-stateful.h > +++ b/northd/en-lr-stateful.h > @@ -62,9 +62,6 @@ struct lr_stateful_record { > /* Load Balancer vIPs relevant for this datapath. */ > struct ovn_lb_ip_set *lb_ips; > > - /* sset of vips which are also part of lr nats. */ > - struct sset vip_nats; > - > /* 'lflow_ref' is used to reference logical flows generated for > * this lr_stateful_record. > * > @@ -107,10 +104,6 @@ struct lr_stateful_table { > struct lr_stateful_tracked_data { > /* Created or updated logical router with LB and/or NAT data. */ > struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */ > - > - /* Indicates if any router's NATs changed which were also LB vips > - * or vice versa. */ > - bool vip_nats_changed; > }; > > struct ed_type_lr_stateful { > @@ -142,7 +135,7 @@ const struct lr_stateful_record *lr_stateful_table_find_by_index( > static inline bool > lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data) > { > - return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed; > + return !hmapx_is_empty(&trk_data->crupdated); > } > > static inline bool > diff --git a/northd/northd.c b/northd/northd.c > index 2c3560ce2d..4b39137e7c 100644 > --- a/northd/northd.c > +++ b/northd/northd.c > @@ -11061,7 +11061,6 @@ build_lrouter_nat_flows_for_lb( > struct ds skip_snat_act = DS_EMPTY_INITIALIZER; > struct ds force_snat_act = DS_EMPTY_INITIALIZER; > struct ds undnat_match = DS_EMPTY_INITIALIZER; > - struct ds unsnat_match = DS_EMPTY_INITIALIZER; > struct ds gw_redir_action = DS_EMPTY_INITIALIZER; > > ds_clear(match); > @@ -11107,13 +11106,6 @@ build_lrouter_nat_flows_for_lb( > /* Remove the trailing " || ". */ > ds_truncate(&undnat_match, undnat_match.length - 4); > > - ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s", > - ip_match, ip_match, lb_vip->vip_str, lb->proto); > - if (lb_vip->port_str) { > - ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto, > - lb_vip->port_str); > - } > - > struct lrouter_nat_lb_flows_ctx ctx = { > .lb_vip = lb_vip, > .lb = lb, > @@ -11171,23 +11163,6 @@ build_lrouter_nat_flows_for_lb( > if (lb->affinity_timeout) { > bitmap_set1(aff_dp_bitmap[type], index); > } > - > - if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) { > - /* The load balancer vip is also present in the NAT entries. > - * So add a high priority lflow to advance the the packet > - * destined to the vip (and the vip port if defined) > - * in the S_ROUTER_IN_UNSNAT stage. > - * There seems to be an issue with ovs-vswitchd. When the new > - * connection packet destined for the lb vip is received, > - * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat > - * conntrack zone. For the next packet, if it goes through > - * unsnat stage, the conntrack flags are not set properly, and > - * it doesn't hit the established state flows in > - * S_ROUTER_IN_DNAT stage. */ > - ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120, > - ds_cstr(&unsnat_match), "next;", > - &lb->nlb->header_, lb_dps->lflow_ref); > - } > } > > for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) { > @@ -11199,7 +11174,6 @@ build_lrouter_nat_flows_for_lb( > lr_datapaths, lb_dps->lflow_ref); > } > > - ds_destroy(&unsnat_match); > ds_destroy(&undnat_match); > ds_destroy(&skip_snat_act); > ds_destroy(&force_snat_act); > @@ -15070,12 +15044,7 @@ build_lrouter_nat_defrag_and_lb( > l3dgw_port, stateless, lflow_ref); > > /* ARP resolve for NAT IPs. */ > - if (od->is_gw_router) { > - /* Add the NAT external_ip to the nat_entries for > - * gateway routers. This is required for adding load balancer > - * flows.*/ > - sset_add(&nat_entries, nat->external_ip); > - } else { > + if (!od->is_gw_router) { > if (!sset_contains(&nat_entries, nat->external_ip)) { > /* Drop packets coming in from external that still has > * destination IP equals to the NAT external IP, to avoid loop. > diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at > index 6fdd761dac..c7f65b9666 100644 > --- a/tests/ovn-northd.at > +++ b/tests/ovn-northd.at > @@ -1610,9 +1610,6 @@ AT_CAPTURE_FILE([sbflows]) > # dnat_and_snat or snat entry. > AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) > ]) > @@ -1643,9 +1640,6 @@ AT_CAPTURE_FILE([sbflows]) > # dnat_and_snat or snat entry. > AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) > ]) > @@ -5868,7 +5862,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) > @@ -5945,7 +5938,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) > @@ -12010,7 +12002,7 @@ check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080 > check ovn-nbctl lr-lb-add lr0 lb1 > check ovn-nbctl lr-lb-add lr0 lb2 > > -# lflow engine should recompute since the nat ip 172.168.0.140 > +# lflow engine should not recompute even if the nat ip 172.168.0.140 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20 > @@ -12019,10 +12011,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the nat ip 172.168.0.150 > +# lflow engine should not recompute even if the nat ip 172.168.0.150 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41 > @@ -12031,10 +12023,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the deleted nat ip 172.168.0.150 > +# lflow engine should not recompute even if the deleted nat ip 172.168.0.150 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150 > @@ -12043,10 +12035,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the deleted nat ip 172.168.0.140 > +# lflow engine should not recompute even if the deleted nat ip 172.168.0.140 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140 > @@ -12055,7 +12047,7 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > # Delete the NAT > diff --git a/tests/system-ovn.at b/tests/system-ovn.at > index c22c7882fc..2411b0267e 100644 > --- a/tests/system-ovn.at > +++ b/tests/system-ovn.at > @@ -12096,6 +12096,86 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d > /connection dropped.*/d"]) > AT_CLEANUP > > +OVN_FOR_EACH_NORTHD([ > +AT_SETUP([load balancing in gateway router - client behind LB with SNAT]) > +AT_SKIP_IF([test $HAVE_NC = no]) > +AT_KEYWORDS([lb]) > + > +ovn_start > +OVS_TRAFFIC_VSWITCHD_START() > +ADD_BR([br-int]) > + > +check ovs-vsctl \ > + -- set Open_vSwitch . external-ids:system-id=hv1 \ > + -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \ > + -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \ > + -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \ > + -- set bridge br-int fail-mode=secure other-config:disable-in-band=true > + > +start_daemon ovn-controller > + > +check ovn-nbctl lr-add lr \ > + -- set logical_router lr options:chassis=hv1 > +check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24 > +check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24 > +check ovn-nbctl ls-add ls1 > +check ovn-nbctl ls-add ls2 > + > +check ovn-nbctl lsp-add ls1 ls1-lr > +check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00 > +check ovn-nbctl lsp-set-type ls1-lr router > +check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1 > +check ovn-nbctl lsp-add ls1 vm1 > +check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 > + > +check ovn-nbctl lsp-add ls2 ls2-lr > +check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00 > +check ovn-nbctl lsp-set-type ls2-lr router > +check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2 > +check ovn-nbctl lsp-add ls2 vm2 > +check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02 > + > +dnl LB using the router IP connected to vm2 as VIP. > +check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \ > + -- lr-lb-add lr lb-test > + > +dnl SNAT everything coming from vm1 to the router IP (towards vm2). > +check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1 > + > +ADD_NAMESPACES(vm1) > +ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2") > + > +ADD_NAMESPACES(vm2) > +ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2") > + > +dnl Start a server on vm2. > +NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid]) > + > +dnl Wait for ovn-controller to catch up. > +wait_for_ports_up > +check ovn-nbctl --wait=hv sync > + > +dnl Test the connection originating something that uses the same source port > +dnl as the LB VIP. > +NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore]) > + > +OVS_APP_EXIT_AND_WAIT([ovn-controller]) > + > +as ovn-sb > +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) > + > +as ovn-nb > +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) > + > +as northd > +OVS_APP_EXIT_AND_WAIT([ovn-northd]) > + > +as > +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d > +/connection dropped.*/d"]) > +AT_CLEANUP > +]) > + > OVN_FOR_EACH_NORTHD([ > AT_SETUP([load balancing affinity sessions - auto clear learnt flows]) > AT_SKIP_IF([test $HAVE_NC = no]) > -- > 2.39.3 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 3/5/24 15:56, Numan Siddique wrote: > On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> Otherwise, in case there's also a SNAT rule that uses the VIP as >> external IP, we break sessions initiated from behind the VIP. >> >> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for >> load balancer vips in router ingress pipeline"). That's OK because >> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with >> DNAT/Load Balancers") addressed the original issue in a better way: >> >> In the reply direction, the order of traversal of the tables >> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> datapath flows that check ct_state in the wrong conntrack zone. >> This is illustrated below where reply trafic enters the physical host >> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the >> DNAT zone and then on to Logical Switch Port zone (22). The third >> flow is incorrectly checking the state from the SNAT zone instead >> of the DNAT zone. >> >> We also add a system test to ensure traffic initiated from behind a VIP >> + SNAT is not broken. >> >> Another nice side effect is that the northd I-P is slightly simplified >> because we don't need to track NAT external IPs anymore. >> >> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") >> Reported-at: https://issues.redhat.com/browse/FDP-291 >> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > > Thanks for the fix. It also simplified the lr-nat-stateful code. > > Acked-by: Numan Siddique <numans@ovn.org> > Thanks, Numan! Applied to main and backported to all branches down to 22.03. Regards, Dumitru
On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 3/5/24 15:56, Numan Siddique wrote: > > On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> Otherwise, in case there's also a SNAT rule that uses the VIP as > >> external IP, we break sessions initiated from behind the VIP. > >> > >> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for > >> load balancer vips in router ingress pipeline"). That's OK because > >> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > >> DNAT/Load Balancers") addressed the original issue in a better way: > >> > >> In the reply direction, the order of traversal of the tables > >> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >> datapath flows that check ct_state in the wrong conntrack zone. > >> This is illustrated below where reply trafic enters the physical host > >> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the > >> DNAT zone and then on to Logical Switch Port zone (22). The third > >> flow is incorrectly checking the state from the SNAT zone instead > >> of the DNAT zone. > >> > >> We also add a system test to ensure traffic initiated from behind a VIP > >> + SNAT is not broken. > >> > >> Another nice side effect is that the northd I-P is slightly simplified > >> because we don't need to track NAT external IPs anymore. > >> > >> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") > >> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > > > > > Thanks for the fix. It also simplified the lr-nat-stateful code. > > > > Acked-by: Numan Siddique <numans@ovn.org> > > > > Thanks, Numan! > > Applied to main and backported to all branches down to 22.03. > > Regards, > Dumitru > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev Hi Dumitru, Numan, Tim and folks, I noticed that the HW offload of k8s nodePort traffic is broken due to this change. The reason is that for client to nodePort (LB with VIP being the node IP) traffic, when the packet is going through the unSNAT stage in the SNAT CT zone, since the entry is never committed to the SNAT zone, it will have CT state returned as "new", which prevents the HW offload to work for such packets. At the moment I have to revert this change in our downstream. For the problem that was fixed by this change [0], I think we can avoid it by separating the port range of SNAT and DNAT. For DNAT, the nodePort range in k8s is configured by API-server option: --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 For SNAT, it can be configured in the OVN's NAT table's external_port_range column, and we can choose something like 10000-30000. An extra benefit of this is that it reduces a CT recirc. Alternatively solutions are: Alternative1: Always commit to both SNAT and DNAT zones. This would introduce unnecessary cost of CT entries and extra CT recirc. Alternative2: Use a common zone for SNAT and DNAT. But there are other issues reported for using the common zone [1] Could you let me know if other thoughts on this? [0] https://issues.redhat.com/browse/FDP-291 [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 Thanks, Han
On 8/29/24 18:14, Han Zhou wrote: > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> On 3/5/24 15:56, Numan Siddique wrote: >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >>>> >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as >>>> external IP, we break sessions initiated from behind the VIP. >>>> >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for >>>> load balancer vips in router ingress pipeline"). That's OK because >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with >>>> DNAT/Load Balancers") addressed the original issue in a better way: >>>> >>>> In the reply direction, the order of traversal of the tables >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >>>> datapath flows that check ct_state in the wrong conntrack zone. >>>> This is illustrated below where reply trafic enters the physical host >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the >>>> DNAT zone and then on to Logical Switch Port zone (22). The third >>>> flow is incorrectly checking the state from the SNAT zone instead >>>> of the DNAT zone. >>>> >>>> We also add a system test to ensure traffic initiated from behind a VIP >>>> + SNAT is not broken. >>>> >>>> Another nice side effect is that the northd I-P is slightly simplified >>>> because we don't need to track NAT external IPs anymore. >>>> >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >>> >>> >>> Thanks for the fix. It also simplified the lr-nat-stateful code. >>> >>> Acked-by: Numan Siddique <numans@ovn.org> >>> >> >> Thanks, Numan! >> >> Applied to main and backported to all branches down to 22.03. >> >> Regards, >> Dumitru >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > Hi Dumitru, Numan, Tim and folks, > Hi Han, > I noticed that the HW offload of k8s nodePort traffic is broken due to > this change. The reason is that for client to nodePort (LB with VIP > being the node IP) traffic, when the packet is going through the > unSNAT stage in the SNAT CT zone, since the entry is never committed > to the SNAT zone, it will have CT state returned as "new", which > prevents the HW offload to work for such packets. > Sorry about that, I forgot we don't always commit in the SNAT zone. > At the moment I have to revert this change in our downstream. For the > problem that was fixed by this change [0], I think we can avoid it by > separating the port range of SNAT and DNAT. For DNAT, the nodePort > range in k8s is configured by API-server option: > > --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 > > For SNAT, it can be configured in the OVN's NAT table's > external_port_range column, and we can choose something like > 10000-30000. > Tim, does this look OK to you? If it's acceptable to limit the SNAT port range this workaround should be fine. > An extra benefit of this is that it reduces a CT recirc. > > Alternatively solutions are: > Alternative1: Always commit to both SNAT and DNAT zones. This would > introduce unnecessary cost of CT entries and extra CT recirc. Extra recirculation aside, I would actually love it if we could use this alternative. I think it's the "most correct" option. I think it would allow us to avoid other workarounds like: https://github.com/ovn-org/ovn/commit/40136a2f2c8 or https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ I do understand the worry about the extra recirculation though. In the HWOL context does that cause visible performance impact? We'd probably have to do more performance testing without HWOL to figure out the impact in the software datapath. > Alternative2: Use a common zone for SNAT and DNAT. But there are other > issues reported for using the common zone [1] > > Could you let me know if other thoughts on this? > On a related note, I know it has been discussed in different settings but I don't think this ever moved forward: would it be possible for NVIDIA to help out with automatically testing HWOL impact for incoming patches? Maybe we could some "simple" system-like tests that ensure that traffic is correctly offloaded in common scenarios? Alternatively, I guess we could also tag a subset of the existing system tests and just run those on actual hardware? It's quite simple (AFAIU) for external CIs to report status on each OVN patch posted on patchwork. That would at least allow us to flag this kind of breakages early (even before they get merged). What do you think? > [0] https://issues.redhat.com/browse/FDP-291 > [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > > Thanks, > Han > Thanks, Dumitru
Hi Han, My understanding is that the unSNAT stage was originally being skipped for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks HWOL for ingress connections that never need to get unSNAT'ed and therefore not committed. In the OVNK use case for pods simply connecting egress, almost all packets will be SNAT'ed to the node IP. Therefore on egress reply, we always enter the unSNAT stage for these packets. So this skipping unSNAT stage seems to be specific to LB related traffic. While I agree splitting the port range would work, I think it makes more sense to just always commit in the unSNAT stage for all traffic. I get there is a performance hit there, but the datapath pipeline seems more consistent and I think outweighs the cost of committing LB traffic. Thanks, Tim Rozet Red Hat OpenShift Networking Team On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: > On 8/29/24 18:14, Han Zhou wrote: > > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> On 3/5/24 15:56, Numan Siddique wrote: > >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >>>> > >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as > >>>> external IP, we break sessions initiated from behind the VIP. > >>>> > >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows > for > >>>> load balancer vips in router ingress pipeline"). That's OK because > >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > >>>> DNAT/Load Balancers") addressed the original issue in a better way: > >>>> > >>>> In the reply direction, the order of traversal of the tables > >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >>>> datapath flows that check ct_state in the wrong conntrack zone. > >>>> This is illustrated below where reply trafic enters the physical > host > >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back > to the > >>>> DNAT zone and then on to Logical Switch Port zone (22). The third > >>>> flow is incorrectly checking the state from the SNAT zone instead > >>>> of the DNAT zone. > >>>> > >>>> We also add a system test to ensure traffic initiated from behind a > VIP > >>>> + SNAT is not broken. > >>>> > >>>> Another nice side effect is that the northd I-P is slightly simplified > >>>> because we don't need to track NAT external IPs anymore. > >>>> > >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer > vips in router ingress pipeline") > >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >>> > >>> > >>> Thanks for the fix. It also simplified the lr-nat-stateful code. > >>> > >>> Acked-by: Numan Siddique <numans@ovn.org> > >>> > >> > >> Thanks, Numan! > >> > >> Applied to main and backported to all branches down to 22.03. > >> > >> Regards, > >> Dumitru > >> > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > Hi Dumitru, Numan, Tim and folks, > > > > Hi Han, > > > I noticed that the HW offload of k8s nodePort traffic is broken due to > > this change. The reason is that for client to nodePort (LB with VIP > > being the node IP) traffic, when the packet is going through the > > unSNAT stage in the SNAT CT zone, since the entry is never committed > > to the SNAT zone, it will have CT state returned as "new", which > > prevents the HW offload to work for such packets. > > > > Sorry about that, I forgot we don't always commit in the SNAT zone. > > > At the moment I have to revert this change in our downstream. For the > > problem that was fixed by this change [0], I think we can avoid it by > > separating the port range of SNAT and DNAT. For DNAT, the nodePort > > range in k8s is configured by API-server option: > > > > --service-node-port-range <a string in the form 'N1-N2'> Default: > 30000-32767 > > > > For SNAT, it can be configured in the OVN's NAT table's > > external_port_range column, and we can choose something like > > 10000-30000. > > > > Tim, does this look OK to you? If it's acceptable to limit the SNAT > port range this workaround should be fine. > > > An extra benefit of this is that it reduces a CT recirc. > > > > Alternatively solutions are: > > Alternative1: Always commit to both SNAT and DNAT zones. This would > > introduce unnecessary cost of CT entries and extra CT recirc. > > Extra recirculation aside, I would actually love it if we could use this > alternative. I think it's the "most correct" option. I think it would > allow us to avoid other workarounds like: > > https://github.com/ovn-org/ovn/commit/40136a2f2c8 > > or > > > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > > I do understand the worry about the extra recirculation though. In the > HWOL context does that cause visible performance impact? > > We'd probably have to do more performance testing without HWOL to figure > out the impact in the software datapath. > > > Alternative2: Use a common zone for SNAT and DNAT. But there are other > > issues reported for using the common zone [1] > > > > Could you let me know if other thoughts on this? > > > > On a related note, I know it has been discussed in different settings > but I don't think this ever moved forward: would it be possible for > NVIDIA to help out with automatically testing HWOL impact for incoming > patches? > > Maybe we could some "simple" system-like tests that ensure that traffic > is correctly offloaded in common scenarios? Alternatively, I guess we > could also tag a subset of the existing system tests and just run those > on actual hardware? > > It's quite simple (AFAIU) for external CIs to report status on each OVN > patch posted on patchwork. That would at least allow us to flag this > kind of breakages early (even before they get merged). > > What do you think? > > > [0] https://issues.redhat.com/browse/FDP-291 > > [1] b8c40e7593 > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > > > > Thanks, > > Han > > > > Thanks, > Dumitru > >
On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: > > Hi Han, > My understanding is that the unSNAT stage was originally being skipped for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks HWOL for ingress connections that never need to get unSNAT'ed and therefore not committed. In the OVNK use case for pods simply connecting egress, almost all packets will be SNAT'ed to the node IP. Therefore on egress reply, we always enter the unSNAT stage for these packets. So this skipping unSNAT stage seems to be specific to LB related traffic. While I agree splitting the port range would work, I think it makes more sense to just always commit in the unSNAT stage for all traffic. I get there is a performance hit there, but the datapath pipeline seems more consistent and I think outweighs the cost of committing LB traffic. Hi Tim and folks, + @Ales Musil <amusil@redhat.com> We've spent a lot of effort on the alternative solution: Always commit to both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in this thread is still unresolved. I'd like to discuss the current situation and ask for opinions. A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: Add LR option to commit all traffic."). When the ct-commit-all option is enabled, it commits the connection to both SNAT and DNAT zones. However, in the ovn-k8s nodeport scenario (a very typical use case) the HW offload is still broken. In that scenario, a packet may come from external network with dest IP being a node's IP, which is also used as an OVN gateway router IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP is SNATed to another LRP's IP on this router. The src/dst IPs in different stages are: Original: EXT IP <-> GR IP1 After DNAT: EXT IP <-> BE IP After SNAT: GR IP2 <-> BE IP With the ct-commit-all option, it commits after DNAT and after SNAT: DNAT zone: EXT IP <-> GR IP1/BE IP SNAT zone: EXT IP/GR IP2 <-> BE IP So, at the UNSNAT stage, the packet traverses the SNAT zone with the original header, which will still always be "new" state, which breaks HW offload. I tried to fix it by also committing to the SNAT zone before DNAT (at the POST_SNAT stage). Now there are 3 entries committed: SNAT zone: EXT IP <-> GR IP1 DNAT zone: EXT IP <-> GR IP1/BE IP SNAT zone: EXT IP/GR IP2 <-> BE IP HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on distributed router - E/W", which tests communication between two workloads using "floating IPs". In this scenario, both workloads are under different LSes behind the same router, and both have their own floating IPs (dnat_and_snat) on the public side of the LR. When one of them sends packets to the other using the floating IP, what's expected is: Original: overlay IP1 <-> floating IP2 After SNAT: floating IP1 <-> floating IP2 After DNAT: floating IP1 <-> overlay IP2 With my fix, the packet is firstly SNATed with CT entry in SNAT zone: SNAT: overlay IP1/floating IP1 <-> floating IP2 This happens in the egress pipeline of the LR, and then it enters into the ingress pipeline again because this is like a hairpin (the dest IP is a dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, traverse the CT zone, and for the first packet the reply is not seen yet, so the CT state is still new, so in the POST_UNSNAT we will try to commit it (because of the issue I am trying to fix above). Because the connection floating IP1 <-> floating IP2 is conflicting with the above committed entry "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and the packet is dropped. There is still a way to solve this, probably by skipping the unnecessary UNSNAT traversal and the conflict commit if the source IP is one of the snat/dnat_and_snat IPs of this same router. However, as we see, the NAT related pipelines have been very complex today, and with the above changes it would become even much more complex. I wonder if this is maintainable in the long run. In addition, I am not even sure if this is the last problem of my patch. It is not easy to tell because today the ct-commit-all feature is an option and by default disabled. So all our test cases by default don't test for this option except the 4 test cases added by Ales's patch. I had to manually set this option to true as default and of course many test cases will fail. Most of them fail because of the expected flows won't match, but I can't tell for sure unless updating every case with new flows and then retest them. Before doing that I just examined ones I think might be impacted by the patch and the "DNAT and SNAT on distributed router - E/W" is the first broken one I found. So there can be other issues not discovered yet. Moreover, we should also keep in mind the performance penalty of these extra CT commit & traversal. I'd like to listen to your opinion before moving forward. Is this worth the effort, or shall we just keep the pipeline simple (relatively) and avoid the original problem (https://issues.redhat.com/browse/FDP-291) by separating the port range of SNAT and DNAT, the first proposal at the beginning of this thread? Thanks, Han > > Thanks, > Tim Rozet > Red Hat OpenShift Networking Team > > > On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> On 8/29/24 18:14, Han Zhou wrote: >> > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> >> >> On 3/5/24 15:56, Numan Siddique wrote: >> >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >>>>>> >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as >> >>>> external IP, we break sessions initiated from behind the VIP. >> >>>> >> >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for >> >>>> load balancer vips in router ingress pipeline"). That's OK because >> >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with >> >>>> DNAT/Load Balancers") addressed the original issue in a better way: >> >>>> >> >>>> In the reply direction, the order of traversal of the tables >> >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> >>>> datapath flows that check ct_state in the wrong conntrack zone. >> >>>> This is illustrated below where reply trafic enters the physical host >> >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the >> >>>> DNAT zone and then on to Logical Switch Port zone (22). The third >> >>>> flow is incorrectly checking the state from the SNAT zone instead >> >>>> of the DNAT zone. >> >>>> >> >>>> We also add a system test to ensure traffic initiated from behind a VIP >> >>>> + SNAT is not broken. >> >>>> >> >>>> Another nice side effect is that the northd I-P is slightly simplified >> >>>> because we don't need to track NAT external IPs anymore. >> >>>> >> >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") >> >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >> >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> >>> >> >>> >> >>> Thanks for the fix. It also simplified the lr-nat-stateful code. >> >>> >> >>> Acked-by: Numan Siddique <numans@ovn.org> >> >>> >> >> >> >> Thanks, Numan! >> >> >> >> Applied to main and backported to all branches down to 22.03. >> >> >> >> Regards, >> >> Dumitru >> >> >> >> _______________________________________________ >> >> dev mailing list >> >> dev@openvswitch.org >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > >> > Hi Dumitru, Numan, Tim and folks, >> > >> >> Hi Han, >> >> > I noticed that the HW offload of k8s nodePort traffic is broken due to >> > this change. The reason is that for client to nodePort (LB with VIP >> > being the node IP) traffic, when the packet is going through the >> > unSNAT stage in the SNAT CT zone, since the entry is never committed >> > to the SNAT zone, it will have CT state returned as "new", which >> > prevents the HW offload to work for such packets. >> > >> >> Sorry about that, I forgot we don't always commit in the SNAT zone. >> >> > At the moment I have to revert this change in our downstream. For the >> > problem that was fixed by this change [0], I think we can avoid it by >> > separating the port range of SNAT and DNAT. For DNAT, the nodePort >> > range in k8s is configured by API-server option: >> > >> > --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 >> > >> > For SNAT, it can be configured in the OVN's NAT table's >> > external_port_range column, and we can choose something like >> > 10000-30000. >> > >> >> Tim, does this look OK to you? If it's acceptable to limit the SNAT >> port range this workaround should be fine. >> >> > An extra benefit of this is that it reduces a CT recirc. >> > >> > Alternatively solutions are: >> > Alternative1: Always commit to both SNAT and DNAT zones. This would >> > introduce unnecessary cost of CT entries and extra CT recirc. >> >> Extra recirculation aside, I would actually love it if we could use this >> alternative. I think it's the "most correct" option. I think it would >> allow us to avoid other workarounds like: >> >> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >> >> or >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >> >> I do understand the worry about the extra recirculation though. In the >> HWOL context does that cause visible performance impact? >> >> We'd probably have to do more performance testing without HWOL to figure >> out the impact in the software datapath. >> >> > Alternative2: Use a common zone for SNAT and DNAT. But there are other >> > issues reported for using the common zone [1] >> > >> > Could you let me know if other thoughts on this? >> > >> >> On a related note, I know it has been discussed in different settings >> but I don't think this ever moved forward: would it be possible for >> NVIDIA to help out with automatically testing HWOL impact for incoming >> patches? >> >> Maybe we could some "simple" system-like tests that ensure that traffic >> is correctly offloaded in common scenarios? Alternatively, I guess we >> could also tag a subset of the existing system tests and just run those >> on actual hardware? >> >> It's quite simple (AFAIU) for external CIs to report status on each OVN >> patch posted on patchwork. That would at least allow us to flag this >> kind of breakages early (even before they get merged). >> >> What do you think? >> >> > [0] https://issues.redhat.com/browse/FDP-291 >> > [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >> > >> > Thanks, >> > Han >> > >> >> Thanks, >> Dumitru
On 5/15/25 6:24 PM, Han Zhou wrote: > On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >> >> Hi Han, >> My understanding is that the unSNAT stage was originally being skipped > for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks > HWOL for ingress connections that never need to get unSNAT'ed and therefore > not committed. In the OVNK use case for pods simply connecting egress, > almost all packets will be SNAT'ed to the node IP. Therefore on egress > reply, we always enter the unSNAT stage for these packets. So this skipping > unSNAT stage seems to be specific to LB related traffic. While I agree > splitting the port range would work, I think it makes more sense to just > always commit in the unSNAT stage for all traffic. I get there is a > performance hit there, but the datapath pipeline seems more consistent and > I think outweighs the cost of committing LB traffic. > > Hi Tim and folks, > Hi Han, > + @Ales Musil <amusil@redhat.com> > > We've spent a lot of effort on the alternative solution: Always commit to > both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in > this thread is still unresolved. I'd like to discuss the current situation > and ask for opinions. > Thanks again for working on this! > A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: > Add LR option to commit all traffic."). When the ct-commit-all option is > enabled, it commits the connection to both SNAT and DNAT zones. However, in > the ovn-k8s nodeport scenario (a very typical use case) the HW offload is > still broken. In that scenario, a packet may come from external network > with dest IP being a node's IP, which is also used as an OVN gateway router > IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP > is SNATed to another LRP's IP on this router. The src/dst IPs in different > stages are: > > Original: EXT IP <-> GR IP1 > After DNAT: EXT IP <-> BE IP > After SNAT: GR IP2 <-> BE IP > > With the ct-commit-all option, it commits after DNAT and after SNAT: > DNAT zone: EXT IP <-> GR IP1/BE IP > SNAT zone: EXT IP/GR IP2 <-> BE IP > > So, at the UNSNAT stage, the packet traverses the SNAT zone with the > original header, which will still always be "new" state, which breaks HW > offload. > Ah, now I understand the problem you were facing. > I tried to fix it by also committing to the SNAT zone before DNAT (at the > POST_SNAT stage). Now there are 3 entries committed: > SNAT zone: EXT IP <-> GR IP1 > DNAT zone: EXT IP <-> GR IP1/BE IP > SNAT zone: EXT IP/GR IP2 <-> BE IP > > HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on > distributed router - E/W", which tests communication between two workloads > using "floating IPs". In this scenario, both workloads are under different > LSes behind the same router, and both have their own floating IPs > (dnat_and_snat) on the public side of the LR. When one of them sends > packets to the other using the floating IP, what's expected is: > > Original: overlay IP1 <-> floating IP2 > After SNAT: floating IP1 <-> floating IP2 > After DNAT: floating IP1 <-> overlay IP2 > > With my fix, the packet is firstly SNATed with CT entry in SNAT zone: > SNAT: overlay IP1/floating IP1 <-> floating IP2 > > This happens in the egress pipeline of the LR, and then it enters into the > ingress pipeline again because this is like a hairpin (the dest IP is a > dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, > traverse the CT zone, and for the first packet the reply is not seen yet, > so the CT state is still new, so in the POST_UNSNAT we will try to commit > it (because of the issue I am trying to fix above). Because the connection > floating IP1 <-> floating IP2 is conflicting with the above committed entry > "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and > the packet is dropped. > > There is still a way to solve this, probably by skipping the unnecessary > UNSNAT traversal and the conflict commit if the source IP is one of the > snat/dnat_and_snat IPs of this same router. > > However, as we see, the NAT related pipelines have been very complex today, > and with the above changes it would become even much more complex. I wonder > if this is maintainable in the long run. > > In addition, I am not even sure if this is the last problem of my patch. It > is not easy to tell because today the ct-commit-all feature is an option > and by default disabled. So all our test cases by default don't test for > this option except the 4 test cases added by Ales's patch. I had to > manually set this option to true as default and of course many test cases > will fail. Most of them fail because of the expected flows won't match, but > I can't tell for sure unless updating every case with new flows and then > retest them. Before doing that I just examined ones I think might be > impacted by the patch and the "DNAT and SNAT on distributed router - E/W" > is the first broken one I found. So there can be other issues not > discovered yet. > > Moreover, we should also keep in mind the performance penalty of these > extra CT commit & traversal. > But basically this happens because we use the same SNAT zone for all router ports. In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per router port fix it? E.g.: host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, DNAT ZONE: 22) ---- host2 That'd also imply that we commit that session (from host1 to host2 - potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different CT lookups for all packets traversing the router. I didn't try any of this but I have the impression that this would solve the HWOL issue: all subsequent packets are part of known and committed sessions in all 4 conntrack zones. So what if we change the implementation of the "commit-all" feature to add the missing ct commits? Wouldn't that work? It feels like (lookup performance hit aside) this is the correct way to forward traffic, i.e., never traverse a CT zone without having a committed entry in that zone. > I'd like to listen to your opinion before moving forward. Is this worth the > effort, or shall we just keep the pipeline simple (relatively) and avoid > the original problem (https://issues.redhat.com/browse/FDP-291) by > separating the port range of SNAT and DNAT, the first proposal at the > beginning of this thread? > Now that I understand the problem you're describing better it might be worth reverting the fix for FDP-291 like you said until we find a proper generic solution. However, we can't afford to do that without separating the port range of SNAT and DNAT like you said [0] because we'd be regressing (at least from OpenShift perspective). Because selecting the nodeport (DNAT) range is done through a k8s API server cmdline argument that means that users can in theory choose whatever range they want. I guess we'd need a command line argument in ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes what SNAT port range to use so that it doesn't conflict with the KAPI nodeport range. Tim, you're the expert on this side, what are your thoughts on the matter? Regards, Dumitru [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > Thanks, > Han > >> >> Thanks, >> Tim Rozet >> Red Hat OpenShift Networking Team >> >> >> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: >>> >>> On 8/29/24 18:14, Han Zhou wrote: >>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> > wrote: >>>>> >>>>> On 3/5/24 15:56, Numan Siddique wrote: >>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > wrote: >>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the > VIP as >>>>>>> external IP, we break sessions initiated from behind the VIP. >>>>>>> >>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows > for >>>>>>> load balancer vips in router ingress pipeline"). That's OK because >>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers > with >>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >>>>>>> >>>>>>> In the reply direction, the order of traversal of the tables >>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >>>>>>> This is illustrated below where reply trafic enters the > physical host >>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), > back to the >>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The > third >>>>>>> flow is incorrectly checking the state from the SNAT zone > instead >>>>>>> of the DNAT zone. >>>>>>> >>>>>>> We also add a system test to ensure traffic initiated from behind a > VIP >>>>>>> + SNAT is not broken. >>>>>>> >>>>>>> Another nice side effect is that the northd I-P is slightly > simplified >>>>>>> because we don't need to track NAT external IPs anymore. >>>>>>> >>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load > balancer vips in router ingress pipeline") >>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >>>>>> >>>>>> >>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >>>>>> >>>>>> Acked-by: Numan Siddique <numans@ovn.org> >>>>>> >>>>> >>>>> Thanks, Numan! >>>>> >>>>> Applied to main and backported to all branches down to 22.03. >>>>> >>>>> Regards, >>>>> Dumitru >>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> dev@openvswitch.org >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >>>> Hi Dumitru, Numan, Tim and folks, >>>> >>> >>> Hi Han, >>> >>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >>>> this change. The reason is that for client to nodePort (LB with VIP >>>> being the node IP) traffic, when the packet is going through the >>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >>>> to the SNAT zone, it will have CT state returned as "new", which >>>> prevents the HW offload to work for such packets. >>>> >>> >>> Sorry about that, I forgot we don't always commit in the SNAT zone. >>> >>>> At the moment I have to revert this change in our downstream. For the >>>> problem that was fixed by this change [0], I think we can avoid it by >>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >>>> range in k8s is configured by API-server option: >>>> >>>> --service-node-port-range <a string in the form 'N1-N2'> Default: > 30000-32767 >>>> >>>> For SNAT, it can be configured in the OVN's NAT table's >>>> external_port_range column, and we can choose something like >>>> 10000-30000. >>>> >>> >>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >>> port range this workaround should be fine. >>> >>>> An extra benefit of this is that it reduces a CT recirc. >>>> >>>> Alternatively solutions are: >>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >>>> introduce unnecessary cost of CT entries and extra CT recirc. >>> >>> Extra recirculation aside, I would actually love it if we could use this >>> alternative. I think it's the "most correct" option. I think it would >>> allow us to avoid other workarounds like: >>> >>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >>> >>> or >>> >>> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >>> >>> I do understand the worry about the extra recirculation though. In the >>> HWOL context does that cause visible performance impact? >>> >>> We'd probably have to do more performance testing without HWOL to figure >>> out the impact in the software datapath. >>> >>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >>>> issues reported for using the common zone [1] >>>> >>>> Could you let me know if other thoughts on this? >>>> >>> >>> On a related note, I know it has been discussed in different settings >>> but I don't think this ever moved forward: would it be possible for >>> NVIDIA to help out with automatically testing HWOL impact for incoming >>> patches? >>> >>> Maybe we could some "simple" system-like tests that ensure that traffic >>> is correctly offloaded in common scenarios? Alternatively, I guess we >>> could also tag a subset of the existing system tests and just run those >>> on actual hardware? >>> >>> It's quite simple (AFAIU) for external CIs to report status on each OVN >>> patch posted on patchwork. That would at least allow us to flag this >>> kind of breakages early (even before they get merged). >>> >>> What do you think? >>> >>>> [0] https://issues.redhat.com/browse/FDP-291 >>>> [1] b8c40e7593 > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >>>> >>>> Thanks, >>>> Han >>>> >>> >>> Thanks, >>> Dumitru > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 5/16/25 1:22 PM, Dumitru Ceara wrote: > On 5/15/25 6:24 PM, Han Zhou wrote: >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >>> >>> Hi Han, >>> My understanding is that the unSNAT stage was originally being skipped >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks >> HWOL for ingress connections that never need to get unSNAT'ed and therefore >> not committed. In the OVNK use case for pods simply connecting egress, >> almost all packets will be SNAT'ed to the node IP. Therefore on egress >> reply, we always enter the unSNAT stage for these packets. So this skipping >> unSNAT stage seems to be specific to LB related traffic. While I agree >> splitting the port range would work, I think it makes more sense to just >> always commit in the unSNAT stage for all traffic. I get there is a >> performance hit there, but the datapath pipeline seems more consistent and >> I think outweighs the cost of committing LB traffic. >> >> Hi Tim and folks, >> > > Hi Han, > >> + @Ales Musil <amusil@redhat.com> >> >> We've spent a lot of effort on the alternative solution: Always commit to >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in >> this thread is still unresolved. I'd like to discuss the current situation >> and ask for opinions. >> > > Thanks again for working on this! > >> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: >> Add LR option to commit all traffic."). When the ct-commit-all option is >> enabled, it commits the connection to both SNAT and DNAT zones. However, in >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is >> still broken. In that scenario, a packet may come from external network >> with dest IP being a node's IP, which is also used as an OVN gateway router >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP >> is SNATed to another LRP's IP on this router. The src/dst IPs in different >> stages are: >> >> Original: EXT IP <-> GR IP1 >> After DNAT: EXT IP <-> BE IP >> After SNAT: GR IP2 <-> BE IP >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >> DNAT zone: EXT IP <-> GR IP1/BE IP >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >> original header, which will still always be "new" state, which breaks HW >> offload. >> > > Ah, now I understand the problem you were facing. > >> I tried to fix it by also committing to the SNAT zone before DNAT (at the >> POST_SNAT stage). Now there are 3 entries committed: >> SNAT zone: EXT IP <-> GR IP1 >> DNAT zone: EXT IP <-> GR IP1/BE IP >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on >> distributed router - E/W", which tests communication between two workloads >> using "floating IPs". In this scenario, both workloads are under different >> LSes behind the same router, and both have their own floating IPs >> (dnat_and_snat) on the public side of the LR. When one of them sends >> packets to the other using the floating IP, what's expected is: >> >> Original: overlay IP1 <-> floating IP2 >> After SNAT: floating IP1 <-> floating IP2 >> After DNAT: floating IP1 <-> overlay IP2 >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >> >> This happens in the egress pipeline of the LR, and then it enters into the >> ingress pipeline again because this is like a hairpin (the dest IP is a >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, >> traverse the CT zone, and for the first packet the reply is not seen yet, >> so the CT state is still new, so in the POST_UNSNAT we will try to commit >> it (because of the issue I am trying to fix above). Because the connection >> floating IP1 <-> floating IP2 is conflicting with the above committed entry >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and >> the packet is dropped. >> >> There is still a way to solve this, probably by skipping the unnecessary >> UNSNAT traversal and the conflict commit if the source IP is one of the >> snat/dnat_and_snat IPs of this same router. >> >> However, as we see, the NAT related pipelines have been very complex today, >> and with the above changes it would become even much more complex. I wonder >> if this is maintainable in the long run. >> >> In addition, I am not even sure if this is the last problem of my patch. It >> is not easy to tell because today the ct-commit-all feature is an option >> and by default disabled. So all our test cases by default don't test for >> this option except the 4 test cases added by Ales's patch. I had to >> manually set this option to true as default and of course many test cases >> will fail. Most of them fail because of the expected flows won't match, but >> I can't tell for sure unless updating every case with new flows and then >> retest them. Before doing that I just examined ones I think might be >> impacted by the patch and the "DNAT and SNAT on distributed router - E/W" >> is the first broken one I found. So there can be other issues not >> discovered yet. >> >> Moreover, we should also keep in mind the performance penalty of these >> extra CT commit & traversal. >> > > But basically this happens because we use the same SNAT zone for all > router ports. > > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per > router port fix it? E.g.: > > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, > DNAT ZONE: 22) ---- host2 > > That'd also imply that we commit that session (from host1 to host2 - > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different > CT lookups for all packets traversing the router. > > I didn't try any of this but I have the impression that this would solve > the HWOL issue: all subsequent packets are part of known and committed > sessions in all 4 conntrack zones. > > So what if we change the implementation of the "commit-all" feature to > add the missing ct commits? Wouldn't that work? > > It feels like (lookup performance hit aside) this is the correct way to > forward traffic, i.e., never traverse a CT zone without having a > committed entry in that zone. > >> I'd like to listen to your opinion before moving forward. Is this worth the >> effort, or shall we just keep the pipeline simple (relatively) and avoid >> the original problem (https://issues.redhat.com/browse/FDP-291) by >> separating the port range of SNAT and DNAT, the first proposal at the >> beginning of this thread? >> > > Now that I understand the problem you're describing better it might be > worth reverting the fix for FDP-291 like you said until we find a proper > generic solution. > > However, we can't afford to do that without separating the port range of > SNAT and DNAT like you said [0] because we'd be regressing (at least > from OpenShift perspective). > For reference, Tim, this is the OpenShift bug that needed the fix for FDP-291: https://issues.redhat.com/browse/OCPBUGS-25889 > Because selecting the nodeport (DNAT) range is done through a k8s API > server cmdline argument that means that users can in theory choose > whatever range they want. I guess we'd need a command line argument in > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes > what SNAT port range to use so that it doesn't conflict with the KAPI > nodeport range. > > Tim, you're the expert on this side, what are your thoughts on the matter? > > Regards, > Dumitru > > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> Thanks, >> Han >> >>> >>> Thanks, >>> Tim Rozet >>> Red Hat OpenShift Networking Team >>> >>> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: >>>> >>>> On 8/29/24 18:14, Han Zhou wrote: >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> >> wrote: >>>>>> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> >> wrote: >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the >> VIP as >>>>>>>> external IP, we break sessions initiated from behind the VIP. >>>>>>>> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows >> for >>>>>>>> load balancer vips in router ingress pipeline"). That's OK because >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers >> with >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >>>>>>>> >>>>>>>> In the reply direction, the order of traversal of the tables >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >>>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >>>>>>>> This is illustrated below where reply trafic enters the >> physical host >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), >> back to the >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The >> third >>>>>>>> flow is incorrectly checking the state from the SNAT zone >> instead >>>>>>>> of the DNAT zone. >>>>>>>> >>>>>>>> We also add a system test to ensure traffic initiated from behind a >> VIP >>>>>>>> + SNAT is not broken. >>>>>>>> >>>>>>>> Another nice side effect is that the northd I-P is slightly >> simplified >>>>>>>> because we don't need to track NAT external IPs anymore. >>>>>>>> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >> balancer vips in router ingress pipeline") >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >>>>>>> >>>>>>> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >>>>>>> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >>>>>>> >>>>>> >>>>>> Thanks, Numan! >>>>>> >>>>>> Applied to main and backported to all branches down to 22.03. >>>>>> >>>>>> Regards, >>>>>> Dumitru >>>>>> >>>>>> _______________________________________________ >>>>>> dev mailing list >>>>>> dev@openvswitch.org >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>> >>>>> Hi Dumitru, Numan, Tim and folks, >>>>> >>>> >>>> Hi Han, >>>> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >>>>> this change. The reason is that for client to nodePort (LB with VIP >>>>> being the node IP) traffic, when the packet is going through the >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >>>>> to the SNAT zone, it will have CT state returned as "new", which >>>>> prevents the HW offload to work for such packets. >>>>> >>>> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. >>>> >>>>> At the moment I have to revert this change in our downstream. For the >>>>> problem that was fixed by this change [0], I think we can avoid it by >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >>>>> range in k8s is configured by API-server option: >>>>> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >> 30000-32767 >>>>> >>>>> For SNAT, it can be configured in the OVN's NAT table's >>>>> external_port_range column, and we can choose something like >>>>> 10000-30000. >>>>> >>>> >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >>>> port range this workaround should be fine. >>>> >>>>> An extra benefit of this is that it reduces a CT recirc. >>>>> >>>>> Alternatively solutions are: >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >>>> >>>> Extra recirculation aside, I would actually love it if we could use this >>>> alternative. I think it's the "most correct" option. I think it would >>>> allow us to avoid other workarounds like: >>>> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >>>> >>>> or >>>> >>>> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >>>> >>>> I do understand the worry about the extra recirculation though. In the >>>> HWOL context does that cause visible performance impact? >>>> >>>> We'd probably have to do more performance testing without HWOL to figure >>>> out the impact in the software datapath. >>>> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >>>>> issues reported for using the common zone [1] >>>>> >>>>> Could you let me know if other thoughts on this? >>>>> >>>> >>>> On a related note, I know it has been discussed in different settings >>>> but I don't think this ever moved forward: would it be possible for >>>> NVIDIA to help out with automatically testing HWOL impact for incoming >>>> patches? >>>> >>>> Maybe we could some "simple" system-like tests that ensure that traffic >>>> is correctly offloaded in common scenarios? Alternatively, I guess we >>>> could also tag a subset of the existing system tests and just run those >>>> on actual hardware? >>>> >>>> It's quite simple (AFAIU) for external CIs to report status on each OVN >>>> patch posted on patchwork. That would at least allow us to flag this >>>> kind of breakages early (even before they get merged). >>>> >>>> What do you think? >>>> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >>>>> [1] b8c40e7593 >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >>>>> >>>>> Thanks, >>>>> Han >>>>> >>>> >>>> Thanks, >>>> Dumitru >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Hi All, Sorry Han that you are still facing issues with this problem. It seems like the best option would be to configure the range of ephemeral ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport service range is configured 30000–32767, and the ephemeral port range in the kernel is set to: net.ipv4.ip_local_port_range = 32768 60999 I think it would make sense to revert and add an option to make this range configurable in OVN. Then OVN-Kubernetes should pass along this config to OVN. As Dumitru mentioned, we can't just revert the previous patch without a substitute fix in place, otherwise we would risk regressing. Thanks, Tim Rozet Red Hat OpenShift Networking Team On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > On 5/16/25 1:22 PM, Dumitru Ceara wrote: > > On 5/15/25 6:24 PM, Han Zhou wrote: > >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: > >>> > >>> Hi Han, > >>> My understanding is that the unSNAT stage was originally being skipped > >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks > >> HWOL for ingress connections that never need to get unSNAT'ed and > therefore > >> not committed. In the OVNK use case for pods simply connecting egress, > >> almost all packets will be SNAT'ed to the node IP. Therefore on egress > >> reply, we always enter the unSNAT stage for these packets. So this > skipping > >> unSNAT stage seems to be specific to LB related traffic. While I agree > >> splitting the port range would work, I think it makes more sense to just > >> always commit in the unSNAT stage for all traffic. I get there is a > >> performance hit there, but the datapath pipeline seems more consistent > and > >> I think outweighs the cost of committing LB traffic. > >> > >> Hi Tim and folks, > >> > > > > Hi Han, > > > >> + @Ales Musil <amusil@redhat.com> > >> > >> We've spent a lot of effort on the alternative solution: Always commit > to > >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed > in > >> this thread is still unresolved. I'd like to discuss the current > situation > >> and ask for opinions. > >> > > > > Thanks again for working on this! > > > >> A while ago, Ales tried to fix it with this commit: 800fd0681579 > ("northd: > >> Add LR option to commit all traffic."). When the ct-commit-all option is > >> enabled, it commits the connection to both SNAT and DNAT zones. > However, in > >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload > is > >> still broken. In that scenario, a packet may come from external network > >> with dest IP being a node's IP, which is also used as an OVN gateway > router > >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src > IP > >> is SNATed to another LRP's IP on this router. The src/dst IPs in > different > >> stages are: > >> > >> Original: EXT IP <-> GR IP1 > >> After DNAT: EXT IP <-> BE IP > >> After SNAT: GR IP2 <-> BE IP > >> > >> With the ct-commit-all option, it commits after DNAT and after SNAT: > >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> > >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the > >> original header, which will still always be "new" state, which breaks HW > >> offload. > >> > > > > Ah, now I understand the problem you were facing. > > > >> I tried to fix it by also committing to the SNAT zone before DNAT (at > the > >> POST_SNAT stage). Now there are 3 entries committed: > >> SNAT zone: EXT IP <-> GR IP1 > >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> > >> HW offload worked, but it breaks at least the scenario of "DNAT and > SNAT on > >> distributed router - E/W", which tests communication between two > workloads > >> using "floating IPs". In this scenario, both workloads are under > different > >> LSes behind the same router, and both have their own floating IPs > >> (dnat_and_snat) on the public side of the LR. When one of them sends > >> packets to the other using the floating IP, what's expected is: > >> > >> Original: overlay IP1 <-> floating IP2 > >> After SNAT: floating IP1 <-> floating IP2 > >> After DNAT: floating IP1 <-> overlay IP2 > >> > >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: > >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> > >> This happens in the egress pipeline of the LR, and then it enters into > the > >> ingress pipeline again because this is like a hairpin (the dest IP is a > >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage > first, > >> traverse the CT zone, and for the first packet the reply is not seen > yet, > >> so the CT state is still new, so in the POST_UNSNAT we will try to > commit > >> it (because of the issue I am trying to fix above). Because the > connection > >> floating IP1 <-> floating IP2 is conflicting with the above committed > entry > >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails > and > >> the packet is dropped. > >> > >> There is still a way to solve this, probably by skipping the unnecessary > >> UNSNAT traversal and the conflict commit if the source IP is one of the > >> snat/dnat_and_snat IPs of this same router. > >> > >> However, as we see, the NAT related pipelines have been very complex > today, > >> and with the above changes it would become even much more complex. I > wonder > >> if this is maintainable in the long run. > >> > >> In addition, I am not even sure if this is the last problem of my > patch. It > >> is not easy to tell because today the ct-commit-all feature is an option > >> and by default disabled. So all our test cases by default don't test for > >> this option except the 4 test cases added by Ales's patch. I had to > >> manually set this option to true as default and of course many test > cases > >> will fail. Most of them fail because of the expected flows won't match, > but > >> I can't tell for sure unless updating every case with new flows and then > >> retest them. Before doing that I just examined ones I think might be > >> impacted by the patch and the "DNAT and SNAT on distributed router - > E/W" > >> is the first broken one I found. So there can be other issues not > >> discovered yet. > >> > >> Moreover, we should also keep in mind the performance penalty of these > >> extra CT commit & traversal. > >> > > > > But basically this happens because we use the same SNAT zone for all > > router ports. > > > > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per > > router port fix it? E.g.: > > > > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, > > DNAT ZONE: 22) ---- host2 > > > > That'd also imply that we commit that session (from host1 to host2 - > > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for > > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different > > CT lookups for all packets traversing the router. > > > > I didn't try any of this but I have the impression that this would solve > > the HWOL issue: all subsequent packets are part of known and committed > > sessions in all 4 conntrack zones. > > > > So what if we change the implementation of the "commit-all" feature to > > add the missing ct commits? Wouldn't that work? > > > > It feels like (lookup performance hit aside) this is the correct way to > > forward traffic, i.e., never traverse a CT zone without having a > > committed entry in that zone. > > > >> I'd like to listen to your opinion before moving forward. Is this worth > the > >> effort, or shall we just keep the pipeline simple (relatively) and avoid > >> the original problem (https://issues.redhat.com/browse/FDP-291) by > >> separating the port range of SNAT and DNAT, the first proposal at the > >> beginning of this thread? > >> > > > > Now that I understand the problem you're describing better it might be > > worth reverting the fix for FDP-291 like you said until we find a proper > > generic solution. > > > > However, we can't afford to do that without separating the port range of > > SNAT and DNAT like you said [0] because we'd be regressing (at least > > from OpenShift perspective). > > > > For reference, Tim, this is the OpenShift bug that needed the fix for > FDP-291: > > https://issues.redhat.com/browse/OCPBUGS-25889 > > > Because selecting the nodeport (DNAT) range is done through a k8s API > > server cmdline argument that means that users can in theory choose > > whatever range they want. I guess we'd need a command line argument in > > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes > > what SNAT port range to use so that it doesn't conflict with the KAPI > > nodeport range. > > > > Tim, you're the expert on this side, what are your thoughts on the > matter? > > > > Regards, > > Dumitru > > > > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > > > >> Thanks, > >> Han > >> > >>> > >>> Thanks, > >>> Tim Rozet > >>> Red Hat OpenShift Networking Team > >>> > >>> > >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >>>> > >>>> On 8/29/24 18:14, Han Zhou wrote: > >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> > >> wrote: > >>>>>> > >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > >> wrote: > >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the > >> VIP as > >>>>>>>> external IP, we break sessions initiated from behind the VIP. > >>>>>>>> > >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat > flows > >> for > >>>>>>>> load balancer vips in router ingress pipeline"). That's OK > because > >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers > >> with > >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better > way: > >>>>>>>> > >>>>>>>> In the reply direction, the order of traversal of the tables > >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >>>>>>>> datapath flows that check ct_state in the wrong conntrack > zone. > >>>>>>>> This is illustrated below where reply trafic enters the > >> physical host > >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), > >> back to the > >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The > >> third > >>>>>>>> flow is incorrectly checking the state from the SNAT zone > >> instead > >>>>>>>> of the DNAT zone. > >>>>>>>> > >>>>>>>> We also add a system test to ensure traffic initiated from behind > a > >> VIP > >>>>>>>> + SNAT is not broken. > >>>>>>>> > >>>>>>>> Another nice side effect is that the northd I-P is slightly > >> simplified > >>>>>>>> because we don't need to track NAT external IPs anymore. > >>>>>>>> > >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load > >> balancer vips in router ingress pipeline") > >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >>>>>>> > >>>>>>> > >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. > >>>>>>> > >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >>>>>>> > >>>>>> > >>>>>> Thanks, Numan! > >>>>>> > >>>>>> Applied to main and backported to all branches down to 22.03. > >>>>>> > >>>>>> Regards, > >>>>>> Dumitru > >>>>>> > >>>>>> _______________________________________________ > >>>>>> dev mailing list > >>>>>> dev@openvswitch.org > >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>>> > >>>>> Hi Dumitru, Numan, Tim and folks, > >>>>> > >>>> > >>>> Hi Han, > >>>> > >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due > to > >>>>> this change. The reason is that for client to nodePort (LB with VIP > >>>>> being the node IP) traffic, when the packet is going through the > >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed > >>>>> to the SNAT zone, it will have CT state returned as "new", which > >>>>> prevents the HW offload to work for such packets. > >>>>> > >>>> > >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. > >>>> > >>>>> At the moment I have to revert this change in our downstream. For the > >>>>> problem that was fixed by this change [0], I think we can avoid it by > >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort > >>>>> range in k8s is configured by API-server option: > >>>>> > >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: > >> 30000-32767 > >>>>> > >>>>> For SNAT, it can be configured in the OVN's NAT table's > >>>>> external_port_range column, and we can choose something like > >>>>> 10000-30000. > >>>>> > >>>> > >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT > >>>> port range this workaround should be fine. > >>>> > >>>>> An extra benefit of this is that it reduces a CT recirc. > >>>>> > >>>>> Alternatively solutions are: > >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would > >>>>> introduce unnecessary cost of CT entries and extra CT recirc. > >>>> > >>>> Extra recirculation aside, I would actually love it if we could use > this > >>>> alternative. I think it's the "most correct" option. I think it > would > >>>> allow us to avoid other workarounds like: > >>>> > >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >>>> > >>>> or > >>>> > >>>> > >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >>>> > >>>> I do understand the worry about the extra recirculation though. In > the > >>>> HWOL context does that cause visible performance impact? > >>>> > >>>> We'd probably have to do more performance testing without HWOL to > figure > >>>> out the impact in the software datapath. > >>>> > >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are > other > >>>>> issues reported for using the common zone [1] > >>>>> > >>>>> Could you let me know if other thoughts on this? > >>>>> > >>>> > >>>> On a related note, I know it has been discussed in different settings > >>>> but I don't think this ever moved forward: would it be possible for > >>>> NVIDIA to help out with automatically testing HWOL impact for incoming > >>>> patches? > >>>> > >>>> Maybe we could some "simple" system-like tests that ensure that > traffic > >>>> is correctly offloaded in common scenarios? Alternatively, I guess we > >>>> could also tag a subset of the existing system tests and just run > those > >>>> on actual hardware? > >>>> > >>>> It's quite simple (AFAIU) for external CIs to report status on each > OVN > >>>> patch posted on patchwork. That would at least allow us to flag this > >>>> kind of breakages early (even before they get merged). > >>>> > >>>> What do you think? > >>>> > >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >>>>> [1] b8c40e7593 > >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >>>>> > >>>>> Thanks, > >>>>> Han > >>>>> > >>>> > >>>> Thanks, > >>>> Dumitru > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >
On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: > > Hi All, > Sorry Han that you are still facing issues with this problem. It seems like the best option would be to configure the range of ephemeral ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport service range is configured 30000–32767, and the ephemeral port range in the kernel is set to: > net.ipv4.ip_local_port_range = 32768 60999 > > I think it would make sense to revert and add an option to make this range configurable in OVN. Then OVN-Kubernetes should pass along this config to OVN. > > As Dumitru mentioned, we can't just revert the previous patch without a substitute fix in place, otherwise we would risk regressing. Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, right? OVN already provides configurable SNAT port range through the external_port_range column in the NAT table. See: https://man7.org/linux/man-pages/man5/ovn-nb.5.html external_port_range: string L4 source port range Range of ports, from which a port number will be picked that will replace the source port of to be NATed packet. This is basically PAT (port address translation). Value of the column is in the format, port_lo-port_hi. For example: external_port_range : "1-30000" Valid range of ports is 1-65535. Does this satisfy the need you asked? If this is agreed, I think we can move forward with below steps: 1. Update ovn-kubernetes to set the external_port_range, if specific ranges are configured. (only needed when k8s configured nodeport range conflicts with the kernel ephemeral port range, and it seems unnecessary for OpenShift?) 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and vice versa. What do you think? Thanks, Han > > Thanks, > > Tim Rozet > Red Hat OpenShift Networking Team > > > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: >> > On 5/15/25 6:24 PM, Han Zhou wrote: >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >> >>> >> >>> Hi Han, >> >>> My understanding is that the unSNAT stage was originally being skipped >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks >> >> HWOL for ingress connections that never need to get unSNAT'ed and therefore >> >> not committed. In the OVNK use case for pods simply connecting egress, >> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress >> >> reply, we always enter the unSNAT stage for these packets. So this skipping >> >> unSNAT stage seems to be specific to LB related traffic. While I agree >> >> splitting the port range would work, I think it makes more sense to just >> >> always commit in the unSNAT stage for all traffic. I get there is a >> >> performance hit there, but the datapath pipeline seems more consistent and >> >> I think outweighs the cost of committing LB traffic. >> >> >> >> Hi Tim and folks, >> >> >> > >> > Hi Han, >> > >> >> + @Ales Musil <amusil@redhat.com> >> >> >> >> We've spent a lot of effort on the alternative solution: Always commit to >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in >> >> this thread is still unresolved. I'd like to discuss the current situation >> >> and ask for opinions. >> >> >> > >> > Thanks again for working on this! >> > >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: >> >> Add LR option to commit all traffic."). When the ct-commit-all option is >> >> enabled, it commits the connection to both SNAT and DNAT zones. However, in >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is >> >> still broken. In that scenario, a packet may come from external network >> >> with dest IP being a node's IP, which is also used as an OVN gateway router >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in different >> >> stages are: >> >> >> >> Original: EXT IP <-> GR IP1 >> >> After DNAT: EXT IP <-> BE IP >> >> After SNAT: GR IP2 <-> BE IP >> >> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >> >> original header, which will still always be "new" state, which breaks HW >> >> offload. >> >> >> > >> > Ah, now I understand the problem you were facing. >> > >> >> I tried to fix it by also committing to the SNAT zone before DNAT (at the >> >> POST_SNAT stage). Now there are 3 entries committed: >> >> SNAT zone: EXT IP <-> GR IP1 >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on >> >> distributed router - E/W", which tests communication between two workloads >> >> using "floating IPs". In this scenario, both workloads are under different >> >> LSes behind the same router, and both have their own floating IPs >> >> (dnat_and_snat) on the public side of the LR. When one of them sends >> >> packets to the other using the floating IP, what's expected is: >> >> >> >> Original: overlay IP1 <-> floating IP2 >> >> After SNAT: floating IP1 <-> floating IP2 >> >> After DNAT: floating IP1 <-> overlay IP2 >> >> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >> >> >> >> This happens in the egress pipeline of the LR, and then it enters into the >> >> ingress pipeline again because this is like a hairpin (the dest IP is a >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, >> >> traverse the CT zone, and for the first packet the reply is not seen yet, >> >> so the CT state is still new, so in the POST_UNSNAT we will try to commit >> >> it (because of the issue I am trying to fix above). Because the connection >> >> floating IP1 <-> floating IP2 is conflicting with the above committed entry >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and >> >> the packet is dropped. >> >> >> >> There is still a way to solve this, probably by skipping the unnecessary >> >> UNSNAT traversal and the conflict commit if the source IP is one of the >> >> snat/dnat_and_snat IPs of this same router. >> >> >> >> However, as we see, the NAT related pipelines have been very complex today, >> >> and with the above changes it would become even much more complex. I wonder >> >> if this is maintainable in the long run. >> >> >> >> In addition, I am not even sure if this is the last problem of my patch. It >> >> is not easy to tell because today the ct-commit-all feature is an option >> >> and by default disabled. So all our test cases by default don't test for >> >> this option except the 4 test cases added by Ales's patch. I had to >> >> manually set this option to true as default and of course many test cases >> >> will fail. Most of them fail because of the expected flows won't match, but >> >> I can't tell for sure unless updating every case with new flows and then >> >> retest them. Before doing that I just examined ones I think might be >> >> impacted by the patch and the "DNAT and SNAT on distributed router - E/W" >> >> is the first broken one I found. So there can be other issues not >> >> discovered yet. >> >> >> >> Moreover, we should also keep in mind the performance penalty of these >> >> extra CT commit & traversal. >> >> >> > >> > But basically this happens because we use the same SNAT zone for all >> > router ports. >> > >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per >> > router port fix it? E.g.: >> > >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, >> > DNAT ZONE: 22) ---- host2 >> > >> > That'd also imply that we commit that session (from host1 to host2 - >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different >> > CT lookups for all packets traversing the router. >> > >> > I didn't try any of this but I have the impression that this would solve >> > the HWOL issue: all subsequent packets are part of known and committed >> > sessions in all 4 conntrack zones. >> > >> > So what if we change the implementation of the "commit-all" feature to >> > add the missing ct commits? Wouldn't that work? >> > >> > It feels like (lookup performance hit aside) this is the correct way to >> > forward traffic, i.e., never traverse a CT zone without having a >> > committed entry in that zone. >> > >> >> I'd like to listen to your opinion before moving forward. Is this worth the >> >> effort, or shall we just keep the pipeline simple (relatively) and avoid >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by >> >> separating the port range of SNAT and DNAT, the first proposal at the >> >> beginning of this thread? >> >> >> > >> > Now that I understand the problem you're describing better it might be >> > worth reverting the fix for FDP-291 like you said until we find a proper >> > generic solution. >> > >> > However, we can't afford to do that without separating the port range of >> > SNAT and DNAT like you said [0] because we'd be regressing (at least >> > from OpenShift perspective). >> > >> >> For reference, Tim, this is the OpenShift bug that needed the fix for >> FDP-291: >> >> https://issues.redhat.com/browse/OCPBUGS-25889 >> >> > Because selecting the nodeport (DNAT) range is done through a k8s API >> > server cmdline argument that means that users can in theory choose >> > whatever range they want. I guess we'd need a command line argument in >> > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes >> > what SNAT port range to use so that it doesn't conflict with the KAPI >> > nodeport range. >> > >> > Tim, you're the expert on this side, what are your thoughts on the matter? >> > >> > Regards, >> > Dumitru >> > >> > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html >> > >> >> Thanks, >> >> Han >> >> >> >>> >> >>> Thanks, >> >>> Tim Rozet >> >>> Red Hat OpenShift Networking Team >> >>> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >>>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> >> >> wrote: >> >>>>>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> >> >> wrote: >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the >> >> VIP as >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. >> >>>>>>>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows >> >> for >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK because >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers >> >> with >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >> >>>>>>>> >> >>>>>>>> In the reply direction, the order of traversal of the tables >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >> >>>>>>>> This is illustrated below where reply trafic enters the >> >> physical host >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), >> >> back to the >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The >> >> third >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone >> >> instead >> >>>>>>>> of the DNAT zone. >> >>>>>>>> >> >>>>>>>> We also add a system test to ensure traffic initiated from behind a >> >> VIP >> >>>>>>>> + SNAT is not broken. >> >>>>>>>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly >> >> simplified >> >>>>>>>> because we don't need to track NAT external IPs anymore. >> >>>>>>>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >> >> balancer vips in router ingress pipeline") >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> >>>>>>> >> >>>>>>> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >> >>>>>>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >> >>>>>>> >> >>>>>> >> >>>>>> Thanks, Numan! >> >>>>>> >> >>>>>> Applied to main and backported to all branches down to 22.03. >> >>>>>> >> >>>>>> Regards, >> >>>>>> Dumitru >> >>>>>> >> >>>>>> _______________________________________________ >> >>>>>> dev mailing list >> >>>>>> dev@openvswitch.org >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >>>>> >> >>>>> Hi Dumitru, Numan, Tim and folks, >> >>>>> >> >>>> >> >>>> Hi Han, >> >>>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >> >>>>> this change. The reason is that for client to nodePort (LB with VIP >> >>>>> being the node IP) traffic, when the packet is going through the >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >> >>>>> to the SNAT zone, it will have CT state returned as "new", which >> >>>>> prevents the HW offload to work for such packets. >> >>>>> >> >>>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. >> >>>> >> >>>>> At the moment I have to revert this change in our downstream. For the >> >>>>> problem that was fixed by this change [0], I think we can avoid it by >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >> >>>>> range in k8s is configured by API-server option: >> >>>>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >> >> 30000-32767 >> >>>>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's >> >>>>> external_port_range column, and we can choose something like >> >>>>> 10000-30000. >> >>>>> >> >>>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >> >>>> port range this workaround should be fine. >> >>>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. >> >>>>> >> >>>>> Alternatively solutions are: >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >> >>>> >> >>>> Extra recirculation aside, I would actually love it if we could use this >> >>>> alternative. I think it's the "most correct" option. I think it would >> >>>> allow us to avoid other workarounds like: >> >>>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >> >>>> >> >>>> or >> >>>> >> >>>> >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >> >>>> >> >>>> I do understand the worry about the extra recirculation though. In the >> >>>> HWOL context does that cause visible performance impact? >> >>>> >> >>>> We'd probably have to do more performance testing without HWOL to figure >> >>>> out the impact in the software datapath. >> >>>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >> >>>>> issues reported for using the common zone [1] >> >>>>> >> >>>>> Could you let me know if other thoughts on this? >> >>>>> >> >>>> >> >>>> On a related note, I know it has been discussed in different settings >> >>>> but I don't think this ever moved forward: would it be possible for >> >>>> NVIDIA to help out with automatically testing HWOL impact for incoming >> >>>> patches? >> >>>> >> >>>> Maybe we could some "simple" system-like tests that ensure that traffic >> >>>> is correctly offloaded in common scenarios? Alternatively, I guess we >> >>>> could also tag a subset of the existing system tests and just run those >> >>>> on actual hardware? >> >>>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on each OVN >> >>>> patch posted on patchwork. That would at least allow us to flag this >> >>>> kind of breakages early (even before they get merged). >> >>>> >> >>>> What do you think? >> >>>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >> >>>>> [1] b8c40e7593 >> >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >> >>>>> >> >>>>> Thanks, >> >>>>> Han >> >>>>> >> >>>> >> >>>> Thanks, >> >>>> Dumitru >> >> _______________________________________________ >> >> dev mailing list >> >> dev@openvswitch.org >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On Tue, May 20, 2025 at 1:37 AM Han Zhou <zhouhan@gmail.com> wrote: > > > On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: > > > > Hi All, > > Sorry Han that you are still facing issues with this problem. It seems > like the best option would be to configure the range of ephemeral ports for > OVN to use when doing an SNAT. Note, in OpenShift the nodeport service > range is configured 30000–32767, and the ephemeral port range in the kernel > is set to: > > net.ipv4.ip_local_port_range = 32768 60999 > > > > I think it would make sense to revert and add an option to make this > range configurable in OVN. Then OVN-Kubernetes should pass along this > config to OVN. > > > > As Dumitru mentioned, we can't just revert the previous patch without a > substitute fix in place, otherwise we would risk regressing. > > Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, > right? > > OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > > external_port_range: string > L4 source port range > > Range of ports, from which a port number will be picked > that will replace the source port of to be NATed packet. > This is basically PAT (port address translation). > > Value of the column is in the format, port_lo-port_hi. For > example: external_port_range : "1-30000" > > Valid range of ports is 1-65535. > > Does this satisfy the need you asked? > Oh, I didn't realize this is already supported. > > If this is agreed, I think we can move forward with below steps: > 1. Update ovn-kubernetes to set the external_port_range, if specific > ranges are configured. (only needed when k8s configured nodeport range > conflicts with the kernel ephemeral port range, and it seems unnecessary > for OpenShift?) > Yeah, we should provide a config knob to set the range though, in case ovnk users do not use the default range. It looks like we already have a test case to cover this scenario as well: https://github.com/openshift/ovn-kubernetes/pull/2120/commits/34b1a4be8cf9d4bf61d65464015c1b3cdb30adfd I'll post a PR to do this part. 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic > towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: > e.g. no need to commit DNATted connections to SNAT zone, and vice versa. > Sure, after 1 gets in, you can try to post a PR reverting the change and lets see if it passes the above test. > > What do you think? > > Thanks, > Han > > > > > Thanks, > > > > Tim Rozet > > Red Hat OpenShift Networking Team > > > > > > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: > >> >>> > >> >>> Hi Han, > >> >>> My understanding is that the unSNAT stage was originally being > skipped > >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this > breaks > >> >> HWOL for ingress connections that never need to get unSNAT'ed and > therefore > >> >> not committed. In the OVNK use case for pods simply connecting > egress, > >> >> almost all packets will be SNAT'ed to the node IP. Therefore on > egress > >> >> reply, we always enter the unSNAT stage for these packets. So this > skipping > >> >> unSNAT stage seems to be specific to LB related traffic. While I > agree > >> >> splitting the port range would work, I think it makes more sense to > just > >> >> always commit in the unSNAT stage for all traffic. I get there is a > >> >> performance hit there, but the datapath pipeline seems more > consistent and > >> >> I think outweighs the cost of committing LB traffic. > >> >> > >> >> Hi Tim and folks, > >> >> > >> > > >> > Hi Han, > >> > > >> >> + @Ales Musil <amusil@redhat.com> > >> >> > >> >> We've spent a lot of effort on the alternative solution: Always > commit to > >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue > discussed in > >> >> this thread is still unresolved. I'd like to discuss the current > situation > >> >> and ask for opinions. > >> >> > >> > > >> > Thanks again for working on this! > >> > > >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 > ("northd: > >> >> Add LR option to commit all traffic."). When the ct-commit-all > option is > >> >> enabled, it commits the connection to both SNAT and DNAT zones. > However, in > >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW > offload is > >> >> still broken. In that scenario, a packet may come from external > network > >> >> with dest IP being a node's IP, which is also used as an OVN gateway > router > >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the > src IP > >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in > different > >> >> stages are: > >> >> > >> >> Original: EXT IP <-> GR IP1 > >> >> After DNAT: EXT IP <-> BE IP > >> >> After SNAT: GR IP2 <-> BE IP > >> >> > >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the > >> >> original header, which will still always be "new" state, which > breaks HW > >> >> offload. > >> >> > >> > > >> > Ah, now I understand the problem you were facing. > >> > > >> >> I tried to fix it by also committing to the SNAT zone before DNAT > (at the > >> >> POST_SNAT stage). Now there are 3 entries committed: > >> >> SNAT zone: EXT IP <-> GR IP1 > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> HW offload worked, but it breaks at least the scenario of "DNAT and > SNAT on > >> >> distributed router - E/W", which tests communication between two > workloads > >> >> using "floating IPs". In this scenario, both workloads are under > different > >> >> LSes behind the same router, and both have their own floating IPs > >> >> (dnat_and_snat) on the public side of the LR. When one of them sends > >> >> packets to the other using the floating IP, what's expected is: > >> >> > >> >> Original: overlay IP1 <-> floating IP2 > >> >> After SNAT: floating IP1 <-> floating IP2 > >> >> After DNAT: floating IP1 <-> overlay IP2 > >> >> > >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: > >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> >> > >> >> This happens in the egress pipeline of the LR, and then it enters > into the > >> >> ingress pipeline again because this is like a hairpin (the dest IP > is a > >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage > first, > >> >> traverse the CT zone, and for the first packet the reply is not seen > yet, > >> >> so the CT state is still new, so in the POST_UNSNAT we will try to > commit > >> >> it (because of the issue I am trying to fix above). Because the > connection > >> >> floating IP1 <-> floating IP2 is conflicting with the above > committed entry > >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action > fails and > >> >> the packet is dropped. > >> >> > >> >> There is still a way to solve this, probably by skipping the > unnecessary > >> >> UNSNAT traversal and the conflict commit if the source IP is one of > the > >> >> snat/dnat_and_snat IPs of this same router. > >> >> > >> >> However, as we see, the NAT related pipelines have been very complex > today, > >> >> and with the above changes it would become even much more complex. I > wonder > >> >> if this is maintainable in the long run. > >> >> > >> >> In addition, I am not even sure if this is the last problem of my > patch. It > >> >> is not easy to tell because today the ct-commit-all feature is an > option > >> >> and by default disabled. So all our test cases by default don't test > for > >> >> this option except the 4 test cases added by Ales's patch. I had to > >> >> manually set this option to true as default and of course many test > cases > >> >> will fail. Most of them fail because of the expected flows won't > match, but > >> >> I can't tell for sure unless updating every case with new flows and > then > >> >> retest them. Before doing that I just examined ones I think might be > >> >> impacted by the patch and the "DNAT and SNAT on distributed router - > E/W" > >> >> is the first broken one I found. So there can be other issues not > >> >> discovered yet. > >> >> > >> >> Moreover, we should also keep in mind the performance penalty of > these > >> >> extra CT commit & traversal. > >> >> > >> > > >> > But basically this happens because we use the same SNAT zone for all > >> > router ports. > >> > > >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per > >> > router port fix it? E.g.: > >> > > >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: > 21, > >> > DNAT ZONE: 22) ---- host2 > >> > > >> > That'd also imply that we commit that session (from host1 to host2 - > >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for > >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 > different > >> > CT lookups for all packets traversing the router. > >> > > >> > I didn't try any of this but I have the impression that this would > solve > >> > the HWOL issue: all subsequent packets are part of known and committed > >> > sessions in all 4 conntrack zones. > >> > > >> > So what if we change the implementation of the "commit-all" feature to > >> > add the missing ct commits? Wouldn't that work? > >> > > >> > It feels like (lookup performance hit aside) this is the correct way > to > >> > forward traffic, i.e., never traverse a CT zone without having a > >> > committed entry in that zone. > >> > > >> >> I'd like to listen to your opinion before moving forward. Is this > worth the > >> >> effort, or shall we just keep the pipeline simple (relatively) and > avoid > >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by > >> >> separating the port range of SNAT and DNAT, the first proposal at the > >> >> beginning of this thread? > >> >> > >> > > >> > Now that I understand the problem you're describing better it might be > >> > worth reverting the fix for FDP-291 like you said until we find a > proper > >> > generic solution. > >> > > >> > However, we can't afford to do that without separating the port range > of > >> > SNAT and DNAT like you said [0] because we'd be regressing (at least > >> > from OpenShift perspective). > >> > > >> > >> For reference, Tim, this is the OpenShift bug that needed the fix for > >> FDP-291: > >> > >> https://issues.redhat.com/browse/OCPBUGS-25889 > >> > >> > Because selecting the nodeport (DNAT) range is done through a k8s API > >> > server cmdline argument that means that users can in theory choose > >> > whatever range they want. I guess we'd need a command line argument > in > >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >> > what SNAT port range to use so that it doesn't conflict with the KAPI > >> > nodeport range. > >> > > >> > Tim, you're the expert on this side, what are your thoughts on the > matter? > >> > > >> > Regards, > >> > Dumitru > >> > > >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> > > >> >> Thanks, > >> >> Han > >> >> > >> >>> > >> >>> Thanks, > >> >>> Tim Rozet > >> >>> Red Hat OpenShift Networking Team > >> >>> > >> >>> > >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >> >>>> > >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> > >> >> wrote: > >> >>>>>> > >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> > >> >> wrote: > >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the > >> >> VIP as > >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. > >> >>>>>>>> > >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat > flows > >> >> for > >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK > because > >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for > routers > >> >> with > >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better > way: > >> >>>>>>>> > >> >>>>>>>> In the reply direction, the order of traversal of the > tables > >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds > incorrect > >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack > zone. > >> >>>>>>>> This is illustrated below where reply trafic enters the > >> >> physical host > >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), > >> >> back to the > >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The > >> >> third > >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone > >> >> instead > >> >>>>>>>> of the DNAT zone. > >> >>>>>>>> > >> >>>>>>>> We also add a system test to ensure traffic initiated from > behind a > >> >> VIP > >> >>>>>>>> + SNAT is not broken. > >> >>>>>>>> > >> >>>>>>>> Another nice side effect is that the northd I-P is slightly > >> >> simplified > >> >>>>>>>> because we don't need to track NAT external IPs anymore. > >> >>>>>>>> > >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load > >> >> balancer vips in router ingress pipeline") > >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful > code. > >> >>>>>>> > >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >> >>>>>>> > >> >>>>>> > >> >>>>>> Thanks, Numan! > >> >>>>>> > >> >>>>>> Applied to main and backported to all branches down to 22.03. > >> >>>>>> > >> >>>>>> Regards, > >> >>>>>> Dumitru > >> >>>>>> > >> >>>>>> _______________________________________________ > >> >>>>>> dev mailing list > >> >>>>>> dev@openvswitch.org > >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> >>>>> > >> >>>>> Hi Dumitru, Numan, Tim and folks, > >> >>>>> > >> >>>> > >> >>>> Hi Han, > >> >>>> > >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken > due to > >> >>>>> this change. The reason is that for client to nodePort (LB with > VIP > >> >>>>> being the node IP) traffic, when the packet is going through the > >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never > committed > >> >>>>> to the SNAT zone, it will have CT state returned as "new", which > >> >>>>> prevents the HW offload to work for such packets. > >> >>>>> > >> >>>> > >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. > >> >>>> > >> >>>>> At the moment I have to revert this change in our downstream. For > the > >> >>>>> problem that was fixed by this change [0], I think we can avoid > it by > >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort > >> >>>>> range in k8s is configured by API-server option: > >> >>>>> > >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: > >> >> 30000-32767 > >> >>>>> > >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >> >>>>> external_port_range column, and we can choose something like > >> >>>>> 10000-30000. > >> >>>>> > >> >>>> > >> >>>> Tim, does this look OK to you? If it's acceptable to limit the > SNAT > >> >>>> port range this workaround should be fine. > >> >>>> > >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >> >>>>> > >> >>>>> Alternatively solutions are: > >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This > would > >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. > >> >>>> > >> >>>> Extra recirculation aside, I would actually love it if we could > use this > >> >>>> alternative. I think it's the "most correct" option. I think it > would > >> >>>> allow us to avoid other workarounds like: > >> >>>> > >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >> >>>> > >> >>>> or > >> >>>> > >> >>>> > >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >> >>>> > >> >>>> I do understand the worry about the extra recirculation though. > In the > >> >>>> HWOL context does that cause visible performance impact? > >> >>>> > >> >>>> We'd probably have to do more performance testing without HWOL to > figure > >> >>>> out the impact in the software datapath. > >> >>>> > >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are > other > >> >>>>> issues reported for using the common zone [1] > >> >>>>> > >> >>>>> Could you let me know if other thoughts on this? > >> >>>>> > >> >>>> > >> >>>> On a related note, I know it has been discussed in different > settings > >> >>>> but I don't think this ever moved forward: would it be possible for > >> >>>> NVIDIA to help out with automatically testing HWOL impact for > incoming > >> >>>> patches? > >> >>>> > >> >>>> Maybe we could some "simple" system-like tests that ensure that > traffic > >> >>>> is correctly offloaded in common scenarios? Alternatively, I > guess we > >> >>>> could also tag a subset of the existing system tests and just run > those > >> >>>> on actual hardware? > >> >>>> > >> >>>> It's quite simple (AFAIU) for external CIs to report status on > each OVN > >> >>>> patch posted on patchwork. That would at least allow us to flag > this > >> >>>> kind of breakages early (even before they get merged). > >> >>>> > >> >>>> What do you think? > >> >>>> > >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >> >>>>> [1] b8c40e7593 > >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Han > >> >>>>> > >> >>>> > >> >>>> Thanks, > >> >>>> Dumitru > >> >> _______________________________________________ > >> >> dev mailing list > >> >> dev@openvswitch.org > >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: > > > On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: > > > > Hi All, > > Sorry Han that you are still facing issues with this problem. It seems > like the best option would be to configure the range of ephemeral ports for > OVN to use when doing an SNAT. Note, in OpenShift the nodeport service > range is configured 30000–32767, and the ephemeral port range in the kernel > is set to: > > net.ipv4.ip_local_port_range = 32768 60999 > > > > I think it would make sense to revert and add an option to make this > range configurable in OVN. Then OVN-Kubernetes should pass along this > config to OVN. > > > > As Dumitru mentioned, we can't just revert the previous patch without a > substitute fix in place, otherwise we would risk regressing. > > Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, > right? > > OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > > external_port_range: string > L4 source port range > > Range of ports, from which a port number will be picked > that will replace the source port of to be NATed packet. > This is basically PAT (port address translation). > > Value of the column is in the format, port_lo-port_hi. For > example: external_port_range : "1-30000" > > Valid range of ports is 1-65535. > > Does this satisfy the need you asked? > > If this is agreed, I think we can move forward with below steps: > 1. Update ovn-kubernetes to set the external_port_range, if specific > ranges are configured. (only needed when k8s configured nodeport range > conflicts with the kernel ephemeral port range, and it seems unnecessary > for OpenShift?) > 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic > towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: > e.g. no need to commit DNATted connections to SNAT zone, and vice versa. > Hi Han, I don't think we should revert/partially revert ct-commit-all, the whole point of that knob is what you have described, commit to SNAT zone if there is any SNAT configured and commit to DNAT zone if there is any DNAT configured. If you still feel like there are some changes that we could make to this approach we can discuss it. There is one system test called "Commit all - UDN" which needs to pass unchanged if there is any revert/change to ct-commit-all. Thanks, Ales > > What do you think? > > Thanks, > Han > > > > > Thanks, > > > > Tim Rozet > > Red Hat OpenShift Networking Team > > > > > > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: > >> >>> > >> >>> Hi Han, > >> >>> My understanding is that the unSNAT stage was originally being > skipped > >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this > breaks > >> >> HWOL for ingress connections that never need to get unSNAT'ed and > therefore > >> >> not committed. In the OVNK use case for pods simply connecting > egress, > >> >> almost all packets will be SNAT'ed to the node IP. Therefore on > egress > >> >> reply, we always enter the unSNAT stage for these packets. So this > skipping > >> >> unSNAT stage seems to be specific to LB related traffic. While I > agree > >> >> splitting the port range would work, I think it makes more sense to > just > >> >> always commit in the unSNAT stage for all traffic. I get there is a > >> >> performance hit there, but the datapath pipeline seems more > consistent and > >> >> I think outweighs the cost of committing LB traffic. > >> >> > >> >> Hi Tim and folks, > >> >> > >> > > >> > Hi Han, > >> > > >> >> + @Ales Musil <amusil@redhat.com> > >> >> > >> >> We've spent a lot of effort on the alternative solution: Always > commit to > >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue > discussed in > >> >> this thread is still unresolved. I'd like to discuss the current > situation > >> >> and ask for opinions. > >> >> > >> > > >> > Thanks again for working on this! > >> > > >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 > ("northd: > >> >> Add LR option to commit all traffic."). When the ct-commit-all > option is > >> >> enabled, it commits the connection to both SNAT and DNAT zones. > However, in > >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW > offload is > >> >> still broken. In that scenario, a packet may come from external > network > >> >> with dest IP being a node's IP, which is also used as an OVN gateway > router > >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the > src IP > >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in > different > >> >> stages are: > >> >> > >> >> Original: EXT IP <-> GR IP1 > >> >> After DNAT: EXT IP <-> BE IP > >> >> After SNAT: GR IP2 <-> BE IP > >> >> > >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the > >> >> original header, which will still always be "new" state, which > breaks HW > >> >> offload. > >> >> > >> > > >> > Ah, now I understand the problem you were facing. > >> > > >> >> I tried to fix it by also committing to the SNAT zone before DNAT > (at the > >> >> POST_SNAT stage). Now there are 3 entries committed: > >> >> SNAT zone: EXT IP <-> GR IP1 > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> HW offload worked, but it breaks at least the scenario of "DNAT and > SNAT on > >> >> distributed router - E/W", which tests communication between two > workloads > >> >> using "floating IPs". In this scenario, both workloads are under > different > >> >> LSes behind the same router, and both have their own floating IPs > >> >> (dnat_and_snat) on the public side of the LR. When one of them sends > >> >> packets to the other using the floating IP, what's expected is: > >> >> > >> >> Original: overlay IP1 <-> floating IP2 > >> >> After SNAT: floating IP1 <-> floating IP2 > >> >> After DNAT: floating IP1 <-> overlay IP2 > >> >> > >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: > >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> >> > >> >> This happens in the egress pipeline of the LR, and then it enters > into the > >> >> ingress pipeline again because this is like a hairpin (the dest IP > is a > >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage > first, > >> >> traverse the CT zone, and for the first packet the reply is not seen > yet, > >> >> so the CT state is still new, so in the POST_UNSNAT we will try to > commit > >> >> it (because of the issue I am trying to fix above). Because the > connection > >> >> floating IP1 <-> floating IP2 is conflicting with the above > committed entry > >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action > fails and > >> >> the packet is dropped. > >> >> > >> >> There is still a way to solve this, probably by skipping the > unnecessary > >> >> UNSNAT traversal and the conflict commit if the source IP is one of > the > >> >> snat/dnat_and_snat IPs of this same router. > >> >> > >> >> However, as we see, the NAT related pipelines have been very complex > today, > >> >> and with the above changes it would become even much more complex. I > wonder > >> >> if this is maintainable in the long run. > >> >> > >> >> In addition, I am not even sure if this is the last problem of my > patch. It > >> >> is not easy to tell because today the ct-commit-all feature is an > option > >> >> and by default disabled. So all our test cases by default don't test > for > >> >> this option except the 4 test cases added by Ales's patch. I had to > >> >> manually set this option to true as default and of course many test > cases > >> >> will fail. Most of them fail because of the expected flows won't > match, but > >> >> I can't tell for sure unless updating every case with new flows and > then > >> >> retest them. Before doing that I just examined ones I think might be > >> >> impacted by the patch and the "DNAT and SNAT on distributed router - > E/W" > >> >> is the first broken one I found. So there can be other issues not > >> >> discovered yet. > >> >> > >> >> Moreover, we should also keep in mind the performance penalty of > these > >> >> extra CT commit & traversal. > >> >> > >> > > >> > But basically this happens because we use the same SNAT zone for all > >> > router ports. > >> > > >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per > >> > router port fix it? E.g.: > >> > > >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: > 21, > >> > DNAT ZONE: 22) ---- host2 > >> > > >> > That'd also imply that we commit that session (from host1 to host2 - > >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for > >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 > different > >> > CT lookups for all packets traversing the router. > >> > > >> > I didn't try any of this but I have the impression that this would > solve > >> > the HWOL issue: all subsequent packets are part of known and committed > >> > sessions in all 4 conntrack zones. > >> > > >> > So what if we change the implementation of the "commit-all" feature to > >> > add the missing ct commits? Wouldn't that work? > >> > > >> > It feels like (lookup performance hit aside) this is the correct way > to > >> > forward traffic, i.e., never traverse a CT zone without having a > >> > committed entry in that zone. > >> > > >> >> I'd like to listen to your opinion before moving forward. Is this > worth the > >> >> effort, or shall we just keep the pipeline simple (relatively) and > avoid > >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by > >> >> separating the port range of SNAT and DNAT, the first proposal at the > >> >> beginning of this thread? > >> >> > >> > > >> > Now that I understand the problem you're describing better it might be > >> > worth reverting the fix for FDP-291 like you said until we find a > proper > >> > generic solution. > >> > > >> > However, we can't afford to do that without separating the port range > of > >> > SNAT and DNAT like you said [0] because we'd be regressing (at least > >> > from OpenShift perspective). > >> > > >> > >> For reference, Tim, this is the OpenShift bug that needed the fix for > >> FDP-291: > >> > >> https://issues.redhat.com/browse/OCPBUGS-25889 > >> > >> > Because selecting the nodeport (DNAT) range is done through a k8s API > >> > server cmdline argument that means that users can in theory choose > >> > whatever range they want. I guess we'd need a command line argument > in > >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >> > what SNAT port range to use so that it doesn't conflict with the KAPI > >> > nodeport range. > >> > > >> > Tim, you're the expert on this side, what are your thoughts on the > matter? > >> > > >> > Regards, > >> > Dumitru > >> > > >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> > > >> >> Thanks, > >> >> Han > >> >> > >> >>> > >> >>> Thanks, > >> >>> Tim Rozet > >> >>> Red Hat OpenShift Networking Team > >> >>> > >> >>> > >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >> >>>> > >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> > >> >> wrote: > >> >>>>>> > >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> > >> >> wrote: > >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the > >> >> VIP as > >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. > >> >>>>>>>> > >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat > flows > >> >> for > >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK > because > >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for > routers > >> >> with > >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better > way: > >> >>>>>>>> > >> >>>>>>>> In the reply direction, the order of traversal of the > tables > >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds > incorrect > >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack > zone. > >> >>>>>>>> This is illustrated below where reply trafic enters the > >> >> physical host > >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), > >> >> back to the > >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The > >> >> third > >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone > >> >> instead > >> >>>>>>>> of the DNAT zone. > >> >>>>>>>> > >> >>>>>>>> We also add a system test to ensure traffic initiated from > behind a > >> >> VIP > >> >>>>>>>> + SNAT is not broken. > >> >>>>>>>> > >> >>>>>>>> Another nice side effect is that the northd I-P is slightly > >> >> simplified > >> >>>>>>>> because we don't need to track NAT external IPs anymore. > >> >>>>>>>> > >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load > >> >> balancer vips in router ingress pipeline") > >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful > code. > >> >>>>>>> > >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >> >>>>>>> > >> >>>>>> > >> >>>>>> Thanks, Numan! > >> >>>>>> > >> >>>>>> Applied to main and backported to all branches down to 22.03. > >> >>>>>> > >> >>>>>> Regards, > >> >>>>>> Dumitru > >> >>>>>> > >> >>>>>> _______________________________________________ > >> >>>>>> dev mailing list > >> >>>>>> dev@openvswitch.org > >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> >>>>> > >> >>>>> Hi Dumitru, Numan, Tim and folks, > >> >>>>> > >> >>>> > >> >>>> Hi Han, > >> >>>> > >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken > due to > >> >>>>> this change. The reason is that for client to nodePort (LB with > VIP > >> >>>>> being the node IP) traffic, when the packet is going through the > >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never > committed > >> >>>>> to the SNAT zone, it will have CT state returned as "new", which > >> >>>>> prevents the HW offload to work for such packets. > >> >>>>> > >> >>>> > >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. > >> >>>> > >> >>>>> At the moment I have to revert this change in our downstream. For > the > >> >>>>> problem that was fixed by this change [0], I think we can avoid > it by > >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort > >> >>>>> range in k8s is configured by API-server option: > >> >>>>> > >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: > >> >> 30000-32767 > >> >>>>> > >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >> >>>>> external_port_range column, and we can choose something like > >> >>>>> 10000-30000. > >> >>>>> > >> >>>> > >> >>>> Tim, does this look OK to you? If it's acceptable to limit the > SNAT > >> >>>> port range this workaround should be fine. > >> >>>> > >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >> >>>>> > >> >>>>> Alternatively solutions are: > >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This > would > >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. > >> >>>> > >> >>>> Extra recirculation aside, I would actually love it if we could > use this > >> >>>> alternative. I think it's the "most correct" option. I think it > would > >> >>>> allow us to avoid other workarounds like: > >> >>>> > >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >> >>>> > >> >>>> or > >> >>>> > >> >>>> > >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >> >>>> > >> >>>> I do understand the worry about the extra recirculation though. > In the > >> >>>> HWOL context does that cause visible performance impact? > >> >>>> > >> >>>> We'd probably have to do more performance testing without HWOL to > figure > >> >>>> out the impact in the software datapath. > >> >>>> > >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are > other > >> >>>>> issues reported for using the common zone [1] > >> >>>>> > >> >>>>> Could you let me know if other thoughts on this? > >> >>>>> > >> >>>> > >> >>>> On a related note, I know it has been discussed in different > settings > >> >>>> but I don't think this ever moved forward: would it be possible for > >> >>>> NVIDIA to help out with automatically testing HWOL impact for > incoming > >> >>>> patches? > >> >>>> > >> >>>> Maybe we could some "simple" system-like tests that ensure that > traffic > >> >>>> is correctly offloaded in common scenarios? Alternatively, I > guess we > >> >>>> could also tag a subset of the existing system tests and just run > those > >> >>>> on actual hardware? > >> >>>> > >> >>>> It's quite simple (AFAIU) for external CIs to report status on > each OVN > >> >>>> patch posted on patchwork. That would at least allow us to flag > this > >> >>>> kind of breakages early (even before they get merged). > >> >>>> > >> >>>> What do you think? > >> >>>> > >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >> >>>>> [1] b8c40e7593 > >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Han > >> >>>>> > >> >>>> > >> >>>> Thanks, > >> >>>> Dumitru > >> >> _______________________________________________ > >> >> dev mailing list > >> >> dev@openvswitch.org > >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
Hi Han, I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review it when you have time. Thanks, Tim Rozet Red Hat OpenShift Networking Team On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: > > > On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: > >> >> >> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: >> > >> > Hi All, >> > Sorry Han that you are still facing issues with this problem. It seems >> like the best option would be to configure the range of ephemeral ports for >> OVN to use when doing an SNAT. Note, in OpenShift the nodeport service >> range is configured 30000–32767, and the ephemeral port range in the kernel >> is set to: >> > net.ipv4.ip_local_port_range = 32768 60999 >> > >> > I think it would make sense to revert and add an option to make this >> range configurable in OVN. Then OVN-Kubernetes should pass along this >> config to OVN. >> > >> > As Dumitru mentioned, we can't just revert the previous patch without a >> substitute fix in place, otherwise we would risk regressing. >> >> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, >> right? >> >> OVN already provides configurable SNAT port range through the >> external_port_range column in the NAT table. See: >> https://man7.org/linux/man-pages/man5/ovn-nb.5.html >> >> external_port_range: string >> L4 source port range >> >> Range of ports, from which a port number will be picked >> that will replace the source port of to be NATed packet. >> This is basically PAT (port address translation). >> >> Value of the column is in the format, port_lo-port_hi. For >> example: external_port_range : "1-30000" >> >> Valid range of ports is 1-65535. >> >> Does this satisfy the need you asked? >> >> If this is agreed, I think we can move forward with below steps: >> 1. Update ovn-kubernetes to set the external_port_range, if specific >> ranges are configured. (only needed when k8s configured nodeport range >> conflicts with the kernel ephemeral port range, and it seems unnecessary >> for OpenShift?) >> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic >> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: >> e.g. no need to commit DNATted connections to SNAT zone, and vice versa. >> > > Hi Han, > > I don't think we should revert/partially revert ct-commit-all, the whole > point of that knob is what you have described, commit to SNAT zone if there > is any SNAT configured and commit to DNAT zone if there is any DNAT > configured. If you still feel like there are some changes that we could > make to this approach we can discuss it. There is one system test called > "Commit all - UDN" which needs to pass unchanged if there is any > revert/change to ct-commit-all. > > Thanks, > Ales > >> >> What do you think? >> >> Thanks, >> Han >> >> > >> > Thanks, >> > >> > Tim Rozet >> > Red Hat OpenShift Networking Team >> > >> > >> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> >> wrote: >> >> >> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: >> >> > On 5/15/25 6:24 PM, Han Zhou wrote: >> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> >> wrote: >> >> >>> >> >> >>> Hi Han, >> >> >>> My understanding is that the unSNAT stage was originally being >> skipped >> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this >> breaks >> >> >> HWOL for ingress connections that never need to get unSNAT'ed and >> therefore >> >> >> not committed. In the OVNK use case for pods simply connecting >> egress, >> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on >> egress >> >> >> reply, we always enter the unSNAT stage for these packets. So this >> skipping >> >> >> unSNAT stage seems to be specific to LB related traffic. While I >> agree >> >> >> splitting the port range would work, I think it makes more sense to >> just >> >> >> always commit in the unSNAT stage for all traffic. I get there is a >> >> >> performance hit there, but the datapath pipeline seems more >> consistent and >> >> >> I think outweighs the cost of committing LB traffic. >> >> >> >> >> >> Hi Tim and folks, >> >> >> >> >> > >> >> > Hi Han, >> >> > >> >> >> + @Ales Musil <amusil@redhat.com> >> >> >> >> >> >> We've spent a lot of effort on the alternative solution: Always >> commit to >> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue >> discussed in >> >> >> this thread is still unresolved. I'd like to discuss the current >> situation >> >> >> and ask for opinions. >> >> >> >> >> > >> >> > Thanks again for working on this! >> >> > >> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 >> ("northd: >> >> >> Add LR option to commit all traffic."). When the ct-commit-all >> option is >> >> >> enabled, it commits the connection to both SNAT and DNAT zones. >> However, in >> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW >> offload is >> >> >> still broken. In that scenario, a packet may come from external >> network >> >> >> with dest IP being a node's IP, which is also used as an OVN >> gateway router >> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the >> src IP >> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in >> different >> >> >> stages are: >> >> >> >> >> >> Original: EXT IP <-> GR IP1 >> >> >> After DNAT: EXT IP <-> BE IP >> >> >> After SNAT: GR IP2 <-> BE IP >> >> >> >> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >> >> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >> >> >> original header, which will still always be "new" state, which >> breaks HW >> >> >> offload. >> >> >> >> >> > >> >> > Ah, now I understand the problem you were facing. >> >> > >> >> >> I tried to fix it by also committing to the SNAT zone before DNAT >> (at the >> >> >> POST_SNAT stage). Now there are 3 entries committed: >> >> >> SNAT zone: EXT IP <-> GR IP1 >> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >> >> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and >> SNAT on >> >> >> distributed router - E/W", which tests communication between two >> workloads >> >> >> using "floating IPs". In this scenario, both workloads are under >> different >> >> >> LSes behind the same router, and both have their own floating IPs >> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends >> >> >> packets to the other using the floating IP, what's expected is: >> >> >> >> >> >> Original: overlay IP1 <-> floating IP2 >> >> >> After SNAT: floating IP1 <-> floating IP2 >> >> >> After DNAT: floating IP1 <-> overlay IP2 >> >> >> >> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT >> zone: >> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >> >> >> >> >> >> This happens in the egress pipeline of the LR, and then it enters >> into the >> >> >> ingress pipeline again because this is like a hairpin (the dest IP >> is a >> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage >> first, >> >> >> traverse the CT zone, and for the first packet the reply is not >> seen yet, >> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to >> commit >> >> >> it (because of the issue I am trying to fix above). Because the >> connection >> >> >> floating IP1 <-> floating IP2 is conflicting with the above >> committed entry >> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action >> fails and >> >> >> the packet is dropped. >> >> >> >> >> >> There is still a way to solve this, probably by skipping the >> unnecessary >> >> >> UNSNAT traversal and the conflict commit if the source IP is one of >> the >> >> >> snat/dnat_and_snat IPs of this same router. >> >> >> >> >> >> However, as we see, the NAT related pipelines have been very >> complex today, >> >> >> and with the above changes it would become even much more complex. >> I wonder >> >> >> if this is maintainable in the long run. >> >> >> >> >> >> In addition, I am not even sure if this is the last problem of my >> patch. It >> >> >> is not easy to tell because today the ct-commit-all feature is an >> option >> >> >> and by default disabled. So all our test cases by default don't >> test for >> >> >> this option except the 4 test cases added by Ales's patch. I had to >> >> >> manually set this option to true as default and of course many test >> cases >> >> >> will fail. Most of them fail because of the expected flows won't >> match, but >> >> >> I can't tell for sure unless updating every case with new flows and >> then >> >> >> retest them. Before doing that I just examined ones I think might be >> >> >> impacted by the patch and the "DNAT and SNAT on distributed router >> - E/W" >> >> >> is the first broken one I found. So there can be other issues not >> >> >> discovered yet. >> >> >> >> >> >> Moreover, we should also keep in mind the performance penalty of >> these >> >> >> extra CT commit & traversal. >> >> >> >> >> > >> >> > But basically this happens because we use the same SNAT zone for all >> >> > router ports. >> >> > >> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone >> per >> >> > router port fix it? E.g.: >> >> > >> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: >> 21, >> >> > DNAT ZONE: 22) ---- host2 >> >> > >> >> > That'd also imply that we commit that session (from host1 to host2 - >> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for >> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 >> different >> >> > CT lookups for all packets traversing the router. >> >> > >> >> > I didn't try any of this but I have the impression that this would >> solve >> >> > the HWOL issue: all subsequent packets are part of known and >> committed >> >> > sessions in all 4 conntrack zones. >> >> > >> >> > So what if we change the implementation of the "commit-all" feature >> to >> >> > add the missing ct commits? Wouldn't that work? >> >> > >> >> > It feels like (lookup performance hit aside) this is the correct way >> to >> >> > forward traffic, i.e., never traverse a CT zone without having a >> >> > committed entry in that zone. >> >> > >> >> >> I'd like to listen to your opinion before moving forward. Is this >> worth the >> >> >> effort, or shall we just keep the pipeline simple (relatively) and >> avoid >> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by >> >> >> separating the port range of SNAT and DNAT, the first proposal at >> the >> >> >> beginning of this thread? >> >> >> >> >> > >> >> > Now that I understand the problem you're describing better it might >> be >> >> > worth reverting the fix for FDP-291 like you said until we find a >> proper >> >> > generic solution. >> >> > >> >> > However, we can't afford to do that without separating the port >> range of >> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least >> >> > from OpenShift perspective). >> >> > >> >> >> >> For reference, Tim, this is the OpenShift bug that needed the fix for >> >> FDP-291: >> >> >> >> https://issues.redhat.com/browse/OCPBUGS-25889 >> >> >> >> > Because selecting the nodeport (DNAT) range is done through a k8s API >> >> > server cmdline argument that means that users can in theory choose >> >> > whatever range they want. I guess we'd need a command line argument >> in >> >> > ovn-kubernetes to allow users (operators, etc.) to inform >> ovn-kubernetes >> >> > what SNAT port range to use so that it doesn't conflict with the KAPI >> >> > nodeport range. >> >> > >> >> > Tim, you're the expert on this side, what are your thoughts on the >> matter? >> >> > >> >> > Regards, >> >> > Dumitru >> >> > >> >> > [0] >> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html >> >> > >> >> >> Thanks, >> >> >> Han >> >> >> >> >> >>> >> >> >>> Thanks, >> >> >>> Tim Rozet >> >> >>> Red Hat OpenShift Networking Team >> >> >>> >> >> >>> >> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> >> wrote: >> >> >>>> >> >> >>>> On 8/29/24 18:14, Han Zhou wrote: >> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com >> > >> >> >> wrote: >> >> >>>>>> >> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < >> dceara@redhat.com> >> >> >> wrote: >> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses >> the >> >> >> VIP as >> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. >> >> >>>>>>>> >> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat >> flows >> >> >> for >> >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK >> because >> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for >> routers >> >> >> with >> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a >> better way: >> >> >>>>>>>> >> >> >>>>>>>> In the reply direction, the order of traversal of the >> tables >> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds >> incorrect >> >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack >> zone. >> >> >>>>>>>> This is illustrated below where reply trafic enters the >> >> >> physical host >> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone >> (default), >> >> >> back to the >> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). >> The >> >> >> third >> >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone >> >> >> instead >> >> >>>>>>>> of the DNAT zone. >> >> >>>>>>>> >> >> >>>>>>>> We also add a system test to ensure traffic initiated from >> behind a >> >> >> VIP >> >> >>>>>>>> + SNAT is not broken. >> >> >>>>>>>> >> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly >> >> >> simplified >> >> >>>>>>>> because we don't need to track NAT external IPs anymore. >> >> >>>>>>>> >> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >> >> >> balancer vips in router ingress pipeline") >> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful >> code. >> >> >>>>>>> >> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >> >> >>>>>>> >> >> >>>>>> >> >> >>>>>> Thanks, Numan! >> >> >>>>>> >> >> >>>>>> Applied to main and backported to all branches down to 22.03. >> >> >>>>>> >> >> >>>>>> Regards, >> >> >>>>>> Dumitru >> >> >>>>>> >> >> >>>>>> _______________________________________________ >> >> >>>>>> dev mailing list >> >> >>>>>> dev@openvswitch.org >> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >> >>>>> >> >> >>>>> Hi Dumitru, Numan, Tim and folks, >> >> >>>>> >> >> >>>> >> >> >>>> Hi Han, >> >> >>>> >> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken >> due to >> >> >>>>> this change. The reason is that for client to nodePort (LB with >> VIP >> >> >>>>> being the node IP) traffic, when the packet is going through the >> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never >> committed >> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which >> >> >>>>> prevents the HW offload to work for such packets. >> >> >>>>> >> >> >>>> >> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT >> zone. >> >> >>>> >> >> >>>>> At the moment I have to revert this change in our downstream. >> For the >> >> >>>>> problem that was fixed by this change [0], I think we can avoid >> it by >> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the >> nodePort >> >> >>>>> range in k8s is configured by API-server option: >> >> >>>>> >> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >> >> >> 30000-32767 >> >> >>>>> >> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's >> >> >>>>> external_port_range column, and we can choose something like >> >> >>>>> 10000-30000. >> >> >>>>> >> >> >>>> >> >> >>>> Tim, does this look OK to you? If it's acceptable to limit the >> SNAT >> >> >>>> port range this workaround should be fine. >> >> >>>> >> >> >>>>> An extra benefit of this is that it reduces a CT recirc. >> >> >>>>> >> >> >>>>> Alternatively solutions are: >> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This >> would >> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >> >> >>>> >> >> >>>> Extra recirculation aside, I would actually love it if we could >> use this >> >> >>>> alternative. I think it's the "most correct" option. I think it >> would >> >> >>>> allow us to avoid other workarounds like: >> >> >>>> >> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >> >> >>>> >> >> >>>> or >> >> >>>> >> >> >>>> >> >> >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >> >> >>>> >> >> >>>> I do understand the worry about the extra recirculation though. >> In the >> >> >>>> HWOL context does that cause visible performance impact? >> >> >>>> >> >> >>>> We'd probably have to do more performance testing without HWOL to >> figure >> >> >>>> out the impact in the software datapath. >> >> >>>> >> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are >> other >> >> >>>>> issues reported for using the common zone [1] >> >> >>>>> >> >> >>>>> Could you let me know if other thoughts on this? >> >> >>>>> >> >> >>>> >> >> >>>> On a related note, I know it has been discussed in different >> settings >> >> >>>> but I don't think this ever moved forward: would it be possible >> for >> >> >>>> NVIDIA to help out with automatically testing HWOL impact for >> incoming >> >> >>>> patches? >> >> >>>> >> >> >>>> Maybe we could some "simple" system-like tests that ensure that >> traffic >> >> >>>> is correctly offloaded in common scenarios? Alternatively, I >> guess we >> >> >>>> could also tag a subset of the existing system tests and just run >> those >> >> >>>> on actual hardware? >> >> >>>> >> >> >>>> It's quite simple (AFAIU) for external CIs to report status on >> each OVN >> >> >>>> patch posted on patchwork. That would at least allow us to flag >> this >> >> >>>> kind of breakages early (even before they get merged). >> >> >>>> >> >> >>>> What do you think? >> >> >>>> >> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >> >> >>>>> [1] b8c40e7593 >> >> >> >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >> >> >>>>> >> >> >>>>> Thanks, >> >> >>>>> Han >> >> >>>>> >> >> >>>> >> >> >>>> Thanks, >> >> >>>> Dumitru >> >> >> _______________________________________________ >> >> >> dev mailing list >> >> >> dev@openvswitch.org >> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >
On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: > > Hi Han, > I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review it when you have time. > > Thanks, > Tim Rozet > Red Hat OpenShift Networking Team > Thanks Tim. I posted a comment last week. Please take a look. I also sent the OVN revert patch today: https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html Hope we can resolve the ovn-k PR soon and then this one can get merged. > > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: >> >> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: >>> >>> >>> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: >>> > >>> > Hi All, >>> > Sorry Han that you are still facing issues with this problem. It seems like the best option would be to configure the range of ephemeral ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport service range is configured 30000–32767, and the ephemeral port range in the kernel is set to: >>> > net.ipv4.ip_local_port_range = 32768 60999 >>> > >>> > I think it would make sense to revert and add an option to make this range configurable in OVN. Then OVN-Kubernetes should pass along this config to OVN. >>> > >>> > As Dumitru mentioned, we can't just revert the previous patch without a substitute fix in place, otherwise we would risk regressing. >>> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, right? >>> >>> OVN already provides configurable SNAT port range through the external_port_range column in the NAT table. See: https://man7.org/linux/man-pages/man5/ovn-nb.5.html >>> >>> external_port_range: string >>> L4 source port range >>> >>> Range of ports, from which a port number will be picked >>> that will replace the source port of to be NATed packet. >>> This is basically PAT (port address translation). >>> >>> Value of the column is in the format, port_lo-port_hi. For >>> example: external_port_range : "1-30000" >>> >>> Valid range of ports is 1-65535. >>> >>> Does this satisfy the need you asked? >>> >>> If this is agreed, I think we can move forward with below steps: >>> 1. Update ovn-kubernetes to set the external_port_range, if specific ranges are configured. (only needed when k8s configured nodeport range conflicts with the kernel ephemeral port range, and it seems unnecessary for OpenShift?) >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and vice versa. >> >> >> Hi Han, >> >> I don't think we should revert/partially revert ct-commit-all, the whole point of that knob is what you have described, commit to SNAT zone if there is any SNAT configured and commit to DNAT zone if there is any DNAT configured. If you still feel like there are some changes that we could make to this approach we can discuss it. There is one system test called "Commit all - UDN" which needs to pass unchanged if there is any revert/change to ct-commit-all. >> Hi Ales, sorry for the confusion. I didn't mean revert the major ct-commit-all change that was required to fix the UDN issue reported, which was supposed to commit all traffic to DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there are stateful SNAT. In one of the revisions of your patch, I commented that to solve the HW offload issue discussed in this thread, we shouldn't distinguish between DNAT and SNAT and always commit to both zones if there are stateful NATs. Now that solving the HW offload issue is not any more a goal of your ct-commit-all patch, we might bring back the separation between SNAT and DNAT, so that in simpler scenarios (where only DNAT or SNAT exist but not both) we could avoid the extra CT commit, which might help reduce some performance penalty. Shall I leave this to you if you think this makes sense and want to follow up? Thanks, Han >> Thanks, >> Ales >>> >>> >>> What do you think? >>> >>> Thanks, >>> Han >>> >>> > >>> > Thanks, >>> > >>> > Tim Rozet >>> > Red Hat OpenShift Networking Team >>> > >>> > >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >>> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >>> >> >>> >>> >> >>> Hi Han, >>> >> >>> My understanding is that the unSNAT stage was originally being skipped >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed and therefore >>> >> >> not committed. In the OVNK use case for pods simply connecting egress, >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress >>> >> >> reply, we always enter the unSNAT stage for these packets. So this skipping >>> >> >> unSNAT stage seems to be specific to LB related traffic. While I agree >>> >> >> splitting the port range would work, I think it makes more sense to just >>> >> >> always commit in the unSNAT stage for all traffic. I get there is a >>> >> >> performance hit there, but the datapath pipeline seems more consistent and >>> >> >> I think outweighs the cost of committing LB traffic. >>> >> >> >>> >> >> Hi Tim and folks, >>> >> >> >>> >> > >>> >> > Hi Han, >>> >> > >>> >> >> + @Ales Musil <amusil@redhat.com> >>> >> >> >>> >> >> We've spent a lot of effort on the alternative solution: Always commit to >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in >>> >> >> this thread is still unresolved. I'd like to discuss the current situation >>> >> >> and ask for opinions. >>> >> >> >>> >> > >>> >> > Thanks again for working on this! >>> >> > >>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all option is >>> >> >> enabled, it commits the connection to both SNAT and DNAT zones. However, in >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is >>> >> >> still broken. In that scenario, a packet may come from external network >>> >> >> with dest IP being a node's IP, which is also used as an OVN gateway router >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in different >>> >> >> stages are: >>> >> >> >>> >> >> Original: EXT IP <-> GR IP1 >>> >> >> After DNAT: EXT IP <-> BE IP >>> >> >> After SNAT: GR IP2 <-> BE IP >>> >> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >>> >> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >>> >> >> original header, which will still always be "new" state, which breaks HW >>> >> >> offload. >>> >> >> >>> >> > >>> >> > Ah, now I understand the problem you were facing. >>> >> > >>> >> >> I tried to fix it by also committing to the SNAT zone before DNAT (at the >>> >> >> POST_SNAT stage). Now there are 3 entries committed: >>> >> >> SNAT zone: EXT IP <-> GR IP1 >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >>> >> >> >>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on >>> >> >> distributed router - E/W", which tests communication between two workloads >>> >> >> using "floating IPs". In this scenario, both workloads are under different >>> >> >> LSes behind the same router, and both have their own floating IPs >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends >>> >> >> packets to the other using the floating IP, what's expected is: >>> >> >> >>> >> >> Original: overlay IP1 <-> floating IP2 >>> >> >> After SNAT: floating IP1 <-> floating IP2 >>> >> >> After DNAT: floating IP1 <-> overlay IP2 >>> >> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >>> >> >> >>> >> >> This happens in the egress pipeline of the LR, and then it enters into the >>> >> >> ingress pipeline again because this is like a hairpin (the dest IP is a >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, >>> >> >> traverse the CT zone, and for the first packet the reply is not seen yet, >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to commit >>> >> >> it (because of the issue I am trying to fix above). Because the connection >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above committed entry >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and >>> >> >> the packet is dropped. >>> >> >> >>> >> >> There is still a way to solve this, probably by skipping the unnecessary >>> >> >> UNSNAT traversal and the conflict commit if the source IP is one of the >>> >> >> snat/dnat_and_snat IPs of this same router. >>> >> >> >>> >> >> However, as we see, the NAT related pipelines have been very complex today, >>> >> >> and with the above changes it would become even much more complex. I wonder >>> >> >> if this is maintainable in the long run. >>> >> >> >>> >> >> In addition, I am not even sure if this is the last problem of my patch. It >>> >> >> is not easy to tell because today the ct-commit-all feature is an option >>> >> >> and by default disabled. So all our test cases by default don't test for >>> >> >> this option except the 4 test cases added by Ales's patch. I had to >>> >> >> manually set this option to true as default and of course many test cases >>> >> >> will fail. Most of them fail because of the expected flows won't match, but >>> >> >> I can't tell for sure unless updating every case with new flows and then >>> >> >> retest them. Before doing that I just examined ones I think might be >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed router - E/W" >>> >> >> is the first broken one I found. So there can be other issues not >>> >> >> discovered yet. >>> >> >> >>> >> >> Moreover, we should also keep in mind the performance penalty of these >>> >> >> extra CT commit & traversal. >>> >> >> >>> >> > >>> >> > But basically this happens because we use the same SNAT zone for all >>> >> > router ports. >>> >> > >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per >>> >> > router port fix it? E.g.: >>> >> > >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, >>> >> > DNAT ZONE: 22) ---- host2 >>> >> > >>> >> > That'd also imply that we commit that session (from host1 to host2 - >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different >>> >> > CT lookups for all packets traversing the router. >>> >> > >>> >> > I didn't try any of this but I have the impression that this would solve >>> >> > the HWOL issue: all subsequent packets are part of known and committed >>> >> > sessions in all 4 conntrack zones. >>> >> > >>> >> > So what if we change the implementation of the "commit-all" feature to >>> >> > add the missing ct commits? Wouldn't that work? >>> >> > >>> >> > It feels like (lookup performance hit aside) this is the correct way to >>> >> > forward traffic, i.e., never traverse a CT zone without having a >>> >> > committed entry in that zone. >>> >> > >>> >> >> I'd like to listen to your opinion before moving forward. Is this worth the >>> >> >> effort, or shall we just keep the pipeline simple (relatively) and avoid >>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by >>> >> >> separating the port range of SNAT and DNAT, the first proposal at the >>> >> >> beginning of this thread? >>> >> >> >>> >> > >>> >> > Now that I understand the problem you're describing better it might be >>> >> > worth reverting the fix for FDP-291 like you said until we find a proper >>> >> > generic solution. >>> >> > >>> >> > However, we can't afford to do that without separating the port range of >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least >>> >> > from OpenShift perspective). >>> >> > >>> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the fix for >>> >> FDP-291: >>> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 >>> >> >>> >> > Because selecting the nodeport (DNAT) range is done through a k8s API >>> >> > server cmdline argument that means that users can in theory choose >>> >> > whatever range they want. I guess we'd need a command line argument in >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes >>> >> > what SNAT port range to use so that it doesn't conflict with the KAPI >>> >> > nodeport range. >>> >> > >>> >> > Tim, you're the expert on this side, what are your thoughts on the matter? >>> >> > >>> >> > Regards, >>> >> > Dumitru >>> >> > >>> >> > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html >>> >> > >>> >> >> Thanks, >>> >> >> Han >>> >> >> >>> >> >>> >>> >> >>> Thanks, >>> >> >>> Tim Rozet >>> >> >>> Red Hat OpenShift Networking Team >>> >> >>> >>> >> >>> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: >>> >> >>>> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < dceara@redhat.com> >>> >> >> wrote: >>> >> >>>>>> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < dceara@redhat.com> >>> >> >> wrote: >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the >>> >> >> VIP as >>> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. >>> >> >>>>>>>> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows >>> >> >> for >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK because >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers >>> >> >> with >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >>> >> >>>>>>>> >>> >> >>>>>>>> In the reply direction, the order of traversal of the tables >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >>> >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >>> >> >>>>>>>> This is illustrated below where reply trafic enters the >>> >> >> physical host >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), >>> >> >> back to the >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The >>> >> >> third >>> >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone >>> >> >> instead >>> >> >>>>>>>> of the DNAT zone. >>> >> >>>>>>>> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated from behind a >>> >> >> VIP >>> >> >>>>>>>> + SNAT is not broken. >>> >> >>>>>>>> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly >>> >> >> simplified >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore. >>> >> >>>>>>>> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >>> >> >> balancer vips in router ingress pipeline") >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >>> >> >>>>>>> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >>> >> >>>>>>> >>> >> >>>>>> >>> >> >>>>>> Thanks, Numan! >>> >> >>>>>> >>> >> >>>>>> Applied to main and backported to all branches down to 22.03. >>> >> >>>>>> >>> >> >>>>>> Regards, >>> >> >>>>>> Dumitru >>> >> >>>>>> >>> >> >>>>>> _______________________________________________ >>> >> >>>>>> dev mailing list >>> >> >>>>>> dev@openvswitch.org >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >> >>>>> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, >>> >> >>>>> >>> >> >>>> >>> >> >>>> Hi Han, >>> >> >>>> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >>> >> >>>>> this change. The reason is that for client to nodePort (LB with VIP >>> >> >>>>> being the node IP) traffic, when the packet is going through the >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which >>> >> >>>>> prevents the HW offload to work for such packets. >>> >> >>>>> >>> >> >>>> >>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. >>> >> >>>> >>> >> >>>>> At the moment I have to revert this change in our downstream. For the >>> >> >>>>> problem that was fixed by this change [0], I think we can avoid it by >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >>> >> >>>>> range in k8s is configured by API-server option: >>> >> >>>>> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >>> >> >> 30000-32767 >>> >> >>>>> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's >>> >> >>>>> external_port_range column, and we can choose something like >>> >> >>>>> 10000-30000. >>> >> >>>>> >>> >> >>>> >>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >>> >> >>>> port range this workaround should be fine. >>> >> >>>> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. >>> >> >>>>> >>> >> >>>>> Alternatively solutions are: >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >>> >> >>>> >>> >> >>>> Extra recirculation aside, I would actually love it if we could use this >>> >> >>>> alternative. I think it's the "most correct" option. I think it would >>> >> >>>> allow us to avoid other workarounds like: >>> >> >>>> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >>> >> >>>> >>> >> >>>> or >>> >> >>>> >>> >> >>>> >>> >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >>> >> >>>> >>> >> >>>> I do understand the worry about the extra recirculation though. In the >>> >> >>>> HWOL context does that cause visible performance impact? >>> >> >>>> >>> >> >>>> We'd probably have to do more performance testing without HWOL to figure >>> >> >>>> out the impact in the software datapath. >>> >> >>>> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >>> >> >>>>> issues reported for using the common zone [1] >>> >> >>>>> >>> >> >>>>> Could you let me know if other thoughts on this? >>> >> >>>>> >>> >> >>>> >>> >> >>>> On a related note, I know it has been discussed in different settings >>> >> >>>> but I don't think this ever moved forward: would it be possible for >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for incoming >>> >> >>>> patches? >>> >> >>>> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure that traffic >>> >> >>>> is correctly offloaded in common scenarios? Alternatively, I guess we >>> >> >>>> could also tag a subset of the existing system tests and just run those >>> >> >>>> on actual hardware? >>> >> >>>> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on each OVN >>> >> >>>> patch posted on patchwork. That would at least allow us to flag this >>> >> >>>> kind of breakages early (even before they get merged). >>> >> >>>> >>> >> >>>> What do you think? >>> >> >>>> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >>> >> >>>>> [1] b8c40e7593 >>> >> >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >>> >> >>>>> >>> >> >>>>> Thanks, >>> >> >>>>> Han >>> >> >>>>> >>> >> >>>> >>> >> >>>> Thanks, >>> >> >>>> Dumitru >>> >> >> _______________________________________________ >>> >> >> dev mailing list >>> >> >> dev@openvswitch.org >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote: > > > On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: > > > > Hi Han, > > I tagged you on > https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review > it when you have time. > > > > Thanks, > > Tim Rozet > > Red Hat OpenShift Networking Team > > > > Thanks Tim. I posted a comment last week. Please take a look. > > I also sent the OVN revert patch today: > https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html > Hope we can resolve the ovn-k PR soon and then this one can get merged. > > > > > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: > >> > >> > >> > >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: > >>> > >>> > >>> > >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: > >>> > > >>> > Hi All, > >>> > Sorry Han that you are still facing issues with this problem. It > seems like the best option would be to configure the range of ephemeral > ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport > service range is configured 30000–32767, and the ephemeral port range in > the kernel is set to: > >>> > net.ipv4.ip_local_port_range = 32768 60999 > >>> > > >>> > I think it would make sense to revert and add an option to make this > range configurable in OVN. Then OVN-Kubernetes should pass along this > config to OVN. > >>> > > >>> > As Dumitru mentioned, we can't just revert the previous patch > without a substitute fix in place, otherwise we would risk regressing. > >>> > >>> Thanks Tim! For the substitute fix, I assume you meant a fix in > ovn-k8s, right? > >>> > >>> OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > >>> > >>> external_port_range: string > >>> L4 source port range > >>> > >>> Range of ports, from which a port number will be picked > >>> that will replace the source port of to be NATed packet. > >>> This is basically PAT (port address translation). > >>> > >>> Value of the column is in the format, port_lo-port_hi. > For > >>> example: external_port_range : "1-30000" > >>> > >>> Valid range of ports is 1-65535. > >>> > >>> Does this satisfy the need you asked? > >>> > >>> If this is agreed, I think we can move forward with below steps: > >>> 1. Update ovn-kubernetes to set the external_port_range, if specific > ranges are configured. (only needed when k8s configured nodeport range > conflicts with the kernel ephemeral port range, and it seems unnecessary > for OpenShift?) > >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic > towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: > e.g. no need to commit DNATted connections to SNAT zone, and vice versa. > >> > >> > >> Hi Han, > >> > >> I don't think we should revert/partially revert ct-commit-all, the > whole point of that knob is what you have described, commit to SNAT zone if > there is any SNAT configured and commit to DNAT zone if there is any DNAT > configured. If you still feel like there are some changes that we could > make to this approach we can discuss it. There is one system test called > "Commit all - UDN" which needs to pass unchanged if there is any > revert/change to ct-commit-all. > >> > > Hi Ales, sorry for the confusion. > Hi Han, I didn't mean revert the major ct-commit-all change that was required to > fix the UDN issue reported, which was supposed to commit all traffic to > DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there > are stateful SNAT. In one of the revisions of your patch, I commented that > to solve the HW offload issue discussed in this thread, we shouldn't > distinguish between DNAT and SNAT and always commit to both zones if there > are stateful NATs. Now that solving the HW offload issue is not any more a > goal of your ct-commit-all patch, we might bring back the separation > between SNAT and DNAT, so that in simpler scenarios (where only DNAT or > SNAT exist but not both) we could avoid the extra CT commit, which might > help reduce some performance penalty. Shall I leave this to you if you > think this makes sense and want to follow up? > I was thinking about looking into the proposal from Dumitru with a zone per port, because we might need that anyway for UDNs. If that doesn't work out we can work out which bits could be simplified. Any objections? Thanks, Ales > Thanks, > Han > > >> Thanks, > >> Ales > >>> > >>> > >>> What do you think? > >>> > >>> Thanks, > >>> Han > >>> > >>> > > >>> > Thanks, > >>> > > >>> > Tim Rozet > >>> > Red Hat OpenShift Networking Team > >>> > > >>> > > >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >>> >> > >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> > wrote: > >>> >> >>> > >>> >> >>> Hi Han, > >>> >> >>> My understanding is that the unSNAT stage was originally being > skipped > >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and > this breaks > >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed > and therefore > >>> >> >> not committed. In the OVNK use case for pods simply connecting > egress, > >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on > egress > >>> >> >> reply, we always enter the unSNAT stage for these packets. So > this skipping > >>> >> >> unSNAT stage seems to be specific to LB related traffic. While I > agree > >>> >> >> splitting the port range would work, I think it makes more sense > to just > >>> >> >> always commit in the unSNAT stage for all traffic. I get there > is a > >>> >> >> performance hit there, but the datapath pipeline seems more > consistent and > >>> >> >> I think outweighs the cost of committing LB traffic. > >>> >> >> > >>> >> >> Hi Tim and folks, > >>> >> >> > >>> >> > > >>> >> > Hi Han, > >>> >> > > >>> >> >> + @Ales Musil <amusil@redhat.com> > >>> >> >> > >>> >> >> We've spent a lot of effort on the alternative solution: Always > commit to > >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue > discussed in > >>> >> >> this thread is still unresolved. I'd like to discuss the current > situation > >>> >> >> and ask for opinions. > >>> >> >> > >>> >> > > >>> >> > Thanks again for working on this! > >>> >> > > >>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 > ("northd: > >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all > option is > >>> >> >> enabled, it commits the connection to both SNAT and DNAT zones. > However, in > >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW > offload is > >>> >> >> still broken. In that scenario, a packet may come from external > network > >>> >> >> with dest IP being a node's IP, which is also used as an OVN > gateway router > >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then > the src IP > >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in > different > >>> >> >> stages are: > >>> >> >> > >>> >> >> Original: EXT IP <-> GR IP1 > >>> >> >> After DNAT: EXT IP <-> BE IP > >>> >> >> After SNAT: GR IP2 <-> BE IP > >>> >> >> > >>> >> >> With the ct-commit-all option, it commits after DNAT and after > SNAT: > >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >>> >> >> > >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with > the > >>> >> >> original header, which will still always be "new" state, which > breaks HW > >>> >> >> offload. > >>> >> >> > >>> >> > > >>> >> > Ah, now I understand the problem you were facing. > >>> >> > > >>> >> >> I tried to fix it by also committing to the SNAT zone before > DNAT (at the > >>> >> >> POST_SNAT stage). Now there are 3 entries committed: > >>> >> >> SNAT zone: EXT IP <-> GR IP1 > >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >>> >> >> > >>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT > and SNAT on > >>> >> >> distributed router - E/W", which tests communication between two > workloads > >>> >> >> using "floating IPs". In this scenario, both workloads are under > different > >>> >> >> LSes behind the same router, and both have their own floating IPs > >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them > sends > >>> >> >> packets to the other using the floating IP, what's expected is: > >>> >> >> > >>> >> >> Original: overlay IP1 <-> floating IP2 > >>> >> >> After SNAT: floating IP1 <-> floating IP2 > >>> >> >> After DNAT: floating IP1 <-> overlay IP2 > >>> >> >> > >>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT > zone: > >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >>> >> >> > >>> >> >> This happens in the egress pipeline of the LR, and then it > enters into the > >>> >> >> ingress pipeline again because this is like a hairpin (the dest > IP is a > >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT > stage first, > >>> >> >> traverse the CT zone, and for the first packet the reply is not > seen yet, > >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try > to commit > >>> >> >> it (because of the issue I am trying to fix above). Because the > connection > >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above > committed entry > >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action > fails and > >>> >> >> the packet is dropped. > >>> >> >> > >>> >> >> There is still a way to solve this, probably by skipping the > unnecessary > >>> >> >> UNSNAT traversal and the conflict commit if the source IP is one > of the > >>> >> >> snat/dnat_and_snat IPs of this same router. > >>> >> >> > >>> >> >> However, as we see, the NAT related pipelines have been very > complex today, > >>> >> >> and with the above changes it would become even much more > complex. I wonder > >>> >> >> if this is maintainable in the long run. > >>> >> >> > >>> >> >> In addition, I am not even sure if this is the last problem of > my patch. It > >>> >> >> is not easy to tell because today the ct-commit-all feature is > an option > >>> >> >> and by default disabled. So all our test cases by default don't > test for > >>> >> >> this option except the 4 test cases added by Ales's patch. I had > to > >>> >> >> manually set this option to true as default and of course many > test cases > >>> >> >> will fail. Most of them fail because of the expected flows won't > match, but > >>> >> >> I can't tell for sure unless updating every case with new flows > and then > >>> >> >> retest them. Before doing that I just examined ones I think > might be > >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed > router - E/W" > >>> >> >> is the first broken one I found. So there can be other issues not > >>> >> >> discovered yet. > >>> >> >> > >>> >> >> Moreover, we should also keep in mind the performance penalty of > these > >>> >> >> extra CT commit & traversal. > >>> >> >> > >>> >> > > >>> >> > But basically this happens because we use the same SNAT zone for > all > >>> >> > router ports. > >>> >> > > >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT > zone per > >>> >> > router port fix it? E.g.: > >>> >> > > >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT > ZONE: 21, > >>> >> > DNAT ZONE: 22) ---- host2 > >>> >> > > >>> >> > That'd also imply that we commit that session (from host1 to > host2 - > >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT > for > >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 > different > >>> >> > CT lookups for all packets traversing the router. > >>> >> > > >>> >> > I didn't try any of this but I have the impression that this > would solve > >>> >> > the HWOL issue: all subsequent packets are part of known and > committed > >>> >> > sessions in all 4 conntrack zones. > >>> >> > > >>> >> > So what if we change the implementation of the "commit-all" > feature to > >>> >> > add the missing ct commits? Wouldn't that work? > >>> >> > > >>> >> > It feels like (lookup performance hit aside) this is the correct > way to > >>> >> > forward traffic, i.e., never traverse a CT zone without having a > >>> >> > committed entry in that zone. > >>> >> > > >>> >> >> I'd like to listen to your opinion before moving forward. Is > this worth the > >>> >> >> effort, or shall we just keep the pipeline simple (relatively) > and avoid > >>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) > by > >>> >> >> separating the port range of SNAT and DNAT, the first proposal > at the > >>> >> >> beginning of this thread? > >>> >> >> > >>> >> > > >>> >> > Now that I understand the problem you're describing better it > might be > >>> >> > worth reverting the fix for FDP-291 like you said until we find a > proper > >>> >> > generic solution. > >>> >> > > >>> >> > However, we can't afford to do that without separating the port > range of > >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at > least > >>> >> > from OpenShift perspective). > >>> >> > > >>> >> > >>> >> For reference, Tim, this is the OpenShift bug that needed the fix > for > >>> >> FDP-291: > >>> >> > >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 > >>> >> > >>> >> > Because selecting the nodeport (DNAT) range is done through a k8s > API > >>> >> > server cmdline argument that means that users can in theory choose > >>> >> > whatever range they want. I guess we'd need a command line > argument in > >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >>> >> > what SNAT port range to use so that it doesn't conflict with the > KAPI > >>> >> > nodeport range. > >>> >> > > >>> >> > Tim, you're the expert on this side, what are your thoughts on > the matter? > >>> >> > > >>> >> > Regards, > >>> >> > Dumitru > >>> >> > > >>> >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >>> >> > > >>> >> >> Thanks, > >>> >> >> Han > >>> >> >> > >>> >> >>> > >>> >> >>> Thanks, > >>> >> >>> Tim Rozet > >>> >> >>> Red Hat OpenShift Networking Team > >>> >> >>> > >>> >> >>> > >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara < > dceara@redhat.com> wrote: > >>> >> >>>> > >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < > dceara@redhat.com> > >>> >> >> wrote: > >>> >> >>>>>> > >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> > >>> >> >> wrote: > >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that > uses the > >>> >> >> VIP as > >>> >> >>>>>>>> external IP, we break sessions initiated from behind the > VIP. > >>> >> >>>>>>>> > >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip > unsnat flows > >>> >> >> for > >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). That's > OK because > >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for > routers > >>> >> >> with > >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a > better way: > >>> >> >>>>>>>> > >>> >> >>>>>>>> In the reply direction, the order of traversal of the > tables > >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds > incorrect > >>> >> >>>>>>>> datapath flows that check ct_state in the wrong > conntrack zone. > >>> >> >>>>>>>> This is illustrated below where reply trafic enters the > >>> >> >> physical host > >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone > (default), > >>> >> >> back to the > >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone > (22). The > >>> >> >> third > >>> >> >>>>>>>> flow is incorrectly checking the state from the SNAT > zone > >>> >> >> instead > >>> >> >>>>>>>> of the DNAT zone. > >>> >> >>>>>>>> > >>> >> >>>>>>>> We also add a system test to ensure traffic initiated from > behind a > >>> >> >> VIP > >>> >> >>>>>>>> + SNAT is not broken. > >>> >> >>>>>>>> > >>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly > >>> >> >> simplified > >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore. > >>> >> >>>>>>>> > >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for > load > >>> >> >> balancer vips in router ingress pipeline") > >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >>> >> >>>>>>> > >>> >> >>>>>>> > >>> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful > code. > >>> >> >>>>>>> > >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >>> >> >>>>>>> > >>> >> >>>>>> > >>> >> >>>>>> Thanks, Numan! > >>> >> >>>>>> > >>> >> >>>>>> Applied to main and backported to all branches down to 22.03. > >>> >> >>>>>> > >>> >> >>>>>> Regards, > >>> >> >>>>>> Dumitru > >>> >> >>>>>> > >>> >> >>>>>> _______________________________________________ > >>> >> >>>>>> dev mailing list > >>> >> >>>>>> dev@openvswitch.org > >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> >> >>>>> > >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, > >>> >> >>>>> > >>> >> >>>> > >>> >> >>>> Hi Han, > >>> >> >>>> > >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is > broken due to > >>> >> >>>>> this change. The reason is that for client to nodePort (LB > with VIP > >>> >> >>>>> being the node IP) traffic, when the packet is going through > the > >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never > committed > >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", > which > >>> >> >>>>> prevents the HW offload to work for such packets. > >>> >> >>>>> > >>> >> >>>> > >>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT > zone. > >>> >> >>>> > >>> >> >>>>> At the moment I have to revert this change in our downstream. > For the > >>> >> >>>>> problem that was fixed by this change [0], I think we can > avoid it by > >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the > nodePort > >>> >> >>>>> range in k8s is configured by API-server option: > >>> >> >>>>> > >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> > Default: > >>> >> >> 30000-32767 > >>> >> >>>>> > >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >>> >> >>>>> external_port_range column, and we can choose something like > >>> >> >>>>> 10000-30000. > >>> >> >>>>> > >>> >> >>>> > >>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit > the SNAT > >>> >> >>>> port range this workaround should be fine. > >>> >> >>>> > >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >>> >> >>>>> > >>> >> >>>>> Alternatively solutions are: > >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This > would > >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. > >>> >> >>>> > >>> >> >>>> Extra recirculation aside, I would actually love it if we > could use this > >>> >> >>>> alternative. I think it's the "most correct" option. I think > it would > >>> >> >>>> allow us to avoid other workarounds like: > >>> >> >>>> > >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >>> >> >>>> > >>> >> >>>> or > >>> >> >>>> > >>> >> >>>> > >>> >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >>> >> >>>> > >>> >> >>>> I do understand the worry about the extra recirculation > though. In the > >>> >> >>>> HWOL context does that cause visible performance impact? > >>> >> >>>> > >>> >> >>>> We'd probably have to do more performance testing without HWOL > to figure > >>> >> >>>> out the impact in the software datapath. > >>> >> >>>> > >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there > are other > >>> >> >>>>> issues reported for using the common zone [1] > >>> >> >>>>> > >>> >> >>>>> Could you let me know if other thoughts on this? > >>> >> >>>>> > >>> >> >>>> > >>> >> >>>> On a related note, I know it has been discussed in different > settings > >>> >> >>>> but I don't think this ever moved forward: would it be > possible for > >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for > incoming > >>> >> >>>> patches? > >>> >> >>>> > >>> >> >>>> Maybe we could some "simple" system-like tests that ensure > that traffic > >>> >> >>>> is correctly offloaded in common scenarios? Alternatively, I > guess we > >>> >> >>>> could also tag a subset of the existing system tests and just > run those > >>> >> >>>> on actual hardware? > >>> >> >>>> > >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on > each OVN > >>> >> >>>> patch posted on patchwork. That would at least allow us to > flag this > >>> >> >>>> kind of breakages early (even before they get merged). > >>> >> >>>> > >>> >> >>>> What do you think? > >>> >> >>>> > >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >>> >> >>>>> [1] b8c40e7593 > >>> >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >>> >> >>>>> > >>> >> >>>>> Thanks, > >>> >> >>>>> Han > >>> >> >>>>> > >>> >> >>>> > >>> >> >>>> Thanks, > >>> >> >>>> Dumitru > >>> >> >> _______________________________________________ > >>> >> >> dev mailing list > >>> >> >> dev@openvswitch.org > >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>
On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote: > > > > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote: >> >> >> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: >> > >> > Hi Han, >> > I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review it when you have time. >> > >> > Thanks, >> > Tim Rozet >> > Red Hat OpenShift Networking Team >> > >> >> Thanks Tim. I posted a comment last week. Please take a look. >> >> I also sent the OVN revert patch today: https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html >> Hope we can resolve the ovn-k PR soon and then this one can get merged. >> >> > >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: >> >> >> >> >> >> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: >> >>> >> >>> >> >>> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: >> >>> > >> >>> > Hi All, >> >>> > Sorry Han that you are still facing issues with this problem. It seems like the best option would be to configure the range of ephemeral ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport service range is configured 30000–32767, and the ephemeral port range in the kernel is set to: >> >>> > net.ipv4.ip_local_port_range = 32768 60999 >> >>> > >> >>> > I think it would make sense to revert and add an option to make this range configurable in OVN. Then OVN-Kubernetes should pass along this config to OVN. >> >>> > >> >>> > As Dumitru mentioned, we can't just revert the previous patch without a substitute fix in place, otherwise we would risk regressing. >> >>> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, right? >> >>> >> >>> OVN already provides configurable SNAT port range through the external_port_range column in the NAT table. See: https://man7.org/linux/man-pages/man5/ovn-nb.5.html >> >>> >> >>> external_port_range: string >> >>> L4 source port range >> >>> >> >>> Range of ports, from which a port number will be picked >> >>> that will replace the source port of to be NATed packet. >> >>> This is basically PAT (port address translation). >> >>> >> >>> Value of the column is in the format, port_lo-port_hi. For >> >>> example: external_port_range : "1-30000" >> >>> >> >>> Valid range of ports is 1-65535. >> >>> >> >>> Does this satisfy the need you asked? >> >>> >> >>> If this is agreed, I think we can move forward with below steps: >> >>> 1. Update ovn-kubernetes to set the external_port_range, if specific ranges are configured. (only needed when k8s configured nodeport range conflicts with the kernel ephemeral port range, and it seems unnecessary for OpenShift?) >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and vice versa. >> >> >> >> >> >> Hi Han, >> >> >> >> I don't think we should revert/partially revert ct-commit-all, the whole point of that knob is what you have described, commit to SNAT zone if there is any SNAT configured and commit to DNAT zone if there is any DNAT configured. If you still feel like there are some changes that we could make to this approach we can discuss it. There is one system test called "Commit all - UDN" which needs to pass unchanged if there is any revert/change to ct-commit-all. >> >> >> >> Hi Ales, sorry for the confusion. > > > Hi Han, > >> I didn't mean revert the major ct-commit-all change that was required to fix the UDN issue reported, which was supposed to commit all traffic to DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there are stateful SNAT. In one of the revisions of your patch, I commented that to solve the HW offload issue discussed in this thread, we shouldn't distinguish between DNAT and SNAT and always commit to both zones if there are stateful NATs. Now that solving the HW offload issue is not any more a goal of your ct-commit-all patch, we might bring back the separation between SNAT and DNAT, so that in simpler scenarios (where only DNAT or SNAT exist but not both) we could avoid the extra CT commit, which might help reduce some performance penalty. Shall I leave this to you if you think this makes sense and want to follow up? > > > I was thinking about looking into the proposal from Dumitru with a zone per port, because we might need that anyway for UDNs. If that doesn't work out we can work out which bits could be simplified. Any objections? Hi Ales, thanks for looking into the zone-per-port proposal. Could you provide more details why is that needed for UDNs? Han > > Thanks, > Ales > >> >> Thanks, >> Han >> >> >> Thanks, >> >> Ales >> >>> >> >>> >> >>> What do you think? >> >>> >> >>> Thanks, >> >>> Han >> >>> >> >>> > >> >>> > Thanks, >> >>> > >> >>> > Tim Rozet >> >>> > Red Hat OpenShift Networking Team >> >>> > >> >>> > >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >>> >> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >> >>> >> >>> >> >>> >> >>> Hi Han, >> >>> >> >>> My understanding is that the unSNAT stage was originally being skipped >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks >> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed and therefore >> >>> >> >> not committed. In the OVNK use case for pods simply connecting egress, >> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress >> >>> >> >> reply, we always enter the unSNAT stage for these packets. So this skipping >> >>> >> >> unSNAT stage seems to be specific to LB related traffic. While I agree >> >>> >> >> splitting the port range would work, I think it makes more sense to just >> >>> >> >> always commit in the unSNAT stage for all traffic. I get there is a >> >>> >> >> performance hit there, but the datapath pipeline seems more consistent and >> >>> >> >> I think outweighs the cost of committing LB traffic. >> >>> >> >> >> >>> >> >> Hi Tim and folks, >> >>> >> >> >> >>> >> > >> >>> >> > Hi Han, >> >>> >> > >> >>> >> >> + @Ales Musil <amusil@redhat.com> >> >>> >> >> >> >>> >> >> We've spent a lot of effort on the alternative solution: Always commit to >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in >> >>> >> >> this thread is still unresolved. I'd like to discuss the current situation >> >>> >> >> and ask for opinions. >> >>> >> >> >> >>> >> > >> >>> >> > Thanks again for working on this! >> >>> >> > >> >>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: >> >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all option is >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT zones. However, in >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is >> >>> >> >> still broken. In that scenario, a packet may come from external network >> >>> >> >> with dest IP being a node's IP, which is also used as an OVN gateway router >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in different >> >>> >> >> stages are: >> >>> >> >> >> >>> >> >> Original: EXT IP <-> GR IP1 >> >>> >> >> After DNAT: EXT IP <-> BE IP >> >>> >> >> After SNAT: GR IP2 <-> BE IP >> >>> >> >> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >>> >> >> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >> >>> >> >> original header, which will still always be "new" state, which breaks HW >> >>> >> >> offload. >> >>> >> >> >> >>> >> > >> >>> >> > Ah, now I understand the problem you were facing. >> >>> >> > >> >>> >> >> I tried to fix it by also committing to the SNAT zone before DNAT (at the >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed: >> >>> >> >> SNAT zone: EXT IP <-> GR IP1 >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >>> >> >> >> >>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on >> >>> >> >> distributed router - E/W", which tests communication between two workloads >> >>> >> >> using "floating IPs". In this scenario, both workloads are under different >> >>> >> >> LSes behind the same router, and both have their own floating IPs >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends >> >>> >> >> packets to the other using the floating IP, what's expected is: >> >>> >> >> >> >>> >> >> Original: overlay IP1 <-> floating IP2 >> >>> >> >> After SNAT: floating IP1 <-> floating IP2 >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2 >> >>> >> >> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >> >>> >> >> >> >>> >> >> This happens in the egress pipeline of the LR, and then it enters into the >> >>> >> >> ingress pipeline again because this is like a hairpin (the dest IP is a >> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, >> >>> >> >> traverse the CT zone, and for the first packet the reply is not seen yet, >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to commit >> >>> >> >> it (because of the issue I am trying to fix above). Because the connection >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above committed entry >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and >> >>> >> >> the packet is dropped. >> >>> >> >> >> >>> >> >> There is still a way to solve this, probably by skipping the unnecessary >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is one of the >> >>> >> >> snat/dnat_and_snat IPs of this same router. >> >>> >> >> >> >>> >> >> However, as we see, the NAT related pipelines have been very complex today, >> >>> >> >> and with the above changes it would become even much more complex. I wonder >> >>> >> >> if this is maintainable in the long run. >> >>> >> >> >> >>> >> >> In addition, I am not even sure if this is the last problem of my patch. It >> >>> >> >> is not easy to tell because today the ct-commit-all feature is an option >> >>> >> >> and by default disabled. So all our test cases by default don't test for >> >>> >> >> this option except the 4 test cases added by Ales's patch. I had to >> >>> >> >> manually set this option to true as default and of course many test cases >> >>> >> >> will fail. Most of them fail because of the expected flows won't match, but >> >>> >> >> I can't tell for sure unless updating every case with new flows and then >> >>> >> >> retest them. Before doing that I just examined ones I think might be >> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed router - E/W" >> >>> >> >> is the first broken one I found. So there can be other issues not >> >>> >> >> discovered yet. >> >>> >> >> >> >>> >> >> Moreover, we should also keep in mind the performance penalty of these >> >>> >> >> extra CT commit & traversal. >> >>> >> >> >> >>> >> > >> >>> >> > But basically this happens because we use the same SNAT zone for all >> >>> >> > router ports. >> >>> >> > >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per >> >>> >> > router port fix it? E.g.: >> >>> >> > >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, >> >>> >> > DNAT ZONE: 22) ---- host2 >> >>> >> > >> >>> >> > That'd also imply that we commit that session (from host1 to host2 - >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for >> >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different >> >>> >> > CT lookups for all packets traversing the router. >> >>> >> > >> >>> >> > I didn't try any of this but I have the impression that this would solve >> >>> >> > the HWOL issue: all subsequent packets are part of known and committed >> >>> >> > sessions in all 4 conntrack zones. >> >>> >> > >> >>> >> > So what if we change the implementation of the "commit-all" feature to >> >>> >> > add the missing ct commits? Wouldn't that work? >> >>> >> > >> >>> >> > It feels like (lookup performance hit aside) this is the correct way to >> >>> >> > forward traffic, i.e., never traverse a CT zone without having a >> >>> >> > committed entry in that zone. >> >>> >> > >> >>> >> >> I'd like to listen to your opinion before moving forward. Is this worth the >> >>> >> >> effort, or shall we just keep the pipeline simple (relatively) and avoid >> >>> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by >> >>> >> >> separating the port range of SNAT and DNAT, the first proposal at the >> >>> >> >> beginning of this thread? >> >>> >> >> >> >>> >> > >> >>> >> > Now that I understand the problem you're describing better it might be >> >>> >> > worth reverting the fix for FDP-291 like you said until we find a proper >> >>> >> > generic solution. >> >>> >> > >> >>> >> > However, we can't afford to do that without separating the port range of >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least >> >>> >> > from OpenShift perspective). >> >>> >> > >> >>> >> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the fix for >> >>> >> FDP-291: >> >>> >> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 >> >>> >> >> >>> >> > Because selecting the nodeport (DNAT) range is done through a k8s API >> >>> >> > server cmdline argument that means that users can in theory choose >> >>> >> > whatever range they want. I guess we'd need a command line argument in >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes >> >>> >> > what SNAT port range to use so that it doesn't conflict with the KAPI >> >>> >> > nodeport range. >> >>> >> > >> >>> >> > Tim, you're the expert on this side, what are your thoughts on the matter? >> >>> >> > >> >>> >> > Regards, >> >>> >> > Dumitru >> >>> >> > >> >>> >> > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html >> >>> >> > >> >>> >> >> Thanks, >> >>> >> >> Han >> >>> >> >> >> >>> >> >>> >> >>> >> >>> Thanks, >> >>> >> >>> Tim Rozet >> >>> >> >>> Red Hat OpenShift Networking Team >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara < dceara@redhat.com> wrote: >> >>> >> >>>> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < dceara@redhat.com> >> >>> >> >> wrote: >> >>> >> >>>>>> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < dceara@redhat.com> >> >>> >> >> wrote: >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the >> >>> >> >> VIP as >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows >> >>> >> >> for >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK because >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers >> >>> >> >> with >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> In the reply direction, the order of traversal of the tables >> >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> >>> >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >> >>> >> >>>>>>>> This is illustrated below where reply trafic enters the >> >>> >> >> physical host >> >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), >> >>> >> >> back to the >> >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The >> >>> >> >> third >> >>> >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone >> >>> >> >> instead >> >>> >> >>>>>>>> of the DNAT zone. >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated from behind a >> >>> >> >> VIP >> >>> >> >>>>>>>> + SNAT is not broken. >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly >> >>> >> >> simplified >> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore. >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >> >>> >> >> balancer vips in router ingress pipeline") >> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> >>> >> >>>>>>> >> >>> >> >>>>>>> >> >>> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >> >>> >> >>>>>>> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >> >>> >> >>>>>>> >> >>> >> >>>>>> >> >>> >> >>>>>> Thanks, Numan! >> >>> >> >>>>>> >> >>> >> >>>>>> Applied to main and backported to all branches down to 22.03. >> >>> >> >>>>>> >> >>> >> >>>>>> Regards, >> >>> >> >>>>>> Dumitru >> >>> >> >>>>>> >> >>> >> >>>>>> _______________________________________________ >> >>> >> >>>>>> dev mailing list >> >>> >> >>>>>> dev@openvswitch.org >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >>> >> >>>>> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, >> >>> >> >>>>> >> >>> >> >>>> >> >>> >> >>>> Hi Han, >> >>> >> >>>> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >> >>> >> >>>>> this change. The reason is that for client to nodePort (LB with VIP >> >>> >> >>>>> being the node IP) traffic, when the packet is going through the >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which >> >>> >> >>>>> prevents the HW offload to work for such packets. >> >>> >> >>>>> >> >>> >> >>>> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. >> >>> >> >>>> >> >>> >> >>>>> At the moment I have to revert this change in our downstream. For the >> >>> >> >>>>> problem that was fixed by this change [0], I think we can avoid it by >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >> >>> >> >>>>> range in k8s is configured by API-server option: >> >>> >> >>>>> >> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >> >>> >> >> 30000-32767 >> >>> >> >>>>> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's >> >>> >> >>>>> external_port_range column, and we can choose something like >> >>> >> >>>>> 10000-30000. >> >>> >> >>>>> >> >>> >> >>>> >> >>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >> >>> >> >>>> port range this workaround should be fine. >> >>> >> >>>> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. >> >>> >> >>>>> >> >>> >> >>>>> Alternatively solutions are: >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >> >>> >> >>>> >> >>> >> >>>> Extra recirculation aside, I would actually love it if we could use this >> >>> >> >>>> alternative. I think it's the "most correct" option. I think it would >> >>> >> >>>> allow us to avoid other workarounds like: >> >>> >> >>>> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >> >>> >> >>>> >> >>> >> >>>> or >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >> >>> >> >>>> >> >>> >> >>>> I do understand the worry about the extra recirculation though. In the >> >>> >> >>>> HWOL context does that cause visible performance impact? >> >>> >> >>>> >> >>> >> >>>> We'd probably have to do more performance testing without HWOL to figure >> >>> >> >>>> out the impact in the software datapath. >> >>> >> >>>> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >> >>> >> >>>>> issues reported for using the common zone [1] >> >>> >> >>>>> >> >>> >> >>>>> Could you let me know if other thoughts on this? >> >>> >> >>>>> >> >>> >> >>>> >> >>> >> >>>> On a related note, I know it has been discussed in different settings >> >>> >> >>>> but I don't think this ever moved forward: would it be possible for >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for incoming >> >>> >> >>>> patches? >> >>> >> >>>> >> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure that traffic >> >>> >> >>>> is correctly offloaded in common scenarios? Alternatively, I guess we >> >>> >> >>>> could also tag a subset of the existing system tests and just run those >> >>> >> >>>> on actual hardware? >> >>> >> >>>> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on each OVN >> >>> >> >>>> patch posted on patchwork. That would at least allow us to flag this >> >>> >> >>>> kind of breakages early (even before they get merged). >> >>> >> >>>> >> >>> >> >>>> What do you think? >> >>> >> >>>> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >> >>> >> >>>>> [1] b8c40e7593 >> >>> >> >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >> >>> >> >>>>> >> >>> >> >>>>> Thanks, >> >>> >> >>>>> Han >> >>> >> >>>>> >> >>> >> >>>> >> >>> >> >>>> Thanks, >> >>> >> >>>> Dumitru >> >>> >> >> _______________________________________________ >> >>> >> >> dev mailing list >> >>> >> >> dev@openvswitch.org >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote: > > > On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote: > > > > > > > > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote: > >> > >> > >> > >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: > >> > > >> > Hi Han, > >> > I tagged you on > https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review > it when you have time. > >> > > >> > Thanks, > >> > Tim Rozet > >> > Red Hat OpenShift Networking Team > >> > > >> > >> Thanks Tim. I posted a comment last week. Please take a look. > >> > >> I also sent the OVN revert patch today: > https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html > >> Hope we can resolve the ovn-k PR soon and then this one can get merged. > >> > >> > > >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: > >> >> > >> >> > >> >> > >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: > >> >>> > >> >>> > >> >>> > >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> > wrote: > >> >>> > > >> >>> > Hi All, > >> >>> > Sorry Han that you are still facing issues with this problem. It > seems like the best option would be to configure the range of ephemeral > ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport > service range is configured 30000–32767, and the ephemeral port range in > the kernel is set to: > >> >>> > net.ipv4.ip_local_port_range = 32768 60999 > >> >>> > > >> >>> > I think it would make sense to revert and add an option to make > this range configurable in OVN. Then OVN-Kubernetes should pass along this > config to OVN. > >> >>> > > >> >>> > As Dumitru mentioned, we can't just revert the previous patch > without a substitute fix in place, otherwise we would risk regressing. > >> >>> > >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in > ovn-k8s, right? > >> >>> > >> >>> OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > >> >>> > >> >>> external_port_range: string > >> >>> L4 source port range > >> >>> > >> >>> Range of ports, from which a port number will be > picked > >> >>> that will replace the source port of to be NATed > packet. > >> >>> This is basically PAT (port address translation). > >> >>> > >> >>> Value of the column is in the format, > port_lo-port_hi. For > >> >>> example: external_port_range : "1-30000" > >> >>> > >> >>> Valid range of ports is 1-65535. > >> >>> > >> >>> Does this satisfy the need you asked? > >> >>> > >> >>> If this is agreed, I think we can move forward with below steps: > >> >>> 1. Update ovn-kubernetes to set the external_port_range, if > specific ranges are configured. (only needed when k8s configured nodeport > range conflicts with the kernel ephemeral port range, and it seems > unnecessary for OpenShift?) > >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for > traffic towards VIPs" in OVN. Also revert some of the changes for > ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and > vice versa. > >> >> > >> >> > >> >> Hi Han, > >> >> > >> >> I don't think we should revert/partially revert ct-commit-all, the > whole point of that knob is what you have described, commit to SNAT zone if > there is any SNAT configured and commit to DNAT zone if there is any DNAT > configured. If you still feel like there are some changes that we could > make to this approach we can discuss it. There is one system test called > "Commit all - UDN" which needs to pass unchanged if there is any > revert/change to ct-commit-all. > >> >> > >> > >> Hi Ales, sorry for the confusion. > > > > > > Hi Han, > > > >> I didn't mean revert the major ct-commit-all change that was required > to fix the UDN issue reported, which was supposed to commit all traffic to > DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there > are stateful SNAT. In one of the revisions of your patch, I commented that > to solve the HW offload issue discussed in this thread, we shouldn't > distinguish between DNAT and SNAT and always commit to both zones if there > are stateful NATs. Now that solving the HW offload issue is not any more a > goal of your ct-commit-all patch, we might bring back the separation > between SNAT and DNAT, so that in simpler scenarios (where only DNAT or > SNAT exist but not both) we could avoid the extra CT commit, which might > help reduce some performance penalty. Shall I leave this to you if you > think this makes sense and want to follow up? > > > > > > I was thinking about looking into the proposal from Dumitru with a zone > per port, because we might need that anyway for UDNs. If that doesn't work > out we can work out which bits could be simplified. Any objections? > > Hi Ales, thanks for looking into the zone-per-port proposal. Could you > provide more details why is that needed for UDNs? > The original bug report was for "k8s LoadBalancer services with ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day. Also I should have made it more clear that it's not for all UDNs, but for specific configuration. So if there is a plan to get this working with HWOL we will need to fix the ct-commit-all, possibly with the zone-per-port. Thanks, Ales [0] https://issues.redhat.com/browse/FDP-787 > Han > > > > > Thanks, > > Ales > > > >> > >> Thanks, > >> Han > >> > >> >> Thanks, > >> >> Ales > >> >>> > >> >>> > >> >>> What do you think? > >> >>> > >> >>> Thanks, > >> >>> Han > >> >>> > >> >>> > > >> >>> > Thanks, > >> >>> > > >> >>> > Tim Rozet > >> >>> > Red Hat OpenShift Networking Team > >> >>> > > >> >>> > > >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >> >>> >> > >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> > wrote: > >> >>> >> >>> > >> >>> >> >>> Hi Han, > >> >>> >> >>> My understanding is that the unSNAT stage was originally > being skipped > >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and > this breaks > >> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed > and therefore > >> >>> >> >> not committed. In the OVNK use case for pods simply > connecting egress, > >> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore > on egress > >> >>> >> >> reply, we always enter the unSNAT stage for these packets. So > this skipping > >> >>> >> >> unSNAT stage seems to be specific to LB related traffic. > While I agree > >> >>> >> >> splitting the port range would work, I think it makes more > sense to just > >> >>> >> >> always commit in the unSNAT stage for all traffic. I get > there is a > >> >>> >> >> performance hit there, but the datapath pipeline seems more > consistent and > >> >>> >> >> I think outweighs the cost of committing LB traffic. > >> >>> >> >> > >> >>> >> >> Hi Tim and folks, > >> >>> >> >> > >> >>> >> > > >> >>> >> > Hi Han, > >> >>> >> > > >> >>> >> >> + @Ales Musil <amusil@redhat.com> > >> >>> >> >> > >> >>> >> >> We've spent a lot of effort on the alternative solution: > Always commit to > >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue > discussed in > >> >>> >> >> this thread is still unresolved. I'd like to discuss the > current situation > >> >>> >> >> and ask for opinions. > >> >>> >> >> > >> >>> >> > > >> >>> >> > Thanks again for working on this! > >> >>> >> > > >> >>> >> >> A while ago, Ales tried to fix it with this commit: > 800fd0681579 ("northd: > >> >>> >> >> Add LR option to commit all traffic."). When the > ct-commit-all option is > >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT > zones. However, in > >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the > HW offload is > >> >>> >> >> still broken. In that scenario, a packet may come from > external network > >> >>> >> >> with dest IP being a node's IP, which is also used as an OVN > gateway router > >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and > then the src IP > >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs > in different > >> >>> >> >> stages are: > >> >>> >> >> > >> >>> >> >> Original: EXT IP <-> GR IP1 > >> >>> >> >> After DNAT: EXT IP <-> BE IP > >> >>> >> >> After SNAT: GR IP2 <-> BE IP > >> >>> >> >> > >> >>> >> >> With the ct-commit-all option, it commits after DNAT and > after SNAT: > >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >>> >> >> > >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone > with the > >> >>> >> >> original header, which will still always be "new" state, > which breaks HW > >> >>> >> >> offload. > >> >>> >> >> > >> >>> >> > > >> >>> >> > Ah, now I understand the problem you were facing. > >> >>> >> > > >> >>> >> >> I tried to fix it by also committing to the SNAT zone before > DNAT (at the > >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed: > >> >>> >> >> SNAT zone: EXT IP <-> GR IP1 > >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >>> >> >> > >> >>> >> >> HW offload worked, but it breaks at least the scenario of > "DNAT and SNAT on > >> >>> >> >> distributed router - E/W", which tests communication between > two workloads > >> >>> >> >> using "floating IPs". In this scenario, both workloads are > under different > >> >>> >> >> LSes behind the same router, and both have their own floating > IPs > >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of > them sends > >> >>> >> >> packets to the other using the floating IP, what's expected > is: > >> >>> >> >> > >> >>> >> >> Original: overlay IP1 <-> floating IP2 > >> >>> >> >> After SNAT: floating IP1 <-> floating IP2 > >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2 > >> >>> >> >> > >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in > SNAT zone: > >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> >>> >> >> > >> >>> >> >> This happens in the egress pipeline of the LR, and then it > enters into the > >> >>> >> >> ingress pipeline again because this is like a hairpin (the > dest IP is a > >> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT > stage first, > >> >>> >> >> traverse the CT zone, and for the first packet the reply is > not seen yet, > >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will > try to commit > >> >>> >> >> it (because of the issue I am trying to fix above). Because > the connection > >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above > committed entry > >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT > action fails and > >> >>> >> >> the packet is dropped. > >> >>> >> >> > >> >>> >> >> There is still a way to solve this, probably by skipping the > unnecessary > >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is > one of the > >> >>> >> >> snat/dnat_and_snat IPs of this same router. > >> >>> >> >> > >> >>> >> >> However, as we see, the NAT related pipelines have been very > complex today, > >> >>> >> >> and with the above changes it would become even much more > complex. I wonder > >> >>> >> >> if this is maintainable in the long run. > >> >>> >> >> > >> >>> >> >> In addition, I am not even sure if this is the last problem > of my patch. It > >> >>> >> >> is not easy to tell because today the ct-commit-all feature > is an option > >> >>> >> >> and by default disabled. So all our test cases by default > don't test for > >> >>> >> >> this option except the 4 test cases added by Ales's patch. I > had to > >> >>> >> >> manually set this option to true as default and of course > many test cases > >> >>> >> >> will fail. Most of them fail because of the expected flows > won't match, but > >> >>> >> >> I can't tell for sure unless updating every case with new > flows and then > >> >>> >> >> retest them. Before doing that I just examined ones I think > might be > >> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed > router - E/W" > >> >>> >> >> is the first broken one I found. So there can be other issues > not > >> >>> >> >> discovered yet. > >> >>> >> >> > >> >>> >> >> Moreover, we should also keep in mind the performance penalty > of these > >> >>> >> >> extra CT commit & traversal. > >> >>> >> >> > >> >>> >> > > >> >>> >> > But basically this happens because we use the same SNAT zone > for all > >> >>> >> > router ports. > >> >>> >> > > >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT > zone per > >> >>> >> > router port fix it? E.g.: > >> >>> >> > > >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT > ZONE: 21, > >> >>> >> > DNAT ZONE: 22) ---- host2 > >> >>> >> > > >> >>> >> > That'd also imply that we commit that session (from host1 to > host2 - > >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and > DNAT for > >> >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 > different > >> >>> >> > CT lookups for all packets traversing the router. > >> >>> >> > > >> >>> >> > I didn't try any of this but I have the impression that this > would solve > >> >>> >> > the HWOL issue: all subsequent packets are part of known and > committed > >> >>> >> > sessions in all 4 conntrack zones. > >> >>> >> > > >> >>> >> > So what if we change the implementation of the "commit-all" > feature to > >> >>> >> > add the missing ct commits? Wouldn't that work? > >> >>> >> > > >> >>> >> > It feels like (lookup performance hit aside) this is the > correct way to > >> >>> >> > forward traffic, i.e., never traverse a CT zone without having > a > >> >>> >> > committed entry in that zone. > >> >>> >> > > >> >>> >> >> I'd like to listen to your opinion before moving forward. Is > this worth the > >> >>> >> >> effort, or shall we just keep the pipeline simple > (relatively) and avoid > >> >>> >> >> the original problem ( > https://issues.redhat.com/browse/FDP-291) by > >> >>> >> >> separating the port range of SNAT and DNAT, the first > proposal at the > >> >>> >> >> beginning of this thread? > >> >>> >> >> > >> >>> >> > > >> >>> >> > Now that I understand the problem you're describing better it > might be > >> >>> >> > worth reverting the fix for FDP-291 like you said until we > find a proper > >> >>> >> > generic solution. > >> >>> >> > > >> >>> >> > However, we can't afford to do that without separating the > port range of > >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at > least > >> >>> >> > from OpenShift perspective). > >> >>> >> > > >> >>> >> > >> >>> >> For reference, Tim, this is the OpenShift bug that needed the > fix for > >> >>> >> FDP-291: > >> >>> >> > >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 > >> >>> >> > >> >>> >> > Because selecting the nodeport (DNAT) range is done through a > k8s API > >> >>> >> > server cmdline argument that means that users can in theory > choose > >> >>> >> > whatever range they want. I guess we'd need a command line > argument in > >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >> >>> >> > what SNAT port range to use so that it doesn't conflict with > the KAPI > >> >>> >> > nodeport range. > >> >>> >> > > >> >>> >> > Tim, you're the expert on this side, what are your thoughts on > the matter? > >> >>> >> > > >> >>> >> > Regards, > >> >>> >> > Dumitru > >> >>> >> > > >> >>> >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> >>> >> > > >> >>> >> >> Thanks, > >> >>> >> >> Han > >> >>> >> >> > >> >>> >> >>> > >> >>> >> >>> Thanks, > >> >>> >> >>> Tim Rozet > >> >>> >> >>> Red Hat OpenShift Networking Team > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara < > dceara@redhat.com> wrote: > >> >>> >> >>>> > >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < > dceara@redhat.com> > >> >>> >> >> wrote: > >> >>> >> >>>>>> > >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> > >> >>> >> >> wrote: > >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that > uses the > >> >>> >> >> VIP as > >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind > the VIP. > >> >>> >> >>>>>>>> > >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip > unsnat flows > >> >>> >> >> for > >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). > That's OK because > >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows > for routers > >> >>> >> >> with > >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a > better way: > >> >>> >> >>>>>>>> > >> >>> >> >>>>>>>> In the reply direction, the order of traversal of > the tables > >> >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" > adds incorrect > >> >>> >> >>>>>>>> datapath flows that check ct_state in the wrong > conntrack zone. > >> >>> >> >>>>>>>> This is illustrated below where reply trafic enters > the > >> >>> >> >> physical host > >> >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone > (default), > >> >>> >> >> back to the > >> >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone > (22). The > >> >>> >> >> third > >> >>> >> >>>>>>>> flow is incorrectly checking the state from the > SNAT zone > >> >>> >> >> instead > >> >>> >> >>>>>>>> of the DNAT zone. > >> >>> >> >>>>>>>> > >> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated > from behind a > >> >>> >> >> VIP > >> >>> >> >>>>>>>> + SNAT is not broken. > >> >>> >> >>>>>>>> > >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is > slightly > >> >>> >> >> simplified > >> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore. > >> >>> >> >>>>>>>> > >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for > load > >> >>> >> >> balancer vips in router ingress pipeline") > >> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >> >>> >> >>>>>>> > >> >>> >> >>>>>>> > >> >>> >> >>>>>>> Thanks for the fix. It also simplified the > lr-nat-stateful code. > >> >>> >> >>>>>>> > >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >> >>> >> >>>>>>> > >> >>> >> >>>>>> > >> >>> >> >>>>>> Thanks, Numan! > >> >>> >> >>>>>> > >> >>> >> >>>>>> Applied to main and backported to all branches down to > 22.03. > >> >>> >> >>>>>> > >> >>> >> >>>>>> Regards, > >> >>> >> >>>>>> Dumitru > >> >>> >> >>>>>> > >> >>> >> >>>>>> _______________________________________________ > >> >>> >> >>>>>> dev mailing list > >> >>> >> >>>>>> dev@openvswitch.org > >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> >>> >> >>>>> > >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, > >> >>> >> >>>>> > >> >>> >> >>>> > >> >>> >> >>>> Hi Han, > >> >>> >> >>>> > >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is > broken due to > >> >>> >> >>>>> this change. The reason is that for client to nodePort (LB > with VIP > >> >>> >> >>>>> being the node IP) traffic, when the packet is going > through the > >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never > committed > >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", > which > >> >>> >> >>>>> prevents the HW offload to work for such packets. > >> >>> >> >>>>> > >> >>> >> >>>> > >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the > SNAT zone. > >> >>> >> >>>> > >> >>> >> >>>>> At the moment I have to revert this change in our > downstream. For the > >> >>> >> >>>>> problem that was fixed by this change [0], I think we can > avoid it by > >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the > nodePort > >> >>> >> >>>>> range in k8s is configured by API-server option: > >> >>> >> >>>>> > >> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> > Default: > >> >>> >> >> 30000-32767 > >> >>> >> >>>>> > >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >> >>> >> >>>>> external_port_range column, and we can choose something > like > >> >>> >> >>>>> 10000-30000. > >> >>> >> >>>>> > >> >>> >> >>>> > >> >>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit > the SNAT > >> >>> >> >>>> port range this workaround should be fine. > >> >>> >> >>>> > >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >> >>> >> >>>>> > >> >>> >> >>>>> Alternatively solutions are: > >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. > This would > >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT > recirc. > >> >>> >> >>>> > >> >>> >> >>>> Extra recirculation aside, I would actually love it if we > could use this > >> >>> >> >>>> alternative. I think it's the "most correct" option. I > think it would > >> >>> >> >>>> allow us to avoid other workarounds like: > >> >>> >> >>>> > >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >> >>> >> >>>> > >> >>> >> >>>> or > >> >>> >> >>>> > >> >>> >> >>>> > >> >>> >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >> >>> >> >>>> > >> >>> >> >>>> I do understand the worry about the extra recirculation > though. In the > >> >>> >> >>>> HWOL context does that cause visible performance impact? > >> >>> >> >>>> > >> >>> >> >>>> We'd probably have to do more performance testing without > HWOL to figure > >> >>> >> >>>> out the impact in the software datapath. > >> >>> >> >>>> > >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But > there are other > >> >>> >> >>>>> issues reported for using the common zone [1] > >> >>> >> >>>>> > >> >>> >> >>>>> Could you let me know if other thoughts on this? > >> >>> >> >>>>> > >> >>> >> >>>> > >> >>> >> >>>> On a related note, I know it has been discussed in > different settings > >> >>> >> >>>> but I don't think this ever moved forward: would it be > possible for > >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact > for incoming > >> >>> >> >>>> patches? > >> >>> >> >>>> > >> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure > that traffic > >> >>> >> >>>> is correctly offloaded in common scenarios? Alternatively, > I guess we > >> >>> >> >>>> could also tag a subset of the existing system tests and > just run those > >> >>> >> >>>> on actual hardware? > >> >>> >> >>>> > >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status > on each OVN > >> >>> >> >>>> patch posted on patchwork. That would at least allow us to > flag this > >> >>> >> >>>> kind of breakages early (even before they get merged). > >> >>> >> >>>> > >> >>> >> >>>> What do you think? > >> >>> >> >>>> > >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >> >>> >> >>>>> [1] b8c40e7593 > >> >>> >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >> >>> >> >>>>> > >> >>> >> >>>>> Thanks, > >> >>> >> >>>>> Han > >> >>> >> >>>>> > >> >>> >> >>>> > >> >>> >> >>>> Thanks, > >> >>> >> >>>> Dumitru > >> >>> >> >> _______________________________________________ > >> >>> >> >> dev mailing list > >> >>> >> >> dev@openvswitch.org > >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>
On Tue, Jun 3, 2025 at 11:50 PM Ales Musil <amusil@redhat.com> wrote: > > > > On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote: >> >> >> >> On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote: >> > >> > >> > >> > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote: >> >> >> >> >> >> >> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: >> >> > >> >> > Hi Han, >> >> > I tagged you on https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review it when you have time. >> >> > >> >> > Thanks, >> >> > Tim Rozet >> >> > Red Hat OpenShift Networking Team >> >> > >> >> >> >> Thanks Tim. I posted a comment last week. Please take a look. >> >> >> >> I also sent the OVN revert patch today: https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html >> >> Hope we can resolve the ovn-k PR soon and then this one can get merged. >> >> >> >> > >> >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> wrote: >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> wrote: >> >> >>> > >> >> >>> > Hi All, >> >> >>> > Sorry Han that you are still facing issues with this problem. It seems like the best option would be to configure the range of ephemeral ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport service range is configured 30000–32767, and the ephemeral port range in the kernel is set to: >> >> >>> > net.ipv4.ip_local_port_range = 32768 60999 >> >> >>> > >> >> >>> > I think it would make sense to revert and add an option to make this range configurable in OVN. Then OVN-Kubernetes should pass along this config to OVN. >> >> >>> > >> >> >>> > As Dumitru mentioned, we can't just revert the previous patch without a substitute fix in place, otherwise we would risk regressing. >> >> >>> >> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, right? >> >> >>> >> >> >>> OVN already provides configurable SNAT port range through the external_port_range column in the NAT table. See: https://man7.org/linux/man-pages/man5/ovn-nb.5.html >> >> >>> >> >> >>> external_port_range: string >> >> >>> L4 source port range >> >> >>> >> >> >>> Range of ports, from which a port number will be picked >> >> >>> that will replace the source port of to be NATed packet. >> >> >>> This is basically PAT (port address translation). >> >> >>> >> >> >>> Value of the column is in the format, port_lo-port_hi. For >> >> >>> example: external_port_range : "1-30000" >> >> >>> >> >> >>> Valid range of ports is 1-65535. >> >> >>> >> >> >>> Does this satisfy the need you asked? >> >> >>> >> >> >>> If this is agreed, I think we can move forward with below steps: >> >> >>> 1. Update ovn-kubernetes to set the external_port_range, if specific ranges are configured. (only needed when k8s configured nodeport range conflicts with the kernel ephemeral port range, and it seems unnecessary for OpenShift?) >> >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and vice versa. >> >> >> >> >> >> >> >> >> Hi Han, >> >> >> >> >> >> I don't think we should revert/partially revert ct-commit-all, the whole point of that knob is what you have described, commit to SNAT zone if there is any SNAT configured and commit to DNAT zone if there is any DNAT configured. If you still feel like there are some changes that we could make to this approach we can discuss it. There is one system test called "Commit all - UDN" which needs to pass unchanged if there is any revert/change to ct-commit-all. >> >> >> >> >> >> >> Hi Ales, sorry for the confusion. >> > >> > >> > Hi Han, >> > >> >> I didn't mean revert the major ct-commit-all change that was required to fix the UDN issue reported, which was supposed to commit all traffic to DNAT zone if there are stateful DNAT and all traffic to SNAT zone if there are stateful SNAT. In one of the revisions of your patch, I commented that to solve the HW offload issue discussed in this thread, we shouldn't distinguish between DNAT and SNAT and always commit to both zones if there are stateful NATs. Now that solving the HW offload issue is not any more a goal of your ct-commit-all patch, we might bring back the separation between SNAT and DNAT, so that in simpler scenarios (where only DNAT or SNAT exist but not both) we could avoid the extra CT commit, which might help reduce some performance penalty. Shall I leave this to you if you think this makes sense and want to follow up? >> > >> > >> > I was thinking about looking into the proposal from Dumitru with a zone per port, because we might need that anyway for UDNs. If that doesn't work out we can work out which bits could be simplified. Any objections? >> >> Hi Ales, thanks for looking into the zone-per-port proposal. Could you provide more details why is that needed for UDNs? > > > The original bug report was for "k8s LoadBalancer services with ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day. Also I should have made it more clear that it's not for all UDNs, but for specific configuration. So if there is a plan to get this working with HWOL we will need to fix the ct-commit-all, possibly with the zone-per-port. > The FDP-787 is already fixed by ct-commit-all, correct? And the HWOL issue was introduced by the "northd: Don't skip the unSNAT stage for traffic towards VIPs" which was supposed to fix FDP-291. Now we are reverting the patch to fix the HWOL issue, and use the port range configuration in ovn-k to avoid the problem reported by FDP-291. So I am still confused is there another problem that requires additional fix (e.g. zone-per-port)? Is ct-commit-all breaking HWOL for the UDN case you mentioned? Thanks, Han > Thanks, > Ales > > [0] https://issues.redhat.com/browse/FDP-787 > >> >> Han >> >> > >> > Thanks, >> > Ales >> > >> >> >> >> Thanks, >> >> Han >> >> >> >> >> Thanks, >> >> >> Ales >> >> >>> >> >> >>> >> >> >>> What do you think? >> >> >>> >> >> >>> Thanks, >> >> >>> Han >> >> >>> >> >> >>> > >> >> >>> > Thanks, >> >> >>> > >> >> >>> > Tim Rozet >> >> >>> > Red Hat OpenShift Networking Team >> >> >>> > >> >> >>> > >> >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> >>> >> >> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: >> >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: >> >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <trozet@redhat.com> wrote: >> >> >>> >> >>> >> >> >>> >> >>> Hi Han, >> >> >>> >> >>> My understanding is that the unSNAT stage was originally being skipped >> >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks >> >> >>> >> >> HWOL for ingress connections that never need to get unSNAT'ed and therefore >> >> >>> >> >> not committed. In the OVNK use case for pods simply connecting egress, >> >> >>> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress >> >> >>> >> >> reply, we always enter the unSNAT stage for these packets. So this skipping >> >> >>> >> >> unSNAT stage seems to be specific to LB related traffic. While I agree >> >> >>> >> >> splitting the port range would work, I think it makes more sense to just >> >> >>> >> >> always commit in the unSNAT stage for all traffic. I get there is a >> >> >>> >> >> performance hit there, but the datapath pipeline seems more consistent and >> >> >>> >> >> I think outweighs the cost of committing LB traffic. >> >> >>> >> >> >> >> >>> >> >> Hi Tim and folks, >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > Hi Han, >> >> >>> >> > >> >> >>> >> >> + @Ales Musil <amusil@redhat.com> >> >> >>> >> >> >> >> >>> >> >> We've spent a lot of effort on the alternative solution: Always commit to >> >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue discussed in >> >> >>> >> >> this thread is still unresolved. I'd like to discuss the current situation >> >> >>> >> >> and ask for opinions. >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > Thanks again for working on this! >> >> >>> >> > >> >> >>> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 ("northd: >> >> >>> >> >> Add LR option to commit all traffic."). When the ct-commit-all option is >> >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT zones. However, in >> >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW offload is >> >> >>> >> >> still broken. In that scenario, a packet may come from external network >> >> >>> >> >> with dest IP being a node's IP, which is also used as an OVN gateway router >> >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the src IP >> >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in different >> >> >>> >> >> stages are: >> >> >>> >> >> >> >> >>> >> >> Original: EXT IP <-> GR IP1 >> >> >>> >> >> After DNAT: EXT IP <-> BE IP >> >> >>> >> >> After SNAT: GR IP2 <-> BE IP >> >> >>> >> >> >> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >>> >> >> >> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the >> >> >>> >> >> original header, which will still always be "new" state, which breaks HW >> >> >>> >> >> offload. >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > Ah, now I understand the problem you were facing. >> >> >>> >> > >> >> >>> >> >> I tried to fix it by also committing to the SNAT zone before DNAT (at the >> >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed: >> >> >>> >> >> SNAT zone: EXT IP <-> GR IP1 >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP >> >> >>> >> >> >> >> >>> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and SNAT on >> >> >>> >> >> distributed router - E/W", which tests communication between two workloads >> >> >>> >> >> using "floating IPs". In this scenario, both workloads are under different >> >> >>> >> >> LSes behind the same router, and both have their own floating IPs >> >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends >> >> >>> >> >> packets to the other using the floating IP, what's expected is: >> >> >>> >> >> >> >> >>> >> >> Original: overlay IP1 <-> floating IP2 >> >> >>> >> >> After SNAT: floating IP1 <-> floating IP2 >> >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2 >> >> >>> >> >> >> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: >> >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 >> >> >>> >> >> >> >> >>> >> >> This happens in the egress pipeline of the LR, and then it enters into the >> >> >>> >> >> ingress pipeline again because this is like a hairpin (the dest IP is a >> >> >>> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage first, >> >> >>> >> >> traverse the CT zone, and for the first packet the reply is not seen yet, >> >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to commit >> >> >>> >> >> it (because of the issue I am trying to fix above). Because the connection >> >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the above committed entry >> >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action fails and >> >> >>> >> >> the packet is dropped. >> >> >>> >> >> >> >> >>> >> >> There is still a way to solve this, probably by skipping the unnecessary >> >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP is one of the >> >> >>> >> >> snat/dnat_and_snat IPs of this same router. >> >> >>> >> >> >> >> >>> >> >> However, as we see, the NAT related pipelines have been very complex today, >> >> >>> >> >> and with the above changes it would become even much more complex. I wonder >> >> >>> >> >> if this is maintainable in the long run. >> >> >>> >> >> >> >> >>> >> >> In addition, I am not even sure if this is the last problem of my patch. It >> >> >>> >> >> is not easy to tell because today the ct-commit-all feature is an option >> >> >>> >> >> and by default disabled. So all our test cases by default don't test for >> >> >>> >> >> this option except the 4 test cases added by Ales's patch. I had to >> >> >>> >> >> manually set this option to true as default and of course many test cases >> >> >>> >> >> will fail. Most of them fail because of the expected flows won't match, but >> >> >>> >> >> I can't tell for sure unless updating every case with new flows and then >> >> >>> >> >> retest them. Before doing that I just examined ones I think might be >> >> >>> >> >> impacted by the patch and the "DNAT and SNAT on distributed router - E/W" >> >> >>> >> >> is the first broken one I found. So there can be other issues not >> >> >>> >> >> discovered yet. >> >> >>> >> >> >> >> >>> >> >> Moreover, we should also keep in mind the performance penalty of these >> >> >>> >> >> extra CT commit & traversal. >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > But basically this happens because we use the same SNAT zone for all >> >> >>> >> > router ports. >> >> >>> >> > >> >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per >> >> >>> >> > router port fix it? E.g.: >> >> >>> >> > >> >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21, >> >> >>> >> > DNAT ZONE: 22) ---- host2 >> >> >>> >> > >> >> >>> >> > That'd also imply that we commit that session (from host1 to host2 - >> >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for >> >> >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 different >> >> >>> >> > CT lookups for all packets traversing the router. >> >> >>> >> > >> >> >>> >> > I didn't try any of this but I have the impression that this would solve >> >> >>> >> > the HWOL issue: all subsequent packets are part of known and committed >> >> >>> >> > sessions in all 4 conntrack zones. >> >> >>> >> > >> >> >>> >> > So what if we change the implementation of the "commit-all" feature to >> >> >>> >> > add the missing ct commits? Wouldn't that work? >> >> >>> >> > >> >> >>> >> > It feels like (lookup performance hit aside) this is the correct way to >> >> >>> >> > forward traffic, i.e., never traverse a CT zone without having a >> >> >>> >> > committed entry in that zone. >> >> >>> >> > >> >> >>> >> >> I'd like to listen to your opinion before moving forward. Is this worth the >> >> >>> >> >> effort, or shall we just keep the pipeline simple (relatively) and avoid >> >> >>> >> >> the original problem ( https://issues.redhat.com/browse/FDP-291) by >> >> >>> >> >> separating the port range of SNAT and DNAT, the first proposal at the >> >> >>> >> >> beginning of this thread? >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > Now that I understand the problem you're describing better it might be >> >> >>> >> > worth reverting the fix for FDP-291 like you said until we find a proper >> >> >>> >> > generic solution. >> >> >>> >> > >> >> >>> >> > However, we can't afford to do that without separating the port range of >> >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least >> >> >>> >> > from OpenShift perspective). >> >> >>> >> > >> >> >>> >> >> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the fix for >> >> >>> >> FDP-291: >> >> >>> >> >> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 >> >> >>> >> >> >> >>> >> > Because selecting the nodeport (DNAT) range is done through a k8s API >> >> >>> >> > server cmdline argument that means that users can in theory choose >> >> >>> >> > whatever range they want. I guess we'd need a command line argument in >> >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform ovn-kubernetes >> >> >>> >> > what SNAT port range to use so that it doesn't conflict with the KAPI >> >> >>> >> > nodeport range. >> >> >>> >> > >> >> >>> >> > Tim, you're the expert on this side, what are your thoughts on the matter? >> >> >>> >> > >> >> >>> >> > Regards, >> >> >>> >> > Dumitru >> >> >>> >> > >> >> >>> >> > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html >> >> >>> >> > >> >> >>> >> >> Thanks, >> >> >>> >> >> Han >> >> >>> >> >> >> >> >>> >> >>> >> >> >>> >> >>> Thanks, >> >> >>> >> >>> Tim Rozet >> >> >>> >> >>> Red Hat OpenShift Networking Team >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara < dceara@redhat.com> wrote: >> >> >>> >> >>>> >> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: >> >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < dceara@redhat.com> >> >> >>> >> >> wrote: >> >> >>> >> >>>>>> >> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: >> >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < dceara@redhat.com> >> >> >>> >> >> wrote: >> >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the >> >> >>> >> >> VIP as >> >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. >> >> >>> >> >>>>>>>> >> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows >> >> >>> >> >> for >> >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK because >> >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers >> >> >>> >> >> with >> >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better way: >> >> >>> >> >>>>>>>> >> >> >>> >> >>>>>>>> In the reply direction, the order of traversal of the tables >> >> >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> >> >>> >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack zone. >> >> >>> >> >>>>>>>> This is illustrated below where reply trafic enters the >> >> >>> >> >> physical host >> >> >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), >> >> >>> >> >> back to the >> >> >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The >> >> >>> >> >> third >> >> >>> >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone >> >> >>> >> >> instead >> >> >>> >> >>>>>>>> of the DNAT zone. >> >> >>> >> >>>>>>>> >> >> >>> >> >>>>>>>> We also add a system test to ensure traffic initiated from behind a >> >> >>> >> >> VIP >> >> >>> >> >>>>>>>> + SNAT is not broken. >> >> >>> >> >>>>>>>> >> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly >> >> >>> >> >> simplified >> >> >>> >> >>>>>>>> because we don't need to track NAT external IPs anymore. >> >> >>> >> >>>>>>>> >> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load >> >> >>> >> >> balancer vips in router ingress pipeline") >> >> >>> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >> >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> >> >>> >> >>>>>>> >> >> >>> >> >>>>>>> >> >> >>> >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful code. >> >> >>> >> >>>>>>> >> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> >> >> >>> >> >>>>>>> >> >> >>> >> >>>>>> >> >> >>> >> >>>>>> Thanks, Numan! >> >> >>> >> >>>>>> >> >> >>> >> >>>>>> Applied to main and backported to all branches down to 22.03. >> >> >>> >> >>>>>> >> >> >>> >> >>>>>> Regards, >> >> >>> >> >>>>>> Dumitru >> >> >>> >> >>>>>> >> >> >>> >> >>>>>> _______________________________________________ >> >> >>> >> >>>>>> dev mailing list >> >> >>> >> >>>>>> dev@openvswitch.org >> >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >> >>> >> >>>>> >> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, >> >> >>> >> >>>>> >> >> >>> >> >>>> >> >> >>> >> >>>> Hi Han, >> >> >>> >> >>>> >> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken due to >> >> >>> >> >>>>> this change. The reason is that for client to nodePort (LB with VIP >> >> >>> >> >>>>> being the node IP) traffic, when the packet is going through the >> >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never committed >> >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which >> >> >>> >> >>>>> prevents the HW offload to work for such packets. >> >> >>> >> >>>>> >> >> >>> >> >>>> >> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. >> >> >>> >> >>>> >> >> >>> >> >>>>> At the moment I have to revert this change in our downstream. For the >> >> >>> >> >>>>> problem that was fixed by this change [0], I think we can avoid it by >> >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort >> >> >>> >> >>>>> range in k8s is configured by API-server option: >> >> >>> >> >>>>> >> >> >>> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: >> >> >>> >> >> 30000-32767 >> >> >>> >> >>>>> >> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's >> >> >>> >> >>>>> external_port_range column, and we can choose something like >> >> >>> >> >>>>> 10000-30000. >> >> >>> >> >>>>> >> >> >>> >> >>>> >> >> >>> >> >>>> Tim, does this look OK to you? If it's acceptable to limit the SNAT >> >> >>> >> >>>> port range this workaround should be fine. >> >> >>> >> >>>> >> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. >> >> >>> >> >>>>> >> >> >>> >> >>>>> Alternatively solutions are: >> >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would >> >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. >> >> >>> >> >>>> >> >> >>> >> >>>> Extra recirculation aside, I would actually love it if we could use this >> >> >>> >> >>>> alternative. I think it's the "most correct" option. I think it would >> >> >>> >> >>>> allow us to avoid other workarounds like: >> >> >>> >> >>>> >> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 >> >> >>> >> >>>> >> >> >>> >> >>>> or >> >> >>> >> >>>> >> >> >>> >> >>>> >> >> >>> >> >> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ >> >> >>> >> >>>> >> >> >>> >> >>>> I do understand the worry about the extra recirculation though. In the >> >> >>> >> >>>> HWOL context does that cause visible performance impact? >> >> >>> >> >>>> >> >> >>> >> >>>> We'd probably have to do more performance testing without HWOL to figure >> >> >>> >> >>>> out the impact in the software datapath. >> >> >>> >> >>>> >> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are other >> >> >>> >> >>>>> issues reported for using the common zone [1] >> >> >>> >> >>>>> >> >> >>> >> >>>>> Could you let me know if other thoughts on this? >> >> >>> >> >>>>> >> >> >>> >> >>>> >> >> >>> >> >>>> On a related note, I know it has been discussed in different settings >> >> >>> >> >>>> but I don't think this ever moved forward: would it be possible for >> >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL impact for incoming >> >> >>> >> >>>> patches? >> >> >>> >> >>>> >> >> >>> >> >>>> Maybe we could some "simple" system-like tests that ensure that traffic >> >> >>> >> >>>> is correctly offloaded in common scenarios? Alternatively, I guess we >> >> >>> >> >>>> could also tag a subset of the existing system tests and just run those >> >> >>> >> >>>> on actual hardware? >> >> >>> >> >>>> >> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report status on each OVN >> >> >>> >> >>>> patch posted on patchwork. That would at least allow us to flag this >> >> >>> >> >>>> kind of breakages early (even before they get merged). >> >> >>> >> >>>> >> >> >>> >> >>>> What do you think? >> >> >>> >> >>>> >> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 >> >> >>> >> >>>>> [1] b8c40e7593 >> >> >>> >> >> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 >> >> >>> >> >>>>> >> >> >>> >> >>>>> Thanks, >> >> >>> >> >>>>> Han >> >> >>> >> >>>>> >> >> >>> >> >>>> >> >> >>> >> >>>> Thanks, >> >> >>> >> >>>> Dumitru >> >> >>> >> >> _______________________________________________ >> >> >>> >> >> dev mailing list >> >> >>> >> >> dev@openvswitch.org >> >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Wed, Jun 4, 2025 at 4:33 PM Han Zhou <zhouhan@gmail.com> wrote: > > > On Tue, Jun 3, 2025 at 11:50 PM Ales Musil <amusil@redhat.com> wrote: > > > > > > > > On Wed, Jun 4, 2025 at 8:40 AM Han Zhou <zhouhan@gmail.com> wrote: > >> > >> > >> > >> On Tue, Jun 3, 2025 at 6:22 AM Ales Musil <amusil@redhat.com> wrote: > >> > > >> > > >> > > >> > On Mon, Jun 2, 2025 at 7:14 PM Han Zhou <zhouhan@gmail.com> wrote: > >> >> > >> >> > >> >> > >> >> On Wed, May 28, 2025 at 4:59 AM Tim Rozet <trozet@redhat.com> wrote: > >> >> > > >> >> > Hi Han, > >> >> > I tagged you on > https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5265. Please review > it when you have time. > >> >> > > >> >> > Thanks, > >> >> > Tim Rozet > >> >> > Red Hat OpenShift Networking Team > >> >> > > >> >> > >> >> Thanks Tim. I posted a comment last week. Please take a look. > >> >> > >> >> I also sent the OVN revert patch today: > https://mail.openvswitch.org/pipermail/ovs-dev/2025-June/423753.html > >> >> Hope we can resolve the ovn-k PR soon and then this one can get > merged. > >> >> > >> >> > > >> >> > On Tue, May 20, 2025 at 9:41 AM Ales Musil <amusil@redhat.com> > wrote: > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Tue, May 20, 2025 at 7:37 AM Han Zhou <zhouhan@gmail.com> > wrote: > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <trozet@redhat.com> > wrote: > >> >> >>> > > >> >> >>> > Hi All, > >> >> >>> > Sorry Han that you are still facing issues with this problem. > It seems like the best option would be to configure the range of ephemeral > ports for OVN to use when doing an SNAT. Note, in OpenShift the nodeport > service range is configured 30000–32767, and the ephemeral port range in > the kernel is set to: > >> >> >>> > net.ipv4.ip_local_port_range = 32768 60999 > >> >> >>> > > >> >> >>> > I think it would make sense to revert and add an option to > make this range configurable in OVN. Then OVN-Kubernetes should pass along > this config to OVN. > >> >> >>> > > >> >> >>> > As Dumitru mentioned, we can't just revert the previous patch > without a substitute fix in place, otherwise we would risk regressing. > >> >> >>> > >> >> >>> Thanks Tim! For the substitute fix, I assume you meant a fix in > ovn-k8s, right? > >> >> >>> > >> >> >>> OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > >> >> >>> > >> >> >>> external_port_range: string > >> >> >>> L4 source port range > >> >> >>> > >> >> >>> Range of ports, from which a port number will be > picked > >> >> >>> that will replace the source port of to be NATed > packet. > >> >> >>> This is basically PAT (port address translation). > >> >> >>> > >> >> >>> Value of the column is in the format, > port_lo-port_hi. For > >> >> >>> example: external_port_range : "1-30000" > >> >> >>> > >> >> >>> Valid range of ports is 1-65535. > >> >> >>> > >> >> >>> Does this satisfy the need you asked? > >> >> >>> > >> >> >>> If this is agreed, I think we can move forward with below steps: > >> >> >>> 1. Update ovn-kubernetes to set the external_port_range, if > specific ranges are configured. (only needed when k8s configured nodeport > range conflicts with the kernel ephemeral port range, and it seems > unnecessary for OpenShift?) > >> >> >>> 2. Revert the patch "northd: Don't skip the unSNAT stage for > traffic towards VIPs" in OVN. Also revert some of the changes for > ct-commit-all: e.g. no need to commit DNATted connections to SNAT zone, and > vice versa. > >> >> >> > >> >> >> > >> >> >> Hi Han, > >> >> >> > >> >> >> I don't think we should revert/partially revert ct-commit-all, > the whole point of that knob is what you have described, commit to SNAT > zone if there is any SNAT configured and commit to DNAT zone if there is > any DNAT configured. If you still feel like there are some changes that we > could make to this approach we can discuss it. There is one system test > called "Commit all - UDN" which needs to pass unchanged if there is any > revert/change to ct-commit-all. > >> >> >> > >> >> > >> >> Hi Ales, sorry for the confusion. > >> > > >> > > >> > Hi Han, > >> > > >> >> I didn't mean revert the major ct-commit-all change that was > required to fix the UDN issue reported, which was supposed to commit all > traffic to DNAT zone if there are stateful DNAT and all traffic to SNAT > zone if there are stateful SNAT. In one of the revisions of your patch, I > commented that to solve the HW offload issue discussed in this thread, we > shouldn't distinguish between DNAT and SNAT and always commit to both zones > if there are stateful NATs. Now that solving the HW offload issue is not > any more a goal of your ct-commit-all patch, we might bring back the > separation between SNAT and DNAT, so that in simpler scenarios (where only > DNAT or SNAT exist but not both) we could avoid the extra CT commit, which > might help reduce some performance penalty. Shall I leave this to you if > you think this makes sense and want to follow up? > >> > > >> > > >> > I was thinking about looking into the proposal from Dumitru with a > zone per port, because we might need that anyway for UDNs. If that doesn't > work out we can work out which bits could be simplified. Any objections? > >> > >> Hi Ales, thanks for looking into the zone-per-port proposal. Could you > provide more details why is that needed for UDNs? > > > > > > The original bug report was for "k8s LoadBalancer services with > ExternalTrafficPolicy=Local"[0] with UDNs, AFAIK this stands till this day. > Also I should have made it more clear that it's not for all UDNs, but for > specific configuration. So if there is a plan to get this working with HWOL > we will need to fix the ct-commit-all, possibly with the zone-per-port. > > > > The FDP-787 is already fixed by ct-commit-all, correct? And the HWOL issue > was introduced by the "northd: Don't skip the unSNAT stage for traffic > towards VIPs" which was supposed to fix FDP-291. Now we are reverting the > patch to fix the HWOL issue, and use the port range configuration in ovn-k > to avoid the problem reported by FDP-291. So I am still confused is there > another problem that requires additional fix (e.g. zone-per-port)? Is > ct-commit-all breaking HWOL for the UDN case you mentioned? > My understanding was that ct-commit-all is breaking HWOL, but I suppose that was never said and the message was that it's not fixing it. So we might be fine, we probably need some testing to confirm though. Thanks, Ales > Thanks, > Han > > > > Thanks, > > Ales > > > > [0] https://issues.redhat.com/browse/FDP-787 > > > >> > >> Han > >> > >> > > >> > Thanks, > >> > Ales > >> > > >> >> > >> >> Thanks, > >> >> Han > >> >> > >> >> >> Thanks, > >> >> >> Ales > >> >> >>> > >> >> >>> > >> >> >>> What do you think? > >> >> >>> > >> >> >>> Thanks, > >> >> >>> Han > >> >> >>> > >> >> >>> > > >> >> >>> > Thanks, > >> >> >>> > > >> >> >>> > Tim Rozet > >> >> >>> > Red Hat OpenShift Networking Team > >> >> >>> > > >> >> >>> > > >> >> >>> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> wrote: > >> >> >>> >> > >> >> >>> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >> >> >>> >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >> >> >>> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet < > trozet@redhat.com> wrote: > >> >> >>> >> >>> > >> >> >>> >> >>> Hi Han, > >> >> >>> >> >>> My understanding is that the unSNAT stage was originally > being skipped > >> >> >>> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, > and this breaks > >> >> >>> >> >> HWOL for ingress connections that never need to get > unSNAT'ed and therefore > >> >> >>> >> >> not committed. In the OVNK use case for pods simply > connecting egress, > >> >> >>> >> >> almost all packets will be SNAT'ed to the node IP. > Therefore on egress > >> >> >>> >> >> reply, we always enter the unSNAT stage for these packets. > So this skipping > >> >> >>> >> >> unSNAT stage seems to be specific to LB related traffic. > While I agree > >> >> >>> >> >> splitting the port range would work, I think it makes more > sense to just > >> >> >>> >> >> always commit in the unSNAT stage for all traffic. I get > there is a > >> >> >>> >> >> performance hit there, but the datapath pipeline seems > more consistent and > >> >> >>> >> >> I think outweighs the cost of committing LB traffic. > >> >> >>> >> >> > >> >> >>> >> >> Hi Tim and folks, > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > Hi Han, > >> >> >>> >> > > >> >> >>> >> >> + @Ales Musil <amusil@redhat.com> > >> >> >>> >> >> > >> >> >>> >> >> We've spent a lot of effort on the alternative solution: > Always commit to > >> >> >>> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload > issue discussed in > >> >> >>> >> >> this thread is still unresolved. I'd like to discuss the > current situation > >> >> >>> >> >> and ask for opinions. > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > Thanks again for working on this! > >> >> >>> >> > > >> >> >>> >> >> A while ago, Ales tried to fix it with this commit: > 800fd0681579 ("northd: > >> >> >>> >> >> Add LR option to commit all traffic."). When the > ct-commit-all option is > >> >> >>> >> >> enabled, it commits the connection to both SNAT and DNAT > zones. However, in > >> >> >>> >> >> the ovn-k8s nodeport scenario (a very typical use case) > the HW offload is > >> >> >>> >> >> still broken. In that scenario, a packet may come from > external network > >> >> >>> >> >> with dest IP being a node's IP, which is also used as an > OVN gateway router > >> >> >>> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and > then the src IP > >> >> >>> >> >> is SNATed to another LRP's IP on this router. The src/dst > IPs in different > >> >> >>> >> >> stages are: > >> >> >>> >> >> > >> >> >>> >> >> Original: EXT IP <-> GR IP1 > >> >> >>> >> >> After DNAT: EXT IP <-> BE IP > >> >> >>> >> >> After SNAT: GR IP2 <-> BE IP > >> >> >>> >> >> > >> >> >>> >> >> With the ct-commit-all option, it commits after DNAT and > after SNAT: > >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> >>> >> >> > >> >> >>> >> >> So, at the UNSNAT stage, the packet traverses the SNAT > zone with the > >> >> >>> >> >> original header, which will still always be "new" state, > which breaks HW > >> >> >>> >> >> offload. > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > Ah, now I understand the problem you were facing. > >> >> >>> >> > > >> >> >>> >> >> I tried to fix it by also committing to the SNAT zone > before DNAT (at the > >> >> >>> >> >> POST_SNAT stage). Now there are 3 entries committed: > >> >> >>> >> >> SNAT zone: EXT IP <-> GR IP1 > >> >> >>> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> >>> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> >>> >> >> > >> >> >>> >> >> HW offload worked, but it breaks at least the scenario of > "DNAT and SNAT on > >> >> >>> >> >> distributed router - E/W", which tests communication > between two workloads > >> >> >>> >> >> using "floating IPs". In this scenario, both workloads are > under different > >> >> >>> >> >> LSes behind the same router, and both have their own > floating IPs > >> >> >>> >> >> (dnat_and_snat) on the public side of the LR. When one of > them sends > >> >> >>> >> >> packets to the other using the floating IP, what's > expected is: > >> >> >>> >> >> > >> >> >>> >> >> Original: overlay IP1 <-> floating IP2 > >> >> >>> >> >> After SNAT: floating IP1 <-> floating IP2 > >> >> >>> >> >> After DNAT: floating IP1 <-> overlay IP2 > >> >> >>> >> >> > >> >> >>> >> >> With my fix, the packet is firstly SNATed with CT entry in > SNAT zone: > >> >> >>> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> >> >>> >> >> > >> >> >>> >> >> This happens in the egress pipeline of the LR, and then it > enters into the > >> >> >>> >> >> ingress pipeline again because this is like a hairpin (the > dest IP is a > >> >> >>> >> >> dnat_and_snat IP on the same router). It will hit the > UNSNAT stage first, > >> >> >>> >> >> traverse the CT zone, and for the first packet the reply > is not seen yet, > >> >> >>> >> >> so the CT state is still new, so in the POST_UNSNAT we > will try to commit > >> >> >>> >> >> it (because of the issue I am trying to fix above). > Because the connection > >> >> >>> >> >> floating IP1 <-> floating IP2 is conflicting with the > above committed entry > >> >> >>> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT > action fails and > >> >> >>> >> >> the packet is dropped. > >> >> >>> >> >> > >> >> >>> >> >> There is still a way to solve this, probably by skipping > the unnecessary > >> >> >>> >> >> UNSNAT traversal and the conflict commit if the source IP > is one of the > >> >> >>> >> >> snat/dnat_and_snat IPs of this same router. > >> >> >>> >> >> > >> >> >>> >> >> However, as we see, the NAT related pipelines have been > very complex today, > >> >> >>> >> >> and with the above changes it would become even much more > complex. I wonder > >> >> >>> >> >> if this is maintainable in the long run. > >> >> >>> >> >> > >> >> >>> >> >> In addition, I am not even sure if this is the last > problem of my patch. It > >> >> >>> >> >> is not easy to tell because today the ct-commit-all > feature is an option > >> >> >>> >> >> and by default disabled. So all our test cases by default > don't test for > >> >> >>> >> >> this option except the 4 test cases added by Ales's patch. > I had to > >> >> >>> >> >> manually set this option to true as default and of course > many test cases > >> >> >>> >> >> will fail. Most of them fail because of the expected flows > won't match, but > >> >> >>> >> >> I can't tell for sure unless updating every case with new > flows and then > >> >> >>> >> >> retest them. Before doing that I just examined ones I > think might be > >> >> >>> >> >> impacted by the patch and the "DNAT and SNAT on > distributed router - E/W" > >> >> >>> >> >> is the first broken one I found. So there can be other > issues not > >> >> >>> >> >> discovered yet. > >> >> >>> >> >> > >> >> >>> >> >> Moreover, we should also keep in mind the performance > penalty of these > >> >> >>> >> >> extra CT commit & traversal. > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > But basically this happens because we use the same SNAT > zone for all > >> >> >>> >> > router ports. > >> >> >>> >> > > >> >> >>> >> > In an ideal world wouldn't using a unique CT SNAT and CT > DNAT zone per > >> >> >>> >> > router port fix it? E.g.: > >> >> >>> >> > > >> >> >>> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 > (SNAT ZONE: 21, > >> >> >>> >> > DNAT ZONE: 22) ---- host2 > >> >> >>> >> > > >> >> >>> >> > That'd also imply that we commit that session (from host1 > to host2 - > >> >> >>> >> > potentially via LB and SNAT) to 4 different zones (SNAT and > DNAT for > >> >> >>> >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do > 4 different > >> >> >>> >> > CT lookups for all packets traversing the router. > >> >> >>> >> > > >> >> >>> >> > I didn't try any of this but I have the impression that > this would solve > >> >> >>> >> > the HWOL issue: all subsequent packets are part of known > and committed > >> >> >>> >> > sessions in all 4 conntrack zones. > >> >> >>> >> > > >> >> >>> >> > So what if we change the implementation of the "commit-all" > feature to > >> >> >>> >> > add the missing ct commits? Wouldn't that work? > >> >> >>> >> > > >> >> >>> >> > It feels like (lookup performance hit aside) this is the > correct way to > >> >> >>> >> > forward traffic, i.e., never traverse a CT zone without > having a > >> >> >>> >> > committed entry in that zone. > >> >> >>> >> > > >> >> >>> >> >> I'd like to listen to your opinion before moving forward. > Is this worth the > >> >> >>> >> >> effort, or shall we just keep the pipeline simple > (relatively) and avoid > >> >> >>> >> >> the original problem ( > https://issues.redhat.com/browse/FDP-291) by > >> >> >>> >> >> separating the port range of SNAT and DNAT, the first > proposal at the > >> >> >>> >> >> beginning of this thread? > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > Now that I understand the problem you're describing better > it might be > >> >> >>> >> > worth reverting the fix for FDP-291 like you said until we > find a proper > >> >> >>> >> > generic solution. > >> >> >>> >> > > >> >> >>> >> > However, we can't afford to do that without separating the > port range of > >> >> >>> >> > SNAT and DNAT like you said [0] because we'd be regressing > (at least > >> >> >>> >> > from OpenShift perspective). > >> >> >>> >> > > >> >> >>> >> > >> >> >>> >> For reference, Tim, this is the OpenShift bug that needed the > fix for > >> >> >>> >> FDP-291: > >> >> >>> >> > >> >> >>> >> https://issues.redhat.com/browse/OCPBUGS-25889 > >> >> >>> >> > >> >> >>> >> > Because selecting the nodeport (DNAT) range is done through > a k8s API > >> >> >>> >> > server cmdline argument that means that users can in theory > choose > >> >> >>> >> > whatever range they want. I guess we'd need a command line > argument in > >> >> >>> >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >> >> >>> >> > what SNAT port range to use so that it doesn't conflict > with the KAPI > >> >> >>> >> > nodeport range. > >> >> >>> >> > > >> >> >>> >> > Tim, you're the expert on this side, what are your thoughts > on the matter? > >> >> >>> >> > > >> >> >>> >> > Regards, > >> >> >>> >> > Dumitru > >> >> >>> >> > > >> >> >>> >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> >> >>> >> > > >> >> >>> >> >> Thanks, > >> >> >>> >> >> Han > >> >> >>> >> >> > >> >> >>> >> >>> > >> >> >>> >> >>> Thanks, > >> >> >>> >> >>> Tim Rozet > >> >> >>> >> >>> Red Hat OpenShift Networking Team > >> >> >>> >> >>> > >> >> >>> >> >>> > >> >> >>> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara < > dceara@redhat.com> wrote: > >> >> >>> >> >>>> > >> >> >>> >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >> >> >>> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara < > dceara@redhat.com> > >> >> >>> >> >> wrote: > >> >> >>> >> >>>>>> > >> >> >>> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >> >> >>> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dceara@redhat.com> > >> >> >>> >> >> wrote: > >> >> >>> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule > that uses the > >> >> >>> >> >> VIP as > >> >> >>> >> >>>>>>>> external IP, we break sessions initiated from behind > the VIP. > >> >> >>> >> >>>>>>>> > >> >> >>> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: > Skip unsnat flows > >> >> >>> >> >> for > >> >> >>> >> >>>>>>>> load balancer vips in router ingress pipeline"). > That's OK because > >> >> >>> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows > for routers > >> >> >>> >> >> with > >> >> >>> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue > in a better way: > >> >> >>> >> >>>>>>>> > >> >> >>> >> >>>>>>>> In the reply direction, the order of traversal > of the tables > >> >> >>> >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" > adds incorrect > >> >> >>> >> >>>>>>>> datapath flows that check ct_state in the wrong > conntrack zone. > >> >> >>> >> >>>>>>>> This is illustrated below where reply trafic > enters the > >> >> >>> >> >> physical host > >> >> >>> >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone > (default), > >> >> >>> >> >> back to the > >> >> >>> >> >>>>>>>> DNAT zone and then on to Logical Switch Port > zone (22). The > >> >> >>> >> >> third > >> >> >>> >> >>>>>>>> flow is incorrectly checking the state from the > SNAT zone > >> >> >>> >> >> instead > >> >> >>> >> >>>>>>>> of the DNAT zone. > >> >> >>> >> >>>>>>>> > >> >> >>> >> >>>>>>>> We also add a system test to ensure traffic > initiated from behind a > >> >> >>> >> >> VIP > >> >> >>> >> >>>>>>>> + SNAT is not broken. > >> >> >>> >> >>>>>>>> > >> >> >>> >> >>>>>>>> Another nice side effect is that the northd I-P is > slightly > >> >> >>> >> >> simplified > >> >> >>> >> >>>>>>>> because we don't need to track NAT external IPs > anymore. > >> >> >>> >> >>>>>>>> > >> >> >>> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows > for load > >> >> >>> >> >> balancer vips in router ingress pipeline") > >> >> >>> >> >>>>>>>> Reported-at: > https://issues.redhat.com/browse/FDP-291 > >> >> >>> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >> >> >>> >> >>>>>>> > >> >> >>> >> >>>>>>> > >> >> >>> >> >>>>>>> Thanks for the fix. It also simplified the > lr-nat-stateful code. > >> >> >>> >> >>>>>>> > >> >> >>> >> >>>>>>> Acked-by: Numan Siddique <numans@ovn.org> > >> >> >>> >> >>>>>>> > >> >> >>> >> >>>>>> > >> >> >>> >> >>>>>> Thanks, Numan! > >> >> >>> >> >>>>>> > >> >> >>> >> >>>>>> Applied to main and backported to all branches down to > 22.03. > >> >> >>> >> >>>>>> > >> >> >>> >> >>>>>> Regards, > >> >> >>> >> >>>>>> Dumitru > >> >> >>> >> >>>>>> > >> >> >>> >> >>>>>> _______________________________________________ > >> >> >>> >> >>>>>> dev mailing list > >> >> >>> >> >>>>>> dev@openvswitch.org > >> >> >>> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> Hi Dumitru, Numan, Tim and folks, > >> >> >>> >> >>>>> > >> >> >>> >> >>>> > >> >> >>> >> >>>> Hi Han, > >> >> >>> >> >>>> > >> >> >>> >> >>>>> I noticed that the HW offload of k8s nodePort traffic > is broken due to > >> >> >>> >> >>>>> this change. The reason is that for client to nodePort > (LB with VIP > >> >> >>> >> >>>>> being the node IP) traffic, when the packet is going > through the > >> >> >>> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is > never committed > >> >> >>> >> >>>>> to the SNAT zone, it will have CT state returned as > "new", which > >> >> >>> >> >>>>> prevents the HW offload to work for such packets. > >> >> >>> >> >>>>> > >> >> >>> >> >>>> > >> >> >>> >> >>>> Sorry about that, I forgot we don't always commit in the > SNAT zone. > >> >> >>> >> >>>> > >> >> >>> >> >>>>> At the moment I have to revert this change in our > downstream. For the > >> >> >>> >> >>>>> problem that was fixed by this change [0], I think we > can avoid it by > >> >> >>> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, > the nodePort > >> >> >>> >> >>>>> range in k8s is configured by API-server option: > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> --service-node-port-range <a string in the form > 'N1-N2'> Default: > >> >> >>> >> >> 30000-32767 > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >> >> >>> >> >>>>> external_port_range column, and we can choose something > like > >> >> >>> >> >>>>> 10000-30000. > >> >> >>> >> >>>>> > >> >> >>> >> >>>> > >> >> >>> >> >>>> Tim, does this look OK to you? If it's acceptable to > limit the SNAT > >> >> >>> >> >>>> port range this workaround should be fine. > >> >> >>> >> >>>> > >> >> >>> >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> Alternatively solutions are: > >> >> >>> >> >>>>> Alternative1: Always commit to both SNAT and DNAT > zones. This would > >> >> >>> >> >>>>> introduce unnecessary cost of CT entries and extra CT > recirc. > >> >> >>> >> >>>> > >> >> >>> >> >>>> Extra recirculation aside, I would actually love it if > we could use this > >> >> >>> >> >>>> alternative. I think it's the "most correct" option. I > think it would > >> >> >>> >> >>>> allow us to avoid other workarounds like: > >> >> >>> >> >>>> > >> >> >>> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >> >> >>> >> >>>> > >> >> >>> >> >>>> or > >> >> >>> >> >>>> > >> >> >>> >> >>>> > >> >> >>> >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > >> >> >>> >> >>>> > >> >> >>> >> >>>> I do understand the worry about the extra recirculation > though. In the > >> >> >>> >> >>>> HWOL context does that cause visible performance impact? > >> >> >>> >> >>>> > >> >> >>> >> >>>> We'd probably have to do more performance testing > without HWOL to figure > >> >> >>> >> >>>> out the impact in the software datapath. > >> >> >>> >> >>>> > >> >> >>> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But > there are other > >> >> >>> >> >>>>> issues reported for using the common zone [1] > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> Could you let me know if other thoughts on this? > >> >> >>> >> >>>>> > >> >> >>> >> >>>> > >> >> >>> >> >>>> On a related note, I know it has been discussed in > different settings > >> >> >>> >> >>>> but I don't think this ever moved forward: would it be > possible for > >> >> >>> >> >>>> NVIDIA to help out with automatically testing HWOL > impact for incoming > >> >> >>> >> >>>> patches? > >> >> >>> >> >>>> > >> >> >>> >> >>>> Maybe we could some "simple" system-like tests that > ensure that traffic > >> >> >>> >> >>>> is correctly offloaded in common scenarios? > Alternatively, I guess we > >> >> >>> >> >>>> could also tag a subset of the existing system tests and > just run those > >> >> >>> >> >>>> on actual hardware? > >> >> >>> >> >>>> > >> >> >>> >> >>>> It's quite simple (AFAIU) for external CIs to report > status on each OVN > >> >> >>> >> >>>> patch posted on patchwork. That would at least allow us > to flag this > >> >> >>> >> >>>> kind of breakages early (even before they get merged). > >> >> >>> >> >>>> > >> >> >>> >> >>>> What do you think? > >> >> >>> >> >>>> > >> >> >>> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >> >> >>> >> >>>>> [1] b8c40e7593 > >> >> >>> >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >> >> >>> >> >>>>> > >> >> >>> >> >>>>> Thanks, > >> >> >>> >> >>>>> Han > >> >> >>> >> >>>>> > >> >> >>> >> >>>> > >> >> >>> >> >>>> Thanks, > >> >> >>> >> >>>> Dumitru > >> >> >>> >> >> _______________________________________________ > >> >> >>> >> >> dev mailing list > >> >> >>> >> >> dev@openvswitch.org > >> >> >>> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>
diff --git a/northd/en-lflow.c b/northd/en-lflow.c index f1a83839df..c4b927fb8c 100644 --- a/northd/en-lflow.c +++ b/northd/en-lflow.c @@ -174,8 +174,7 @@ lflow_lr_stateful_handler(struct engine_node *node, void *data) struct ed_type_lr_stateful *lr_sful_data = engine_get_input_data("lr_stateful", node); - if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data) - || lr_sful_data->trk_data.vip_nats_changed) { + if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) { return false; } diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c index ad11025c69..215d924e42 100644 --- a/northd/en-lr-nat.c +++ b/northd/en-lr-nat.c @@ -219,10 +219,6 @@ lr_nat_record_init(struct lr_nat_record *lrnat_rec, lrnat_rec->nbr_uuid = od->nbr->header_.uuid; shash_init(&lrnat_rec->snat_ips); - sset_init(&lrnat_rec->external_ips); - for (size_t i = 0; i < od->nbr->n_nat; i++) { - sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip); - } sset_init(&lrnat_rec->external_macs); lrnat_rec->has_distributed_nat = false; @@ -343,7 +339,6 @@ lr_nat_record_clear(struct lr_nat_record *lrnat_rec) } free(lrnat_rec->nat_entries); - sset_destroy(&lrnat_rec->external_ips); sset_destroy(&lrnat_rec->external_macs); } diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h index 71d3a1782d..81a7b0abd7 100644 --- a/northd/en-lr-nat.h +++ b/northd/en-lr-nat.h @@ -64,9 +64,6 @@ struct lr_nat_record { bool has_distributed_nat; - /* Set of nat external ips on the router. */ - struct sset external_ips; - /* Set of nat external macs on the router. */ struct sset external_macs; diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c index 6d0192487c..baf1bd2f89 100644 --- a/northd/en-lr-stateful.c +++ b/northd/en-lr-stateful.c @@ -82,7 +82,6 @@ static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *, enum lb_neighbor_responder_mode, const struct sset *lb_ips_v4, const struct sset *lb_ips_v6); -static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *); /* 'lr_stateful' engine node manages the NB logical router LB data. */ @@ -110,7 +109,6 @@ en_lr_stateful_clear_tracked_data(void *data_) struct ed_type_lr_stateful *data = data_; hmapx_clear(&data->trk_data.crupdated); - data->trk_data.vip_nats_changed = false; } void @@ -198,10 +196,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) /* Add the lr_stateful_rec rec to the tracking data. */ hmapx_add(&data->trk_data.crupdated, lr_stateful_rec); - - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { - data->trk_data.vip_nats_changed = true; - } continue; } @@ -319,9 +313,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) { struct lr_stateful_record *lr_stateful_rec = hmapx_node->data; - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { - data->trk_data.vip_nats_changed = true; - } const struct ovn_datapath *od = ovn_datapaths_find_by_index(input_data.lr_datapaths, lr_stateful_rec->lr_index); @@ -360,13 +351,6 @@ lr_stateful_lr_nat_handler(struct engine_node *node, void *data_) lr_stateful_record_create(&data->table, lrnat_rec, od, input_data.lb_datapaths_map, input_data.lbgrp_datapaths_map); - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { - data->trk_data.vip_nats_changed = true; - } - } else { - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { - data->trk_data.vip_nats_changed = true; - } } /* Add the lr_stateful_rec rec to the tracking data. */ @@ -525,11 +509,6 @@ lr_stateful_record_create(struct lr_stateful_table *table, build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb); } - sset_init(&lr_stateful_rec->vip_nats); - - if (nbr->n_nat) { - lr_stateful_rebuild_vip_nats(lr_stateful_rec); - } lr_stateful_rec->has_lb_vip = od_has_lb_vip(od); hmap_insert(&table->entries, &lr_stateful_rec->key_node, @@ -556,7 +535,6 @@ static void lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec) { ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips); - sset_destroy(&lr_stateful_rec->vip_nats); lflow_ref_destroy(lr_stateful_rec->lflow_ref); free(lr_stateful_rec); } @@ -671,32 +649,3 @@ remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec, ip_address); } } - -static bool -lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec) -{ - struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats); - sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats); - - const char *external_ip; - SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) { - bool is_vip_nat = false; - if (addr_is_ipv6(external_ip)) { - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6, - external_ip); - } else { - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4, - external_ip); - } - - if (is_vip_nat) { - sset_add(&lr_stateful_rec->vip_nats, external_ip); - } - } - - bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats, - &old_vip_nats); - sset_destroy(&old_vip_nats); - - return vip_nats_changed; -} diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h index 4a07174c31..e15c61e575 100644 --- a/northd/en-lr-stateful.h +++ b/northd/en-lr-stateful.h @@ -62,9 +62,6 @@ struct lr_stateful_record { /* Load Balancer vIPs relevant for this datapath. */ struct ovn_lb_ip_set *lb_ips; - /* sset of vips which are also part of lr nats. */ - struct sset vip_nats; - /* 'lflow_ref' is used to reference logical flows generated for * this lr_stateful_record. * @@ -107,10 +104,6 @@ struct lr_stateful_table { struct lr_stateful_tracked_data { /* Created or updated logical router with LB and/or NAT data. */ struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */ - - /* Indicates if any router's NATs changed which were also LB vips - * or vice versa. */ - bool vip_nats_changed; }; struct ed_type_lr_stateful { @@ -142,7 +135,7 @@ const struct lr_stateful_record *lr_stateful_table_find_by_index( static inline bool lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data) { - return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed; + return !hmapx_is_empty(&trk_data->crupdated); } static inline bool diff --git a/northd/northd.c b/northd/northd.c index 2c3560ce2d..4b39137e7c 100644 --- a/northd/northd.c +++ b/northd/northd.c @@ -11061,7 +11061,6 @@ build_lrouter_nat_flows_for_lb( struct ds skip_snat_act = DS_EMPTY_INITIALIZER; struct ds force_snat_act = DS_EMPTY_INITIALIZER; struct ds undnat_match = DS_EMPTY_INITIALIZER; - struct ds unsnat_match = DS_EMPTY_INITIALIZER; struct ds gw_redir_action = DS_EMPTY_INITIALIZER; ds_clear(match); @@ -11107,13 +11106,6 @@ build_lrouter_nat_flows_for_lb( /* Remove the trailing " || ". */ ds_truncate(&undnat_match, undnat_match.length - 4); - ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s", - ip_match, ip_match, lb_vip->vip_str, lb->proto); - if (lb_vip->port_str) { - ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto, - lb_vip->port_str); - } - struct lrouter_nat_lb_flows_ctx ctx = { .lb_vip = lb_vip, .lb = lb, @@ -11171,23 +11163,6 @@ build_lrouter_nat_flows_for_lb( if (lb->affinity_timeout) { bitmap_set1(aff_dp_bitmap[type], index); } - - if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) { - /* The load balancer vip is also present in the NAT entries. - * So add a high priority lflow to advance the the packet - * destined to the vip (and the vip port if defined) - * in the S_ROUTER_IN_UNSNAT stage. - * There seems to be an issue with ovs-vswitchd. When the new - * connection packet destined for the lb vip is received, - * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat - * conntrack zone. For the next packet, if it goes through - * unsnat stage, the conntrack flags are not set properly, and - * it doesn't hit the established state flows in - * S_ROUTER_IN_DNAT stage. */ - ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120, - ds_cstr(&unsnat_match), "next;", - &lb->nlb->header_, lb_dps->lflow_ref); - } } for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) { @@ -11199,7 +11174,6 @@ build_lrouter_nat_flows_for_lb( lr_datapaths, lb_dps->lflow_ref); } - ds_destroy(&unsnat_match); ds_destroy(&undnat_match); ds_destroy(&skip_snat_act); ds_destroy(&force_snat_act); @@ -15070,12 +15044,7 @@ build_lrouter_nat_defrag_and_lb( l3dgw_port, stateless, lflow_ref); /* ARP resolve for NAT IPs. */ - if (od->is_gw_router) { - /* Add the NAT external_ip to the nat_entries for - * gateway routers. This is required for adding load balancer - * flows.*/ - sset_add(&nat_entries, nat->external_ip); - } else { + if (!od->is_gw_router) { if (!sset_contains(&nat_entries, nat->external_ip)) { /* Drop packets coming in from external that still has * destination IP equals to the NAT external IP, to avoid loop. diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at index 6fdd761dac..c7f65b9666 100644 --- a/tests/ovn-northd.at +++ b/tests/ovn-northd.at @@ -1610,9 +1610,6 @@ AT_CAPTURE_FILE([sbflows]) # dnat_and_snat or snat entry. AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) ]) @@ -1643,9 +1640,6 @@ AT_CAPTURE_FILE([sbflows]) # dnat_and_snat or snat entry. AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) ]) @@ -5868,7 +5862,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) @@ -5945,7 +5938,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) @@ -12010,7 +12002,7 @@ check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080 check ovn-nbctl lr-lb-add lr0 lb1 check ovn-nbctl lr-lb-add lr0 lb2 -# lflow engine should recompute since the nat ip 172.168.0.140 +# lflow engine should not recompute even if the nat ip 172.168.0.140 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20 @@ -12019,10 +12011,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the nat ip 172.168.0.150 +# lflow engine should not recompute even if the nat ip 172.168.0.150 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41 @@ -12031,10 +12023,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the deleted nat ip 172.168.0.150 +# lflow engine should not recompute even if the deleted nat ip 172.168.0.150 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150 @@ -12043,10 +12035,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the deleted nat ip 172.168.0.140 +# lflow engine should not recompute even if the deleted nat ip 172.168.0.140 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140 @@ -12055,7 +12047,7 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE # Delete the NAT diff --git a/tests/system-ovn.at b/tests/system-ovn.at index c22c7882fc..2411b0267e 100644 --- a/tests/system-ovn.at +++ b/tests/system-ovn.at @@ -12096,6 +12096,86 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d /connection dropped.*/d"]) AT_CLEANUP +OVN_FOR_EACH_NORTHD([ +AT_SETUP([load balancing in gateway router - client behind LB with SNAT]) +AT_SKIP_IF([test $HAVE_NC = no]) +AT_KEYWORDS([lb]) + +ovn_start +OVS_TRAFFIC_VSWITCHD_START() +ADD_BR([br-int]) + +check ovs-vsctl \ + -- set Open_vSwitch . external-ids:system-id=hv1 \ + -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \ + -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \ + -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \ + -- set bridge br-int fail-mode=secure other-config:disable-in-band=true + +start_daemon ovn-controller + +check ovn-nbctl lr-add lr \ + -- set logical_router lr options:chassis=hv1 +check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24 +check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24 +check ovn-nbctl ls-add ls1 +check ovn-nbctl ls-add ls2 + +check ovn-nbctl lsp-add ls1 ls1-lr +check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00 +check ovn-nbctl lsp-set-type ls1-lr router +check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1 +check ovn-nbctl lsp-add ls1 vm1 +check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 + +check ovn-nbctl lsp-add ls2 ls2-lr +check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00 +check ovn-nbctl lsp-set-type ls2-lr router +check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2 +check ovn-nbctl lsp-add ls2 vm2 +check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02 + +dnl LB using the router IP connected to vm2 as VIP. +check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \ + -- lr-lb-add lr lb-test + +dnl SNAT everything coming from vm1 to the router IP (towards vm2). +check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1 + +ADD_NAMESPACES(vm1) +ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2") + +ADD_NAMESPACES(vm2) +ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2") + +dnl Start a server on vm2. +NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid]) + +dnl Wait for ovn-controller to catch up. +wait_for_ports_up +check ovn-nbctl --wait=hv sync + +dnl Test the connection originating something that uses the same source port +dnl as the LB VIP. +NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore]) + +OVS_APP_EXIT_AND_WAIT([ovn-controller]) + +as ovn-sb +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) + +as ovn-nb +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) + +as northd +OVS_APP_EXIT_AND_WAIT([ovn-northd]) + +as +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d +/connection dropped.*/d"]) +AT_CLEANUP +]) + OVN_FOR_EACH_NORTHD([ AT_SETUP([load balancing affinity sessions - auto clear learnt flows]) AT_SKIP_IF([test $HAVE_NC = no])
Otherwise, in case there's also a SNAT rule that uses the VIP as external IP, we break sessions initiated from behind the VIP. This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline"). That's OK because commit 384a7c6237da ("northd: Refactor Logical Flows for routers with DNAT/Load Balancers") addressed the original issue in a better way: In the reply direction, the order of traversal of the tables "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect datapath flows that check ct_state in the wrong conntrack zone. This is illustrated below where reply trafic enters the physical host port (6) and traverses DNAT zone (14), SNAT zone (default), back to the DNAT zone and then on to Logical Switch Port zone (22). The third flow is incorrectly checking the state from the SNAT zone instead of the DNAT zone. We also add a system test to ensure traffic initiated from behind a VIP + SNAT is not broken. Another nice side effect is that the northd I-P is slightly simplified because we don't need to track NAT external IPs anymore. Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") Reported-at: https://issues.redhat.com/browse/FDP-291 Signed-off-by: Dumitru Ceara <dceara@redhat.com> --- northd/en-lflow.c | 3 +- northd/en-lr-nat.c | 5 --- northd/en-lr-nat.h | 3 -- northd/en-lr-stateful.c | 51 -------------------------- northd/en-lr-stateful.h | 9 +---- northd/northd.c | 33 +---------------- tests/ovn-northd.at | 24 +++++-------- tests/system-ovn.at | 80 +++++++++++++++++++++++++++++++++++++++++ 8 files changed, 91 insertions(+), 117 deletions(-)