Message ID | 20240226125830.2173125-1-dceara@redhat.com |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev] northd: Don't skip the unSNAT stage for traffic towards VIPs. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/github-robot-_ovn-kubernetes | success | github build: passed |
On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > > Otherwise, in case there's also a SNAT rule that uses the VIP as > external IP, we break sessions initiated from behind the VIP. > > This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for > load balancer vips in router ingress pipeline"). That's OK because > commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > DNAT/Load Balancers") addressed the original issue in a better way: > > In the reply direction, the order of traversal of the tables > "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > datapath flows that check ct_state in the wrong conntrack zone. > This is illustrated below where reply trafic enters the physical host > port (6) and traverses DNAT zone (14), SNAT zone (default), back to the > DNAT zone and then on to Logical Switch Port zone (22). The third > flow is incorrectly checking the state from the SNAT zone instead > of the DNAT zone. > > We also add a system test to ensure traffic initiated from behind a VIP > + SNAT is not broken. > > Another nice side effect is that the northd I-P is slightly simplified > because we don't need to track NAT external IPs anymore. > > Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") > Reported-at: https://issues.redhat.com/browse/FDP-291 > Signed-off-by: Dumitru Ceara <dceara@redhat.com> Thanks for the fix. It also simplified the lr-nat-stateful code. Acked-by: Numan Siddique <numans@ovn.org> Numan > --- > northd/en-lflow.c | 3 +- > northd/en-lr-nat.c | 5 --- > northd/en-lr-nat.h | 3 -- > northd/en-lr-stateful.c | 51 -------------------------- > northd/en-lr-stateful.h | 9 +---- > northd/northd.c | 33 +---------------- > tests/ovn-northd.at | 24 +++++-------- > tests/system-ovn.at | 80 +++++++++++++++++++++++++++++++++++++++++ > 8 files changed, 91 insertions(+), 117 deletions(-) > > diff --git a/northd/en-lflow.c b/northd/en-lflow.c > index f1a83839df..c4b927fb8c 100644 > --- a/northd/en-lflow.c > +++ b/northd/en-lflow.c > @@ -174,8 +174,7 @@ lflow_lr_stateful_handler(struct engine_node *node, void *data) > struct ed_type_lr_stateful *lr_sful_data = > engine_get_input_data("lr_stateful", node); > > - if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data) > - || lr_sful_data->trk_data.vip_nats_changed) { > + if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) { > return false; > } > > diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c > index ad11025c69..215d924e42 100644 > --- a/northd/en-lr-nat.c > +++ b/northd/en-lr-nat.c > @@ -219,10 +219,6 @@ lr_nat_record_init(struct lr_nat_record *lrnat_rec, > lrnat_rec->nbr_uuid = od->nbr->header_.uuid; > > shash_init(&lrnat_rec->snat_ips); > - sset_init(&lrnat_rec->external_ips); > - for (size_t i = 0; i < od->nbr->n_nat; i++) { > - sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip); > - } > > sset_init(&lrnat_rec->external_macs); > lrnat_rec->has_distributed_nat = false; > @@ -343,7 +339,6 @@ lr_nat_record_clear(struct lr_nat_record *lrnat_rec) > } > > free(lrnat_rec->nat_entries); > - sset_destroy(&lrnat_rec->external_ips); > sset_destroy(&lrnat_rec->external_macs); > } > > diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h > index 71d3a1782d..81a7b0abd7 100644 > --- a/northd/en-lr-nat.h > +++ b/northd/en-lr-nat.h > @@ -64,9 +64,6 @@ struct lr_nat_record { > > bool has_distributed_nat; > > - /* Set of nat external ips on the router. */ > - struct sset external_ips; > - > /* Set of nat external macs on the router. */ > struct sset external_macs; > > diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c > index 6d0192487c..baf1bd2f89 100644 > --- a/northd/en-lr-stateful.c > +++ b/northd/en-lr-stateful.c > @@ -82,7 +82,6 @@ static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *, > enum lb_neighbor_responder_mode, > const struct sset *lb_ips_v4, > const struct sset *lb_ips_v6); > -static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *); > > /* 'lr_stateful' engine node manages the NB logical router LB data. > */ > @@ -110,7 +109,6 @@ en_lr_stateful_clear_tracked_data(void *data_) > struct ed_type_lr_stateful *data = data_; > > hmapx_clear(&data->trk_data.crupdated); > - data->trk_data.vip_nats_changed = false; > } > > void > @@ -198,10 +196,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) > > /* Add the lr_stateful_rec rec to the tracking data. */ > hmapx_add(&data->trk_data.crupdated, lr_stateful_rec); > - > - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { > - data->trk_data.vip_nats_changed = true; > - } > continue; > } > > @@ -319,9 +313,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) > > HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) { > struct lr_stateful_record *lr_stateful_rec = hmapx_node->data; > - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { > - data->trk_data.vip_nats_changed = true; > - } > const struct ovn_datapath *od = > ovn_datapaths_find_by_index(input_data.lr_datapaths, > lr_stateful_rec->lr_index); > @@ -360,13 +351,6 @@ lr_stateful_lr_nat_handler(struct engine_node *node, void *data_) > lr_stateful_record_create(&data->table, lrnat_rec, od, > input_data.lb_datapaths_map, > input_data.lbgrp_datapaths_map); > - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { > - data->trk_data.vip_nats_changed = true; > - } > - } else { > - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { > - data->trk_data.vip_nats_changed = true; > - } > } > > /* Add the lr_stateful_rec rec to the tracking data. */ > @@ -525,11 +509,6 @@ lr_stateful_record_create(struct lr_stateful_table *table, > build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb); > } > > - sset_init(&lr_stateful_rec->vip_nats); > - > - if (nbr->n_nat) { > - lr_stateful_rebuild_vip_nats(lr_stateful_rec); > - } > lr_stateful_rec->has_lb_vip = od_has_lb_vip(od); > > hmap_insert(&table->entries, &lr_stateful_rec->key_node, > @@ -556,7 +535,6 @@ static void > lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec) > { > ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips); > - sset_destroy(&lr_stateful_rec->vip_nats); > lflow_ref_destroy(lr_stateful_rec->lflow_ref); > free(lr_stateful_rec); > } > @@ -671,32 +649,3 @@ remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec, > ip_address); > } > } > - > -static bool > -lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec) > -{ > - struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats); > - sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats); > - > - const char *external_ip; > - SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) { > - bool is_vip_nat = false; > - if (addr_is_ipv6(external_ip)) { > - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6, > - external_ip); > - } else { > - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4, > - external_ip); > - } > - > - if (is_vip_nat) { > - sset_add(&lr_stateful_rec->vip_nats, external_ip); > - } > - } > - > - bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats, > - &old_vip_nats); > - sset_destroy(&old_vip_nats); > - > - return vip_nats_changed; > -} > diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h > index 4a07174c31..e15c61e575 100644 > --- a/northd/en-lr-stateful.h > +++ b/northd/en-lr-stateful.h > @@ -62,9 +62,6 @@ struct lr_stateful_record { > /* Load Balancer vIPs relevant for this datapath. */ > struct ovn_lb_ip_set *lb_ips; > > - /* sset of vips which are also part of lr nats. */ > - struct sset vip_nats; > - > /* 'lflow_ref' is used to reference logical flows generated for > * this lr_stateful_record. > * > @@ -107,10 +104,6 @@ struct lr_stateful_table { > struct lr_stateful_tracked_data { > /* Created or updated logical router with LB and/or NAT data. */ > struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */ > - > - /* Indicates if any router's NATs changed which were also LB vips > - * or vice versa. */ > - bool vip_nats_changed; > }; > > struct ed_type_lr_stateful { > @@ -142,7 +135,7 @@ const struct lr_stateful_record *lr_stateful_table_find_by_index( > static inline bool > lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data) > { > - return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed; > + return !hmapx_is_empty(&trk_data->crupdated); > } > > static inline bool > diff --git a/northd/northd.c b/northd/northd.c > index 2c3560ce2d..4b39137e7c 100644 > --- a/northd/northd.c > +++ b/northd/northd.c > @@ -11061,7 +11061,6 @@ build_lrouter_nat_flows_for_lb( > struct ds skip_snat_act = DS_EMPTY_INITIALIZER; > struct ds force_snat_act = DS_EMPTY_INITIALIZER; > struct ds undnat_match = DS_EMPTY_INITIALIZER; > - struct ds unsnat_match = DS_EMPTY_INITIALIZER; > struct ds gw_redir_action = DS_EMPTY_INITIALIZER; > > ds_clear(match); > @@ -11107,13 +11106,6 @@ build_lrouter_nat_flows_for_lb( > /* Remove the trailing " || ". */ > ds_truncate(&undnat_match, undnat_match.length - 4); > > - ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s", > - ip_match, ip_match, lb_vip->vip_str, lb->proto); > - if (lb_vip->port_str) { > - ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto, > - lb_vip->port_str); > - } > - > struct lrouter_nat_lb_flows_ctx ctx = { > .lb_vip = lb_vip, > .lb = lb, > @@ -11171,23 +11163,6 @@ build_lrouter_nat_flows_for_lb( > if (lb->affinity_timeout) { > bitmap_set1(aff_dp_bitmap[type], index); > } > - > - if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) { > - /* The load balancer vip is also present in the NAT entries. > - * So add a high priority lflow to advance the the packet > - * destined to the vip (and the vip port if defined) > - * in the S_ROUTER_IN_UNSNAT stage. > - * There seems to be an issue with ovs-vswitchd. When the new > - * connection packet destined for the lb vip is received, > - * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat > - * conntrack zone. For the next packet, if it goes through > - * unsnat stage, the conntrack flags are not set properly, and > - * it doesn't hit the established state flows in > - * S_ROUTER_IN_DNAT stage. */ > - ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120, > - ds_cstr(&unsnat_match), "next;", > - &lb->nlb->header_, lb_dps->lflow_ref); > - } > } > > for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) { > @@ -11199,7 +11174,6 @@ build_lrouter_nat_flows_for_lb( > lr_datapaths, lb_dps->lflow_ref); > } > > - ds_destroy(&unsnat_match); > ds_destroy(&undnat_match); > ds_destroy(&skip_snat_act); > ds_destroy(&force_snat_act); > @@ -15070,12 +15044,7 @@ build_lrouter_nat_defrag_and_lb( > l3dgw_port, stateless, lflow_ref); > > /* ARP resolve for NAT IPs. */ > - if (od->is_gw_router) { > - /* Add the NAT external_ip to the nat_entries for > - * gateway routers. This is required for adding load balancer > - * flows.*/ > - sset_add(&nat_entries, nat->external_ip); > - } else { > + if (!od->is_gw_router) { > if (!sset_contains(&nat_entries, nat->external_ip)) { > /* Drop packets coming in from external that still has > * destination IP equals to the NAT external IP, to avoid loop. > diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at > index 6fdd761dac..c7f65b9666 100644 > --- a/tests/ovn-northd.at > +++ b/tests/ovn-northd.at > @@ -1610,9 +1610,6 @@ AT_CAPTURE_FILE([sbflows]) > # dnat_and_snat or snat entry. > AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) > ]) > @@ -1643,9 +1640,6 @@ AT_CAPTURE_FILE([sbflows]) > # dnat_and_snat or snat entry. > AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) > ]) > @@ -5868,7 +5862,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) > @@ -5945,7 +5938,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;) > - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) > table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) > @@ -12010,7 +12002,7 @@ check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080 > check ovn-nbctl lr-lb-add lr0 lb1 > check ovn-nbctl lr-lb-add lr0 lb2 > > -# lflow engine should recompute since the nat ip 172.168.0.140 > +# lflow engine should not recompute even if the nat ip 172.168.0.140 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20 > @@ -12019,10 +12011,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the nat ip 172.168.0.150 > +# lflow engine should not recompute even if the nat ip 172.168.0.150 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41 > @@ -12031,10 +12023,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the deleted nat ip 172.168.0.150 > +# lflow engine should not recompute even if the deleted nat ip 172.168.0.150 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150 > @@ -12043,10 +12035,10 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > -# lflow engine should recompute since the deleted nat ip 172.168.0.140 > +# lflow engine should not recompute even if the deleted nat ip 172.168.0.140 > # is a lb vip. > check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats > check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140 > @@ -12055,7 +12047,7 @@ check_engine_stats lr_nat norecompute compute > check_engine_stats lr_stateful norecompute compute > check_engine_stats sync_to_sb_pb recompute nocompute > check_engine_stats sync_to_sb_lb norecompute compute > -check_engine_stats lflow recompute nocompute > +check_engine_stats lflow norecompute compute > CHECK_NO_CHANGE_AFTER_RECOMPUTE > > # Delete the NAT > diff --git a/tests/system-ovn.at b/tests/system-ovn.at > index c22c7882fc..2411b0267e 100644 > --- a/tests/system-ovn.at > +++ b/tests/system-ovn.at > @@ -12096,6 +12096,86 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d > /connection dropped.*/d"]) > AT_CLEANUP > > +OVN_FOR_EACH_NORTHD([ > +AT_SETUP([load balancing in gateway router - client behind LB with SNAT]) > +AT_SKIP_IF([test $HAVE_NC = no]) > +AT_KEYWORDS([lb]) > + > +ovn_start > +OVS_TRAFFIC_VSWITCHD_START() > +ADD_BR([br-int]) > + > +check ovs-vsctl \ > + -- set Open_vSwitch . external-ids:system-id=hv1 \ > + -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \ > + -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \ > + -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \ > + -- set bridge br-int fail-mode=secure other-config:disable-in-band=true > + > +start_daemon ovn-controller > + > +check ovn-nbctl lr-add lr \ > + -- set logical_router lr options:chassis=hv1 > +check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24 > +check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24 > +check ovn-nbctl ls-add ls1 > +check ovn-nbctl ls-add ls2 > + > +check ovn-nbctl lsp-add ls1 ls1-lr > +check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00 > +check ovn-nbctl lsp-set-type ls1-lr router > +check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1 > +check ovn-nbctl lsp-add ls1 vm1 > +check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 > + > +check ovn-nbctl lsp-add ls2 ls2-lr > +check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00 > +check ovn-nbctl lsp-set-type ls2-lr router > +check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2 > +check ovn-nbctl lsp-add ls2 vm2 > +check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02 > + > +dnl LB using the router IP connected to vm2 as VIP. > +check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \ > + -- lr-lb-add lr lb-test > + > +dnl SNAT everything coming from vm1 to the router IP (towards vm2). > +check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1 > + > +ADD_NAMESPACES(vm1) > +ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2") > + > +ADD_NAMESPACES(vm2) > +ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2") > + > +dnl Start a server on vm2. > +NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid]) > + > +dnl Wait for ovn-controller to catch up. > +wait_for_ports_up > +check ovn-nbctl --wait=hv sync > + > +dnl Test the connection originating something that uses the same source port > +dnl as the LB VIP. > +NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore]) > + > +OVS_APP_EXIT_AND_WAIT([ovn-controller]) > + > +as ovn-sb > +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) > + > +as ovn-nb > +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) > + > +as northd > +OVS_APP_EXIT_AND_WAIT([ovn-northd]) > + > +as > +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d > +/connection dropped.*/d"]) > +AT_CLEANUP > +]) > + > OVN_FOR_EACH_NORTHD([ > AT_SETUP([load balancing affinity sessions - auto clear learnt flows]) > AT_SKIP_IF([test $HAVE_NC = no]) > -- > 2.39.3 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 3/5/24 15:56, Numan Siddique wrote: > On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> Otherwise, in case there's also a SNAT rule that uses the VIP as >> external IP, we break sessions initiated from behind the VIP. >> >> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for >> load balancer vips in router ingress pipeline"). That's OK because >> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with >> DNAT/Load Balancers") addressed the original issue in a better way: >> >> In the reply direction, the order of traversal of the tables >> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >> datapath flows that check ct_state in the wrong conntrack zone. >> This is illustrated below where reply trafic enters the physical host >> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the >> DNAT zone and then on to Logical Switch Port zone (22). The third >> flow is incorrectly checking the state from the SNAT zone instead >> of the DNAT zone. >> >> We also add a system test to ensure traffic initiated from behind a VIP >> + SNAT is not broken. >> >> Another nice side effect is that the northd I-P is slightly simplified >> because we don't need to track NAT external IPs anymore. >> >> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") >> Reported-at: https://issues.redhat.com/browse/FDP-291 >> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > > Thanks for the fix. It also simplified the lr-nat-stateful code. > > Acked-by: Numan Siddique <numans@ovn.org> > Thanks, Numan! Applied to main and backported to all branches down to 22.03. Regards, Dumitru
On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 3/5/24 15:56, Numan Siddique wrote: > > On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> Otherwise, in case there's also a SNAT rule that uses the VIP as > >> external IP, we break sessions initiated from behind the VIP. > >> > >> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for > >> load balancer vips in router ingress pipeline"). That's OK because > >> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > >> DNAT/Load Balancers") addressed the original issue in a better way: > >> > >> In the reply direction, the order of traversal of the tables > >> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >> datapath flows that check ct_state in the wrong conntrack zone. > >> This is illustrated below where reply trafic enters the physical host > >> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the > >> DNAT zone and then on to Logical Switch Port zone (22). The third > >> flow is incorrectly checking the state from the SNAT zone instead > >> of the DNAT zone. > >> > >> We also add a system test to ensure traffic initiated from behind a VIP > >> + SNAT is not broken. > >> > >> Another nice side effect is that the northd I-P is slightly simplified > >> because we don't need to track NAT external IPs anymore. > >> > >> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") > >> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > > > > > Thanks for the fix. It also simplified the lr-nat-stateful code. > > > > Acked-by: Numan Siddique <numans@ovn.org> > > > > Thanks, Numan! > > Applied to main and backported to all branches down to 22.03. > > Regards, > Dumitru > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev Hi Dumitru, Numan, Tim and folks, I noticed that the HW offload of k8s nodePort traffic is broken due to this change. The reason is that for client to nodePort (LB with VIP being the node IP) traffic, when the packet is going through the unSNAT stage in the SNAT CT zone, since the entry is never committed to the SNAT zone, it will have CT state returned as "new", which prevents the HW offload to work for such packets. At the moment I have to revert this change in our downstream. For the problem that was fixed by this change [0], I think we can avoid it by separating the port range of SNAT and DNAT. For DNAT, the nodePort range in k8s is configured by API-server option: --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 For SNAT, it can be configured in the OVN's NAT table's external_port_range column, and we can choose something like 10000-30000. An extra benefit of this is that it reduces a CT recirc. Alternatively solutions are: Alternative1: Always commit to both SNAT and DNAT zones. This would introduce unnecessary cost of CT entries and extra CT recirc. Alternative2: Use a common zone for SNAT and DNAT. But there are other issues reported for using the common zone [1] Could you let me know if other thoughts on this? [0] https://issues.redhat.com/browse/FDP-291 [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 Thanks, Han
On 8/29/24 18:14, Han Zhou wrote: > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> On 3/5/24 15:56, Numan Siddique wrote: >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> wrote: >>>> >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as >>>> external IP, we break sessions initiated from behind the VIP. >>>> >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for >>>> load balancer vips in router ingress pipeline"). That's OK because >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with >>>> DNAT/Load Balancers") addressed the original issue in a better way: >>>> >>>> In the reply direction, the order of traversal of the tables >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect >>>> datapath flows that check ct_state in the wrong conntrack zone. >>>> This is illustrated below where reply trafic enters the physical host >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the >>>> DNAT zone and then on to Logical Switch Port zone (22). The third >>>> flow is incorrectly checking the state from the SNAT zone instead >>>> of the DNAT zone. >>>> >>>> We also add a system test to ensure traffic initiated from behind a VIP >>>> + SNAT is not broken. >>>> >>>> Another nice side effect is that the northd I-P is slightly simplified >>>> because we don't need to track NAT external IPs anymore. >>>> >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >>> >>> >>> Thanks for the fix. It also simplified the lr-nat-stateful code. >>> >>> Acked-by: Numan Siddique <numans@ovn.org> >>> >> >> Thanks, Numan! >> >> Applied to main and backported to all branches down to 22.03. >> >> Regards, >> Dumitru >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > Hi Dumitru, Numan, Tim and folks, > Hi Han, > I noticed that the HW offload of k8s nodePort traffic is broken due to > this change. The reason is that for client to nodePort (LB with VIP > being the node IP) traffic, when the packet is going through the > unSNAT stage in the SNAT CT zone, since the entry is never committed > to the SNAT zone, it will have CT state returned as "new", which > prevents the HW offload to work for such packets. > Sorry about that, I forgot we don't always commit in the SNAT zone. > At the moment I have to revert this change in our downstream. For the > problem that was fixed by this change [0], I think we can avoid it by > separating the port range of SNAT and DNAT. For DNAT, the nodePort > range in k8s is configured by API-server option: > > --service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 > > For SNAT, it can be configured in the OVN's NAT table's > external_port_range column, and we can choose something like > 10000-30000. > Tim, does this look OK to you? If it's acceptable to limit the SNAT port range this workaround should be fine. > An extra benefit of this is that it reduces a CT recirc. > > Alternatively solutions are: > Alternative1: Always commit to both SNAT and DNAT zones. This would > introduce unnecessary cost of CT entries and extra CT recirc. Extra recirculation aside, I would actually love it if we could use this alternative. I think it's the "most correct" option. I think it would allow us to avoid other workarounds like: https://github.com/ovn-org/ovn/commit/40136a2f2c8 or https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ I do understand the worry about the extra recirculation though. In the HWOL context does that cause visible performance impact? We'd probably have to do more performance testing without HWOL to figure out the impact in the software datapath. > Alternative2: Use a common zone for SNAT and DNAT. But there are other > issues reported for using the common zone [1] > > Could you let me know if other thoughts on this? > On a related note, I know it has been discussed in different settings but I don't think this ever moved forward: would it be possible for NVIDIA to help out with automatically testing HWOL impact for incoming patches? Maybe we could some "simple" system-like tests that ensure that traffic is correctly offloaded in common scenarios? Alternatively, I guess we could also tag a subset of the existing system tests and just run those on actual hardware? It's quite simple (AFAIU) for external CIs to report status on each OVN patch posted on patchwork. That would at least allow us to flag this kind of breakages early (even before they get merged). What do you think? > [0] https://issues.redhat.com/browse/FDP-291 > [1] b8c40e7593 https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > > Thanks, > Han > Thanks, Dumitru
Hi Han, My understanding is that the unSNAT stage was originally being skipped for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks HWOL for ingress connections that never need to get unSNAT'ed and therefore not committed. In the OVNK use case for pods simply connecting egress, almost all packets will be SNAT'ed to the node IP. Therefore on egress reply, we always enter the unSNAT stage for these packets. So this skipping unSNAT stage seems to be specific to LB related traffic. While I agree splitting the port range would work, I think it makes more sense to just always commit in the unSNAT stage for all traffic. I get there is a performance hit there, but the datapath pipeline seems more consistent and I think outweighs the cost of committing LB traffic. Thanks, Tim Rozet Red Hat OpenShift Networking Team On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dceara@redhat.com> wrote: > On 8/29/24 18:14, Han Zhou wrote: > > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dceara@redhat.com> wrote: > >> > >> On 3/5/24 15:56, Numan Siddique wrote: > >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dceara@redhat.com> > wrote: > >>>> > >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as > >>>> external IP, we break sessions initiated from behind the VIP. > >>>> > >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows > for > >>>> load balancer vips in router ingress pipeline"). That's OK because > >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > >>>> DNAT/Load Balancers") addressed the original issue in a better way: > >>>> > >>>> In the reply direction, the order of traversal of the tables > >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >>>> datapath flows that check ct_state in the wrong conntrack zone. > >>>> This is illustrated below where reply trafic enters the physical > host > >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back > to the > >>>> DNAT zone and then on to Logical Switch Port zone (22). The third > >>>> flow is incorrectly checking the state from the SNAT zone instead > >>>> of the DNAT zone. > >>>> > >>>> We also add a system test to ensure traffic initiated from behind a > VIP > >>>> + SNAT is not broken. > >>>> > >>>> Another nice side effect is that the northd I-P is slightly simplified > >>>> because we don't need to track NAT external IPs anymore. > >>>> > >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer > vips in router ingress pipeline") > >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com> > >>> > >>> > >>> Thanks for the fix. It also simplified the lr-nat-stateful code. > >>> > >>> Acked-by: Numan Siddique <numans@ovn.org> > >>> > >> > >> Thanks, Numan! > >> > >> Applied to main and backported to all branches down to 22.03. > >> > >> Regards, > >> Dumitru > >> > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > Hi Dumitru, Numan, Tim and folks, > > > > Hi Han, > > > I noticed that the HW offload of k8s nodePort traffic is broken due to > > this change. The reason is that for client to nodePort (LB with VIP > > being the node IP) traffic, when the packet is going through the > > unSNAT stage in the SNAT CT zone, since the entry is never committed > > to the SNAT zone, it will have CT state returned as "new", which > > prevents the HW offload to work for such packets. > > > > Sorry about that, I forgot we don't always commit in the SNAT zone. > > > At the moment I have to revert this change in our downstream. For the > > problem that was fixed by this change [0], I think we can avoid it by > > separating the port range of SNAT and DNAT. For DNAT, the nodePort > > range in k8s is configured by API-server option: > > > > --service-node-port-range <a string in the form 'N1-N2'> Default: > 30000-32767 > > > > For SNAT, it can be configured in the OVN's NAT table's > > external_port_range column, and we can choose something like > > 10000-30000. > > > > Tim, does this look OK to you? If it's acceptable to limit the SNAT > port range this workaround should be fine. > > > An extra benefit of this is that it reduces a CT recirc. > > > > Alternatively solutions are: > > Alternative1: Always commit to both SNAT and DNAT zones. This would > > introduce unnecessary cost of CT entries and extra CT recirc. > > Extra recirculation aside, I would actually love it if we could use this > alternative. I think it's the "most correct" option. I think it would > allow us to avoid other workarounds like: > > https://github.com/ovn-org/ovn/commit/40136a2f2c8 > > or > > > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amusil@redhat.com/ > > I do understand the worry about the extra recirculation though. In the > HWOL context does that cause visible performance impact? > > We'd probably have to do more performance testing without HWOL to figure > out the impact in the software datapath. > > > Alternative2: Use a common zone for SNAT and DNAT. But there are other > > issues reported for using the common zone [1] > > > > Could you let me know if other thoughts on this? > > > > On a related note, I know it has been discussed in different settings > but I don't think this ever moved forward: would it be possible for > NVIDIA to help out with automatically testing HWOL impact for incoming > patches? > > Maybe we could some "simple" system-like tests that ensure that traffic > is correctly offloaded in common scenarios? Alternatively, I guess we > could also tag a subset of the existing system tests and just run those > on actual hardware? > > It's quite simple (AFAIU) for external CIs to report status on each OVN > patch posted on patchwork. That would at least allow us to flag this > kind of breakages early (even before they get merged). > > What do you think? > > > [0] https://issues.redhat.com/browse/FDP-291 > > [1] b8c40e7593 > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > > > > Thanks, > > Han > > > > Thanks, > Dumitru > >
diff --git a/northd/en-lflow.c b/northd/en-lflow.c index f1a83839df..c4b927fb8c 100644 --- a/northd/en-lflow.c +++ b/northd/en-lflow.c @@ -174,8 +174,7 @@ lflow_lr_stateful_handler(struct engine_node *node, void *data) struct ed_type_lr_stateful *lr_sful_data = engine_get_input_data("lr_stateful", node); - if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data) - || lr_sful_data->trk_data.vip_nats_changed) { + if (!lr_stateful_has_tracked_data(&lr_sful_data->trk_data)) { return false; } diff --git a/northd/en-lr-nat.c b/northd/en-lr-nat.c index ad11025c69..215d924e42 100644 --- a/northd/en-lr-nat.c +++ b/northd/en-lr-nat.c @@ -219,10 +219,6 @@ lr_nat_record_init(struct lr_nat_record *lrnat_rec, lrnat_rec->nbr_uuid = od->nbr->header_.uuid; shash_init(&lrnat_rec->snat_ips); - sset_init(&lrnat_rec->external_ips); - for (size_t i = 0; i < od->nbr->n_nat; i++) { - sset_add(&lrnat_rec->external_ips, od->nbr->nat[i]->external_ip); - } sset_init(&lrnat_rec->external_macs); lrnat_rec->has_distributed_nat = false; @@ -343,7 +339,6 @@ lr_nat_record_clear(struct lr_nat_record *lrnat_rec) } free(lrnat_rec->nat_entries); - sset_destroy(&lrnat_rec->external_ips); sset_destroy(&lrnat_rec->external_macs); } diff --git a/northd/en-lr-nat.h b/northd/en-lr-nat.h index 71d3a1782d..81a7b0abd7 100644 --- a/northd/en-lr-nat.h +++ b/northd/en-lr-nat.h @@ -64,9 +64,6 @@ struct lr_nat_record { bool has_distributed_nat; - /* Set of nat external ips on the router. */ - struct sset external_ips; - /* Set of nat external macs on the router. */ struct sset external_macs; diff --git a/northd/en-lr-stateful.c b/northd/en-lr-stateful.c index 6d0192487c..baf1bd2f89 100644 --- a/northd/en-lr-stateful.c +++ b/northd/en-lr-stateful.c @@ -82,7 +82,6 @@ static void remove_lrouter_lb_reachable_ips(struct lr_stateful_record *, enum lb_neighbor_responder_mode, const struct sset *lb_ips_v4, const struct sset *lb_ips_v6); -static bool lr_stateful_rebuild_vip_nats(struct lr_stateful_record *); /* 'lr_stateful' engine node manages the NB logical router LB data. */ @@ -110,7 +109,6 @@ en_lr_stateful_clear_tracked_data(void *data_) struct ed_type_lr_stateful *data = data_; hmapx_clear(&data->trk_data.crupdated); - data->trk_data.vip_nats_changed = false; } void @@ -198,10 +196,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) /* Add the lr_stateful_rec rec to the tracking data. */ hmapx_add(&data->trk_data.crupdated, lr_stateful_rec); - - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { - data->trk_data.vip_nats_changed = true; - } continue; } @@ -319,9 +313,6 @@ lr_stateful_lb_data_handler(struct engine_node *node, void *data_) HMAPX_FOR_EACH (hmapx_node, &data->trk_data.crupdated) { struct lr_stateful_record *lr_stateful_rec = hmapx_node->data; - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { - data->trk_data.vip_nats_changed = true; - } const struct ovn_datapath *od = ovn_datapaths_find_by_index(input_data.lr_datapaths, lr_stateful_rec->lr_index); @@ -360,13 +351,6 @@ lr_stateful_lr_nat_handler(struct engine_node *node, void *data_) lr_stateful_record_create(&data->table, lrnat_rec, od, input_data.lb_datapaths_map, input_data.lbgrp_datapaths_map); - if (!sset_is_empty(&lr_stateful_rec->vip_nats)) { - data->trk_data.vip_nats_changed = true; - } - } else { - if (lr_stateful_rebuild_vip_nats(lr_stateful_rec)) { - data->trk_data.vip_nats_changed = true; - } } /* Add the lr_stateful_rec rec to the tracking data. */ @@ -525,11 +509,6 @@ lr_stateful_record_create(struct lr_stateful_table *table, build_lrouter_lb_reachable_ips(lr_stateful_rec, od, lb_dps->lb); } - sset_init(&lr_stateful_rec->vip_nats); - - if (nbr->n_nat) { - lr_stateful_rebuild_vip_nats(lr_stateful_rec); - } lr_stateful_rec->has_lb_vip = od_has_lb_vip(od); hmap_insert(&table->entries, &lr_stateful_rec->key_node, @@ -556,7 +535,6 @@ static void lr_stateful_record_destroy(struct lr_stateful_record *lr_stateful_rec) { ovn_lb_ip_set_destroy(lr_stateful_rec->lb_ips); - sset_destroy(&lr_stateful_rec->vip_nats); lflow_ref_destroy(lr_stateful_rec->lflow_ref); free(lr_stateful_rec); } @@ -671,32 +649,3 @@ remove_lrouter_lb_reachable_ips(struct lr_stateful_record *lr_stateful_rec, ip_address); } } - -static bool -lr_stateful_rebuild_vip_nats(struct lr_stateful_record *lr_stateful_rec) -{ - struct sset old_vip_nats = SSET_INITIALIZER(&old_vip_nats); - sset_swap(&lr_stateful_rec->vip_nats, &old_vip_nats); - - const char *external_ip; - SSET_FOR_EACH (external_ip, &lr_stateful_rec->lrnat_rec->external_ips) { - bool is_vip_nat = false; - if (addr_is_ipv6(external_ip)) { - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v6, - external_ip); - } else { - is_vip_nat = sset_contains(&lr_stateful_rec->lb_ips->ips_v4, - external_ip); - } - - if (is_vip_nat) { - sset_add(&lr_stateful_rec->vip_nats, external_ip); - } - } - - bool vip_nats_changed = !sset_equals(&lr_stateful_rec->vip_nats, - &old_vip_nats); - sset_destroy(&old_vip_nats); - - return vip_nats_changed; -} diff --git a/northd/en-lr-stateful.h b/northd/en-lr-stateful.h index 4a07174c31..e15c61e575 100644 --- a/northd/en-lr-stateful.h +++ b/northd/en-lr-stateful.h @@ -62,9 +62,6 @@ struct lr_stateful_record { /* Load Balancer vIPs relevant for this datapath. */ struct ovn_lb_ip_set *lb_ips; - /* sset of vips which are also part of lr nats. */ - struct sset vip_nats; - /* 'lflow_ref' is used to reference logical flows generated for * this lr_stateful_record. * @@ -107,10 +104,6 @@ struct lr_stateful_table { struct lr_stateful_tracked_data { /* Created or updated logical router with LB and/or NAT data. */ struct hmapx crupdated; /* Stores 'struct lr_stateful_record'. */ - - /* Indicates if any router's NATs changed which were also LB vips - * or vice versa. */ - bool vip_nats_changed; }; struct ed_type_lr_stateful { @@ -142,7 +135,7 @@ const struct lr_stateful_record *lr_stateful_table_find_by_index( static inline bool lr_stateful_has_tracked_data(struct lr_stateful_tracked_data *trk_data) { - return !hmapx_is_empty(&trk_data->crupdated) || trk_data->vip_nats_changed; + return !hmapx_is_empty(&trk_data->crupdated); } static inline bool diff --git a/northd/northd.c b/northd/northd.c index 2c3560ce2d..4b39137e7c 100644 --- a/northd/northd.c +++ b/northd/northd.c @@ -11061,7 +11061,6 @@ build_lrouter_nat_flows_for_lb( struct ds skip_snat_act = DS_EMPTY_INITIALIZER; struct ds force_snat_act = DS_EMPTY_INITIALIZER; struct ds undnat_match = DS_EMPTY_INITIALIZER; - struct ds unsnat_match = DS_EMPTY_INITIALIZER; struct ds gw_redir_action = DS_EMPTY_INITIALIZER; ds_clear(match); @@ -11107,13 +11106,6 @@ build_lrouter_nat_flows_for_lb( /* Remove the trailing " || ". */ ds_truncate(&undnat_match, undnat_match.length - 4); - ds_put_format(&unsnat_match, "%s && %s.dst == %s && %s", - ip_match, ip_match, lb_vip->vip_str, lb->proto); - if (lb_vip->port_str) { - ds_put_format(&unsnat_match, " && %s.dst == %s", lb->proto, - lb_vip->port_str); - } - struct lrouter_nat_lb_flows_ctx ctx = { .lb_vip = lb_vip, .lb = lb, @@ -11171,23 +11163,6 @@ build_lrouter_nat_flows_for_lb( if (lb->affinity_timeout) { bitmap_set1(aff_dp_bitmap[type], index); } - - if (sset_contains(&lrnat_rec->external_ips, lb_vip->vip_str)) { - /* The load balancer vip is also present in the NAT entries. - * So add a high priority lflow to advance the the packet - * destined to the vip (and the vip port if defined) - * in the S_ROUTER_IN_UNSNAT stage. - * There seems to be an issue with ovs-vswitchd. When the new - * connection packet destined for the lb vip is received, - * it is dnat'ed in the S_ROUTER_IN_DNAT stage in the dnat - * conntrack zone. For the next packet, if it goes through - * unsnat stage, the conntrack flags are not set properly, and - * it doesn't hit the established state flows in - * S_ROUTER_IN_DNAT stage. */ - ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_UNSNAT, 120, - ds_cstr(&unsnat_match), "next;", - &lb->nlb->header_, lb_dps->lflow_ref); - } } for (size_t type = 0; type < LROUTER_NAT_LB_FLOW_MAX; type++) { @@ -11199,7 +11174,6 @@ build_lrouter_nat_flows_for_lb( lr_datapaths, lb_dps->lflow_ref); } - ds_destroy(&unsnat_match); ds_destroy(&undnat_match); ds_destroy(&skip_snat_act); ds_destroy(&force_snat_act); @@ -15070,12 +15044,7 @@ build_lrouter_nat_defrag_and_lb( l3dgw_port, stateless, lflow_ref); /* ARP resolve for NAT IPs. */ - if (od->is_gw_router) { - /* Add the NAT external_ip to the nat_entries for - * gateway routers. This is required for adding load balancer - * flows.*/ - sset_add(&nat_entries, nat->external_ip); - } else { + if (!od->is_gw_router) { if (!sset_contains(&nat_entries, nat->external_ip)) { /* Drop packets coming in from external that still has * destination IP equals to the NAT external IP, to avoid loop. diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at index 6fdd761dac..c7f65b9666 100644 --- a/tests/ovn-northd.at +++ b/tests/ovn-northd.at @@ -1610,9 +1610,6 @@ AT_CAPTURE_FILE([sbflows]) # dnat_and_snat or snat entry. AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) ]) @@ -1643,9 +1640,6 @@ AT_CAPTURE_FILE([sbflows]) # dnat_and_snat or snat entry. AT_CHECK([grep "lr_in_unsnat" sbflows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.1 && tcp && tcp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.4 && udp && udp.dst == 8080), action=(next;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 192.168.2.5 && tcp && tcp.dst == 8080), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 192.168.2.4), action=(ct_snat;) ]) @@ -5868,7 +5862,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=0 , match=(1), action=(next;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) @@ -5945,7 +5938,6 @@ AT_CHECK([grep "lr_in_unsnat" lr0flows | ovn_strip_lflows], [0], [dnl table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-public" && ip6.dst == def0::10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip4.dst == 10.0.0.1), action=(ct_snat;) table=??(lr_in_unsnat ), priority=110 , match=(inport == "lr0-sw0" && ip6.dst == aef0::1), action=(ct_snat;) - table=??(lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.10 && tcp && tcp.dst == 9082), action=(next;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.10), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.20), action=(ct_snat;) table=??(lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.30), action=(ct_snat;) @@ -12010,7 +12002,7 @@ check ovn-nbctl lb-add lb2 172.168.0.150:80 10.0.0.40:8080 check ovn-nbctl lr-lb-add lr0 lb1 check ovn-nbctl lr-lb-add lr0 lb2 -# lflow engine should recompute since the nat ip 172.168.0.140 +# lflow engine should not recompute even if the nat ip 172.168.0.140 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.140 10.0.0.20 @@ -12019,10 +12011,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the nat ip 172.168.0.150 +# lflow engine should not recompute even if the nat ip 172.168.0.150 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-add lr0 dnat_and_snat 172.168.0.150 10.0.0.41 @@ -12031,10 +12023,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the deleted nat ip 172.168.0.150 +# lflow engine should not recompute even if the deleted nat ip 172.168.0.150 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.150 @@ -12043,10 +12035,10 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE -# lflow engine should recompute since the deleted nat ip 172.168.0.140 +# lflow engine should not recompute even if the deleted nat ip 172.168.0.140 # is a lb vip. check as northd ovn-appctl -t ovn-northd inc-engine/clear-stats check ovn-nbctl --wait=sb lr-nat-del lr0 dnat_and_snat 172.168.0.140 @@ -12055,7 +12047,7 @@ check_engine_stats lr_nat norecompute compute check_engine_stats lr_stateful norecompute compute check_engine_stats sync_to_sb_pb recompute nocompute check_engine_stats sync_to_sb_lb norecompute compute -check_engine_stats lflow recompute nocompute +check_engine_stats lflow norecompute compute CHECK_NO_CHANGE_AFTER_RECOMPUTE # Delete the NAT diff --git a/tests/system-ovn.at b/tests/system-ovn.at index c22c7882fc..2411b0267e 100644 --- a/tests/system-ovn.at +++ b/tests/system-ovn.at @@ -12096,6 +12096,86 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d /connection dropped.*/d"]) AT_CLEANUP +OVN_FOR_EACH_NORTHD([ +AT_SETUP([load balancing in gateway router - client behind LB with SNAT]) +AT_SKIP_IF([test $HAVE_NC = no]) +AT_KEYWORDS([lb]) + +ovn_start +OVS_TRAFFIC_VSWITCHD_START() +ADD_BR([br-int]) + +check ovs-vsctl \ + -- set Open_vSwitch . external-ids:system-id=hv1 \ + -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \ + -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \ + -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \ + -- set bridge br-int fail-mode=secure other-config:disable-in-band=true + +start_daemon ovn-controller + +check ovn-nbctl lr-add lr \ + -- set logical_router lr options:chassis=hv1 +check ovn-nbctl lrp-add lr lr-ls1 00:00:00:00:01:00 41.41.41.2/24 +check ovn-nbctl lrp-add lr lr-ls2 00:00:00:00:02:00 42.42.42.2/24 +check ovn-nbctl ls-add ls1 +check ovn-nbctl ls-add ls2 + +check ovn-nbctl lsp-add ls1 ls1-lr +check ovn-nbctl lsp-set-addresses ls1-lr 00:00:00:00:01:00 +check ovn-nbctl lsp-set-type ls1-lr router +check ovn-nbctl lsp-set-options ls1-lr router-port=lr-ls1 +check ovn-nbctl lsp-add ls1 vm1 +check ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 + +check ovn-nbctl lsp-add ls2 ls2-lr +check ovn-nbctl lsp-set-addresses ls2-lr 00:00:00:00:02:00 +check ovn-nbctl lsp-set-type ls2-lr router +check ovn-nbctl lsp-set-options ls2-lr router-port=lr-ls2 +check ovn-nbctl lsp-add ls2 vm2 +check ovn-nbctl lsp-set-addresses vm2 00:00:00:00:00:02 + +dnl LB using the router IP connected to vm2 as VIP. +check ovn-nbctl lb-add lb-test 42.42.42.2:8080 41.41.41.1:8080 tcp \ + -- lr-lb-add lr lb-test + +dnl SNAT everything coming from vm1 to the router IP (towards vm2). +check ovn-nbctl lr-nat-add lr snat 42.42.42.2 41.41.41.1 + +ADD_NAMESPACES(vm1) +ADD_VETH(vm1, vm1, br-int, "41.41.41.1/24", "00:00:00:00:00:01", "41.41.41.2") + +ADD_NAMESPACES(vm2) +ADD_VETH(vm2, vm2, br-int, "42.42.42.1/24", "00:00:00:00:00:02", "42.42.42.2") + +dnl Start a server on vm2. +NETNS_DAEMONIZE([vm2], [nc -l -k 42.42.42.1 80], [vm2.pid]) + +dnl Wait for ovn-controller to catch up. +wait_for_ports_up +check ovn-nbctl --wait=hv sync + +dnl Test the connection originating something that uses the same source port +dnl as the LB VIP. +NS_CHECK_EXEC([vm1], [nc -z -p 8080 42.42.42.1 80], 0, [ignore], [ignore]) + +OVS_APP_EXIT_AND_WAIT([ovn-controller]) + +as ovn-sb +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) + +as ovn-nb +OVS_APP_EXIT_AND_WAIT([ovsdb-server]) + +as northd +OVS_APP_EXIT_AND_WAIT([ovn-northd]) + +as +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d +/connection dropped.*/d"]) +AT_CLEANUP +]) + OVN_FOR_EACH_NORTHD([ AT_SETUP([load balancing affinity sessions - auto clear learnt flows]) AT_SKIP_IF([test $HAVE_NC = no])
Otherwise, in case there's also a SNAT rule that uses the VIP as external IP, we break sessions initiated from behind the VIP. This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline"). That's OK because commit 384a7c6237da ("northd: Refactor Logical Flows for routers with DNAT/Load Balancers") addressed the original issue in a better way: In the reply direction, the order of traversal of the tables "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect datapath flows that check ct_state in the wrong conntrack zone. This is illustrated below where reply trafic enters the physical host port (6) and traverses DNAT zone (14), SNAT zone (default), back to the DNAT zone and then on to Logical Switch Port zone (22). The third flow is incorrectly checking the state from the SNAT zone instead of the DNAT zone. We also add a system test to ensure traffic initiated from behind a VIP + SNAT is not broken. Another nice side effect is that the northd I-P is slightly simplified because we don't need to track NAT external IPs anymore. Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer vips in router ingress pipeline") Reported-at: https://issues.redhat.com/browse/FDP-291 Signed-off-by: Dumitru Ceara <dceara@redhat.com> --- northd/en-lflow.c | 3 +- northd/en-lr-nat.c | 5 --- northd/en-lr-nat.h | 3 -- northd/en-lr-stateful.c | 51 -------------------------- northd/en-lr-stateful.h | 9 +---- northd/northd.c | 33 +---------------- tests/ovn-northd.at | 24 +++++-------- tests/system-ovn.at | 80 +++++++++++++++++++++++++++++++++++++++++ 8 files changed, 91 insertions(+), 117 deletions(-)