diff mbox series

[ovs-dev,RFC] northd.c: Fix direct access to SNATed networks.

Message ID 20240207155627.806188-1-martin.kalcok@canonical.com
State RFC
Headers show
Series [ovs-dev,RFC] northd.c: Fix direct access to SNATed networks. | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_ovn-kubernetes fail github build: failed

Commit Message

Martin Kalcok Feb. 7, 2024, 3:56 p.m. UTC
When a machine from an external network tries to access machine in the
OVN's internal network with SNAT enabled, using protocol like TCP, it
receives connection reset after first packets are successfully exchanged.

This setup may be a bit unusual and requires that whatever is responsible
for routing in the external network, routes packets for OVN's internal
network via LR's external port IP. However it is something that used to
work in ML2/OVS Openstack deployments and got broken after upgrade to
ML2/OVN.

To demonstrate what the traffic looked like from the point of view of the
external machine, lets say we have:
  * E  (IP of machine in the external network)
  * I  (IP of machine in OVN's internal network)
  * LR (IP of OVN's Logical Router Port connected to external network)

The traffic looks like this:
  * [SYN]      E  -> I
  * [SYN,ACK]  I  -> E
  * [ACK]      E  -> I
  * [PSH,ACK]  E  -> I
  * [ACK]      LR -> E

Connection gets reseted after it receives ACK from an unexpected IP (LR).

Although I'm not completely sure why the first [SYN,ACK] reply got back
without SNAT, root cause of this issue is that traffic initiated from
the external network is not tracked in CT. This causes even traffic that's
in the "reply" direction (i.e. it was initiated from outside the SNAT
network), to be translated by Logical Router.

My proposal is to initiate CT and commit "new" connections destined
to the internal network (with enabled SNAT) on Logical Router. This
will enable SNAT rules to translate only traffic that originates from
internal networks.

Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
---
 northd/northd.c | 70 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 61 insertions(+), 9 deletions(-)

As this would be my first "real" contribution to the OVN, I'd love to hear
some feeback on my approach while I work on performance measurments and
tests, which I plan on including in final proposal.
I also left some inline XXX comments marking parts that I'm not sure about
and need resolving before final proposal.
I tested this change in my local Openstack deployment with distributed gw
chassis and it seems to solve the issue regardless of the VM's placement,
whether it does or does not have a floating IP. It also doesn't seem to
break outbound connectivity, nor connectivity to VM's floating IPs.
Thank you, for any reviews/sugestions.

Comments

0-day Robot Feb. 7, 2024, 4:20 p.m. UTC | #1
Bleep bloop.  Greetings Martin Kalcok, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 88 characters long (recommended limit is 79)
WARNING: Comment with 'xxx' marker
#97 FILE: northd/northd.c:10839:
                /* XXX: Need to figure out what is appropraite priority for these rules.

WARNING: Line is 83 characters long (recommended limit is 79)
#98 FILE: northd/northd.c:10840:
                        All they require is to be after any existing specific rules

Lines checked: 174, Warnings: 3, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Mark Michelson Feb. 8, 2024, 12:22 a.m. UTC | #2
Hi Martin,

Thanks a bunch for your first OVN contribution. I have comments below.

On 2/7/24 10:56, Martin Kalcok wrote:
> When a machine from an external network tries to access machine in the
> OVN's internal network with SNAT enabled, using protocol like TCP, it
> receives connection reset after first packets are successfully exchanged.
> 
> This setup may be a bit unusual and requires that whatever is responsible
> for routing in the external network, routes packets for OVN's internal
> network via LR's external port IP. However it is something that used to
> work in ML2/OVS Openstack deployments and got broken after upgrade to
> ML2/OVN.
> 
> To demonstrate what the traffic looked like from the point of view of the
> external machine, lets say we have:
>    * E  (IP of machine in the external network)
>    * I  (IP of machine in OVN's internal network)
>    * LR (IP of OVN's Logical Router Port connected to external network)
> 
> The traffic looks like this:
>    * [SYN]      E  -> I
>    * [SYN,ACK]  I  -> E
>    * [ACK]      E  -> I
>    * [PSH,ACK]  E  -> I
>    * [ACK]      LR -> E
> 
> Connection gets reseted after it receives ACK from an unexpected IP (LR).
> 
> Although I'm not completely sure why the first [SYN,ACK] reply got back
> without SNAT, root cause of this issue is that traffic initiated from
> the external network is not tracked in CT. This causes even traffic that's
> in the "reply" direction (i.e. it was initiated from outside the SNAT
> network), to be translated by Logical Router.
> 
> My proposal is to initiate CT and commit "new" connections destined
> to the internal network (with enabled SNAT) on Logical Router. This
> will enable SNAT rules to translate only traffic that originates from
> internal networks.
> 
> Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
> ---
>   northd/northd.c | 70 ++++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 61 insertions(+), 9 deletions(-)
> 
> As this would be my first "real" contribution to the OVN, I'd love to hear
> some feeback on my approach while I work on performance measurments and
> tests, which I plan on including in final proposal.
> I also left some inline XXX comments marking parts that I'm not sure about
> and need resolving before final proposal.
> I tested this change in my local Openstack deployment with distributed gw
> chassis and it seems to solve the issue regardless of the VM's placement,
> whether it does or does not have a floating IP. It also doesn't seem to
> break outbound connectivity, nor connectivity to VM's floating IPs.
> Thank you, for any reviews/sugestions.

Without getting into the details of this particular patch, I have some 
general suggestions for contributing:

1) Before sending a patch, run utilities/checkpatch.py against all 
patches in the series. This can help 0-day Robot not to complain about 
coding guidelines violations.

2) Run the testsuite locally before posting. If you see test errors, 
double check if your change is what's causing them.

3) As a sanity check, consider pushing the code to your github fork, and 
ensure that github actions complete without error. If there are any 
errors, double check that they're not the fault of github (e.g. it's not 
your fault if the action fails because the machine runs out of disk 
space while allocating pods). In this case, The OVS robot links to 
https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181 
, where there are some compilation errors reported. I've pointed out the 
offending lines in the diff below.

4) Consider adding test cases. In this case, I suggest adding a couple:
   * First, you can add one to ovn-northd.at that strictly tests whether 
the generated logical flows for SNAT are as expected. It would also be 
good to ensure that the flows are updated properly when the SNAT 
configuration changes on the logical router.
   * Second, I would add a test to system-ovn.at that simulates the 
broken scenario. It could ensure that packets originated internally are 
SNATted, and that replies to packets originated externally are not SNATted.

> 
> diff --git a/northd/northd.c b/northd/northd.c
> index 01eec64ca..b695932fd 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table *lflows, struct ovn_datapath *od,
>       ds_destroy(&actions);
>   }
>   
> +static inline bool
> +lrouter_use_common_zone(const struct ovn_datapath *od)
> +{
> +    return !od->is_gw_router && use_common_zone;
> +}
> +
>   static void
>   add_route(struct lflow_table *lflows, struct ovn_datapath *od,
>             const struct ovn_port *op, const char *lrp_addr_s,
> @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows, struct ovn_datapath *od,

I'm not sure that add_route() is a good fit for this new code. 
add_route() is called for every static route, logical router port IP, 
load balancer VIP, and NAT external address. It doesn't really make 
sense to be iterating through the router's NATs when adding a static 
route, for instance. Further, this has the potential to attempt to add 
duplicate flows since a router can (and likely will) have add_route() 
called multiple times for its addresses.

If you need to add new flows related to the configured NAT rules on a 
router, I'd suggest looking for an appropriate place in 
build_lrouter_nat_defrag_and_lb() or the functions it calls. We iterate 
over the configured NAT rules there, so it's a natural fit for these 
types of changes.

>       build_route_match(op_inport, rtb_id, network_s, plen, is_src_route,
>                         is_ipv4, &match, &priority, ofs);
>   
> +    bool use_ct = false;
> +    struct ds target_net = DS_EMPTY_INITIALIZER;
> +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
>       struct ds common_actions = DS_EMPTY_INITIALIZER;
>       struct ds actions = DS_EMPTY_INITIALIZER;
>       if (is_discard_route) {
> @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows, struct ovn_datapath *od,
>           } else {
>               ds_put_format(&common_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
>           }
> +
> +        /* To enable direct connectivity from external networks to Logical IPs
> +         * of VMs/Containers in the SNATed networks, we need to track
> +         * connections initiated from external networks.  This allows SNAT
> +         * rules to avoid SNATing packets in "reply" direction. */
> +        if (od->nbr && od->nbr->n_nat) {
> +            ds_put_format(&target_net, "%s/%d", network_s, plen);
> +            for (int i = 0; i < od->nbr->n_nat; i++) {
> +                const struct nbrec_nat *nat = od->nbr->nat[i];
> +                if (strcmp(nat->type, "snat") ||
> +                    strcmp(nat->logical_ip, ds_cstr(&target_net)) ||

We allow IPv6 addresses for NAT. strcmp isn't a good way to compare IPv6 
addresses since the same address can have many string representations.

> +                    lrouter_use_common_zone(od)) {
> +                    continue;
> +                }
> +
> +                /* Initiate connection tracking for traffic destined to
> +                 * SNATed network. */
> +                use_ct = true;
> +
> +                /* XXX: Need to figure out what is appropraite priority for these rules.
> +                        All they require is to be after any existing specific rules
> +                        and before the default rule.*/

There aren't any hard and fast rules for picking a priority number, 
other than what you stated in this comment. As long as it gets the job 
done, you can pick any appropriate priority number. It's best to try to 
leave some room for new priorities to be inserted in case new rules 
arise, but other than that, pick whatever you want.

> +
> +                /* Commit new connections initiated from the outside of
> +                 * SNATed network to CT. */
> +                ds_put_format(&snat_port_match,
> +                              "outport == %s && ct.new",
> +                              op->json_key);
> +                ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_REQUEST, 5,
> +                                        ds_cstr(&snat_port_match),
> +                                        "ct_commit; output;", stage_hint);

I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for 
performing the ct_commit; This doesn't have anything to do with ARP 
requests, so it seems like a different table would fit better here.

> +                ds_clear(&snat_port_match);
> +
> +                /* Initiate connection tracking for traffic from SNATed
> +                 * network to enable rules in S_ROUTER_OUT_SNAT disregard
> +                 * traffic in "reply" direction. */
> +                ds_put_format(&snat_port_match, "inport == %s", op->json_key);
> +                ovn_lflow_add_with_hint(lflows, od, S_ROUTER_OUT_POST_UNDNAT,
> +                                        5, ds_cstr(&snat_port_match),
> +                                        "ct_next;", stage_hint);

Your two calls to ovn_lflow_add_with_hint() are missing the final 
lflow_ref parameter, which is why github actions failed when testing.

> +                break;
> +            }
> +        }
> +
>           ds_put_format(&common_actions, "; "
>                         "%s = %s; "
>                         "eth.src = %s; "
>                         "outport = %s; "
>                         "flags.loopback = 1; "
> -                      "next;",
> +                      "%s",
>                         is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
>                         lrp_addr_s,
>                         op->lrp_networks.ea_s,
> -                      op->json_key);
> +                      op->json_key,
> +                      use_ct ? "ct_next;" : "next;");

I'm a bit confused by this addition. In your scenario, E sends traffic 
to I, and I sends traffic to E. Traffic is never destined for a LRP 
address, load balancer VIP, or NAT external address. So I don't know why 
this flow would be matched in your scenario. Did I miss a detail somewhere?

Even if this is necessary for your scenario, I think this is adding 
conntrack to situations where it shouldn't be present, such as traffic 
destined for static routes.

>           ds_put_format(&actions, "ip.ttl--; %s", ds_cstr(&common_actions));
>       }
>   
> @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows, struct ovn_datapath *od,
>                                   ds_cstr(&common_actions),\
>                                   stage_hint, lflow_ref);
>       }
> +    ds_destroy(&snat_port_match);
> +    ds_destroy(&target_net);
>       ds_destroy(&match);
>       ds_destroy(&common_actions);
>       ds_destroy(&actions);
> @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
>       const struct shash *meter_groups;
>   };
>   
> -static inline bool
> -lrouter_use_common_zone(const struct ovn_datapath *od)
> -{
> -    return !od->is_gw_router && use_common_zone;
> -}
> -
>   static void
>   build_distr_lrouter_nat_flows_for_lb(struct lrouter_nat_lb_flows_ctx *ctx,
>                                        enum lrouter_nat_lb_flow_type type,
> @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct lflow_table *lflows,
>       }
>       ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
>   
> -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
> +    /* ct_commit is needed here otherwise replies to the SNATed traffic
> +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
> +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat->external_ip);
>       if (nat->external_port_range[0]) {
>           ds_put_format(actions, ",%s", nat->external_port_range);
>       }
I think the reason why this patch works is because it essentially ends 
up sending everything to conntrack. The problem is that it's likely 
sending more than is necessary.

If you look in build_lrouter_out_snat_flow(), you can see that we have " 
&& (!ct.trk || !ct.rpl)" as part of the match before we attempt to SNAT. 
Therefore, we at least are attempting not to SNAT reply traffic.

As you noted, the [SYN,ACK] sent from "I" in your scenario is not 
SNATted. This is because the [SYN] from "E" was committed to conntrack 
in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was 
properly seen as a reply packet, and it was not SNATted.

However, for some reason the final [ACK] from "I" was not seen as a 
reply packet, and therefore it was SNATted. I'm curious if anything 
about the TCP connection changed between the initial handshake and the 
subsequent packets that would make conntrack see the traffic as 
"related" instead of part of the same connection. If so, then I think 
the trick here is to ensure that ct.rel traffic is committed to 
conntrack as well.

Ales Musil added a new action type in OVN 22.12 called "ct_commit_nat" . 
Its intended purpose was to ensure that in addition to the direct 
connection, related traffic would also be committed to conntrack. 
Currently, it's only used for DNAT so that ICMP replies from load 
balancer backends will be unDNATted.

It seems we need something similar here, but it needs to be used when 
traffic hits the UNSNAT stage of the router pipeline. Maybe it would be 
enough to extend the ct_commit_nat command to take an optional "snat" or 
"dnat" parameter to choose the appropriate conntrack zone to commit to. 
Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call 
ct_commit_nat(snat) instead of ct_snat to ensure that the related 
traffic is committed to conntrack. I'm not 100% sure if that will do the 
trick, but if it does, it will ensure a more focused amount of traffic 
will get committed to conntrack. It would also be a much simpler and 
smaller patch.

@Ales Musil, did my suggestion above make sense? Or am I completely 
approaching this incorrectly?
Mark Michelson Feb. 8, 2024, 12:26 a.m. UTC | #3
I mentioned Ales in my notes but forgot to CC him in my reply. Whoops. 
I've added him now :)

On 2/7/24 19:22, Mark Michelson wrote:
> Hi Martin,
> 
> Thanks a bunch for your first OVN contribution. I have comments below.
> 
> On 2/7/24 10:56, Martin Kalcok wrote:
>> When a machine from an external network tries to access machine in the
>> OVN's internal network with SNAT enabled, using protocol like TCP, it
>> receives connection reset after first packets are successfully exchanged.
>>
>> This setup may be a bit unusual and requires that whatever is responsible
>> for routing in the external network, routes packets for OVN's internal
>> network via LR's external port IP. However it is something that used to
>> work in ML2/OVS Openstack deployments and got broken after upgrade to
>> ML2/OVN.
>>
>> To demonstrate what the traffic looked like from the point of view of the
>> external machine, lets say we have:
>>    * E  (IP of machine in the external network)
>>    * I  (IP of machine in OVN's internal network)
>>    * LR (IP of OVN's Logical Router Port connected to external network)
>>
>> The traffic looks like this:
>>    * [SYN]      E  -> I
>>    * [SYN,ACK]  I  -> E
>>    * [ACK]      E  -> I
>>    * [PSH,ACK]  E  -> I
>>    * [ACK]      LR -> E
>>
>> Connection gets reseted after it receives ACK from an unexpected IP (LR).
>>
>> Although I'm not completely sure why the first [SYN,ACK] reply got back
>> without SNAT, root cause of this issue is that traffic initiated from
>> the external network is not tracked in CT. This causes even traffic 
>> that's
>> in the "reply" direction (i.e. it was initiated from outside the SNAT
>> network), to be translated by Logical Router.
>>
>> My proposal is to initiate CT and commit "new" connections destined
>> to the internal network (with enabled SNAT) on Logical Router. This
>> will enable SNAT rules to translate only traffic that originates from
>> internal networks.
>>
>> Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
>> ---
>>   northd/northd.c | 70 ++++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 61 insertions(+), 9 deletions(-)
>>
>> As this would be my first "real" contribution to the OVN, I'd love to 
>> hear
>> some feeback on my approach while I work on performance measurments and
>> tests, which I plan on including in final proposal.
>> I also left some inline XXX comments marking parts that I'm not sure 
>> about
>> and need resolving before final proposal.
>> I tested this change in my local Openstack deployment with distributed gw
>> chassis and it seems to solve the issue regardless of the VM's placement,
>> whether it does or does not have a floating IP. It also doesn't seem to
>> break outbound connectivity, nor connectivity to VM's floating IPs.
>> Thank you, for any reviews/sugestions.
> 
> Without getting into the details of this particular patch, I have some 
> general suggestions for contributing:
> 
> 1) Before sending a patch, run utilities/checkpatch.py against all 
> patches in the series. This can help 0-day Robot not to complain about 
> coding guidelines violations.
> 
> 2) Run the testsuite locally before posting. If you see test errors, 
> double check if your change is what's causing them.
> 
> 3) As a sanity check, consider pushing the code to your github fork, and 
> ensure that github actions complete without error. If there are any 
> errors, double check that they're not the fault of github (e.g. it's not 
> your fault if the action fails because the machine runs out of disk 
> space while allocating pods). In this case, The OVS robot links to 
> https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181 
> , where there are some compilation errors reported. I've pointed out the 
> offending lines in the diff below.
> 
> 4) Consider adding test cases. In this case, I suggest adding a couple:
>    * First, you can add one to ovn-northd.at that strictly tests whether 
> the generated logical flows for SNAT are as expected. It would also be 
> good to ensure that the flows are updated properly when the SNAT 
> configuration changes on the logical router.
>    * Second, I would add a test to system-ovn.at that simulates the 
> broken scenario. It could ensure that packets originated internally are 
> SNATted, and that replies to packets originated externally are not SNATted.
> 
>>
>> diff --git a/northd/northd.c b/northd/northd.c
>> index 01eec64ca..b695932fd 100644
>> --- a/northd/northd.c
>> +++ b/northd/northd.c
>> @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table 
>> *lflows, struct ovn_datapath *od,
>>       ds_destroy(&actions);
>>   }
>> +static inline bool
>> +lrouter_use_common_zone(const struct ovn_datapath *od)
>> +{
>> +    return !od->is_gw_router && use_common_zone;
>> +}
>> +
>>   static void
>>   add_route(struct lflow_table *lflows, struct ovn_datapath *od,
>>             const struct ovn_port *op, const char *lrp_addr_s,
>> @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows, struct 
>> ovn_datapath *od,
> 
> I'm not sure that add_route() is a good fit for this new code. 
> add_route() is called for every static route, logical router port IP, 
> load balancer VIP, and NAT external address. It doesn't really make 
> sense to be iterating through the router's NATs when adding a static 
> route, for instance. Further, this has the potential to attempt to add 
> duplicate flows since a router can (and likely will) have add_route() 
> called multiple times for its addresses.
> 
> If you need to add new flows related to the configured NAT rules on a 
> router, I'd suggest looking for an appropriate place in 
> build_lrouter_nat_defrag_and_lb() or the functions it calls. We iterate 
> over the configured NAT rules there, so it's a natural fit for these 
> types of changes.
> 
>>       build_route_match(op_inport, rtb_id, network_s, plen, is_src_route,
>>                         is_ipv4, &match, &priority, ofs);
>> +    bool use_ct = false;
>> +    struct ds target_net = DS_EMPTY_INITIALIZER;
>> +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
>>       struct ds common_actions = DS_EMPTY_INITIALIZER;
>>       struct ds actions = DS_EMPTY_INITIALIZER;
>>       if (is_discard_route) {
>> @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows, struct 
>> ovn_datapath *od,
>>           } else {
>>               ds_put_format(&common_actions, "ip%s.dst", is_ipv4 ? "4" 
>> : "6");
>>           }
>> +
>> +        /* To enable direct connectivity from external networks to 
>> Logical IPs
>> +         * of VMs/Containers in the SNATed networks, we need to track
>> +         * connections initiated from external networks.  This allows 
>> SNAT
>> +         * rules to avoid SNATing packets in "reply" direction. */
>> +        if (od->nbr && od->nbr->n_nat) {
>> +            ds_put_format(&target_net, "%s/%d", network_s, plen);
>> +            for (int i = 0; i < od->nbr->n_nat; i++) {
>> +                const struct nbrec_nat *nat = od->nbr->nat[i];
>> +                if (strcmp(nat->type, "snat") ||
>> +                    strcmp(nat->logical_ip, ds_cstr(&target_net)) ||
> 
> We allow IPv6 addresses for NAT. strcmp isn't a good way to compare IPv6 
> addresses since the same address can have many string representations.
> 
>> +                    lrouter_use_common_zone(od)) {
>> +                    continue;
>> +                }
>> +
>> +                /* Initiate connection tracking for traffic destined to
>> +                 * SNATed network. */
>> +                use_ct = true;
>> +
>> +                /* XXX: Need to figure out what is appropraite 
>> priority for these rules.
>> +                        All they require is to be after any existing 
>> specific rules
>> +                        and before the default rule.*/
> 
> There aren't any hard and fast rules for picking a priority number, 
> other than what you stated in this comment. As long as it gets the job 
> done, you can pick any appropriate priority number. It's best to try to 
> leave some room for new priorities to be inserted in case new rules 
> arise, but other than that, pick whatever you want.
> 
>> +
>> +                /* Commit new connections initiated from the outside of
>> +                 * SNATed network to CT. */
>> +                ds_put_format(&snat_port_match,
>> +                              "outport == %s && ct.new",
>> +                              op->json_key);
>> +                ovn_lflow_add_with_hint(lflows, od, 
>> S_ROUTER_IN_ARP_REQUEST, 5,
>> +                                        ds_cstr(&snat_port_match),
>> +                                        "ct_commit; output;", 
>> stage_hint);
> 
> I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for 
> performing the ct_commit; This doesn't have anything to do with ARP 
> requests, so it seems like a different table would fit better here.
> 
>> +                ds_clear(&snat_port_match);
>> +
>> +                /* Initiate connection tracking for traffic from SNATed
>> +                 * network to enable rules in S_ROUTER_OUT_SNAT 
>> disregard
>> +                 * traffic in "reply" direction. */
>> +                ds_put_format(&snat_port_match, "inport == %s", 
>> op->json_key);
>> +                ovn_lflow_add_with_hint(lflows, od, 
>> S_ROUTER_OUT_POST_UNDNAT,
>> +                                        5, ds_cstr(&snat_port_match),
>> +                                        "ct_next;", stage_hint);
> 
> Your two calls to ovn_lflow_add_with_hint() are missing the final 
> lflow_ref parameter, which is why github actions failed when testing.
> 
>> +                break;
>> +            }
>> +        }
>> +
>>           ds_put_format(&common_actions, "; "
>>                         "%s = %s; "
>>                         "eth.src = %s; "
>>                         "outport = %s; "
>>                         "flags.loopback = 1; "
>> -                      "next;",
>> +                      "%s",
>>                         is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
>>                         lrp_addr_s,
>>                         op->lrp_networks.ea_s,
>> -                      op->json_key);
>> +                      op->json_key,
>> +                      use_ct ? "ct_next;" : "next;");
> 
> I'm a bit confused by this addition. In your scenario, E sends traffic 
> to I, and I sends traffic to E. Traffic is never destined for a LRP 
> address, load balancer VIP, or NAT external address. So I don't know why 
> this flow would be matched in your scenario. Did I miss a detail somewhere?
> 
> Even if this is necessary for your scenario, I think this is adding 
> conntrack to situations where it shouldn't be present, such as traffic 
> destined for static routes.
> 
>>           ds_put_format(&actions, "ip.ttl--; %s", 
>> ds_cstr(&common_actions));
>>       }
>> @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows, struct 
>> ovn_datapath *od,
>>                                   ds_cstr(&common_actions),\
>>                                   stage_hint, lflow_ref);
>>       }
>> +    ds_destroy(&snat_port_match);
>> +    ds_destroy(&target_net);
>>       ds_destroy(&match);
>>       ds_destroy(&common_actions);
>>       ds_destroy(&actions);
>> @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
>>       const struct shash *meter_groups;
>>   };
>> -static inline bool
>> -lrouter_use_common_zone(const struct ovn_datapath *od)
>> -{
>> -    return !od->is_gw_router && use_common_zone;
>> -}
>> -
>>   static void
>>   build_distr_lrouter_nat_flows_for_lb(struct lrouter_nat_lb_flows_ctx 
>> *ctx,
>>                                        enum lrouter_nat_lb_flow_type 
>> type,
>> @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct lflow_table 
>> *lflows,
>>       }
>>       ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
>> -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
>> +    /* ct_commit is needed here otherwise replies to the SNATed traffic
>> +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
>> +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat->external_ip);
>>       if (nat->external_port_range[0]) {
>>           ds_put_format(actions, ",%s", nat->external_port_range);
>>       }
> I think the reason why this patch works is because it essentially ends 
> up sending everything to conntrack. The problem is that it's likely 
> sending more than is necessary.
> 
> If you look in build_lrouter_out_snat_flow(), you can see that we have " 
> && (!ct.trk || !ct.rpl)" as part of the match before we attempt to SNAT. 
> Therefore, we at least are attempting not to SNAT reply traffic.
> 
> As you noted, the [SYN,ACK] sent from "I" in your scenario is not 
> SNATted. This is because the [SYN] from "E" was committed to conntrack 
> in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was 
> properly seen as a reply packet, and it was not SNATted.
> 
> However, for some reason the final [ACK] from "I" was not seen as a 
> reply packet, and therefore it was SNATted. I'm curious if anything 
> about the TCP connection changed between the initial handshake and the 
> subsequent packets that would make conntrack see the traffic as 
> "related" instead of part of the same connection. If so, then I think 
> the trick here is to ensure that ct.rel traffic is committed to 
> conntrack as well.
> 
> Ales Musil added a new action type in OVN 22.12 called "ct_commit_nat" . 
> Its intended purpose was to ensure that in addition to the direct 
> connection, related traffic would also be committed to conntrack. 
> Currently, it's only used for DNAT so that ICMP replies from load 
> balancer backends will be unDNATted.
> 
> It seems we need something similar here, but it needs to be used when 
> traffic hits the UNSNAT stage of the router pipeline. Maybe it would be 
> enough to extend the ct_commit_nat command to take an optional "snat" or 
> "dnat" parameter to choose the appropriate conntrack zone to commit to. 
> Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call 
> ct_commit_nat(snat) instead of ct_snat to ensure that the related 
> traffic is committed to conntrack. I'm not 100% sure if that will do the 
> trick, but if it does, it will ensure a more focused amount of traffic 
> will get committed to conntrack. It would also be a much simpler and 
> smaller patch.
> 
> @Ales Musil, did my suggestion above make sense? Or am I completely 
> approaching this incorrectly?
Ales Musil Feb. 8, 2024, 8:30 a.m. UTC | #4
On Thu, Feb 8, 2024 at 1:22 AM Mark Michelson <mmichels@redhat.com> wrote:

> Hi Martin
>
> Thanks a bunch for your first OVN contribution. I have comments below.
>
> On 2/7/24 10:56, Martin Kalcok wrote:
> > When a machine from an external network tries to access machine in the
> > OVN's internal network with SNAT enabled, using protocol like TCP, it
> > receives connection reset after first packets are successfully exchanged.
> >
> > This setup may be a bit unusual and requires that whatever is responsible
> > for routing in the external network, routes packets for OVN's internal
> > network via LR's external port IP. However it is something that used to
> > work in ML2/OVS Openstack deployments and got broken after upgrade to
> > ML2/OVN.
> >
> > To demonstrate what the traffic looked like from the point of view of the
> > external machine, lets say we have:
> >    * E  (IP of machine in the external network)
> >    * I  (IP of machine in OVN's internal network)
> >    * LR (IP of OVN's Logical Router Port connected to external network)
> >
> > The traffic looks like this:
> >    * [SYN]      E  -> I
> >    * [SYN,ACK]  I  -> E
> >    * [ACK]      E  -> I
> >    * [PSH,ACK]  E  -> I
> >    * [ACK]      LR -> E
> >
> > Connection gets reseted after it receives ACK from an unexpected IP (LR).
> >
> > Although I'm not completely sure why the first [SYN,ACK] reply got back
> > without SNAT, root cause of this issue is that traffic initiated from
> > the external network is not tracked in CT. This causes even traffic
> that's
> > in the "reply" direction (i.e. it was initiated from outside the SNAT
> > network), to be translated by Logical Router.
> >
> > My proposal is to initiate CT and commit "new" connections destined
> > to the internal network (with enabled SNAT) on Logical Router. This
> > will enable SNAT rules to translate only traffic that originates from
> > internal networks.
> >
> > Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
> > ---
> >   northd/northd.c | 70 ++++++++++++++++++++++++++++++++++++++++++-------
> >   1 file changed, 61 insertions(+), 9 deletions(-)
> >
> > As this would be my first "real" contribution to the OVN, I'd love to
> hear
> > some feeback on my approach while I work on performance measurments and
> > tests, which I plan on including in final proposal.
> > I also left some inline XXX comments marking parts that I'm not sure
> about
> > and need resolving before final proposal.
> > I tested this change in my local Openstack deployment with distributed gw
> > chassis and it seems to solve the issue regardless of the VM's placement,
> > whether it does or does not have a floating IP. It also doesn't seem to
> > break outbound connectivity, nor connectivity to VM's floating IPs.
> > Thank you, for any reviews/sugestions.
>
> Without getting into the details of this particular patch, I have some
> general suggestions for contributing:
>
> 1) Before sending a patch, run utilities/checkpatch.py against all
> patches in the series. This can help 0-day Robot not to complain about
> coding guidelines violations.
>
> 2) Run the testsuite locally before posting. If you see test errors,
> double check if your change is what's causing them.
>
> 3) As a sanity check, consider pushing the code to your github fork, and
> ensure that github actions complete without error. If there are any
> errors, double check that they're not the fault of github (e.g. it's not
> your fault if the action fails because the machine runs out of disk
> space while allocating pods). In this case, The OVS robot links to
> https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181
> , where there are some compilation errors reported. I've pointed out the
> offending lines in the diff below.
>
> 4) Consider adding test cases. In this case, I suggest adding a couple:
>    * First, you can add one to ovn-northd.at that strictly tests whether
> the generated logical flows for SNAT are as expected. It would also be
> good to ensure that the flows are updated properly when the SNAT
> configuration changes on the logical router.
>    * Second, I would add a test to system-ovn.at that simulates the
> broken scenario. It could ensure that packets originated internally are
> SNATted, and that replies to packets originated externally are not SNATted.
>
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index 01eec64ca..b695932fd 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table
> *lflows, struct ovn_datapath *od,
> >       ds_destroy(&actions);
> >   }
> >
> > +static inline bool
> > +lrouter_use_common_zone(const struct ovn_datapath *od)
> > +{
> > +    return !od->is_gw_router && use_common_zone;
> > +}
> > +
> >   static void
> >   add_route(struct lflow_table *lflows, struct ovn_datapath *od,
> >             const struct ovn_port *op, const char *lrp_addr_s,
> > @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows, struct
> ovn_datapath *od,
>
> I'm not sure that add_route() is a good fit for this new code.
> add_route() is called for every static route, logical router port IP,
> load balancer VIP, and NAT external address. It doesn't really make
> sense to be iterating through the router's NATs when adding a static
> route, for instance. Further, this has the potential to attempt to add
> duplicate flows since a router can (and likely will) have add_route()
> called multiple times for its addresses.
>
> If you need to add new flows related to the configured NAT rules on a
> router, I'd suggest looking for an appropriate place in
> build_lrouter_nat_defrag_and_lb() or the functions it calls. We iterate
> over the configured NAT rules there, so it's a natural fit for these
> types of changes.
>
> >       build_route_match(op_inport, rtb_id, network_s, plen, is_src_route,
> >                         is_ipv4, &match, &priority, ofs);
> >
> > +    bool use_ct = false;
> > +    struct ds target_net = DS_EMPTY_INITIALIZER;
> > +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
> >       struct ds common_actions = DS_EMPTY_INITIALIZER;
> >       struct ds actions = DS_EMPTY_INITIALIZER;
> >       if (is_discard_route) {
> > @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows, struct
> ovn_datapath *od,
> >           } else {
> >               ds_put_format(&common_actions, "ip%s.dst", is_ipv4 ? "4" :
> "6");
> >           }
> > +
> > +        /* To enable direct connectivity from external networks to
> Logical IPs
> > +         * of VMs/Containers in the SNATed networks, we need to track
> > +         * connections initiated from external networks.  This allows
> SNAT
> > +         * rules to avoid SNATing packets in "reply" direction. */
> > +        if (od->nbr && od->nbr->n_nat) {
> > +            ds_put_format(&target_net, "%s/%d", network_s, plen);
> > +            for (int i = 0; i < od->nbr->n_nat; i++) {
> > +                const struct nbrec_nat *nat = od->nbr->nat[i];
> > +                if (strcmp(nat->type, "snat") ||
> > +                    strcmp(nat->logical_ip, ds_cstr(&target_net)) ||
>
> We allow IPv6 addresses for NAT. strcmp isn't a good way to compare IPv6
> addresses since the same address can have many string representations.
>
> > +                    lrouter_use_common_zone(od)) {
> > +                    continue;
> > +                }
> > +
> > +                /* Initiate connection tracking for traffic destined to
> > +                 * SNATed network. */
> > +                use_ct = true;
> > +
> > +                /* XXX: Need to figure out what is appropraite priority
> for these rules.
> > +                        All they require is to be after any existing
> specific rules
> > +                        and before the default rule.*/
>
> There aren't any hard and fast rules for picking a priority number,
> other than what you stated in this comment. As long as it gets the job
> done, you can pick any appropriate priority number. It's best to try to
> leave some room for new priorities to be inserted in case new rules
> arise, but other than that, pick whatever you want.
>
> > +
> > +                /* Commit new connections initiated from the outside of
> > +                 * SNATed network to CT. */
> > +                ds_put_format(&snat_port_match,
> > +                              "outport == %s && ct.new",
> > +                              op->json_key);
> > +                ovn_lflow_add_with_hint(lflows, od,
> S_ROUTER_IN_ARP_REQUEST, 5,
> > +                                        ds_cstr(&snat_port_match),
> > +                                        "ct_commit; output;",
> stage_hint);
>
> I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for
> performing the ct_commit; This doesn't have anything to do with ARP
> requests, so it seems like a different table would fit better here.
>
> > +                ds_clear(&snat_port_match);
> > +
> > +                /* Initiate connection tracking for traffic from SNATed
> > +                 * network to enable rules in S_ROUTER_OUT_SNAT
> disregard
> > +                 * traffic in "reply" direction. */
> > +                ds_put_format(&snat_port_match, "inport == %s",
> op->json_key);
> > +                ovn_lflow_add_with_hint(lflows, od,
> S_ROUTER_OUT_POST_UNDNAT,
> > +                                        5, ds_cstr(&snat_port_match),
> > +                                        "ct_next;", stage_hint);
>
> Your two calls to ovn_lflow_add_with_hint() are missing the final
> lflow_ref parameter, which is why github actions failed when testing.
>
> > +                break;
> > +            }
> > +        }
> > +
> >           ds_put_format(&common_actions, "; "
> >                         "%s = %s; "
> >                         "eth.src = %s; "
> >                         "outport = %s; "
> >                         "flags.loopback = 1; "
> > -                      "next;",
> > +                      "%s",
> >                         is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
> >                         lrp_addr_s,
> >                         op->lrp_networks.ea_s,
> > -                      op->json_key);
> > +                      op->json_key,
> > +                      use_ct ? "ct_next;" : "next;");
>
> I'm a bit confused by this addition. In your scenario, E sends traffic
> to I, and I sends traffic to E. Traffic is never destined for a LRP
> address, load balancer VIP, or NAT external address. So I don't know why
> this flow would be matched in your scenario. Did I miss a detail somewhere?
>
> Even if this is necessary for your scenario, I think this is adding
> conntrack to situations where it shouldn't be present, such as traffic
> destined for static routes.
>
> >           ds_put_format(&actions, "ip.ttl--; %s",
> ds_cstr(&common_actions));
> >       }
> >
> > @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows, struct
> ovn_datapath *od,
> >                                   ds_cstr(&common_actions),\
> >                                   stage_hint, lflow_ref);
> >       }
> > +    ds_destroy(&snat_port_match);
> > +    ds_destroy(&target_net);
> >       ds_destroy(&match);
> >       ds_destroy(&common_actions);
> >       ds_destroy(&actions);
> > @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
> >       const struct shash *meter_groups;
> >   };
> >
> > -static inline bool
> > -lrouter_use_common_zone(const struct ovn_datapath *od)
> > -{
> > -    return !od->is_gw_router && use_common_zone;
> > -}
> > -
> >   static void
> >   build_distr_lrouter_nat_flows_for_lb(struct lrouter_nat_lb_flows_ctx
> *ctx,
> >                                        enum lrouter_nat_lb_flow_type
> type,
> > @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct lflow_table
> *lflows,
> >       }
> >       ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
> >
> > -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
> > +    /* ct_commit is needed here otherwise replies to the SNATed traffic
> > +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
> > +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat->external_ip);
> >       if (nat->external_port_range[0]) {
> >           ds_put_format(actions, ",%s", nat->external_port_range);
> >       }
> I think the reason why this patch works is because it essentially ends
> up sending everything to conntrack. The problem is that it's likely
> sending more than is necessary.
>
> If you look in build_lrouter_out_snat_flow(), you can see that we have "
> && (!ct.trk || !ct.rpl)" as part of the match before we attempt to SNAT.
> Therefore, we at least are attempting not to SNAT reply traffic.
>
> As you noted, the [SYN,ACK] sent from "I" in your scenario is not
> SNATted. This is because the [SYN] from "E" was committed to conntrack
> in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was
> properly seen as a reply packet, and it was not SNATted.
>
> However, for some reason the final [ACK] from "I" was not seen as a
> reply packet, and therefore it was SNATted. I'm curious if anything
> about the TCP connection changed between the initial handshake and the
> subsequent packets that would make conntrack see the traffic as
> "related" instead of part of the same connection. If so, then I think
> the trick here is to ensure that ct.rel traffic is committed to
> conntrack as well.
>
> Ales Musil added a new action type in OVN 22.12 called "ct_commit_nat" .
> Its intended purpose was to ensure that in addition to the direct
> connection, related traffic would also be committed to conntrack.
> Currently, it's only used for DNAT so that ICMP replies from load
> balancer backends will be unDNATted.
>
> It seems we need something similar here, but it needs to be used when
> traffic hits the UNSNAT stage of the router pipeline. Maybe it would be
> enough to extend the ct_commit_nat command to take an optional "snat" or
> "dnat" parameter to choose the appropriate conntrack zone to commit to.
> Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call
> ct_commit_nat(snat) instead of ct_snat to ensure that the related
> traffic is committed to conntrack. I'm not 100% sure if that will do the
> trick, but if it does, it will ensure a more focused amount of traffic
> will get committed to conntrack. It would also be a much simpler and
> smaller patch.
>
> @Ales Musil, did my suggestion above make sense? Or am I completely
> approaching this incorrectly?
>

> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>

Hi Martin and Mark,

I'm not quite sure either why the first handshake part doesn't have the
SNAT IP because it should behave the same. Regardless of that
I did spend some time trying to figure out what is exactly going on.

My understanding is that traffic from E to I is not going to CT at all,
so any reply (I -> E) isn't seen as a reply because it's committed into
CT there via the SNAT. Which is why I don't get the handshake part
being correct when it shouldn't. Which seems to inline with description
from Martin.

Now for the solution as Mark mentioned the pipeline stages in this RFC
are very wrong for this kind of task. What we need to achieve is to
commit traffic that is heading towards the logical ip (cidr) of SNAT.

As Mark mentioned this should be done in S_ROUTER_IN_UNSNAT
with the match being the same as constructed in
build_lrouter_in_unsnat_match with exception that the ip.dst should be
the logical ip instead of the external one with action being ct_commit,
unfortunately ct_commit is using the DNAT zone for LR pipeline.
Meaning we would need to extend that action to allow something like
ct_commit(dnat), ct_commit(snat).

However, before we get to all of that, which I'm not entirely sure if that
might break anything existing, I would like to know what is the use case for
this. It seems slightly odd that it is desired to "leak" the internal IP
and basically leaving it unprotected when setting up FIPS or LB is a more
viable option IMO.

Let me know if you have any further questions about the proposed solution.

Thanks,
Ales
Martin Kalcok Feb. 8, 2024, 10:12 a.m. UTC | #5
On Thu, 2024-02-08 at 09:30 +0100, Ales Musil wrote:
> 
> 
> On Thu, Feb 8, 2024 at 1:22 AM Mark Michelson <mmichels@redhat.com>
> wrote:
> > Hi Martin
> > 
> > Thanks a bunch for your first OVN contribution. I have comments
> > below.
> > 
> > On 2/7/24 10:56, Martin Kalcok wrote:
> > > When a machine from an external network tries to access machine
> > > in the
> > > OVN's internal network with SNAT enabled, using protocol like
> > > TCP, it
> > > receives connection reset after first packets are successfully
> > > exchanged.
> > > 
> > > This setup may be a bit unusual and requires that whatever is
> > > responsible
> > > for routing in the external network, routes packets for OVN's
> > > internal
> > > network via LR's external port IP. However it is something that
> > > used to
> > > work in ML2/OVS Openstack deployments and got broken after
> > > upgrade to
> > > ML2/OVN.
> > > 
> > > To demonstrate what the traffic looked like from the point of
> > > view of the
> > > external machine, lets say we have:
> > >     * E  (IP of machine in the external network)
> > >     * I  (IP of machine in OVN's internal network)
> > >     * LR (IP of OVN's Logical Router Port connected to external
> > > network)
> > > 
> > > The traffic looks like this:
> > >     * [SYN]      E  -> I
> > >     * [SYN,ACK]  I  -> E
> > >     * [ACK]      E  -> I
> > >     * [PSH,ACK]  E  -> I
> > >     * [ACK]      LR -> E
> > > 
> > > Connection gets reseted after it receives ACK from an unexpected
> > > IP (LR).
> > > 
> > > Although I'm not completely sure why the first [SYN,ACK] reply
> > > got back
> > > without SNAT, root cause of this issue is that traffic initiated
> > > from
> > > the external network is not tracked in CT. This causes even
> > > traffic that's
> > > in the "reply" direction (i.e. it was initiated from outside the
> > > SNAT
> > > network), to be translated by Logical Router.
> > > 
> > > My proposal is to initiate CT and commit "new" connections
> > > destined
> > > to the internal network (with enabled SNAT) on Logical Router.
> > > This
> > > will enable SNAT rules to translate only traffic that originates
> > > from
> > > internal networks.
> > > 
> > > Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
> > > ---
> > >    northd/northd.c | 70
> > > ++++++++++++++++++++++++++++++++++++++++++-------
> > >    1 file changed, 61 insertions(+), 9 deletions(-)
> > > 
> > > As this would be my first "real" contribution to the OVN, I'd
> > > love to hear
> > > some feeback on my approach while I work on performance
> > > measurments and
> > > tests, which I plan on including in final proposal.
> > > I also left some inline XXX comments marking parts that I'm not
> > > sure about
> > > and need resolving before final proposal.
> > > I tested this change in my local Openstack deployment with
> > > distributed gw
> > > chassis and it seems to solve the issue regardless of the VM's
> > > placement,
> > > whether it does or does not have a floating IP. It also doesn't
> > > seem to
> > > break outbound connectivity, nor connectivity to VM's floating
> > > IPs.
> > > Thank you, for any reviews/sugestions.
> > 
> > Without getting into the details of this particular patch, I have
> > some 
> > general suggestions for contributing:
> > 
> > 1) Before sending a patch, run utilities/checkpatch.py against all 
> > patches in the series. This can help 0-day Robot not to complain
> > about 
> > coding guidelines violations.
> > 
> > 2) Run the testsuite locally before posting. If you see test
> > errors, 
> > double check if your change is what's causing them.

Sorry about the failing tests. I ran the unit tests a checked the
errors (that were caused by me) but given that I intended this as an
RFC and I wasn't completely sure about my approach I didn't want to
spend too much effort on fixing/adding unit tests that might get
scraped. In the hindsight, I should've put more effort in and post a
clean work for the review. 

> > 
> > 3) As a sanity check, consider pushing the code to your github
> > fork, and 
> > ensure that github actions complete without error. If there are any
> > errors, double check that they're not the fault of github (e.g.
> > it's not 
> > your fault if the action fails because the machine runs out of disk
> > space while allocating pods). In this case, The OVS robot links to 
> > https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181
> > , where there are some compilation errors reported. I've pointed
> > out the 
> > offending lines in the diff below.
> > 
> > 4) Consider adding test cases. In this case, I suggest adding a
> > couple:
> >    * First, you can add one to ovn-northd.at that strictly tests
> > whether 
> > the generated logical flows for SNAT are as expected. It would also
> > be 
> > good to ensure that the flows are updated properly when the SNAT 
> > configuration changes on the logical router.
> >    * Second, I would add a test to system-ovn.at that simulates the
> > broken scenario. It could ensure that packets originated internally
> > are 
> > SNATted, and that replies to packets originated externally are not
> > SNATted.

Thank you, these are great starting points.

> > 
> > > 
> > > diff --git a/northd/northd.c b/northd/northd.c
> > > index 01eec64ca..b695932fd 100644
> > > --- a/northd/northd.c
> > > +++ b/northd/northd.c
> > > @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table
> > > *lflows, struct ovn_datapath *od,
> > >        ds_destroy(&actions);
> > >    }
> > >    
> > > +static inline bool
> > > +lrouter_use_common_zone(const struct ovn_datapath *od)
> > > +{
> > > +    return !od->is_gw_router && use_common_zone;
> > > +}
> > > +
> > >    static void
> > >    add_route(struct lflow_table *lflows, struct ovn_datapath *od,
> > >              const struct ovn_port *op, const char *lrp_addr_s,
> > > @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > 
> > I'm not sure that add_route() is a good fit for this new code. 
> > add_route() is called for every static route, logical router port
> > IP, 
> > load balancer VIP, and NAT external address. It doesn't really make
> > sense to be iterating through the router's NATs when adding a
> > static 
> > route, for instance. Further, this has the potential to attempt to
> > add 
> > duplicate flows since a router can (and likely will) have
> > add_route() 
> > called multiple times for its addresses.
> > 
> > If you need to add new flows related to the configured NAT rules on
> > a 
> > router, I'd suggest looking for an appropriate place in 
> > build_lrouter_nat_defrag_and_lb() or the functions it calls. We
> > iterate 
> > over the configured NAT rules there, so it's a natural fit for
> > these 
> > types of changes.

Thanks, I'll move the logic around and insert rules into more
appropriate tables. My original proposal was based mostly on injecting
rules into tables that I saw traffic hitting on the way in/out in the
ovn-trace. I strongly suspected that they might not be correct.

> > 
> > >        build_route_match(op_inport, rtb_id, network_s, plen,
> > > is_src_route,
> > >                          is_ipv4, &match, &priority, ofs);
> > >    
> > > +    bool use_ct = false;
> > > +    struct ds target_net = DS_EMPTY_INITIALIZER;
> > > +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
> > >        struct ds common_actions = DS_EMPTY_INITIALIZER;
> > >        struct ds actions = DS_EMPTY_INITIALIZER;
> > >        if (is_discard_route) {
> > > @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > >            } else {
> > >                ds_put_format(&common_actions, "ip%s.dst", is_ipv4
> > > ? "4" : "6");
> > >            }
> > > +
> > > +        /* To enable direct connectivity from external networks
> > > to Logical IPs
> > > +         * of VMs/Containers in the SNATed networks, we need to
> > > track
> > > +         * connections initiated from external networks.  This
> > > allows SNAT
> > > +         * rules to avoid SNATing packets in "reply" direction.
> > > */
> > > +        if (od->nbr && od->nbr->n_nat) {
> > > +            ds_put_format(&target_net, "%s/%d", network_s,
> > > plen);
> > > +            for (int i = 0; i < od->nbr->n_nat; i++) {
> > > +                const struct nbrec_nat *nat = od->nbr->nat[i];
> > > +                if (strcmp(nat->type, "snat") ||
> > > +                    strcmp(nat->logical_ip,
> > > ds_cstr(&target_net)) ||
> > 
> > We allow IPv6 addresses for NAT. strcmp isn't a good way to compare
> > IPv6 
> > addresses since the same address can have many string
> > representations.

Ack, thanks.

> > 
> > > +                    lrouter_use_common_zone(od)) {
> > > +                    continue;
> > > +                }
> > > +
> > > +                /* Initiate connection tracking for traffic
> > > destined to
> > > +                 * SNATed network. */
> > > +                use_ct = true;
> > > +
> > > +                /* XXX: Need to figure out what is appropraite
> > > priority for these rules.
> > > +                        All they require is to be after any
> > > existing specific rules
> > > +                        and before the default rule.*/
> > 
> > There aren't any hard and fast rules for picking a priority number,
> > other than what you stated in this comment. As long as it gets the
> > job 
> > done, you can pick any appropriate priority number. It's best to
> > try to 
> > leave some room for new priorities to be inserted in case new rules
> > arise, but other than that, pick whatever you want.

Ack.

> > 
> > > +
> > > +                /* Commit new connections initiated from the
> > > outside of
> > > +                 * SNATed network to CT. */
> > > +                ds_put_format(&snat_port_match,
> > > +                              "outport == %s && ct.new",
> > > +                              op->json_key);
> > > +                ovn_lflow_add_with_hint(lflows, od,
> > > S_ROUTER_IN_ARP_REQUEST, 5,
> > > +                                       
> > > ds_cstr(&snat_port_match),
> > > +                                        "ct_commit; output;",
> > > stage_hint);
> > 
> > I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for
> > performing the ct_commit; This doesn't have anything to do with ARP
> > requests, so it seems like a different table would fit better here.

So this was one of my biggest struggles. If I understand connection
tracking in OVN correctly (from docs/experimenting), in order to have
access to connection state (`ct.new`/`ct.rpl`/...) or to use
`ct_commit`, it must first be initiated with `ct_next`. This initiates
connection tracking information for the flow and moves on to the next
table (stage?). This meant for me that I had to alter at least two
tables in each direction. One that would call `ct_next` and another
that would use this initiated context to perform either `ct_commit` or
match on fields like `ct.rpl`.
The reason why I put `ct_commit` here was that it was the last stage in
ingress pipeline and packet was guaranteed to hit it before being
processed by egress pipeline.
Maybe I just missed it, but I didn't find an action that would initiate
CT information (like `ct_next`) and resubmit the packet to the same
table instead fo going to the next. If there was an option like this, I
could keep changes to the single stage.

> > 
> > > +                ds_clear(&snat_port_match);
> > > +
> > > +                /* Initiate connection tracking for traffic from
> > > SNATed
> > > +                 * network to enable rules in S_ROUTER_OUT_SNAT
> > > disregard
> > > +                 * traffic in "reply" direction. */
> > > +                ds_put_format(&snat_port_match, "inport == %s",
> > > op->json_key);
> > > +                ovn_lflow_add_with_hint(lflows, od,
> > > S_ROUTER_OUT_POST_UNDNAT,
> > > +                                        5,
> > > ds_cstr(&snat_port_match),
> > > +                                        "ct_next;", stage_hint);
> > 
> > Your two calls to ovn_lflow_add_with_hint() are missing the final 
> > lflow_ref parameter, which is why github actions failed when
> > testing.

Ack, thanks.

> > 
> > > +                break;
> > > +            }
> > > +        }
> > > +
> > >            ds_put_format(&common_actions, "; "
> > >                          "%s = %s; "
> > >                          "eth.src = %s; "
> > >                          "outport = %s; "
> > >                          "flags.loopback = 1; "
> > > -                      "next;",
> > > +                      "%s",
> > >                          is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
> > >                          lrp_addr_s,
> > >                          op->lrp_networks.ea_s,
> > > -                      op->json_key);
> > > +                      op->json_key,
> > > +                      use_ct ? "ct_next;" : "next;");
> > 
> > I'm a bit confused by this addition. In your scenario, E sends
> > traffic 
> > to I, and I sends traffic to E. Traffic is never destined for a LRP
> > address, load balancer VIP, or NAT external address. So I don't
> > know why 
> > this flow would be matched in your scenario. Did I miss a detail
> > somewhere?

This flow would match because "I" is part of destination network for
which this `add_route` is called.

> > 
> > Even if this is necessary for your scenario, I think this is adding
> > conntrack to situations where it shouldn't be present, such as
> > traffic 
> > destined for static routes.

I tried to limit conntract initiation only for traffic destined to
networks with enabled SNAT. Anyway, as you pointed earlier, these rules
should probably go elsewhere.

> > 
> > >            ds_put_format(&actions, "ip.ttl--; %s",
> > > ds_cstr(&common_actions));
> > >        }
> > >    
> > > @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > >                                    ds_cstr(&common_actions),\
> > >                                    stage_hint, lflow_ref);
> > >        }
> > > +    ds_destroy(&snat_port_match);
> > > +    ds_destroy(&target_net);
> > >        ds_destroy(&match);
> > >        ds_destroy(&common_actions);
> > >        ds_destroy(&actions);
> > > @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
> > >        const struct shash *meter_groups;
> > >    };
> > >    
> > > -static inline bool
> > > -lrouter_use_common_zone(const struct ovn_datapath *od)
> > > -{
> > > -    return !od->is_gw_router && use_common_zone;
> > > -}
> > > -
> > >    static void
> > >    build_distr_lrouter_nat_flows_for_lb(struct
> > > lrouter_nat_lb_flows_ctx *ctx,
> > >                                         enum
> > > lrouter_nat_lb_flow_type type,
> > > @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct
> > > lflow_table *lflows,
> > >        }
> > >        ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
> > >    
> > > -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
> > > +    /* ct_commit is needed here otherwise replies to the SNATed
> > > traffic
> > > +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
> > > +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat-
> > > >external_ip);
> > >        if (nat->external_port_range[0]) {
> > >            ds_put_format(actions, ",%s", nat-
> > > >external_port_range);
> > >        }
> > I think the reason why this patch works is because it essentially
> > ends 
> > up sending everything to conntrack. The problem is that it's likely
> > sending more than is necessary.
> > 
> > If you look in build_lrouter_out_snat_flow(), you can see that we
> > have " 
> > && (!ct.trk || !ct.rpl)" as part of the match before we attempt to
> > SNAT. 
> > Therefore, we at least are attempting not to SNAT reply traffic.

This may be again my incomplete understanding of the role of `ct_next`,
but from my experiments, these rules with "&& (!ct.trk || !ct.rpl)"
would not actually match expected reply traffic unless I injected
`ct_next` into the pipeline before this stage.

> > 
> > As you noted, the [SYN,ACK] sent from "I" in your scenario is not 
> > SNATted. This is because the [SYN] from "E" was committed to
> > conntrack 
> > in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was
> > properly seen as a reply packet, and it was not SNATted.

Looking at the S_ROUTER_OUT_POST_UNDNAT stage, the commits happen only
for the gateway routers. However I observed this behavior in deployment
with distributed router. I'll keep trying to figure out why the first
reply is correct, it's still a mystery to me.

> > 
> > However, for some reason the final [ACK] from "I" was not seen as a
> > reply packet, and therefore it was SNATted. I'm curious if anything
> > about the TCP connection changed between the initial handshake and
> > the 
> > subsequent packets that would make conntrack see the traffic as 
> > "related" instead of part of the same connection. If so, then I
> > think 
> > the trick here is to ensure that ct.rel traffic is committed to 
> > conntrack as well.

I tested this with very simple combination of `nc -l` on one end and
`telnet` on another. I captured a pcap [0] and it doesn't show port
changes or anything funky like that.

> > 
> > Ales Musil added a new action type in OVN 22.12 called
> > "ct_commit_nat" . 
> > Its intended purpose was to ensure that in addition to the direct 
> > connection, related traffic would also be committed to conntrack. 
> > Currently, it's only used for DNAT so that ICMP replies from load 
> > balancer backends will be unDNATted.
> > 
> > It seems we need something similar here, but it needs to be used
> > when 
> > traffic hits the UNSNAT stage of the router pipeline. Maybe it
> > would be 
> > enough to extend the ct_commit_nat command to take an optional
> > "snat" or 
> > "dnat" parameter to choose the appropriate conntrack zone to commit
> > to. 
> > Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call 
> > ct_commit_nat(snat) instead of ct_snat to ensure that the related 
> > traffic is committed to conntrack. I'm not 100% sure if that will
> > do the 
> > trick, but if it does, it will ensure a more focused amount of
> > traffic 
> > will get committed to conntrack. It would also be a much simpler
> > and 
> > smaller patch.
> > 
> > @Ales Musil, did my suggestion above make sense? Or am I completely
> > approaching this incorrectly?
> > 
> > 
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > 
> 
> 
> Hi Martin and Mark,
> 
> I'm not quite sure either why the first handshake part doesn't have
> the
> SNAT IP because it should behave the same. Regardless of that
> I did spend some time trying to figure out what is exactly going on.
> 
> My understanding is that traffic from E to I is not going to CT at
> all,
> so any reply (I -> E) isn't seen as a reply because it's committed
> into
> CT there via the SNAT. Which is why I don't get the handshake part
> being correct when it shouldn't. Which seems to inline with
> description
> from Martin.
> 
> Now for the solution as Mark mentioned the pipeline stages in this
> RFC
> are very wrong for this kind of task. What we need to achieve is to
> commit traffic that is heading towards the logical ip (cidr) of SNAT.
> 
> As Mark mentioned this should be done in S_ROUTER_IN_UNSNAT
> with the match being the same as constructed in
> build_lrouter_in_unsnat_match with exception that the ip.dst should
> be
> the logical ip instead of the external one with action being
> ct_commit,
> unfortunately ct_commit is using the DNAT zone for LR pipeline.
> Meaning we would need to extend that action to allow something like
> ct_commit(dnat), ct_commit(snat). 
> 
> However, before we get to all of that, which I'm not entirely sure if
> that
> might break anything existing, I would like to know what is the use
> case for
> this. It seems slightly odd that it is desired to "leak" the internal
> IP
> and basically leaving it unprotected when setting up FIPS or LB is a
> more
> viable option IMO.
> 
> Let me know if you have any further questions about the proposed
> solution.
> 
> Thanks,
> Ales
> -- 
>  Ales Musil 
>  
> Senior Software Engineer - OVN Core 
>  
>  Red Hat EMEA 
>   
>  amusil@redhat.com  
>  
> 

Thank you all for a insightful review. There's a lot for me to think
about here.
As for the actual use case, I'll double check to make sure I have all
the details and get back to you. Right now I mostly know that this is a
use case that exists, it used to work previously in ML2/OVS Openstack
deployment and broke after upgrading to ML2/OVN.

[0] https://pastebin.com/gmbNP1Bg
Martin Kalcok Feb. 8, 2024, 8:39 p.m. UTC | #6
> On 8 Feb 2024, at 11:12, martin.kalcok@canonical.com wrote:
> 
> On Thu, 2024-02-08 at 09:30 +0100, Ales Musil wrote:
>> 
>> 
>> On Thu, Feb 8, 2024 at 1:22 AM Mark Michelson <mmichels@redhat.com>
>> wrote:
>>> Hi Martin
>>> 
>>> Thanks a bunch for your first OVN contribution. I have comments
>>> below.
>>> 
>>> On 2/7/24 10:56, Martin Kalcok wrote:
>>>> When a machine from an external network tries to access machine
>>>> in the
>>>> OVN's internal network with SNAT enabled, using protocol like
>>>> TCP, it
>>>> receives connection reset after first packets are successfully
>>>> exchanged.
>>>> 
>>>> This setup may be a bit unusual and requires that whatever is
>>>> responsible
>>>> for routing in the external network, routes packets for OVN's
>>>> internal
>>>> network via LR's external port IP. However it is something that
>>>> used to
>>>> work in ML2/OVS Openstack deployments and got broken after
>>>> upgrade to
>>>> ML2/OVN.
>>>> 
>>>> To demonstrate what the traffic looked like from the point of
>>>> view of the
>>>> external machine, lets say we have:
>>>>     * E  (IP of machine in the external network)
>>>>     * I  (IP of machine in OVN's internal network)
>>>>     * LR (IP of OVN's Logical Router Port connected to external
>>>> network)
>>>> 
>>>> The traffic looks like this:
>>>>     * [SYN]      E  -> I
>>>>     * [SYN,ACK]  I  -> E
>>>>     * [ACK]      E  -> I
>>>>     * [PSH,ACK]  E  -> I
>>>>     * [ACK]      LR -> E
>>>> 
>>>> Connection gets reseted after it receives ACK from an unexpected
>>>> IP (LR).
>>>> 
>>>> Although I'm not completely sure why the first [SYN,ACK] reply
>>>> got back
>>>> without SNAT, root cause of this issue is that traffic initiated
>>>> from
>>>> the external network is not tracked in CT. This causes even
>>>> traffic that's
>>>> in the "reply" direction (i.e. it was initiated from outside the
>>>> SNAT
>>>> network), to be translated by Logical Router.
>>>> 
>>>> My proposal is to initiate CT and commit "new" connections
>>>> destined
>>>> to the internal network (with enabled SNAT) on Logical Router.
>>>> This
>>>> will enable SNAT rules to translate only traffic that originates
>>>> from
>>>> internal networks.
>>>> 
>>>> Signed-off-by: Martin Kalcok <martin.kalcok@canonical.com>
>>>> ---
>>>>    northd/northd.c | 70
>>>> ++++++++++++++++++++++++++++++++++++++++++-------
>>>>    1 file changed, 61 insertions(+), 9 deletions(-)
>>>> 
>>>> As this would be my first "real" contribution to the OVN, I'd
>>>> love to hear
>>>> some feeback on my approach while I work on performance
>>>> measurments and
>>>> tests, which I plan on including in final proposal.
>>>> I also left some inline XXX comments marking parts that I'm not
>>>> sure about
>>>> and need resolving before final proposal.
>>>> I tested this change in my local Openstack deployment with
>>>> distributed gw
>>>> chassis and it seems to solve the issue regardless of the VM's
>>>> placement,
>>>> whether it does or does not have a floating IP. It also doesn't
>>>> seem to
>>>> break outbound connectivity, nor connectivity to VM's floating
>>>> IPs.
>>>> Thank you, for any reviews/sugestions.
>>> 
>>> Without getting into the details of this particular patch, I have
>>> some 
>>> general suggestions for contributing:
>>> 
>>> 1) Before sending a patch, run utilities/checkpatch.py against all 
>>> patches in the series. This can help 0-day Robot not to complain
>>> about 
>>> coding guidelines violations.
>>> 
>>> 2) Run the testsuite locally before posting. If you see test
>>> errors, 
>>> double check if your change is what's causing them.
> 
> Sorry about the failing tests. I ran the unit tests a checked the
> errors (that were caused by me) but given that I intended this as an
> RFC and I wasn't completely sure about my approach I didn't want to
> spend too much effort on fixing/adding unit tests that might get
> scraped. In the hindsight, I should've put more effort in and post a
> clean work for the review. 
> 
>>> 
>>> 3) As a sanity check, consider pushing the code to your github
>>> fork, and 
>>> ensure that github actions complete without error. If there are any
>>> errors, double check that they're not the fault of github (e.g.
>>> it's not 
>>> your fault if the action fails because the machine runs out of disk
>>> space while allocating pods). In this case, The OVS robot links to 
>>> https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181
>>> , where there are some compilation errors reported. I've pointed
>>> out the 
>>> offending lines in the diff below.
>>> 
>>> 4) Consider adding test cases. In this case, I suggest adding a
>>> couple:
>>>    * First, you can add one to ovn-northd.at that strictly tests
>>> whether 
>>> the generated logical flows for SNAT are as expected. It would also
>>> be 
>>> good to ensure that the flows are updated properly when the SNAT 
>>> configuration changes on the logical router.
>>>    * Second, I would add a test to system-ovn.at that simulates the
>>> broken scenario. It could ensure that packets originated internally
>>> are 
>>> SNATted, and that replies to packets originated externally are not
>>> SNATted.
> 
> Thank you, these are great starting points.
> 
>>> 
>>>> 
>>>> diff --git a/northd/northd.c b/northd/northd.c
>>>> index 01eec64ca..b695932fd 100644
>>>> --- a/northd/northd.c
>>>> +++ b/northd/northd.c
>>>> @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table
>>>> *lflows, struct ovn_datapath *od,
>>>>        ds_destroy(&actions);
>>>>    }
>>>>    
>>>> +static inline bool
>>>> +lrouter_use_common_zone(const struct ovn_datapath *od)
>>>> +{
>>>> +    return !od->is_gw_router && use_common_zone;
>>>> +}
>>>> +
>>>>    static void
>>>>    add_route(struct lflow_table *lflows, struct ovn_datapath *od,
>>>>              const struct ovn_port *op, const char *lrp_addr_s,
>>>> @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows,
>>>> struct ovn_datapath *od,
>>> 
>>> I'm not sure that add_route() is a good fit for this new code. 
>>> add_route() is called for every static route, logical router port
>>> IP, 
>>> load balancer VIP, and NAT external address. It doesn't really make
>>> sense to be iterating through the router's NATs when adding a
>>> static 
>>> route, for instance. Further, this has the potential to attempt to
>>> add 
>>> duplicate flows since a router can (and likely will) have
>>> add_route() 
>>> called multiple times for its addresses.
>>> 
>>> If you need to add new flows related to the configured NAT rules on
>>> a 
>>> router, I'd suggest looking for an appropriate place in 
>>> build_lrouter_nat_defrag_and_lb() or the functions it calls. We
>>> iterate 
>>> over the configured NAT rules there, so it's a natural fit for
>>> these 
>>> types of changes.
> 
> Thanks, I'll move the logic around and insert rules into more
> appropriate tables. My original proposal was based mostly on injecting
> rules into tables that I saw traffic hitting on the way in/out in the
> ovn-trace. I strongly suspected that they might not be correct.
> 
>>> 
>>>>        build_route_match(op_inport, rtb_id, network_s, plen,
>>>> is_src_route,
>>>>                          is_ipv4, &match, &priority, ofs);
>>>>    
>>>> +    bool use_ct = false;
>>>> +    struct ds target_net = DS_EMPTY_INITIALIZER;
>>>> +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
>>>>        struct ds common_actions = DS_EMPTY_INITIALIZER;
>>>>        struct ds actions = DS_EMPTY_INITIALIZER;
>>>>        if (is_discard_route) {
>>>> @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows,
>>>> struct ovn_datapath *od,
>>>>            } else {
>>>>                ds_put_format(&common_actions, "ip%s.dst", is_ipv4
>>>> ? "4" : "6");
>>>>            }
>>>> +
>>>> +        /* To enable direct connectivity from external networks
>>>> to Logical IPs
>>>> +         * of VMs/Containers in the SNATed networks, we need to
>>>> track
>>>> +         * connections initiated from external networks.  This
>>>> allows SNAT
>>>> +         * rules to avoid SNATing packets in "reply" direction.
>>>> */
>>>> +        if (od->nbr && od->nbr->n_nat) {
>>>> +            ds_put_format(&target_net, "%s/%d", network_s,
>>>> plen);
>>>> +            for (int i = 0; i < od->nbr->n_nat; i++) {
>>>> +                const struct nbrec_nat *nat = od->nbr->nat[i];
>>>> +                if (strcmp(nat->type, "snat") ||
>>>> +                    strcmp(nat->logical_ip,
>>>> ds_cstr(&target_net)) ||
>>> 
>>> We allow IPv6 addresses for NAT. strcmp isn't a good way to compare
>>> IPv6 
>>> addresses since the same address can have many string
>>> representations.
> 
> Ack, thanks.
> 
>>> 
>>>> +                    lrouter_use_common_zone(od)) {
>>>> +                    continue;
>>>> +                }
>>>> +
>>>> +                /* Initiate connection tracking for traffic
>>>> destined to
>>>> +                 * SNATed network. */
>>>> +                use_ct = true;
>>>> +
>>>> +                /* XXX: Need to figure out what is appropraite
>>>> priority for these rules.
>>>> +                        All they require is to be after any
>>>> existing specific rules
>>>> +                        and before the default rule.*/
>>> 
>>> There aren't any hard and fast rules for picking a priority number,
>>> other than what you stated in this comment. As long as it gets the
>>> job 
>>> done, you can pick any appropriate priority number. It's best to
>>> try to 
>>> leave some room for new priorities to be inserted in case new rules
>>> arise, but other than that, pick whatever you want.
> 
> Ack.
> 
>>> 
>>>> +
>>>> +                /* Commit new connections initiated from the
>>>> outside of
>>>> +                 * SNATed network to CT. */
>>>> +                ds_put_format(&snat_port_match,
>>>> +                              "outport == %s && ct.new",
>>>> +                              op->json_key);
>>>> +                ovn_lflow_add_with_hint(lflows, od,
>>>> S_ROUTER_IN_ARP_REQUEST, 5,
>>>> +                                       
>>>> ds_cstr(&snat_port_match),
>>>> +                                        "ct_commit; output;",
>>>> stage_hint);
>>> 
>>> I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for
>>> performing the ct_commit; This doesn't have anything to do with ARP
>>> requests, so it seems like a different table would fit better here.
> 
> So this was one of my biggest struggles. If I understand connection
> tracking in OVN correctly (from docs/experimenting), in order to have
> access to connection state (`ct.new`/`ct.rpl`/...) or to use
> `ct_commit`, it must first be initiated with `ct_next`. This initiates
> connection tracking information for the flow and moves on to the next
> table (stage?). This meant for me that I had to alter at least two
> tables in each direction. One that would call `ct_next` and another
> that would use this initiated context to perform either `ct_commit` or
> match on fields like `ct.rpl`.
> The reason why I put `ct_commit` here was that it was the last stage in
> ingress pipeline and packet was guaranteed to hit it before being
> processed by egress pipeline.
> Maybe I just missed it, but I didn't find an action that would initiate
> CT information (like `ct_next`) and resubmit the packet to the same
> table instead fo going to the next. If there was an option like this, I
> could keep changes to the single stage.
> 
>>> 
>>>> +                ds_clear(&snat_port_match);
>>>> +
>>>> +                /* Initiate connection tracking for traffic from
>>>> SNATed
>>>> +                 * network to enable rules in S_ROUTER_OUT_SNAT
>>>> disregard
>>>> +                 * traffic in "reply" direction. */
>>>> +                ds_put_format(&snat_port_match, "inport == %s",
>>>> op->json_key);
>>>> +                ovn_lflow_add_with_hint(lflows, od,
>>>> S_ROUTER_OUT_POST_UNDNAT,
>>>> +                                        5,
>>>> ds_cstr(&snat_port_match),
>>>> +                                        "ct_next;", stage_hint);
>>> 
>>> Your two calls to ovn_lflow_add_with_hint() are missing the final 
>>> lflow_ref parameter, which is why github actions failed when
>>> testing.
> 
> Ack, thanks.
> 
>>> 
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>>            ds_put_format(&common_actions, "; "
>>>>                          "%s = %s; "
>>>>                          "eth.src = %s; "
>>>>                          "outport = %s; "
>>>>                          "flags.loopback = 1; "
>>>> -                      "next;",
>>>> +                      "%s",
>>>>                          is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
>>>>                          lrp_addr_s,
>>>>                          op->lrp_networks.ea_s,
>>>> -                      op->json_key);
>>>> +                      op->json_key,
>>>> +                      use_ct ? "ct_next;" : "next;");
>>> 
>>> I'm a bit confused by this addition. In your scenario, E sends
>>> traffic 
>>> to I, and I sends traffic to E. Traffic is never destined for a LRP
>>> address, load balancer VIP, or NAT external address. So I don't
>>> know why 
>>> this flow would be matched in your scenario. Did I miss a detail
>>> somewhere?
> 
> This flow would match because "I" is part of destination network for
> which this `add_route` is called.
> 
>>> 
>>> Even if this is necessary for your scenario, I think this is adding
>>> conntrack to situations where it shouldn't be present, such as
>>> traffic 
>>> destined for static routes.
> 
> I tried to limit conntract initiation only for traffic destined to
> networks with enabled SNAT. Anyway, as you pointed earlier, these rules
> should probably go elsewhere.
> 
>>> 
>>>>            ds_put_format(&actions, "ip.ttl--; %s",
>>>> ds_cstr(&common_actions));
>>>>        }
>>>>    
>>>> @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows,
>>>> struct ovn_datapath *od,
>>>>                                    ds_cstr(&common_actions),\
>>>>                                    stage_hint, lflow_ref);
>>>>        }
>>>> +    ds_destroy(&snat_port_match);
>>>> +    ds_destroy(&target_net);
>>>>        ds_destroy(&match);
>>>>        ds_destroy(&common_actions);
>>>>        ds_destroy(&actions);
>>>> @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
>>>>        const struct shash *meter_groups;
>>>>    };
>>>>    
>>>> -static inline bool
>>>> -lrouter_use_common_zone(const struct ovn_datapath *od)
>>>> -{
>>>> -    return !od->is_gw_router && use_common_zone;
>>>> -}
>>>> -
>>>>    static void
>>>>    build_distr_lrouter_nat_flows_for_lb(struct
>>>> lrouter_nat_lb_flows_ctx *ctx,
>>>>                                         enum
>>>> lrouter_nat_lb_flow_type type,
>>>> @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct
>>>> lflow_table *lflows,
>>>>        }
>>>>        ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
>>>>    
>>>> -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
>>>> +    /* ct_commit is needed here otherwise replies to the SNATed
>>>> traffic
>>>> +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
>>>> +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat-
>>>>> external_ip);
>>>>        if (nat->external_port_range[0]) {
>>>>            ds_put_format(actions, ",%s", nat-
>>>>> external_port_range);
>>>>        }
>>> I think the reason why this patch works is because it essentially
>>> ends 
>>> up sending everything to conntrack. The problem is that it's likely
>>> sending more than is necessary.
>>> 
>>> If you look in build_lrouter_out_snat_flow(), you can see that we
>>> have " 
>>> && (!ct.trk || !ct.rpl)" as part of the match before we attempt to
>>> SNAT. 
>>> Therefore, we at least are attempting not to SNAT reply traffic.
> 
> This may be again my incomplete understanding of the role of `ct_next`,
> but from my experiments, these rules with "&& (!ct.trk || !ct.rpl)"
> would not actually match expected reply traffic unless I injected
> `ct_next` into the pipeline before this stage.
> 
>>> 
>>> As you noted, the [SYN,ACK] sent from "I" in your scenario is not 
>>> SNATted. This is because the [SYN] from "E" was committed to
>>> conntrack 
>>> in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was
>>> properly seen as a reply packet, and it was not SNATted.
> 
> Looking at the S_ROUTER_OUT_POST_UNDNAT stage, the commits happen only
> for the gateway routers. However I observed this behavior in deployment
> with distributed router. I'll keep trying to figure out why the first
> reply is correct, it's still a mystery to me.
> 
>>> 
>>> However, for some reason the final [ACK] from "I" was not seen as a
>>> reply packet, and therefore it was SNATted. I'm curious if anything
>>> about the TCP connection changed between the initial handshake and
>>> the 
>>> subsequent packets that would make conntrack see the traffic as 
>>> "related" instead of part of the same connection. If so, then I
>>> think 
>>> the trick here is to ensure that ct.rel traffic is committed to 
>>> conntrack as well.
> 
> I tested this with very simple combination of `nc -l` on one end and
> `telnet` on another. I captured a pcap [0] and it doesn't show port
> changes or anything funky like that.
> 
>>> 
>>> Ales Musil added a new action type in OVN 22.12 called
>>> "ct_commit_nat" . 
>>> Its intended purpose was to ensure that in addition to the direct 
>>> connection, related traffic would also be committed to conntrack. 
>>> Currently, it's only used for DNAT so that ICMP replies from load 
>>> balancer backends will be unDNATted.
>>> 
>>> It seems we need something similar here, but it needs to be used
>>> when 
>>> traffic hits the UNSNAT stage of the router pipeline. Maybe it
>>> would be 
>>> enough to extend the ct_commit_nat command to take an optional
>>> "snat" or 
>>> "dnat" parameter to choose the appropriate conntrack zone to commit
>>> to. 
>>> Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call 
>>> ct_commit_nat(snat) instead of ct_snat to ensure that the related 
>>> traffic is committed to conntrack. I'm not 100% sure if that will
>>> do the 
>>> trick, but if it does, it will ensure a more focused amount of
>>> traffic 
>>> will get committed to conntrack. It would also be a much simpler
>>> and 
>>> smaller patch.
>>> 
>>> @Ales Musil, did my suggestion above make sense? Or am I completely
>>> approaching this incorrectly?
>>> 
>>> 
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>> 
>> 
>> 
>> Hi Martin and Mark,
>> 
>> I'm not quite sure either why the first handshake part doesn't have
>> the
>> SNAT IP because it should behave the same. Regardless of that
>> I did spend some time trying to figure out what is exactly going on.
>> 
>> My understanding is that traffic from E to I is not going to CT at
>> all,
>> so any reply (I -> E) isn't seen as a reply because it's committed
>> into
>> CT there via the SNAT. Which is why I don't get the handshake part
>> being correct when it shouldn't. Which seems to inline with
>> description
>> from Martin.
>> 
>> Now for the solution as Mark mentioned the pipeline stages in this
>> RFC
>> are very wrong for this kind of task. What we need to achieve is to
>> commit traffic that is heading towards the logical ip (cidr) of SNAT.
>> 
>> As Mark mentioned this should be done in S_ROUTER_IN_UNSNAT
>> with the match being the same as constructed in
>> build_lrouter_in_unsnat_match with exception that the ip.dst should
>> be
>> the logical ip instead of the external one with action being
>> ct_commit,
>> unfortunately ct_commit is using the DNAT zone for LR pipeline.
>> Meaning we would need to extend that action to allow something like
>> ct_commit(dnat), ct_commit(snat). 
>> 
>> However, before we get to all of that, which I'm not entirely sure if
>> that
>> might break anything existing, I would like to know what is the use
>> case for
>> this. It seems slightly odd that it is desired to "leak" the internal
>> IP
>> and basically leaving it unprotected when setting up FIPS or LB is a
>> more
>> viable option IMO.

So, an example of a use-case is an Openstack cloud with hundreds of subnets. Each subnet has its own router. VMs on these subnets should use SNAT when talking to the internet (vie ext. net) but at the same time these VMs should be reachable directly by other machines on the ext. net (outside of Opensack cloud).
As I mentioned above, this was a use-case that worked previously with ML2/OVS Openstack.

>> 
>> Let me know if you have any further questions about the proposed
>> solution.
>> 
>> Thanks,
>> Ales
>> -- 
>>  Ales Musil 
>>  
>> Senior Software Engineer - OVN Core 
>>  
>>  Red Hat EMEA 
>>   
>>  amusil@redhat.com  
>>  
>> 
> 
> Thank you all for a insightful review. There's a lot for me to think
> about here.
> As for the actual use case, I'll double check to make sure I have all
> the details and get back to you. Right now I mostly know that this is a
> use case that exists, it used to work previously in ML2/OVS Openstack
> deployment and broke after upgrading to ML2/OVN.
> 
> [0] https://pastebin.com/gmbNP1Bg
diff mbox series

Patch

diff --git a/northd/northd.c b/northd/northd.c
index 01eec64ca..b695932fd 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -10772,6 +10772,12 @@  build_ecmp_route_flow(struct lflow_table *lflows, struct ovn_datapath *od,
     ds_destroy(&actions);
 }
 
+static inline bool
+lrouter_use_common_zone(const struct ovn_datapath *od)
+{
+    return !od->is_gw_router && use_common_zone;
+}
+
 static void
 add_route(struct lflow_table *lflows, struct ovn_datapath *od,
           const struct ovn_port *op, const char *lrp_addr_s,
@@ -10796,6 +10802,9 @@  add_route(struct lflow_table *lflows, struct ovn_datapath *od,
     build_route_match(op_inport, rtb_id, network_s, plen, is_src_route,
                       is_ipv4, &match, &priority, ofs);
 
+    bool use_ct = false;
+    struct ds target_net = DS_EMPTY_INITIALIZER;
+    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
     struct ds common_actions = DS_EMPTY_INITIALIZER;
     struct ds actions = DS_EMPTY_INITIALIZER;
     if (is_discard_route) {
@@ -10808,16 +10817,61 @@  add_route(struct lflow_table *lflows, struct ovn_datapath *od,
         } else {
             ds_put_format(&common_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
         }
+
+        /* To enable direct connectivity from external networks to Logical IPs
+         * of VMs/Containers in the SNATed networks, we need to track
+         * connections initiated from external networks.  This allows SNAT
+         * rules to avoid SNATing packets in "reply" direction. */
+        if (od->nbr && od->nbr->n_nat) {
+            ds_put_format(&target_net, "%s/%d", network_s, plen);
+            for (int i = 0; i < od->nbr->n_nat; i++) {
+                const struct nbrec_nat *nat = od->nbr->nat[i];
+                if (strcmp(nat->type, "snat") ||
+                    strcmp(nat->logical_ip, ds_cstr(&target_net)) ||
+                    lrouter_use_common_zone(od)) {
+                    continue;
+                }
+
+                /* Initiate connection tracking for traffic destined to
+                 * SNATed network. */
+                use_ct = true;
+
+                /* XXX: Need to figure out what is appropraite priority for these rules.
+                        All they require is to be after any existing specific rules
+                        and before the default rule.*/
+
+                /* Commit new connections initiated from the outside of
+                 * SNATed network to CT. */
+                ds_put_format(&snat_port_match,
+                              "outport == %s && ct.new",
+                              op->json_key);
+                ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_REQUEST, 5,
+                                        ds_cstr(&snat_port_match),
+                                        "ct_commit; output;", stage_hint);
+                ds_clear(&snat_port_match);
+
+                /* Initiate connection tracking for traffic from SNATed
+                 * network to enable rules in S_ROUTER_OUT_SNAT disregard
+                 * traffic in "reply" direction. */
+                ds_put_format(&snat_port_match, "inport == %s", op->json_key);
+                ovn_lflow_add_with_hint(lflows, od, S_ROUTER_OUT_POST_UNDNAT,
+                                        5, ds_cstr(&snat_port_match),
+                                        "ct_next;", stage_hint);
+                break;
+            }
+        }
+
         ds_put_format(&common_actions, "; "
                       "%s = %s; "
                       "eth.src = %s; "
                       "outport = %s; "
                       "flags.loopback = 1; "
-                      "next;",
+                      "%s",
                       is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
                       lrp_addr_s,
                       op->lrp_networks.ea_s,
-                      op->json_key);
+                      op->json_key,
+                      use_ct ? "ct_next;" : "next;");
         ds_put_format(&actions, "ip.ttl--; %s", ds_cstr(&common_actions));
     }
 
@@ -10833,6 +10887,8 @@  add_route(struct lflow_table *lflows, struct ovn_datapath *od,
                                 ds_cstr(&common_actions),\
                                 stage_hint, lflow_ref);
     }
+    ds_destroy(&snat_port_match);
+    ds_destroy(&target_net);
     ds_destroy(&match);
     ds_destroy(&common_actions);
     ds_destroy(&actions);
@@ -10933,12 +10989,6 @@  struct lrouter_nat_lb_flows_ctx {
     const struct shash *meter_groups;
 };
 
-static inline bool
-lrouter_use_common_zone(const struct ovn_datapath *od)
-{
-    return !od->is_gw_router && use_common_zone;
-}
-
 static void
 build_distr_lrouter_nat_flows_for_lb(struct lrouter_nat_lb_flows_ctx *ctx,
                                      enum lrouter_nat_lb_flow_type type,
@@ -14627,7 +14677,9 @@  build_lrouter_out_snat_flow(struct lflow_table *lflows,
     }
     ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
 
-    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
+    /* ct_commit is needed here otherwise replies to the SNATed traffic
+     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
+    ds_put_format(actions, "ct_commit; ct_snat(%s", nat->external_ip);
     if (nat->external_port_range[0]) {
         ds_put_format(actions, ",%s", nat->external_port_range);
     }