diff mbox series

[ovs-dev,ovn] ovn: introduce IP_SRC_POLICY stage in ingress router pipeline

Message ID 1e51c076f452530084f6a31974858abe7346240d.1589906317.git.lorenzo.bianconi@redhat.com
State New
Headers show
Series [ovs-dev,ovn] ovn: introduce IP_SRC_POLICY stage in ingress router pipeline | expand

Commit Message

Lorenzo Bianconi May 19, 2020, 4:42 p.m. UTC
In order to fix the issues introduced by commit
c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "), restore
previous configuration of table 9 in ingress router pipeline and
introduce a new stage called 'ip_src_policy' used to set the src address
info in order to not distribute FIP traffic if DVR is enabled

Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario ")
Tested-by: Jakub Libosvar <jlibosva@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
Changes since RFC:
- added unit-tests
---
 northd/ovn-northd.8.xml | 65 ++++++++++++++++++++---------------------
 northd/ovn-northd.c     | 38 ++++++++++--------------
 tests/ovn.at            | 28 +++++-------------
 tests/system-ovn.at     | 28 ++++++++++++++++++
 4 files changed, 82 insertions(+), 77 deletions(-)

Comments

Dumitru Ceara May 20, 2020, 2:37 p.m. UTC | #1
On 5/19/20 6:42 PM, Lorenzo Bianconi wrote:
> In order to fix the issues introduced by commit
> c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "), restore
> previous configuration of table 9 in ingress router pipeline and
> introduce a new stage called 'ip_src_policy' used to set the src address
> info in order to not distribute FIP traffic if DVR is enabled
> 
> Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario ")
> Tested-by: Jakub Libosvar <jlibosva@redhat.com>
> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>

Hi Lorenzo,

As I mentioned on the RFC patch, if there is a way that we can fix this
issue without the need for a new pipeline stage I think that would be
preferable. I don't see an easy way to do that but maybe Han and others
have suggestions.

> ---
> Changes since RFC:
> - added unit-tests
> ---
>  northd/ovn-northd.8.xml | 65 ++++++++++++++++++++---------------------
>  northd/ovn-northd.c     | 38 ++++++++++--------------
>  tests/ovn.at            | 28 +++++-------------
>  tests/system-ovn.at     | 28 ++++++++++++++++++
>  4 files changed, 82 insertions(+), 77 deletions(-)
> 
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index 8f224b07f..09dbb52b4 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -2484,37 +2484,6 @@ output;
>          </p>
>        </li>
>  
> -      <li>
> -        <p>
> -          For distributed logical routers where one of the logical router ports
> -          specifies a <code>redirect-chassis</code>, a priority-400 logical
> -          flow for each <code>dnat_and_snat</code> NAT rules configured.
> -          These flows will allow to properly forward traffic to the external
> -          connections if available and avoid sending it through the tunnel.
> -          Assuming the following NAT rule has been configured:
> -        </p>
> -
> -        <pre>
> -external_ip = <var>A</var>;
> -external_mac = <var>B</var>;
> -logical_ip = <var>C</var>;
> -        </pre>
> -
> -        <p>
> -          the following action will be applied:
> -        </p>
> -
> -        <pre>
> -ip.ttl--;
> -reg0 = <var>ip.dst</var>;
> -reg1 = <var>A</var>;
> -eth.src = <var>B</var>;
> -outport = <var>router-port</var>;
> -next;
> -        </pre>
> -
> -      </li>
> -
>        <li>
>          <p>
>            IPv4 routing table.  For each route to IPv4 network <var>N</var> with
> @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
>        </li>
>      </ul>
>  
> -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> +    <h3>Ingress Table 12: IP Source Policy</h3>
> +
> +    <p>
> +      This table contains for distributed logical routers where one of
> +      the logical router ports specifies a <code>redirect-chassis</code>,
> +      a priority-100 logical flow for each <code>dnat_and_snat</code>
> +      NAT rules configured.
> +      These flows will allow to properly forward traffic to the external
> +      connections if available and avoid sending it through the tunnel.
> +      Assuming the following NAT rule has been configured:
> +    </p>
> +
> +    <pre>
> +external_ip = <var>A</var>;
> +external_mac = <var>B</var>;
> +logical_ip = <var>C</var>;
> +    </pre>
> +
> +    <p>
> +      the following action will be applied:
> +    </p>
> +
> +    <pre>
> +reg1 = <var>A</var>;
> +eth.src = <var>B</var>;
> +next;
> +    </pre>
> +

This section of the man page will show what action we perform on
traffic, i.e., move external_ip in reg1 (for SNAT) and move external_mac
in eth.src but there's no mention about what kind of traffic will match
these flows, that is: "ip.src == logical_ip &&
is_chassis_resident(logical_port)".

> +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
>  
>      <p>
>        Any packet that reaches this table is an IP packet whose next-hop
> @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
>  
>      </ul>
>  
> -    <h3>Ingress Table 13: Check packet length</h3>
> +    <h3>Ingress Table 14: Check packet length</h3>
>  
>      <p>
>        For distributed logical routers with distributed gateway port configured
> @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER = check_pkt_larger(<var>L</var>); next;
>        and advances to the next table.
>      </p>
>  
> -    <h3>Ingress Table 14: Handle larger packets</h3>
> +    <h3>Ingress Table 15: Handle larger packets</h3>
>  

Indices for tables "Gateway Redirect" and "ARP Request" should be
updated too.

>      <p>
>        For distributed logical routers with distributed gateway port configured
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 3c0070ea7..d5f3997a9 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -175,11 +175,12 @@ enum ovn_stage {
>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")   \
>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10, "lr_in_ip_routing_ecmp") \
>      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")       \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12, "lr_in_arp_resolve")  \
> -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13, "lr_in_chk_pkt_len")   \
> -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")   \
> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15, "lr_in_gw_redirect")  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16, "lr_in_arp_request")  \
> +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12, "lr_in_ip_src_policy") \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13, "lr_in_arp_resolve")  \
> +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14, "lr_in_chk_pkt_len")   \
> +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")   \
> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16, "lr_in_gw_redirect")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17, "lr_in_arp_request")  \
>                                                                        \
>      /* Logical router egress stages. */                               \
>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows, struct ovn_datapath *od,
>      ds_destroy(&actions);
>  }
>  
> -/* default logical flow prioriry for distributed routes */
> -#define DROUTE_PRIO 400
>  struct parsed_route {
>      struct ovs_list list_node;
>      struct v46_ip prefix;
> @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>  }
>  
>  static void
> -add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
> +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath *od)
>  {
>      struct ds actions = DS_EMPTY_INITIALIZER;
>      struct ds match = DS_EMPTY_INITIALIZER;
> @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
>                        is_ipv4 ? "4" : "6", nat->logical_ip,
>                        nat->logical_port);
>          char *prefix = is_ipv4 ? "" : "xx";
> -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> -                      od->l3dgw_port->json_key, nat->external_mac,
> -                      prefix, is_ipv4 ? "4" : "6",
> -                      prefix, nat->external_ip);
> -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, DROUTE_PRIO,
> +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> +                      nat->external_mac, prefix, nat->external_ip);
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
>                        ds_cstr(&match), ds_cstr(&actions));
>          ds_clear(&match);
>          ds_clear(&actions);
> @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct ovn_port *op,
>      }
>      build_route_match(op_inport, network_s, plen, is_src_route, is_ipv4,
>                        &match, &priority);
> -    /* traffic for internal IPs of logical switch ports must be sent to
> -     * the gw controller through the overlay tunnels
> -     */
> -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> -        priority += DROUTE_PRIO;
> -    }
>  
>      struct ds actions = DS_EMPTY_INITIALIZER;
>      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0; %sreg0 = ",
> @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
>       * logical router
>       */
>      HMAP_FOR_EACH (od, key_node, datapaths) {
> -        if (od->nbr && od->l3dgw_port) {
> -            add_distributed_routes(lflows, od);
> +        if (!od->nbr) {
> +            continue;
> +        }
> +        if (od->l3dgw_port) {
> +            add_ip_src_policy_flows(lflows, od);
>          }
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0, "1", "next;");
>      }
>  
>      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP: IP Routing.
> diff --git a/tests/ovn.at b/tests/ovn.at
> index f39fda2e4..fcc34fd5d 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=p
>  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
>  grep "Port patch-br-int-to-ln_port" | wc -l`])
>  
> -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
> -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
> -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> -
> -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding lr0`
> -# Check that the OVS flows appear for the dnat_and_snat entries in
> -# lr_in_ip_routing table.
> -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
> -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> -
> -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
> -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> -
>  # Re-add nat-addresses option
>  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0 nat-addresses="router"
>  
> @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  # Since the sw0-vir is not claimed by any chassis, eth.dst should be set to
>  # zero if the ip4.dst is the virtual ip in the router pipeline.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>  
>  ip_to_hex() {
> @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p1's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>  
>  # Forcibly clear virtual_parent. ovn-controller should release the binding
> @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p2's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
>  ])
>  
>  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p3's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>  
>  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>  
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>  
>  # Delete hv1-vif1 port. hv1 should release sw0-vir
> @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>  
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>  
>  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>  
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>  
>  # Delete sw0-p2 logical port
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at

I think it's preferable to have the unit tests in ovn.at whenever
possible so that they get run regularly.

> index 9ae6c6b1f..1e4f147b4 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "172.16.1.2/24", "f0:00:00:01:02:05", \
>  ovn-nbctl lsp-add alice alice1 \
>  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
>  
> +# Add external network
> +ADD_NAMESPACES(ext-net)
> +ip link add alice-ext netns alice1 type veth peer name ext-veth netns ext-net
> +ip -n ext-net link set dev ext-veth up
> +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> +ip -n ext-net route add default via 10.0.0.2
> +
> +ip -n alice1 link set dev alice-ext up
> +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> +
>  # Add DNAT rules
>  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3 192.168.1.2 foo1 00:00:02:02:03:04])
>  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:00:02:02:03:05])
> @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:0
>  # Add a SNAT rule
>  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
>  
> +# Add default route to ext-net
> +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> +
>  ovn-nbctl --wait=hv sync
>  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep 'nat(src=172.16.1.1)'])
>  
> @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3 -w 2 172.16.1.2 | FORMAT_PING], \
>  3 packets transmitted, 3 received, 0% packet loss, time 0ms
>  ])
>  
> +# Try to ping external network
> +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst 172.16.1.3 and icmp > ext-net.pcap &])
> +sleep 1
> +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 | FORMAT_PING], \
> +[0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_WAIT_UNTIL([
> +    total_pkts=$(cat ext-net.pcap | wc -l)
> +    test "${total_pkts}" = "3"
> +])
> +
>  # We verify that SNAT indeed happened via 'dump-conntrack' command.
>  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1) | \
>  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> 

The system test fails on my test machine (Fedora 32 with OVS master and
OVN master + your patch applied):

make check-system-userspace TESTSUITEFLAGS="17"
[...]
 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
(system-ovn.at:2823)

[...]
system-ovn.at:2802: wait succeeded after 1 seconds
./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
"dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
sort | uniq | \
sed -e 's/zone=[0-9]*/zone=<cleared>/'
./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
-c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed 's/time.*ms$/time
0ms/'
NS_EXEC_HEREDOC
./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
"dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
sort | uniq | \
sed -e 's/zone=[0-9]*/zone=<cleared>/'
--- -   2020-05-20 10:28:50.053113201 -0400
+++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
2020-05-20 10:28:50.050668409 -0400
@@ -1,2 +1 @@
-icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
[...]

Same happens when running the system test with the kernel datapath:
make check-kernel TESTSUITEFLAGS="17"
[...]
17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
(system-ovn.at:2823)

Regards,
Dumitru
Han Zhou May 20, 2020, 5:20 p.m. UTC | #2
On Wed, May 20, 2020 at 7:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 5/19/20 6:42 PM, Lorenzo Bianconi wrote:
> > In order to fix the issues introduced by commit
> > c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "), restore
> > previous configuration of table 9 in ingress router pipeline and
> > introduce a new stage called 'ip_src_policy' used to set the src address
> > info in order to not distribute FIP traffic if DVR is enabled
> >
> > Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario ")
> > Tested-by: Jakub Libosvar <jlibosva@redhat.com>
> > Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>
> Hi Lorenzo,
>
> As I mentioned on the RFC patch, if there is a way that we can fix this
> issue without the need for a new pipeline stage I think that would be
> preferable. I don't see an easy way to do that but maybe Han and others
> have suggestions.
>

Hi, I'm reviewing this but could you educate me the detailed scenario that
the original patch c0bf32d72f8b ("Manage ARP process locally in a DVR
scenario ") was trying to solve? From the description added in this patch I
still can't understand: why do we need to avoid sending traffic through
tunnel, and why setting the src-IP and src-Mac would avoid that?

> > ---
> > Changes since RFC:
> > - added unit-tests
> > ---
> >  northd/ovn-northd.8.xml | 65 ++++++++++++++++++++---------------------
> >  northd/ovn-northd.c     | 38 ++++++++++--------------
> >  tests/ovn.at            | 28 +++++-------------
> >  tests/system-ovn.at     | 28 ++++++++++++++++++
> >  4 files changed, 82 insertions(+), 77 deletions(-)
> >
> > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > index 8f224b07f..09dbb52b4 100644
> > --- a/northd/ovn-northd.8.xml
> > +++ b/northd/ovn-northd.8.xml
> > @@ -2484,37 +2484,6 @@ output;
> >          </p>
> >        </li>
> >
> > -      <li>
> > -        <p>
> > -          For distributed logical routers where one of the logical
router ports
> > -          specifies a <code>redirect-chassis</code>, a priority-400
logical
> > -          flow for each <code>dnat_and_snat</code> NAT rules
configured.
> > -          These flows will allow to properly forward traffic to the
external
> > -          connections if available and avoid sending it through the
tunnel.
> > -          Assuming the following NAT rule has been configured:
> > -        </p>
> > -
> > -        <pre>
> > -external_ip = <var>A</var>;
> > -external_mac = <var>B</var>;
> > -logical_ip = <var>C</var>;
> > -        </pre>
> > -
> > -        <p>
> > -          the following action will be applied:
> > -        </p>
> > -
> > -        <pre>
> > -ip.ttl--;
> > -reg0 = <var>ip.dst</var>;
> > -reg1 = <var>A</var>;
> > -eth.src = <var>B</var>;
> > -outport = <var>router-port</var>;
> > -next;
> > -        </pre>
> > -
> > -      </li>
> > -
> >        <li>
> >          <p>
> >            IPv4 routing table.  For each route to IPv4 network
<var>N</var> with
> > @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
> >        </li>
> >      </ul>
> >
> > -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> > +    <h3>Ingress Table 12: IP Source Policy</h3>
> > +
> > +    <p>
> > +      This table contains for distributed logical routers where one of
> > +      the logical router ports specifies a
<code>redirect-chassis</code>,
> > +      a priority-100 logical flow for each <code>dnat_and_snat</code>
> > +      NAT rules configured.
> > +      These flows will allow to properly forward traffic to the
external
> > +      connections if available and avoid sending it through the tunnel.
> > +      Assuming the following NAT rule has been configured:
> > +    </p>
> > +
> > +    <pre>
> > +external_ip = <var>A</var>;
> > +external_mac = <var>B</var>;
> > +logical_ip = <var>C</var>;
> > +    </pre>
> > +
> > +    <p>
> > +      the following action will be applied:
> > +    </p>
> > +
> > +    <pre>
> > +reg1 = <var>A</var>;
> > +eth.src = <var>B</var>;
> > +next;
> > +    </pre>
> > +
>
> This section of the man page will show what action we perform on
> traffic, i.e., move external_ip in reg1 (for SNAT) and move external_mac
> in eth.src but there's no mention about what kind of traffic will match
> these flows, that is: "ip.src == logical_ip &&
> is_chassis_resident(logical_port)".
>
> > +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
> >
> >      <p>
> >        Any packet that reaches this table is an IP packet whose next-hop
> > @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
> >
> >      </ul>
> >
> > -    <h3>Ingress Table 13: Check packet length</h3>
> > +    <h3>Ingress Table 14: Check packet length</h3>
> >
> >      <p>
> >        For distributed logical routers with distributed gateway port
configured
> > @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER =
check_pkt_larger(<var>L</var>); next;
> >        and advances to the next table.
> >      </p>
> >
> > -    <h3>Ingress Table 14: Handle larger packets</h3>
> > +    <h3>Ingress Table 15: Handle larger packets</h3>
> >
>
> Indices for tables "Gateway Redirect" and "ARP Request" should be
> updated too.
>
> >      <p>
> >        For distributed logical routers with distributed gateway port
configured
> > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > index 3c0070ea7..d5f3997a9 100644
> > --- a/northd/ovn-northd.c
> > +++ b/northd/ovn-northd.c
> > @@ -175,11 +175,12 @@ enum ovn_stage {
> >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
"lr_in_ip_routing")   \
> >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
"lr_in_ip_routing_ecmp") \
> >      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")
    \
> > -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
"lr_in_arp_resolve")  \
> > -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
"lr_in_chk_pkt_len")   \
> > -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
14,"lr_in_larger_pkts")   \
> > -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
"lr_in_gw_redirect")  \
> > -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
"lr_in_arp_request")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12,
"lr_in_ip_src_policy") \
> > +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
"lr_in_arp_resolve")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
"lr_in_chk_pkt_len")   \
> > +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
15,"lr_in_larger_pkts")   \
> > +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
"lr_in_gw_redirect")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
"lr_in_arp_request")  \
> >                                                                        \
> >      /* Logical router egress stages. */                               \
> >      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> > @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows,
struct ovn_datapath *od,
> >      ds_destroy(&actions);
> >  }
> >
> > -/* default logical flow prioriry for distributed routes */
> > -#define DROUTE_PRIO 400
> >  struct parsed_route {
> >      struct ovs_list list_node;
> >      struct v46_ip prefix;
> > @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows, struct
ovn_datapath *od,
> >  }
> >
> >  static void
> > -add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
> > +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath *od)
> >  {
> >      struct ds actions = DS_EMPTY_INITIALIZER;
> >      struct ds match = DS_EMPTY_INITIALIZER;
> > @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows,
struct ovn_datapath *od)
> >                        is_ipv4 ? "4" : "6", nat->logical_ip,
> >                        nat->logical_port);
> >          char *prefix = is_ipv4 ? "" : "xx";
> > -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> > -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> > -                      od->l3dgw_port->json_key, nat->external_mac,
> > -                      prefix, is_ipv4 ? "4" : "6",
> > -                      prefix, nat->external_ip);
> > -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, DROUTE_PRIO,
> > +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> > +                      nat->external_mac, prefix, nat->external_ip);
> > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
> >                        ds_cstr(&match), ds_cstr(&actions));
> >          ds_clear(&match);
> >          ds_clear(&actions);
> > @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct
ovn_port *op,
> >      }
> >      build_route_match(op_inport, network_s, plen, is_src_route,
is_ipv4,
> >                        &match, &priority);
> > -    /* traffic for internal IPs of logical switch ports must be sent to
> > -     * the gw controller through the overlay tunnels
> > -     */
> > -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> > -        priority += DROUTE_PRIO;
> > -    }
> >
> >      struct ds actions = DS_EMPTY_INITIALIZER;
> >      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0; %sreg0
= ",
> > @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths,
struct hmap *ports,
> >       * logical router
> >       */
> >      HMAP_FOR_EACH (od, key_node, datapaths) {
> > -        if (od->nbr && od->l3dgw_port) {
> > -            add_distributed_routes(lflows, od);
> > +        if (!od->nbr) {
> > +            continue;
> > +        }
> > +        if (od->l3dgw_port) {
> > +            add_ip_src_policy_flows(lflows, od);
> >          }
> > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0, "1",
"next;");
> >      }
> >
> >      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP: IP
Routing.
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index f39fda2e4..fcc34fd5d 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch .
external-ids:ovn-bridge-mappings=p
> >  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
> >  grep "Port patch-br-int-to-ln_port" | wc -l`])
> >
> > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing |
\
> > -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing |
\
> > -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> > -
> > -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding lr0`
> > -# Check that the OVS flows appear for the dnat_and_snat entries in
> > -# lr_in_ip_routing table.
> > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17
| \
> > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> > -
> > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17
| \
> > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> > -
> >  # Re-add nat-addresses option
> >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
nat-addresses="router"
> >
> > @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # Since the sw0-vir is not claimed by any chassis, eth.dst should be
set to
> >  # zero if the ip4.dst is the virtual ip in the router pipeline.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> >  ])
> >
> >  ip_to_hex() {
> > @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with
the
> >  # sw0-p1's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> >  ])
> >
> >  # Forcibly clear virtual_parent. ovn-controller should release the
binding
> > @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with
the
> >  # sw0-p2's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> >  ])
> >
> >  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> > @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with
the
> >  # sw0-p3's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> >  ])
> >
> >  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> > @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> >  ])
> >
> >  # Delete hv1-vif1 port. hv1 should release sw0-vir
> > @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> >  ])
> >
> >  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> > @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep
lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
"lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> >  ])
> >
> >  # Delete sw0-p2 logical port
> > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
>
> I think it's preferable to have the unit tests in ovn.at whenever
> possible so that they get run regularly.
>
> > index 9ae6c6b1f..1e4f147b4 100644
> > --- a/tests/system-ovn.at
> > +++ b/tests/system-ovn.at
> > @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "172.16.1.2/24",
"f0:00:00:01:02:05", \
> >  ovn-nbctl lsp-add alice alice1 \
> >  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
> >
> > +# Add external network
> > +ADD_NAMESPACES(ext-net)
> > +ip link add alice-ext netns alice1 type veth peer name ext-veth netns
ext-net
> > +ip -n ext-net link set dev ext-veth up
> > +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> > +ip -n ext-net route add default via 10.0.0.2
> > +
> > +ip -n alice1 link set dev alice-ext up
> > +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> > +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> > +
> >  # Add DNAT rules
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3 192.168.1.2
foo1 00:00:02:02:03:04])
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3
foo2 00:00:02:02:03:05])
> > @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat
172.16.1.4 192.168.1.3 foo2 00:0
> >  # Add a SNAT rule
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
> >
> > +# Add default route to ext-net
> > +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> > +
> >  ovn-nbctl --wait=hv sync
> >  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep
'nat(src=172.16.1.1)'])
> >
> > @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3 -w 2
172.16.1.2 | FORMAT_PING], \
> >  3 packets transmitted, 3 received, 0% packet loss, time 0ms
> >  ])
> >
> > +# Try to ping external network
> > +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst 172.16.1.3
and icmp > ext-net.pcap &])
> > +sleep 1
> > +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> > +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 |
FORMAT_PING], \
> > +[0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_WAIT_UNTIL([
> > +    total_pkts=$(cat ext-net.pcap | wc -l)
> > +    test "${total_pkts}" = "3"
> > +])
> > +
> >  # We verify that SNAT indeed happened via 'dump-conntrack' command.
> >  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1) | \
> >  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> >
>
> The system test fails on my test machine (Fedora 32 with OVS master and
> OVN master + your patch applied):
>
> make check-system-userspace TESTSUITEFLAGS="17"
> [...]
>  17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> (system-ovn.at:2823)
>
> [...]
> system-ovn.at:2802: wait succeeded after 1 seconds
> ./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
> "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> sort | uniq | \
> sed -e 's/zone=[0-9]*/zone=<cleared>/'
> ./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
> ./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
> -c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed 's/time.*ms$/time
> 0ms/'
> NS_EXEC_HEREDOC
> ./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
> "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> sort | uniq | \
> sed -e 's/zone=[0-9]*/zone=<cleared>/'
> --- -   2020-05-20 10:28:50.053113201 -0400
> +++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
> 2020-05-20 10:28:50.050668409 -0400
> @@ -1,2 +1 @@
>
-icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
> [...]
>
> Same happens when running the system test with the kernel datapath:
> make check-kernel TESTSUITEFLAGS="17"
> [...]
> 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> (system-ovn.at:2823)
>
> Regards,
> Dumitru
>
Lorenzo Bianconi May 20, 2020, 7:31 p.m. UTC | #3
> On 5/19/20 6:42 PM, Lorenzo Bianconi wrote:
> > In order to fix the issues introduced by commit
> > c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "), restore
> > previous configuration of table 9 in ingress router pipeline and
> > introduce a new stage called 'ip_src_policy' used to set the src address
> > info in order to not distribute FIP traffic if DVR is enabled
> > 
> > Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario ")
> > Tested-by: Jakub Libosvar <jlibosva@redhat.com>
> > Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> 
> Hi Lorenzo,
> 
> As I mentioned on the RFC patch, if there is a way that we can fix this
> issue without the need for a new pipeline stage I think that would be
> preferable. I don't see an easy way to do that but maybe Han and others
> have suggestions.

Hi Dumitru,

thx for the review.

> 
> > ---
> > Changes since RFC:
> > - added unit-tests
> > ---
> >  northd/ovn-northd.8.xml | 65 ++++++++++++++++++++---------------------
> >  northd/ovn-northd.c     | 38 ++++++++++--------------
> >  tests/ovn.at            | 28 +++++-------------
> >  tests/system-ovn.at     | 28 ++++++++++++++++++
> >  4 files changed, 82 insertions(+), 77 deletions(-)
> > 
> > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > index 8f224b07f..09dbb52b4 100644
> > --- a/northd/ovn-northd.8.xml
> > +++ b/northd/ovn-northd.8.xml
> > @@ -2484,37 +2484,6 @@ output;
> >          </p>
> >        </li>
> >  
> > -      <li>
> > -        <p>
> > -          For distributed logical routers where one of the logical router ports
> > -          specifies a <code>redirect-chassis</code>, a priority-400 logical
> > -          flow for each <code>dnat_and_snat</code> NAT rules configured.
> > -          These flows will allow to properly forward traffic to the external
> > -          connections if available and avoid sending it through the tunnel.
> > -          Assuming the following NAT rule has been configured:
> > -        </p>
> > -
> > -        <pre>
> > -external_ip = <var>A</var>;
> > -external_mac = <var>B</var>;
> > -logical_ip = <var>C</var>;
> > -        </pre>
> > -
> > -        <p>
> > -          the following action will be applied:
> > -        </p>
> > -
> > -        <pre>
> > -ip.ttl--;
> > -reg0 = <var>ip.dst</var>;
> > -reg1 = <var>A</var>;
> > -eth.src = <var>B</var>;
> > -outport = <var>router-port</var>;
> > -next;
> > -        </pre>
> > -
> > -      </li>
> > -
> >        <li>
> >          <p>
> >            IPv4 routing table.  For each route to IPv4 network <var>N</var> with
> > @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
> >        </li>
> >      </ul>
> >  
> > -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> > +    <h3>Ingress Table 12: IP Source Policy</h3>
> > +
> > +    <p>
> > +      This table contains for distributed logical routers where one of
> > +      the logical router ports specifies a <code>redirect-chassis</code>,
> > +      a priority-100 logical flow for each <code>dnat_and_snat</code>
> > +      NAT rules configured.
> > +      These flows will allow to properly forward traffic to the external
> > +      connections if available and avoid sending it through the tunnel.
> > +      Assuming the following NAT rule has been configured:
> > +    </p>
> > +
> > +    <pre>
> > +external_ip = <var>A</var>;
> > +external_mac = <var>B</var>;
> > +logical_ip = <var>C</var>;
> > +    </pre>
> > +
> > +    <p>
> > +      the following action will be applied:
> > +    </p>
> > +
> > +    <pre>
> > +reg1 = <var>A</var>;
> > +eth.src = <var>B</var>;
> > +next;
> > +    </pre>
> > +
> 
> This section of the man page will show what action we perform on
> traffic, i.e., move external_ip in reg1 (for SNAT) and move external_mac
> in eth.src but there's no mention about what kind of traffic will match
> these flows, that is: "ip.src == logical_ip &&
> is_chassis_resident(logical_port)".

ack, I will update it in v2

> 
> > +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
> >  
> >      <p>
> >        Any packet that reaches this table is an IP packet whose next-hop
> > @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
> >  
> >      </ul>
> >  
> > -    <h3>Ingress Table 13: Check packet length</h3>
> > +    <h3>Ingress Table 14: Check packet length</h3>
> >  
> >      <p>
> >        For distributed logical routers with distributed gateway port configured
> > @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER = check_pkt_larger(<var>L</var>); next;
> >        and advances to the next table.
> >      </p>
> >  
> > -    <h3>Ingress Table 14: Handle larger packets</h3>
> > +    <h3>Ingress Table 15: Handle larger packets</h3>
> >  
> 
> Indices for tables "Gateway Redirect" and "ARP Request" should be
> updated too.

ack, I will update it in v2

> 
> >      <p>
> >        For distributed logical routers with distributed gateway port configured
> > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > index 3c0070ea7..d5f3997a9 100644
> > --- a/northd/ovn-northd.c
> > +++ b/northd/ovn-northd.c
> > @@ -175,11 +175,12 @@ enum ovn_stage {
> >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")   \
> >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10, "lr_in_ip_routing_ecmp") \
> >      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")       \
> > -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12, "lr_in_arp_resolve")  \
> > -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13, "lr_in_chk_pkt_len")   \
> > -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")   \
> > -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15, "lr_in_gw_redirect")  \
> > -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16, "lr_in_arp_request")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12, "lr_in_ip_src_policy") \
> > +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13, "lr_in_arp_resolve")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14, "lr_in_chk_pkt_len")   \
> > +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")   \
> > +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16, "lr_in_gw_redirect")  \
> > +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17, "lr_in_arp_request")  \
> >                                                                        \
> >      /* Logical router egress stages. */                               \
> >      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> > @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows, struct ovn_datapath *od,
> >      ds_destroy(&actions);
> >  }
> >  
> > -/* default logical flow prioriry for distributed routes */
> > -#define DROUTE_PRIO 400
> >  struct parsed_route {
> >      struct ovs_list list_node;
> >      struct v46_ip prefix;
> > @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
> >  }
> >  
> >  static void
> > -add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
> > +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath *od)
> >  {
> >      struct ds actions = DS_EMPTY_INITIALIZER;
> >      struct ds match = DS_EMPTY_INITIALIZER;
> > @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
> >                        is_ipv4 ? "4" : "6", nat->logical_ip,
> >                        nat->logical_port);
> >          char *prefix = is_ipv4 ? "" : "xx";
> > -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> > -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> > -                      od->l3dgw_port->json_key, nat->external_mac,
> > -                      prefix, is_ipv4 ? "4" : "6",
> > -                      prefix, nat->external_ip);
> > -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, DROUTE_PRIO,
> > +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> > +                      nat->external_mac, prefix, nat->external_ip);
> > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
> >                        ds_cstr(&match), ds_cstr(&actions));
> >          ds_clear(&match);
> >          ds_clear(&actions);
> > @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct ovn_port *op,
> >      }
> >      build_route_match(op_inport, network_s, plen, is_src_route, is_ipv4,
> >                        &match, &priority);
> > -    /* traffic for internal IPs of logical switch ports must be sent to
> > -     * the gw controller through the overlay tunnels
> > -     */
> > -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> > -        priority += DROUTE_PRIO;
> > -    }
> >  
> >      struct ds actions = DS_EMPTY_INITIALIZER;
> >      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0; %sreg0 = ",
> > @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
> >       * logical router
> >       */
> >      HMAP_FOR_EACH (od, key_node, datapaths) {
> > -        if (od->nbr && od->l3dgw_port) {
> > -            add_distributed_routes(lflows, od);
> > +        if (!od->nbr) {
> > +            continue;
> > +        }
> > +        if (od->l3dgw_port) {
> > +            add_ip_src_policy_flows(lflows, od);
> >          }
> > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0, "1", "next;");
> >      }
> >  
> >      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP: IP Routing.
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index f39fda2e4..fcc34fd5d 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=p
> >  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
> >  grep "Port patch-br-int-to-ln_port" | wc -l`])
> >  
> > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
> > -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
> > -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> > -
> > -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding lr0`
> > -# Check that the OVS flows appear for the dnat_and_snat entries in
> > -# lr_in_ip_routing table.
> > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
> > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> > -
> > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
> > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> > -
> >  # Re-add nat-addresses option
> >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0 nat-addresses="router"
> >  
> > @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # Since the sw0-vir is not claimed by any chassis, eth.dst should be set to
> >  # zero if the ip4.dst is the virtual ip in the router pipeline.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> >  ])
> >  
> >  ip_to_hex() {
> > @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with the
> >  # sw0-p1's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> >  ])
> >  
> >  # Forcibly clear virtual_parent. ovn-controller should release the binding
> > @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with the
> >  # sw0-p2's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> >  ])
> >  
> >  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> > @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  # There should be an arp resolve flow to resolve the virtual_ip with the
> >  # sw0-p3's MAC.
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> >  ])
> >  
> >  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> > @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >  
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> >  ])
> >  
> >  # Delete hv1-vif1 port. hv1 should release sw0-vir
> > @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >  
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> >  ])
> >  
> >  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> > @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >  > lflows.txt
> >  
> >  AT_CHECK([cat lflows.txt], [0], [dnl
> > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> >  ])
> >  
> >  # Delete sw0-p2 logical port
> > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> 
> I think it's preferable to have the unit tests in ovn.at whenever
> possible so that they get run regularly.

I will add some arp checkes in v2

> 
> > index 9ae6c6b1f..1e4f147b4 100644
> > --- a/tests/system-ovn.at
> > +++ b/tests/system-ovn.at
> > @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "172.16.1.2/24", "f0:00:00:01:02:05", \
> >  ovn-nbctl lsp-add alice alice1 \
> >  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
> >  
> > +# Add external network
> > +ADD_NAMESPACES(ext-net)
> > +ip link add alice-ext netns alice1 type veth peer name ext-veth netns ext-net
> > +ip -n ext-net link set dev ext-veth up
> > +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> > +ip -n ext-net route add default via 10.0.0.2
> > +
> > +ip -n alice1 link set dev alice-ext up
> > +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> > +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> > +
> >  # Add DNAT rules
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3 192.168.1.2 foo1 00:00:02:02:03:04])
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:00:02:02:03:05])
> > @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:0
> >  # Add a SNAT rule
> >  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
> >  
> > +# Add default route to ext-net
> > +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> > +
> >  ovn-nbctl --wait=hv sync
> >  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep 'nat(src=172.16.1.1)'])
> >  
> > @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3 -w 2 172.16.1.2 | FORMAT_PING], \
> >  3 packets transmitted, 3 received, 0% packet loss, time 0ms
> >  ])
> >  
> > +# Try to ping external network
> > +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst 172.16.1.3 and icmp > ext-net.pcap &])
> > +sleep 1
> > +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> > +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 | FORMAT_PING], \
> > +[0], [dnl
> > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > +])
> > +
> > +OVS_WAIT_UNTIL([
> > +    total_pkts=$(cat ext-net.pcap | wc -l)
> > +    test "${total_pkts}" = "3"
> > +])
> > +
> >  # We verify that SNAT indeed happened via 'dump-conntrack' command.
> >  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1) | \
> >  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> > 
> 
> The system test fails on my test machine (Fedora 32 with OVS master and
> OVN master + your patch applied):
> 

shame on me, I sent an out-of-date version, thx for spotting this. I will fix
it in v2

Regards,
Lorenzo

> make check-system-userspace TESTSUITEFLAGS="17"
> [...]
>  17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> (system-ovn.at:2823)
> 
> [...]
> system-ovn.at:2802: wait succeeded after 1 seconds
> ./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
> "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> sort | uniq | \
> sed -e 's/zone=[0-9]*/zone=<cleared>/'
> ./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
> ./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
> -c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed 's/time.*ms$/time
> 0ms/'
> NS_EXEC_HEREDOC
> ./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
> "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> sort | uniq | \
> sed -e 's/zone=[0-9]*/zone=<cleared>/'
> --- -   2020-05-20 10:28:50.053113201 -0400
> +++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
> 2020-05-20 10:28:50.050668409 -0400
> @@ -1,2 +1 @@
> -icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
> [...]
> 
> Same happens when running the system test with the kernel datapath:
> make check-kernel TESTSUITEFLAGS="17"
> [...]
> 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> (system-ovn.at:2823)
> 
> Regards,
> Dumitru
>
Lorenzo Bianconi May 20, 2020, 7:43 p.m. UTC | #4
> On Wed, May 20, 2020 at 7:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On 5/19/20 6:42 PM, Lorenzo Bianconi wrote:
> > > In order to fix the issues introduced by commit
> > > c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "), restore
> > > previous configuration of table 9 in ingress router pipeline and
> > > introduce a new stage called 'ip_src_policy' used to set the src address
> > > info in order to not distribute FIP traffic if DVR is enabled
> > >
> > > Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario ")
> > > Tested-by: Jakub Libosvar <jlibosva@redhat.com>
> > > Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> >
> > Hi Lorenzo,
> >
> > As I mentioned on the RFC patch, if there is a way that we can fix this
> > issue without the need for a new pipeline stage I think that would be
> > preferable. I don't see an easy way to do that but maybe Han and others
> > have suggestions.
> >
> 
> Hi, I'm reviewing this but could you educate me the detailed scenario that
> the original patch c0bf32d72f8b ("Manage ARP process locally in a DVR
> scenario ") was trying to solve? From the description added in this patch I
> still can't understand: why do we need to avoid sending traffic through
> tunnel, and why setting the src-IP and src-Mac would avoid that?

Hi Han,

the main DVR/FIP issue before commit c0bf32d72f8b was ARP handling. In
particular, ARP traffic was always sent to the gw router using a geneve tunnel
while the IP traffic was sent to the localnet port available on the chassis.
Our goal is manage ARP locally when we have installed a DVR/FIP rule on the
node. In particular, we need to overwrite reg1 and eth.src that are set in
table 9 of the ingress router pipeline in order to send the ARP request using
the FIP ip/mac address. In order to not break any static routes added to the
logical router routing table but at the same time properly set reg1/eth.src
I defined a new stable in the ingress router pipeline (stage=12).
Another possible approach would be reset reg1 and eth.src in pinctrl code but
we have no access to sb db in pinctrl thread.

Regards,
Lorenzo

> 
> > > ---
> > > Changes since RFC:
> > > - added unit-tests
> > > ---
> > >  northd/ovn-northd.8.xml | 65 ++++++++++++++++++++---------------------
> > >  northd/ovn-northd.c     | 38 ++++++++++--------------
> > >  tests/ovn.at            | 28 +++++-------------
> > >  tests/system-ovn.at     | 28 ++++++++++++++++++
> > >  4 files changed, 82 insertions(+), 77 deletions(-)
> > >
> > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > index 8f224b07f..09dbb52b4 100644
> > > --- a/northd/ovn-northd.8.xml
> > > +++ b/northd/ovn-northd.8.xml
> > > @@ -2484,37 +2484,6 @@ output;
> > >          </p>
> > >        </li>
> > >
> > > -      <li>
> > > -        <p>
> > > -          For distributed logical routers where one of the logical
> router ports
> > > -          specifies a <code>redirect-chassis</code>, a priority-400
> logical
> > > -          flow for each <code>dnat_and_snat</code> NAT rules
> configured.
> > > -          These flows will allow to properly forward traffic to the
> external
> > > -          connections if available and avoid sending it through the
> tunnel.
> > > -          Assuming the following NAT rule has been configured:
> > > -        </p>
> > > -
> > > -        <pre>
> > > -external_ip = <var>A</var>;
> > > -external_mac = <var>B</var>;
> > > -logical_ip = <var>C</var>;
> > > -        </pre>
> > > -
> > > -        <p>
> > > -          the following action will be applied:
> > > -        </p>
> > > -
> > > -        <pre>
> > > -ip.ttl--;
> > > -reg0 = <var>ip.dst</var>;
> > > -reg1 = <var>A</var>;
> > > -eth.src = <var>B</var>;
> > > -outport = <var>router-port</var>;
> > > -next;
> > > -        </pre>
> > > -
> > > -      </li>
> > > -
> > >        <li>
> > >          <p>
> > >            IPv4 routing table.  For each route to IPv4 network
> <var>N</var> with
> > > @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
> > >        </li>
> > >      </ul>
> > >
> > > -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> > > +    <h3>Ingress Table 12: IP Source Policy</h3>
> > > +
> > > +    <p>
> > > +      This table contains for distributed logical routers where one of
> > > +      the logical router ports specifies a
> <code>redirect-chassis</code>,
> > > +      a priority-100 logical flow for each <code>dnat_and_snat</code>
> > > +      NAT rules configured.
> > > +      These flows will allow to properly forward traffic to the
> external
> > > +      connections if available and avoid sending it through the tunnel.
> > > +      Assuming the following NAT rule has been configured:
> > > +    </p>
> > > +
> > > +    <pre>
> > > +external_ip = <var>A</var>;
> > > +external_mac = <var>B</var>;
> > > +logical_ip = <var>C</var>;
> > > +    </pre>
> > > +
> > > +    <p>
> > > +      the following action will be applied:
> > > +    </p>
> > > +
> > > +    <pre>
> > > +reg1 = <var>A</var>;
> > > +eth.src = <var>B</var>;
> > > +next;
> > > +    </pre>
> > > +
> >
> > This section of the man page will show what action we perform on
> > traffic, i.e., move external_ip in reg1 (for SNAT) and move external_mac
> > in eth.src but there's no mention about what kind of traffic will match
> > these flows, that is: "ip.src == logical_ip &&
> > is_chassis_resident(logical_port)".
> >
> > > +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
> > >
> > >      <p>
> > >        Any packet that reaches this table is an IP packet whose next-hop
> > > @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
> > >
> > >      </ul>
> > >
> > > -    <h3>Ingress Table 13: Check packet length</h3>
> > > +    <h3>Ingress Table 14: Check packet length</h3>
> > >
> > >      <p>
> > >        For distributed logical routers with distributed gateway port
> configured
> > > @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER =
> check_pkt_larger(<var>L</var>); next;
> > >        and advances to the next table.
> > >      </p>
> > >
> > > -    <h3>Ingress Table 14: Handle larger packets</h3>
> > > +    <h3>Ingress Table 15: Handle larger packets</h3>
> > >
> >
> > Indices for tables "Gateway Redirect" and "ARP Request" should be
> > updated too.
> >
> > >      <p>
> > >        For distributed logical routers with distributed gateway port
> configured
> > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > index 3c0070ea7..d5f3997a9 100644
> > > --- a/northd/ovn-northd.c
> > > +++ b/northd/ovn-northd.c
> > > @@ -175,11 +175,12 @@ enum ovn_stage {
> > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
> "lr_in_ip_routing")   \
> > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> "lr_in_ip_routing_ecmp") \
> > >      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")
>     \
> > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> "lr_in_arp_resolve")  \
> > > -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> "lr_in_chk_pkt_len")   \
> > > -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> 14,"lr_in_larger_pkts")   \
> > > -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> "lr_in_gw_redirect")  \
> > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> "lr_in_arp_request")  \
> > > +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12,
> "lr_in_ip_src_policy") \
> > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> "lr_in_arp_resolve")  \
> > > +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> "lr_in_chk_pkt_len")   \
> > > +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> 15,"lr_in_larger_pkts")   \
> > > +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> "lr_in_gw_redirect")  \
> > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> "lr_in_arp_request")  \
> > >                                                                        \
> > >      /* Logical router egress stages. */                               \
> > >      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> > > @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows,
> struct ovn_datapath *od,
> > >      ds_destroy(&actions);
> > >  }
> > >
> > > -/* default logical flow prioriry for distributed routes */
> > > -#define DROUTE_PRIO 400
> > >  struct parsed_route {
> > >      struct ovs_list list_node;
> > >      struct v46_ip prefix;
> > > @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
> > >  }
> > >
> > >  static void
> > > -add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
> > > +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath *od)
> > >  {
> > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > >      struct ds match = DS_EMPTY_INITIALIZER;
> > > @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows,
> struct ovn_datapath *od)
> > >                        is_ipv4 ? "4" : "6", nat->logical_ip,
> > >                        nat->logical_port);
> > >          char *prefix = is_ipv4 ? "" : "xx";
> > > -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> > > -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> > > -                      od->l3dgw_port->json_key, nat->external_mac,
> > > -                      prefix, is_ipv4 ? "4" : "6",
> > > -                      prefix, nat->external_ip);
> > > -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, DROUTE_PRIO,
> > > +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> > > +                      nat->external_mac, prefix, nat->external_ip);
> > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
> > >                        ds_cstr(&match), ds_cstr(&actions));
> > >          ds_clear(&match);
> > >          ds_clear(&actions);
> > > @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct
> ovn_port *op,
> > >      }
> > >      build_route_match(op_inport, network_s, plen, is_src_route,
> is_ipv4,
> > >                        &match, &priority);
> > > -    /* traffic for internal IPs of logical switch ports must be sent to
> > > -     * the gw controller through the overlay tunnels
> > > -     */
> > > -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> > > -        priority += DROUTE_PRIO;
> > > -    }
> > >
> > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > >      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0; %sreg0
> = ",
> > > @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths,
> struct hmap *ports,
> > >       * logical router
> > >       */
> > >      HMAP_FOR_EACH (od, key_node, datapaths) {
> > > -        if (od->nbr && od->l3dgw_port) {
> > > -            add_distributed_routes(lflows, od);
> > > +        if (!od->nbr) {
> > > +            continue;
> > > +        }
> > > +        if (od->l3dgw_port) {
> > > +            add_ip_src_policy_flows(lflows, od);
> > >          }
> > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0, "1",
> "next;");
> > >      }
> > >
> > >      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP: IP
> Routing.
> > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > index f39fda2e4..fcc34fd5d 100644
> > > --- a/tests/ovn.at
> > > +++ b/tests/ovn.at
> > > @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch .
> external-ids:ovn-bridge-mappings=p
> > >  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
> > >  grep "Port patch-br-int-to-ln_port" | wc -l`])
> > >
> > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing |
> \
> > > -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing |
> \
> > > -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> > > -
> > > -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding lr0`
> > > -# Check that the OVS flows appear for the dnat_and_snat entries in
> > > -# lr_in_ip_routing table.
> > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17
> | \
> > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> > > -
> > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17
> | \
> > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> > > -
> > >  # Re-add nat-addresses option
> > >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
> nat-addresses="router"
> > >
> > > @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  # Since the sw0-vir is not claimed by any chassis, eth.dst should be
> set to
> > >  # zero if the ip4.dst is the virtual ip in the router pipeline.
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > >  ])
> > >
> > >  ip_to_hex() {
> > > @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  # There should be an arp resolve flow to resolve the virtual_ip with
> the
> > >  # sw0-p1's MAC.
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > >  ])
> > >
> > >  # Forcibly clear virtual_parent. ovn-controller should release the
> binding
> > > @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  # There should be an arp resolve flow to resolve the virtual_ip with
> the
> > >  # sw0-p2's MAC.
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> > >  ])
> > >
> > >  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> > > @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  # There should be an arp resolve flow to resolve the virtual_ip with
> the
> > >  # sw0-p3's MAC.
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > >  ])
> > >
> > >  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> > > @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  > lflows.txt
> > >
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> > >  ])
> > >
> > >  # Delete hv1-vif1 port. hv1 should release sw0-vir
> > > @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  > lflows.txt
> > >
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> > >  ])
> > >
> > >  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> > > @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep
> lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > >  > lflows.txt
> > >
> > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> > >  ])
> > >
> > >  # Delete sw0-p2 logical port
> > > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> >
> > I think it's preferable to have the unit tests in ovn.at whenever
> > possible so that they get run regularly.
> >
> > > index 9ae6c6b1f..1e4f147b4 100644
> > > --- a/tests/system-ovn.at
> > > +++ b/tests/system-ovn.at
> > > @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "172.16.1.2/24",
> "f0:00:00:01:02:05", \
> > >  ovn-nbctl lsp-add alice alice1 \
> > >  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
> > >
> > > +# Add external network
> > > +ADD_NAMESPACES(ext-net)
> > > +ip link add alice-ext netns alice1 type veth peer name ext-veth netns
> ext-net
> > > +ip -n ext-net link set dev ext-veth up
> > > +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> > > +ip -n ext-net route add default via 10.0.0.2
> > > +
> > > +ip -n alice1 link set dev alice-ext up
> > > +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> > > +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> > > +
> > >  # Add DNAT rules
> > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3 192.168.1.2
> foo1 00:00:02:02:03:04])
> > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3
> foo2 00:00:02:02:03:05])
> > > @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat
> 172.16.1.4 192.168.1.3 foo2 00:0
> > >  # Add a SNAT rule
> > >  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
> > >
> > > +# Add default route to ext-net
> > > +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> > > +
> > >  ovn-nbctl --wait=hv sync
> > >  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep
> 'nat(src=172.16.1.1)'])
> > >
> > > @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3 -w 2
> 172.16.1.2 | FORMAT_PING], \
> > >  3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > >  ])
> > >
> > > +# Try to ping external network
> > > +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst 172.16.1.3
> and icmp > ext-net.pcap &])
> > > +sleep 1
> > > +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> > > +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 |
> FORMAT_PING], \
> > > +[0], [dnl
> > > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > > +])
> > > +
> > > +OVS_WAIT_UNTIL([
> > > +    total_pkts=$(cat ext-net.pcap | wc -l)
> > > +    test "${total_pkts}" = "3"
> > > +])
> > > +
> > >  # We verify that SNAT indeed happened via 'dump-conntrack' command.
> > >  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1) | \
> > >  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> > >
> >
> > The system test fails on my test machine (Fedora 32 with OVS master and
> > OVN master + your patch applied):
> >
> > make check-system-userspace TESTSUITEFLAGS="17"
> > [...]
> >  17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > (system-ovn.at:2823)
> >
> > [...]
> > system-ovn.at:2802: wait succeeded after 1 seconds
> > ./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
> > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > sort | uniq | \
> > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > ./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
> > ./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
> > -c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed 's/time.*ms$/time
> > 0ms/'
> > NS_EXEC_HEREDOC
> > ./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
> > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > 's/id=[0-9]*/id=<cleared>/g' -e 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > sort | uniq | \
> > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > --- -   2020-05-20 10:28:50.053113201 -0400
> > +++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
> > 2020-05-20 10:28:50.050668409 -0400
> > @@ -1,2 +1 @@
> >
> -icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
> > [...]
> >
> > Same happens when running the system test with the kernel datapath:
> > make check-kernel TESTSUITEFLAGS="17"
> > [...]
> > 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > (system-ovn.at:2823)
> >
> > Regards,
> > Dumitru
> >
Han Zhou May 22, 2020, 8:33 p.m. UTC | #5
On Wed, May 20, 2020 at 12:43 PM Lorenzo Bianconi <
lorenzo.bianconi@redhat.com> wrote:
>
> > On Wed, May 20, 2020 at 7:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > On 5/19/20 6:42 PM, Lorenzo Bianconi wrote:
> > > > In order to fix the issues introduced by commit
> > > > c0bf32d72f8b ("Manage ARP process locally in a DVR scenario "),
restore
> > > > previous configuration of table 9 in ingress router pipeline and
> > > > introduce a new stage called 'ip_src_policy' used to set the src
address
> > > > info in order to not distribute FIP traffic if DVR is enabled
> > > >
> > > > Fixes: c0bf32d72f8b ("Manage ARP process locally in a DVR scenario
")
> > > > Tested-by: Jakub Libosvar <jlibosva@redhat.com>
> > > > Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> > >
> > > Hi Lorenzo,
> > >
> > > As I mentioned on the RFC patch, if there is a way that we can fix
this
> > > issue without the need for a new pipeline stage I think that would be
> > > preferable. I don't see an easy way to do that but maybe Han and
others
> > > have suggestions.
> > >
> >
> > Hi, I'm reviewing this but could you educate me the detailed scenario
that
> > the original patch c0bf32d72f8b ("Manage ARP process locally in a DVR
> > scenario ") was trying to solve? From the description added in this
patch I
> > still can't understand: why do we need to avoid sending traffic through
> > tunnel, and why setting the src-IP and src-Mac would avoid that?
>
> Hi Han,
>
> the main DVR/FIP issue before commit c0bf32d72f8b was ARP handling. In
> particular, ARP traffic was always sent to the gw router using a geneve
tunnel
> while the IP traffic was sent to the localnet port available on the
chassis.
> Our goal is manage ARP locally when we have installed a DVR/FIP rule on
the
> node. In particular, we need to overwrite reg1 and eth.src that are set in
> table 9 of the ingress router pipeline in order to send the ARP request
using
> the FIP ip/mac address. In order to not break any static routes added to
the
> logical router routing table but at the same time properly set
reg1/eth.src
> I defined a new stable in the ingress router pipeline (stage=12).
> Another possible approach would be reset reg1 and eth.src in pinctrl code
but
> we have no access to sb db in pinctrl thread.
>
> Regards,
> Lorenzo
>

Hi Lorenzo,

Thanks for explaining the scenario (here and also today's OVN meeting).
Going through the patch again, I think there are still something unclear to
me. Let me describe my understanding to the problem first:
The logical topology is:
VM -> LS -> LR -> External LS -> TOR

And physically, VM is on a Chassis, the LR has a distributed gateway port
on a GW chassis.
The VM has dnat_and_snat configured in OVN with a floating IP and external
mac, and we expect the VM -> TOR traffic to directly go out through the
VM's chassis without going through the GW node.

Now IP traffic goes as expected, but for the first packet before the ARP is
resolved from LR for the TOR's IP, the ARP request is triggered and sent
from the VM chassis to the GW node over the tunnel, and broadcasted from GW
node to the TOR. The first IP packet that triggers the ARP is buffered on
the VM chassis during the ARP process, and when ARP response is received
and MAC-binding is updated, the buffered IP packet is also sent to GW node
over the tunnel and sent to the TOR from there, instead of directly from
the VM chassis (according to what you mentioned in the ovn meeting). The
second and later IP packets go from VM chassis to the TOR directly.
However, since the first IP packet (with the nat external MAC) is sent out
from gateway node, the TOR's MAC table can get confused.

Please let know if the above is accurate about the behavior before commit
c0bf32d72f8b, and correct me if any step is wrong.

Now to the root cause. It seems the problem is in the ARP resolving. In
ingress stage "ARP Request", since the dst MAC is still unresolved, the
arp() action will be executed by pinctrl (now still on the VM chassis).
When the arp is reinjected to OVS, since it is going through the pipelines
from the begining (this is my understanding about PACKET_OUT, but I am not
sure), when it hits the GW_REDIRECT stage, since it is not IP packet, it
won't get matched by the flow that bypasses the redirect, so the output
port is replaced by the CR-P, and redirected to the GW chassis through
tunnel.

As to the buffered packet, when it is reinjected to OVS, if it go through
the pipeline again, it should bypass the redirection since in GW_REDIRECT
stage it should hit the flow that matches: ip.src == nat->logical_ip. So I
don't understand why would it get tunneled to GW node.

Maybe there is something I misunderstood about the packet reinjection
behavior. Please correct me. I should have spent time myself to debug it
but I guess you might have already debugged so I'd like to save some effort
here ;)

Now to the solution. Since I am not completely clear about the root cause,
I can't say for sure about the solution. However, here are some suggestions
based on your patch, if it is proved to solve the problem already.
1) This patch is completely different solution from the original commit
c0bf32d72f8b, and the old solution was wrong and abandoned, so I'd suggest
to submit as two separate patches: i) revert the old commit, and ii) the
new solution. This way, it would be more clear for the review and also for
others to understand the patch later.

2) It seems all this patch is doing is matching the "ip.src ==
nat->logical_ip and is_chassis_residency(nat->logical_port)" and then
update the eth.src and reg1. I am not sure why it solves the problem, but I
think this doesn't need a new stage. I think it can be added in the
GW_REDIRECT stage (before ARP_REQUEST), with a higher priority flow with
the same match and actions.

3) In the new function add_ip_src_policy_flows(), the condition checks
seems not sufficient to determine if it the nat can be performed in a
"distributed" manner. Since the logic is related to NAT, it is better to
put it together under the section /* NAT, Defrag and load balancing. */ of
the function build_lrouter_flows(), where the related checks are already
performed to determine if it is "distributed".

            /* For distributed router NAT, determine whether this NAT rule
             * satisfies the conditions for distributed NAT processing. */
            bool distributed = false;
            struct eth_addr mac;
            if (od->l3dgw_port && !strcmp(nat->type, "dnat_and_snat") &&
                nat->logical_port && nat->external_mac) {
                if (eth_addr_from_string(nat->external_mac, &mac)) {
                    distributed = true;
                } else {
                    static struct vlog_rate_limit rl =
                        VLOG_RATE_LIMIT_INIT(5, 1);
                    VLOG_WARN_RL(&rl, "bad mac %s for dnat in router "
                        ""UUID_FMT"", nat->external_mac,
UUID_ARGS(&od->key));
                    continue;
                }
            }

Just put the new flow addition under this section it would be easier to
follow and avoid repeating the logic of determing if it is "distributed".

Thanks,
Han

> >
> > > > ---
> > > > Changes since RFC:
> > > > - added unit-tests
> > > > ---
> > > >  northd/ovn-northd.8.xml | 65
++++++++++++++++++++---------------------
> > > >  northd/ovn-northd.c     | 38 ++++++++++--------------
> > > >  tests/ovn.at            | 28 +++++-------------
> > > >  tests/system-ovn.at     | 28 ++++++++++++++++++
> > > >  4 files changed, 82 insertions(+), 77 deletions(-)
> > > >
> > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > > index 8f224b07f..09dbb52b4 100644
> > > > --- a/northd/ovn-northd.8.xml
> > > > +++ b/northd/ovn-northd.8.xml
> > > > @@ -2484,37 +2484,6 @@ output;
> > > >          </p>
> > > >        </li>
> > > >
> > > > -      <li>
> > > > -        <p>
> > > > -          For distributed logical routers where one of the logical
> > router ports
> > > > -          specifies a <code>redirect-chassis</code>, a priority-400
> > logical
> > > > -          flow for each <code>dnat_and_snat</code> NAT rules
> > configured.
> > > > -          These flows will allow to properly forward traffic to the
> > external
> > > > -          connections if available and avoid sending it through the
> > tunnel.
> > > > -          Assuming the following NAT rule has been configured:
> > > > -        </p>
> > > > -
> > > > -        <pre>
> > > > -external_ip = <var>A</var>;
> > > > -external_mac = <var>B</var>;
> > > > -logical_ip = <var>C</var>;
> > > > -        </pre>
> > > > -
> > > > -        <p>
> > > > -          the following action will be applied:
> > > > -        </p>
> > > > -
> > > > -        <pre>
> > > > -ip.ttl--;
> > > > -reg0 = <var>ip.dst</var>;
> > > > -reg1 = <var>A</var>;
> > > > -eth.src = <var>B</var>;
> > > > -outport = <var>router-port</var>;
> > > > -next;
> > > > -        </pre>
> > > > -
> > > > -      </li>
> > > > -
> > > >        <li>
> > > >          <p>
> > > >            IPv4 routing table.  For each route to IPv4 network
> > <var>N</var> with
> > > > @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
> > > >        </li>
> > > >      </ul>
> > > >
> > > > -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> > > > +    <h3>Ingress Table 12: IP Source Policy</h3>
> > > > +
> > > > +    <p>
> > > > +      This table contains for distributed logical routers where
one of
> > > > +      the logical router ports specifies a
> > <code>redirect-chassis</code>,
> > > > +      a priority-100 logical flow for each
<code>dnat_and_snat</code>
> > > > +      NAT rules configured.
> > > > +      These flows will allow to properly forward traffic to the
> > external
> > > > +      connections if available and avoid sending it through the
tunnel.
> > > > +      Assuming the following NAT rule has been configured:
> > > > +    </p>
> > > > +
> > > > +    <pre>
> > > > +external_ip = <var>A</var>;
> > > > +external_mac = <var>B</var>;
> > > > +logical_ip = <var>C</var>;
> > > > +    </pre>
> > > > +
> > > > +    <p>
> > > > +      the following action will be applied:
> > > > +    </p>
> > > > +
> > > > +    <pre>
> > > > +reg1 = <var>A</var>;
> > > > +eth.src = <var>B</var>;
> > > > +next;
> > > > +    </pre>
> > > > +
> > >
> > > This section of the man page will show what action we perform on
> > > traffic, i.e., move external_ip in reg1 (for SNAT) and move
external_mac
> > > in eth.src but there's no mention about what kind of traffic will
match
> > > these flows, that is: "ip.src == logical_ip &&
> > > is_chassis_resident(logical_port)".
> > >
> > > > +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
> > > >
> > > >      <p>
> > > >        Any packet that reaches this table is an IP packet whose
next-hop
> > > > @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
> > > >
> > > >      </ul>
> > > >
> > > > -    <h3>Ingress Table 13: Check packet length</h3>
> > > > +    <h3>Ingress Table 14: Check packet length</h3>
> > > >
> > > >      <p>
> > > >        For distributed logical routers with distributed gateway port
> > configured
> > > > @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER =
> > check_pkt_larger(<var>L</var>); next;
> > > >        and advances to the next table.
> > > >      </p>
> > > >
> > > > -    <h3>Ingress Table 14: Handle larger packets</h3>
> > > > +    <h3>Ingress Table 15: Handle larger packets</h3>
> > > >
> > >
> > > Indices for tables "Gateway Redirect" and "ARP Request" should be
> > > updated too.
> > >
> > > >      <p>
> > > >        For distributed logical routers with distributed gateway port
> > configured
> > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > > index 3c0070ea7..d5f3997a9 100644
> > > > --- a/northd/ovn-northd.c
> > > > +++ b/northd/ovn-northd.c
> > > > @@ -175,11 +175,12 @@ enum ovn_stage {
> > > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
> > "lr_in_ip_routing")   \
> > > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> > "lr_in_ip_routing_ecmp") \
> > > >      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11,
"lr_in_policy")
> >     \
> > > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> > "lr_in_arp_resolve")  \
> > > > -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> > "lr_in_chk_pkt_len")   \
> > > > -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> > 14,"lr_in_larger_pkts")   \
> > > > -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> > "lr_in_gw_redirect")  \
> > > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> > "lr_in_arp_request")  \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12,
> > "lr_in_ip_src_policy") \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> > "lr_in_arp_resolve")  \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> > "lr_in_chk_pkt_len")   \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> > 15,"lr_in_larger_pkts")   \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> > "lr_in_gw_redirect")  \
> > > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> > "lr_in_arp_request")  \
> > > >
   \
> > > >      /* Logical router egress stages. */
    \
> > > >      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")
   \
> > > > @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows,
> > struct ovn_datapath *od,
> > > >      ds_destroy(&actions);
> > > >  }
> > > >
> > > > -/* default logical flow prioriry for distributed routes */
> > > > -#define DROUTE_PRIO 400
> > > >  struct parsed_route {
> > > >      struct ovs_list list_node;
> > > >      struct v46_ip prefix;
> > > > @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows,
struct
> > ovn_datapath *od,
> > > >  }
> > > >
> > > >  static void
> > > > -add_distributed_routes(struct hmap *lflows, struct ovn_datapath
*od)
> > > > +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath
*od)
> > > >  {
> > > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > > >      struct ds match = DS_EMPTY_INITIALIZER;
> > > > @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows,
> > struct ovn_datapath *od)
> > > >                        is_ipv4 ? "4" : "6", nat->logical_ip,
> > > >                        nat->logical_port);
> > > >          char *prefix = is_ipv4 ? "" : "xx";
> > > > -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> > > > -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> > > > -                      od->l3dgw_port->json_key, nat->external_mac,
> > > > -                      prefix, is_ipv4 ? "4" : "6",
> > > > -                      prefix, nat->external_ip);
> > > > -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING,
DROUTE_PRIO,
> > > > +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> > > > +                      nat->external_mac, prefix, nat->external_ip);
> > > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
> > > >                        ds_cstr(&match), ds_cstr(&actions));
> > > >          ds_clear(&match);
> > > >          ds_clear(&actions);
> > > > @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct
> > ovn_port *op,
> > > >      }
> > > >      build_route_match(op_inport, network_s, plen, is_src_route,
> > is_ipv4,
> > > >                        &match, &priority);
> > > > -    /* traffic for internal IPs of logical switch ports must be
sent to
> > > > -     * the gw controller through the overlay tunnels
> > > > -     */
> > > > -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> > > > -        priority += DROUTE_PRIO;
> > > > -    }
> > > >
> > > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > > >      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0;
%sreg0
> > = ",
> > > > @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths,
> > struct hmap *ports,
> > > >       * logical router
> > > >       */
> > > >      HMAP_FOR_EACH (od, key_node, datapaths) {
> > > > -        if (od->nbr && od->l3dgw_port) {
> > > > -            add_distributed_routes(lflows, od);
> > > > +        if (!od->nbr) {
> > > > +            continue;
> > > > +        }
> > > > +        if (od->l3dgw_port) {
> > > > +            add_ip_src_policy_flows(lflows, od);
> > > >          }
> > > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0,
"1",
> > "next;");
> > > >      }
> > > >
> > > >      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP:
IP
> > Routing.
> > > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > > index f39fda2e4..fcc34fd5d 100644
> > > > --- a/tests/ovn.at
> > > > +++ b/tests/ovn.at
> > > > @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch .
> > external-ids:ovn-bridge-mappings=p
> > > >  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
> > > >  grep "Port patch-br-int-to-ln_port" | wc -l`])
> > > >
> > > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep
lr_in_ip_routing |
> > \
> > > > -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> > > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep
lr_in_ip_routing |
> > \
> > > > -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> > > > -
> > > > -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding
lr0`
> > > > -# Check that the OVS flows appear for the dnat_and_snat entries in
> > > > -# lr_in_ip_routing table.
> > > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int
table=17
> > | \
> > > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> > > > -
> > > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int
table=17
> > | \
> > > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> > > > -
> > > >  # Re-add nat-addresses option
> > > >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
> > nat-addresses="router"
> > > >
> > > > @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  # Since the sw0-vir is not claimed by any chassis, eth.dst should
be
> > set to
> > > >  # zero if the ip4.dst is the virtual ip in the router pipeline.
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
next;)
> > > >  ])
> > > >
> > > >  ip_to_hex() {
> > > > @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  # There should be an arp resolve flow to resolve the virtual_ip
with
> > the
> > > >  # sw0-p1's MAC.
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
next;)
> > > >  ])
> > > >
> > > >  # Forcibly clear virtual_parent. ovn-controller should release the
> > binding
> > > > @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  # There should be an arp resolve flow to resolve the virtual_ip
with
> > the
> > > >  # sw0-p2's MAC.
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05;
next;)
> > > >  ])
> > > >
> > > >  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> > > > @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  # There should be an arp resolve flow to resolve the virtual_ip
with
> > the
> > > >  # sw0-p3's MAC.
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
next;)
> > > >  ])
> > > >
> > > >  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> > > > @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  > lflows.txt
> > > >
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
next;)
> > > >  ])
> > > >
> > > >  # Delete hv1-vif1 port. hv1 should release sw0-vir
> > > > @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  > lflows.txt
> > > >
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
next;)
> > > >  ])
> > > >
> > > >  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> > > > @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep
> > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > >  > lflows.txt
> > > >
> > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
next;)
> > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
next;)
> > > >  ])
> > > >
> > > >  # Delete sw0-p2 logical port
> > > > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> > >
> > > I think it's preferable to have the unit tests in ovn.at whenever
> > > possible so that they get run regularly.
> > >
> > > > index 9ae6c6b1f..1e4f147b4 100644
> > > > --- a/tests/system-ovn.at
> > > > +++ b/tests/system-ovn.at
> > > > @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "
172.16.1.2/24",
> > "f0:00:00:01:02:05", \
> > > >  ovn-nbctl lsp-add alice alice1 \
> > > >  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
> > > >
> > > > +# Add external network
> > > > +ADD_NAMESPACES(ext-net)
> > > > +ip link add alice-ext netns alice1 type veth peer name ext-veth
netns
> > ext-net
> > > > +ip -n ext-net link set dev ext-veth up
> > > > +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> > > > +ip -n ext-net route add default via 10.0.0.2
> > > > +
> > > > +ip -n alice1 link set dev alice-ext up
> > > > +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> > > > +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> > > > +
> > > >  # Add DNAT rules
> > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3
192.168.1.2
> > foo1 00:00:02:02:03:04])
> > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4
192.168.1.3
> > foo2 00:00:02:02:03:05])
> > > > @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1
dnat_and_snat
> > 172.16.1.4 192.168.1.3 foo2 00:0
> > > >  # Add a SNAT rule
> > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
> > > >
> > > > +# Add default route to ext-net
> > > > +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> > > > +
> > > >  ovn-nbctl --wait=hv sync
> > > >  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep
> > 'nat(src=172.16.1.1)'])
> > > >
> > > > @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3
-w 2
> > 172.16.1.2 | FORMAT_PING], \
> > > >  3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > > >  ])
> > > >
> > > > +# Try to ping external network
> > > > +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst
172.16.1.3
> > and icmp > ext-net.pcap &])
> > > > +sleep 1
> > > > +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> > > > +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 |
> > FORMAT_PING], \
> > > > +[0], [dnl
> > > > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > > > +])
> > > > +
> > > > +OVS_WAIT_UNTIL([
> > > > +    total_pkts=$(cat ext-net.pcap | wc -l)
> > > > +    test "${total_pkts}" = "3"
> > > > +])
> > > > +
> > > >  # We verify that SNAT indeed happened via 'dump-conntrack' command.
> > > >  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1)
| \
> > > >  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> > > >
> > >
> > > The system test fails on my test machine (Fedora 32 with OVS master
and
> > > OVN master + your patch applied):
> > >
> > > make check-system-userspace TESTSUITEFLAGS="17"
> > > [...]
> > >  17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > > (system-ovn.at:2823)
> > >
> > > [...]
> > > system-ovn.at:2802: wait succeeded after 1 seconds
> > > ./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
> > > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > > 's/id=[0-9]*/id=<cleared>/g' -e
's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > > sort | uniq | \
> > > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > > ./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
> > > ./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
> > > -c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed
's/time.*ms$/time
> > > 0ms/'
> > > NS_EXEC_HEREDOC
> > > ./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
> > > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > > 's/id=[0-9]*/id=<cleared>/g' -e
's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > > sort | uniq | \
> > > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > > --- -   2020-05-20 10:28:50.053113201 -0400
> > > +++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
> > > 2020-05-20 10:28:50.050668409 -0400
> > > @@ -1,2 +1 @@
> > >
> >
-icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
> > > [...]
> > >
> > > Same happens when running the system test with the kernel datapath:
> > > make check-kernel TESTSUITEFLAGS="17"
> > > [...]
> > > 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > > (system-ovn.at:2823)
> > >
> > > Regards,
> > > Dumitru
> > >
Lorenzo Bianconi May 22, 2020, 9:23 p.m. UTC | #6
> On Wed, May 20, 2020 at 12:43 PM Lorenzo Bianconi <
> lorenzo.bianconi@redhat.com> wrote:
> >

[...]

> >
> > Regards,
> > Lorenzo
> >
> 
> Hi Lorenzo,
> 

Hi Han,

thx for the detailed explanation

> Thanks for explaining the scenario (here and also today's OVN meeting).
> Going through the patch again, I think there are still something unclear to
> me. Let me describe my understanding to the problem first:
> The logical topology is:
> VM -> LS -> LR -> External LS -> TOR
> 
> And physically, VM is on a Chassis, the LR has a distributed gateway port
> on a GW chassis.
> The VM has dnat_and_snat configured in OVN with a floating IP and external
> mac, and we expect the VM -> TOR traffic to directly go out through the
> VM's chassis without going through the GW node.

correct

> 
> Now IP traffic goes as expected, but for the first packet before the ARP is
> resolved from LR for the TOR's IP, the ARP request is triggered and sent
> from the VM chassis to the GW node over the tunnel, and broadcasted from GW
> node to the TOR. The first IP packet that triggers the ARP is buffered on
> the VM chassis during the ARP process, and when ARP response is received
> and MAC-binding is updated, the buffered IP packet is also sent to GW node
> over the tunnel and sent to the TOR from there, instead of directly from
> the VM chassis (according to what you mentioned in the ovn meeting). The
> second and later IP packets go from VM chassis to the TOR directly.
> However, since the first IP packet (with the nat external MAC) is sent out
> from gateway node, the TOR's MAC table can get confused.

correct, this was the original issue opened by OSP folks

> 
> Please let know if the above is accurate about the behavior before commit
> c0bf32d72f8b, and correct me if any step is wrong.
> 
> Now to the root cause. It seems the problem is in the ARP resolving. In
> ingress stage "ARP Request", since the dst MAC is still unresolved, the
> arp() action will be executed by pinctrl (now still on the VM chassis).
> When the arp is reinjected to OVS, since it is going through the pipelines
> from the begining (this is my understanding about PACKET_OUT, but I am not
> sure), when it hits the GW_REDIRECT stage, since it is not IP packet, it
> won't get matched by the flow that bypasses the redirect, so the output
> port is replaced by the CR-P, and redirected to the GW chassis through
> tunnel.
> 
> As to the buffered packet, when it is reinjected to OVS, if it go through
> the pipeline again, it should bypass the redirection since in GW_REDIRECT
> stage it should hit the flow that matches: ip.src == nat->logical_ip. So I
> don't understand why would it get tunneled to GW node.
> 
> Maybe there is something I misunderstood about the packet reinjection
> behavior. Please correct me. I should have spent time myself to debug it
> but I guess you might have already debugged so I'd like to save some effort
> here ;)
> 
> Now to the solution. Since I am not completely clear about the root cause,
> I can't say for sure about the solution. However, here are some suggestions
> based on your patch, if it is proved to solve the problem already.
> 1) This patch is completely different solution from the original commit
> c0bf32d72f8b, and the old solution was wrong and abandoned, so I'd suggest
> to submit as two separate patches: i) revert the old commit, and ii) the
> new solution. This way, it would be more clear for the review and also for
> others to understand the patch later.

ack, fine

> 
> 2) It seems all this patch is doing is matching the "ip.src ==
> nat->logical_ip and is_chassis_residency(nat->logical_port)" and then
> update the eth.src and reg1. I am not sure why it solves the problem, but I
> think this doesn't need a new stage. I think it can be added in the
> GW_REDIRECT stage (before ARP_REQUEST), with a higher priority flow with
> the same match and actions.

ack, I am fine with it. It seemed to me more clear having a dedicated stage for
it but I am 100% fine with your suggestion since it is cheaper

> 
> 3) In the new function add_ip_src_policy_flows(), the condition checks
> seems not sufficient to determine if it the nat can be performed in a
> "distributed" manner. Since the logic is related to NAT, it is better to
> put it together under the section /* NAT, Defrag and load balancing. */ of
> the function build_lrouter_flows(), where the related checks are already
> performed to determine if it is "distributed".
> 
>             /* For distributed router NAT, determine whether this NAT rule
>              * satisfies the conditions for distributed NAT processing. */
>             bool distributed = false;
>             struct eth_addr mac;
>             if (od->l3dgw_port && !strcmp(nat->type, "dnat_and_snat") &&
>                 nat->logical_port && nat->external_mac) {
>                 if (eth_addr_from_string(nat->external_mac, &mac)) {
>                     distributed = true;
>                 } else {
>                     static struct vlog_rate_limit rl =
>                         VLOG_RATE_LIMIT_INIT(5, 1);
>                     VLOG_WARN_RL(&rl, "bad mac %s for dnat in router "
>                         ""UUID_FMT"", nat->external_mac,
> UUID_ARGS(&od->key));
>                     continue;
>                 }
>             }
> 
> Just put the new flow addition under this section it would be easier to
> follow and avoid repeating the logic of determing if it is "distributed".

ack, thx a lot for the hint, I will implement it tomorrow and I will send a v3
if OSP folks confirm it works fine (but I guess so)

Regards,
Lorenzo

> 
> Thanks,
> Han
> 
> > >
> > > > > ---
> > > > > Changes since RFC:
> > > > > - added unit-tests
> > > > > ---
> > > > >  northd/ovn-northd.8.xml | 65
> ++++++++++++++++++++---------------------
> > > > >  northd/ovn-northd.c     | 38 ++++++++++--------------
> > > > >  tests/ovn.at            | 28 +++++-------------
> > > > >  tests/system-ovn.at     | 28 ++++++++++++++++++
> > > > >  4 files changed, 82 insertions(+), 77 deletions(-)
> > > > >
> > > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > > > index 8f224b07f..09dbb52b4 100644
> > > > > --- a/northd/ovn-northd.8.xml
> > > > > +++ b/northd/ovn-northd.8.xml
> > > > > @@ -2484,37 +2484,6 @@ output;
> > > > >          </p>
> > > > >        </li>
> > > > >
> > > > > -      <li>
> > > > > -        <p>
> > > > > -          For distributed logical routers where one of the logical
> > > router ports
> > > > > -          specifies a <code>redirect-chassis</code>, a priority-400
> > > logical
> > > > > -          flow for each <code>dnat_and_snat</code> NAT rules
> > > configured.
> > > > > -          These flows will allow to properly forward traffic to the
> > > external
> > > > > -          connections if available and avoid sending it through the
> > > tunnel.
> > > > > -          Assuming the following NAT rule has been configured:
> > > > > -        </p>
> > > > > -
> > > > > -        <pre>
> > > > > -external_ip = <var>A</var>;
> > > > > -external_mac = <var>B</var>;
> > > > > -logical_ip = <var>C</var>;
> > > > > -        </pre>
> > > > > -
> > > > > -        <p>
> > > > > -          the following action will be applied:
> > > > > -        </p>
> > > > > -
> > > > > -        <pre>
> > > > > -ip.ttl--;
> > > > > -reg0 = <var>ip.dst</var>;
> > > > > -reg1 = <var>A</var>;
> > > > > -eth.src = <var>B</var>;
> > > > > -outport = <var>router-port</var>;
> > > > > -next;
> > > > > -        </pre>
> > > > > -
> > > > > -      </li>
> > > > > -
> > > > >        <li>
> > > > >          <p>
> > > > >            IPv4 routing table.  For each route to IPv4 network
> > > <var>N</var> with
> > > > > @@ -2660,7 +2629,35 @@ outport = <var>P</var>;
> > > > >        </li>
> > > > >      </ul>
> > > > >
> > > > > -    <h3>Ingress Table 12: ARP/ND Resolution</h3>
> > > > > +    <h3>Ingress Table 12: IP Source Policy</h3>
> > > > > +
> > > > > +    <p>
> > > > > +      This table contains for distributed logical routers where
> one of
> > > > > +      the logical router ports specifies a
> > > <code>redirect-chassis</code>,
> > > > > +      a priority-100 logical flow for each
> <code>dnat_and_snat</code>
> > > > > +      NAT rules configured.
> > > > > +      These flows will allow to properly forward traffic to the
> > > external
> > > > > +      connections if available and avoid sending it through the
> tunnel.
> > > > > +      Assuming the following NAT rule has been configured:
> > > > > +    </p>
> > > > > +
> > > > > +    <pre>
> > > > > +external_ip = <var>A</var>;
> > > > > +external_mac = <var>B</var>;
> > > > > +logical_ip = <var>C</var>;
> > > > > +    </pre>
> > > > > +
> > > > > +    <p>
> > > > > +      the following action will be applied:
> > > > > +    </p>
> > > > > +
> > > > > +    <pre>
> > > > > +reg1 = <var>A</var>;
> > > > > +eth.src = <var>B</var>;
> > > > > +next;
> > > > > +    </pre>
> > > > > +
> > > >
> > > > This section of the man page will show what action we perform on
> > > > traffic, i.e., move external_ip in reg1 (for SNAT) and move
> external_mac
> > > > in eth.src but there's no mention about what kind of traffic will
> match
> > > > these flows, that is: "ip.src == logical_ip &&
> > > > is_chassis_resident(logical_port)".
> > > >
> > > > > +    <h3>Ingress Table 13: ARP/ND Resolution</h3>
> > > > >
> > > > >      <p>
> > > > >        Any packet that reaches this table is an IP packet whose
> next-hop
> > > > > @@ -2819,7 +2816,7 @@ outport = <var>P</var>;
> > > > >
> > > > >      </ul>
> > > > >
> > > > > -    <h3>Ingress Table 13: Check packet length</h3>
> > > > > +    <h3>Ingress Table 14: Check packet length</h3>
> > > > >
> > > > >      <p>
> > > > >        For distributed logical routers with distributed gateway port
> > > configured
> > > > > @@ -2849,7 +2846,7 @@ REGBIT_PKT_LARGER =
> > > check_pkt_larger(<var>L</var>); next;
> > > > >        and advances to the next table.
> > > > >      </p>
> > > > >
> > > > > -    <h3>Ingress Table 14: Handle larger packets</h3>
> > > > > +    <h3>Ingress Table 15: Handle larger packets</h3>
> > > > >
> > > >
> > > > Indices for tables "Gateway Redirect" and "ARP Request" should be
> > > > updated too.
> > > >
> > > > >      <p>
> > > > >        For distributed logical routers with distributed gateway port
> > > configured
> > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > > > index 3c0070ea7..d5f3997a9 100644
> > > > > --- a/northd/ovn-northd.c
> > > > > +++ b/northd/ovn-northd.c
> > > > > @@ -175,11 +175,12 @@ enum ovn_stage {
> > > > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
> > > "lr_in_ip_routing")   \
> > > > >      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> > > "lr_in_ip_routing_ecmp") \
> > > > >      PIPELINE_STAGE(ROUTER, IN,  POLICY,          11,
> "lr_in_policy")
> > >     \
> > > > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> > > "lr_in_arp_resolve")  \
> > > > > -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> > > "lr_in_chk_pkt_len")   \
> > > > > -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> > > 14,"lr_in_larger_pkts")   \
> > > > > -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> > > "lr_in_gw_redirect")  \
> > > > > -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> > > "lr_in_arp_request")  \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12,
> > > "lr_in_ip_src_policy") \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> > > "lr_in_arp_resolve")  \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> > > "lr_in_chk_pkt_len")   \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> > > 15,"lr_in_larger_pkts")   \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> > > "lr_in_gw_redirect")  \
> > > > > +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> > > "lr_in_arp_request")  \
> > > > >
>    \
> > > > >      /* Logical router egress stages. */
>     \
> > > > >      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")
>    \
> > > > > @@ -7103,8 +7104,6 @@ build_routing_policy_flow(struct hmap *lflows,
> > > struct ovn_datapath *od,
> > > > >      ds_destroy(&actions);
> > > > >  }
> > > > >
> > > > > -/* default logical flow prioriry for distributed routes */
> > > > > -#define DROUTE_PRIO 400
> > > > >  struct parsed_route {
> > > > >      struct ovs_list list_node;
> > > > >      struct v46_ip prefix;
> > > > > @@ -7493,7 +7492,7 @@ build_ecmp_route_flow(struct hmap *lflows,
> struct
> > > ovn_datapath *od,
> > > > >  }
> > > > >
> > > > >  static void
> > > > > -add_distributed_routes(struct hmap *lflows, struct ovn_datapath
> *od)
> > > > > +add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath
> *od)
> > > > >  {
> > > > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > > > >      struct ds match = DS_EMPTY_INITIALIZER;
> > > > > @@ -7511,12 +7510,9 @@ add_distributed_routes(struct hmap *lflows,
> > > struct ovn_datapath *od)
> > > > >                        is_ipv4 ? "4" : "6", nat->logical_ip,
> > > > >                        nat->logical_port);
> > > > >          char *prefix = is_ipv4 ? "" : "xx";
> > > > > -        ds_put_format(&actions, "outport = %s; eth.src = %s; "
> > > > > -                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
> > > > > -                      od->l3dgw_port->json_key, nat->external_mac,
> > > > > -                      prefix, is_ipv4 ? "4" : "6",
> > > > > -                      prefix, nat->external_ip);
> > > > > -        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING,
> DROUTE_PRIO,
> > > > > +        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
> > > > > +                      nat->external_mac, prefix, nat->external_ip);
> > > > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
> > > > >                        ds_cstr(&match), ds_cstr(&actions));
> > > > >          ds_clear(&match);
> > > > >          ds_clear(&actions);
> > > > > @@ -7547,12 +7543,6 @@ add_route(struct hmap *lflows, const struct
> > > ovn_port *op,
> > > > >      }
> > > > >      build_route_match(op_inport, network_s, plen, is_src_route,
> > > is_ipv4,
> > > > >                        &match, &priority);
> > > > > -    /* traffic for internal IPs of logical switch ports must be
> sent to
> > > > > -     * the gw controller through the overlay tunnels
> > > > > -     */
> > > > > -    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
> > > > > -        priority += DROUTE_PRIO;
> > > > > -    }
> > > > >
> > > > >      struct ds actions = DS_EMPTY_INITIALIZER;
> > > > >      ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0;
> %sreg0
> > > = ",
> > > > > @@ -9519,9 +9509,13 @@ build_lrouter_flows(struct hmap *datapaths,
> > > struct hmap *ports,
> > > > >       * logical router
> > > > >       */
> > > > >      HMAP_FOR_EACH (od, key_node, datapaths) {
> > > > > -        if (od->nbr && od->l3dgw_port) {
> > > > > -            add_distributed_routes(lflows, od);
> > > > > +        if (!od->nbr) {
> > > > > +            continue;
> > > > > +        }
> > > > > +        if (od->l3dgw_port) {
> > > > > +            add_ip_src_policy_flows(lflows, od);
> > > > >          }
> > > > > +        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0,
> "1",
> > > "next;");
> > > > >      }
> > > > >
> > > > >      /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP:
> IP
> > > Routing.
> > > > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > > > index f39fda2e4..fcc34fd5d 100644
> > > > > --- a/tests/ovn.at
> > > > > +++ b/tests/ovn.at
> > > > > @@ -9637,20 +9637,6 @@ AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch .
> > > external-ids:ovn-bridge-mappings=p
> > > > >  OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
> > > > >  grep "Port patch-br-int-to-ln_port" | wc -l`])
> > > > >
> > > > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep
> lr_in_ip_routing |
> > > \
> > > > > -grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
> > > > > -AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep
> lr_in_ip_routing |
> > > \
> > > > > -grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
> > > > > -
> > > > > -key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding
> lr0`
> > > > > -# Check that the OVS flows appear for the dnat_and_snat entries in
> > > > > -# lr_in_ip_routing table.
> > > > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int
> table=17
> > > | \
> > > > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
> > > > > -
> > > > > -OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int
> table=17
> > > | \
> > > > > -grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
> > > > > -
> > > > >  # Re-add nat-addresses option
> > > > >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
> > > nat-addresses="router"
> > > > >
> > > > > @@ -15141,7 +15127,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  # Since the sw0-vir is not claimed by any chassis, eth.dst should
> be
> > > set to
> > > > >  # zero if the ip4.dst is the virtual ip in the router pipeline.
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
> next;)
> > > > >  ])
> > > > >
> > > > >  ip_to_hex() {
> > > > > @@ -15192,7 +15178,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  # There should be an arp resolve flow to resolve the virtual_ip
> with
> > > the
> > > > >  # sw0-p1's MAC.
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
> next;)
> > > > >  ])
> > > > >
> > > > >  # Forcibly clear virtual_parent. ovn-controller should release the
> > > binding
> > > > > @@ -15233,7 +15219,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  # There should be an arp resolve flow to resolve the virtual_ip
> with
> > > the
> > > > >  # sw0-p2's MAC.
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05;
> next;)
> > > > >  ])
> > > > >
> > > > >  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> > > > > @@ -15256,7 +15242,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  # There should be an arp resolve flow to resolve the virtual_ip
> with
> > > the
> > > > >  # sw0-p3's MAC.
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
> next;)
> > > > >  ])
> > > > >
> > > > >  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> > > > > @@ -15277,7 +15263,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  > lflows.txt
> > > > >
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03;
> next;)
> > > > >  ])
> > > > >
> > > > >  # Delete hv1-vif1 port. hv1 should release sw0-vir
> > > > > @@ -15295,7 +15281,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  > lflows.txt
> > > > >
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00;
> next;)
> > > > >  ])
> > > > >
> > > > >  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> > > > > @@ -15316,7 +15302,7 @@ ovn-sbctl dump-flows lr0 | grep
> > > lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> > > > >  > lflows.txt
> > > > >
> > > > >  AT_CHECK([cat lflows.txt], [0], [dnl
> > > > > -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
> next;)
> > > > > +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> > > "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04;
> next;)
> > > > >  ])
> > > > >
> > > > >  # Delete sw0-p2 logical port
> > > > > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> > > >
> > > > I think it's preferable to have the unit tests in ovn.at whenever
> > > > possible so that they get run regularly.
> > > >
> > > > > index 9ae6c6b1f..1e4f147b4 100644
> > > > > --- a/tests/system-ovn.at
> > > > > +++ b/tests/system-ovn.at
> > > > > @@ -2747,6 +2747,17 @@ ADD_VETH(alice1, alice1, br-int, "
> 172.16.1.2/24",
> > > "f0:00:00:01:02:05", \
> > > > >  ovn-nbctl lsp-add alice alice1 \
> > > > >  -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
> > > > >
> > > > > +# Add external network
> > > > > +ADD_NAMESPACES(ext-net)
> > > > > +ip link add alice-ext netns alice1 type veth peer name ext-veth
> netns
> > > ext-net
> > > > > +ip -n ext-net link set dev ext-veth up
> > > > > +ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
> > > > > +ip -n ext-net route add default via 10.0.0.2
> > > > > +
> > > > > +ip -n alice1 link set dev alice-ext up
> > > > > +ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
> > > > > +ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
> > > > > +
> > > > >  # Add DNAT rules
> > > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3
> 192.168.1.2
> > > foo1 00:00:02:02:03:04])
> > > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4
> 192.168.1.3
> > > foo2 00:00:02:02:03:05])
> > > > > @@ -2754,6 +2765,9 @@ AT_CHECK([ovn-nbctl lr-nat-add R1
> dnat_and_snat
> > > 172.16.1.4 192.168.1.3 foo2 00:0
> > > > >  # Add a SNAT rule
> > > > >  AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
> > > > >
> > > > > +# Add default route to ext-net
> > > > > +AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
> > > > > +
> > > > >  ovn-nbctl --wait=hv sync
> > > > >  OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep
> > > 'nat(src=172.16.1.1)'])
> > > > >
> > > > > @@ -2776,6 +2790,20 @@ NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3
> -w 2
> > > 172.16.1.2 | FORMAT_PING], \
> > > > >  3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > > > >  ])
> > > > >
> > > > > +# Try to ping external network
> > > > > +NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst
> 172.16.1.3
> > > and icmp > ext-net.pcap &])
> > > > > +sleep 1
> > > > > +AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
> > > > > +NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 |
> > > FORMAT_PING], \
> > > > > +[0], [dnl
> > > > > +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> > > > > +])
> > > > > +
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    total_pkts=$(cat ext-net.pcap | wc -l)
> > > > > +    test "${total_pkts}" = "3"
> > > > > +])
> > > > > +
> > > > >  # We verify that SNAT indeed happened via 'dump-conntrack' command.
> > > > >  AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1)
> | \
> > > > >  sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl
> > > > >
> > > >
> > > > The system test fails on my test machine (Fedora 32 with OVS master
> and
> > > > OVN master + your patch applied):
> > > >
> > > > make check-system-userspace TESTSUITEFLAGS="17"
> > > > [...]
> > > >  17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > > > (system-ovn.at:2823)
> > > >
> > > > [...]
> > > > system-ovn.at:2802: wait succeeded after 1 seconds
> > > > ./system-ovn.at:2808: ovs-appctl dpctl/dump-conntrack | grep
> > > > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > > > 's/id=[0-9]*/id=<cleared>/g' -e
> 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > > > sort | uniq | \
> > > > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > > > ./system-ovn.at:2813: ovs-appctl dpctl/flush-conntrack
> > > > ./system-ovn.at:2817: ip netns exec bar1 sh << NS_EXEC_HEREDOC ping -q
> > > > -c 3 -i 0.3 -w 2 172.16.1.2 | grep "transmitted" | sed
> 's/time.*ms$/time
> > > > 0ms/'
> > > > NS_EXEC_HEREDOC
> > > > ./system-ovn.at:2823: ovs-appctl dpctl/dump-conntrack | grep
> > > > "dst=172.16.1.1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e
> > > > 's/id=[0-9]*/id=<cleared>/g' -e
> 's/state=[0-9_A-Z]*/state=<cleared>/g' |
> > > > sort | uniq | \
> > > > sed -e 's/zone=[0-9]*/zone=<cleared>/'
> > > > --- -   2020-05-20 10:28:50.053113201 -0400
> > > > +++ /root/ovn/tests/system-userspace-testsuite.dir/at-groups/17/stdout
> > > > 2020-05-20 10:28:50.050668409 -0400
> > > > @@ -1,2 +1 @@
> > > >
> > >
> -icmp,orig=(src=192.168.2.2,dst=172.16.1.2,id=<cleared>,type=8,code=0),reply=(src=172.16.1.2,dst=172.16.1.1,id=<cleared>,type=0,code=0),zone=<cleared>
> > > > [...]
> > > >
> > > > Same happens when running the system test with the kernel datapath:
> > > > make check-kernel TESTSUITEFLAGS="17"
> > > > [...]
> > > > 17: ovn -- DNAT and SNAT on distributed router - N/S FAILED
> > > > (system-ovn.at:2823)
> > > >
> > > > Regards,
> > > > Dumitru
> > > >
diff mbox series

Patch

diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 8f224b07f..09dbb52b4 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -2484,37 +2484,6 @@  output;
         </p>
       </li>
 
-      <li>
-        <p>
-          For distributed logical routers where one of the logical router ports
-          specifies a <code>redirect-chassis</code>, a priority-400 logical
-          flow for each <code>dnat_and_snat</code> NAT rules configured.
-          These flows will allow to properly forward traffic to the external
-          connections if available and avoid sending it through the tunnel.
-          Assuming the following NAT rule has been configured:
-        </p>
-
-        <pre>
-external_ip = <var>A</var>;
-external_mac = <var>B</var>;
-logical_ip = <var>C</var>;
-        </pre>
-
-        <p>
-          the following action will be applied:
-        </p>
-
-        <pre>
-ip.ttl--;
-reg0 = <var>ip.dst</var>;
-reg1 = <var>A</var>;
-eth.src = <var>B</var>;
-outport = <var>router-port</var>;
-next;
-        </pre>
-
-      </li>
-
       <li>
         <p>
           IPv4 routing table.  For each route to IPv4 network <var>N</var> with
@@ -2660,7 +2629,35 @@  outport = <var>P</var>;
       </li>
     </ul>
 
-    <h3>Ingress Table 12: ARP/ND Resolution</h3>
+    <h3>Ingress Table 12: IP Source Policy</h3>
+
+    <p>
+      This table contains for distributed logical routers where one of
+      the logical router ports specifies a <code>redirect-chassis</code>,
+      a priority-100 logical flow for each <code>dnat_and_snat</code>
+      NAT rules configured.
+      These flows will allow to properly forward traffic to the external
+      connections if available and avoid sending it through the tunnel.
+      Assuming the following NAT rule has been configured:
+    </p>
+
+    <pre>
+external_ip = <var>A</var>;
+external_mac = <var>B</var>;
+logical_ip = <var>C</var>;
+    </pre>
+
+    <p>
+      the following action will be applied:
+    </p>
+
+    <pre>
+reg1 = <var>A</var>;
+eth.src = <var>B</var>;
+next;
+    </pre>
+
+    <h3>Ingress Table 13: ARP/ND Resolution</h3>
 
     <p>
       Any packet that reaches this table is an IP packet whose next-hop
@@ -2819,7 +2816,7 @@  outport = <var>P</var>;
 
     </ul>
 
-    <h3>Ingress Table 13: Check packet length</h3>
+    <h3>Ingress Table 14: Check packet length</h3>
 
     <p>
       For distributed logical routers with distributed gateway port configured
@@ -2849,7 +2846,7 @@  REGBIT_PKT_LARGER = check_pkt_larger(<var>L</var>); next;
       and advances to the next table.
     </p>
 
-    <h3>Ingress Table 14: Handle larger packets</h3>
+    <h3>Ingress Table 15: Handle larger packets</h3>
 
     <p>
       For distributed logical routers with distributed gateway port configured
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 3c0070ea7..d5f3997a9 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -175,11 +175,12 @@  enum ovn_stage {
     PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")   \
     PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10, "lr_in_ip_routing_ecmp") \
     PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")       \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12, "lr_in_arp_resolve")  \
-    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13, "lr_in_chk_pkt_len")   \
-    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")   \
-    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15, "lr_in_gw_redirect")  \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16, "lr_in_arp_request")  \
+    PIPELINE_STAGE(ROUTER, IN,  IP_SRC_POLICY,   12, "lr_in_ip_src_policy") \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13, "lr_in_arp_resolve")  \
+    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14, "lr_in_chk_pkt_len")   \
+    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")   \
+    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16, "lr_in_gw_redirect")  \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17, "lr_in_arp_request")  \
                                                                       \
     /* Logical router egress stages. */                               \
     PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
@@ -7103,8 +7104,6 @@  build_routing_policy_flow(struct hmap *lflows, struct ovn_datapath *od,
     ds_destroy(&actions);
 }
 
-/* default logical flow prioriry for distributed routes */
-#define DROUTE_PRIO 400
 struct parsed_route {
     struct ovs_list list_node;
     struct v46_ip prefix;
@@ -7493,7 +7492,7 @@  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
 }
 
 static void
-add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
+add_ip_src_policy_flows(struct hmap *lflows, struct ovn_datapath *od)
 {
     struct ds actions = DS_EMPTY_INITIALIZER;
     struct ds match = DS_EMPTY_INITIALIZER;
@@ -7511,12 +7510,9 @@  add_distributed_routes(struct hmap *lflows, struct ovn_datapath *od)
                       is_ipv4 ? "4" : "6", nat->logical_ip,
                       nat->logical_port);
         char *prefix = is_ipv4 ? "" : "xx";
-        ds_put_format(&actions, "outport = %s; eth.src = %s; "
-                      "%sreg0 = ip%s.dst; %sreg1 = %s; next;",
-                      od->l3dgw_port->json_key, nat->external_mac,
-                      prefix, is_ipv4 ? "4" : "6",
-                      prefix, nat->external_ip);
-        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, DROUTE_PRIO,
+        ds_put_format(&actions, "eth.src = %s; %sreg1 = %s; next;",
+                      nat->external_mac, prefix, nat->external_ip);
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 100,
                       ds_cstr(&match), ds_cstr(&actions));
         ds_clear(&match);
         ds_clear(&actions);
@@ -7547,12 +7543,6 @@  add_route(struct hmap *lflows, const struct ovn_port *op,
     }
     build_route_match(op_inport, network_s, plen, is_src_route, is_ipv4,
                       &match, &priority);
-    /* traffic for internal IPs of logical switch ports must be sent to
-     * the gw controller through the overlay tunnels
-     */
-    if (op->nbrp && !op->nbrp->n_gateway_chassis) {
-        priority += DROUTE_PRIO;
-    }
 
     struct ds actions = DS_EMPTY_INITIALIZER;
     ds_put_format(&actions, "ip.ttl--; "REG_ECMP_GROUP_ID" = 0; %sreg0 = ",
@@ -9519,9 +9509,13 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
      * logical router
      */
     HMAP_FOR_EACH (od, key_node, datapaths) {
-        if (od->nbr && od->l3dgw_port) {
-            add_distributed_routes(lflows, od);
+        if (!od->nbr) {
+            continue;
+        }
+        if (od->l3dgw_port) {
+            add_ip_src_policy_flows(lflows, od);
         }
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_SRC_POLICY, 0, "1", "next;");
     }
 
     /* Logical router ingress table IP_ROUTING & IP_ROUTING_ECMP: IP Routing.
diff --git a/tests/ovn.at b/tests/ovn.at
index f39fda2e4..fcc34fd5d 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -9637,20 +9637,6 @@  AT_CHECK([as hv3 ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=p
 OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-vsctl show | \
 grep "Port patch-br-int-to-ln_port" | wc -l`])
 
-AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
-grep "ip4.src == 10.0.0.3 && is_chassis_resident(\"foo1\")" -c`])
-AT_CHECK([test 1 = `ovn-sbctl dump-flows lr0 | grep lr_in_ip_routing | \
-grep "ip4.src == 10.0.0.4 && is_chassis_resident(\"foo2\")" -c`])
-
-key=`ovn-sbctl --bare --columns tunnel_key list datapath_Binding lr0`
-# Check that the OVS flows appear for the dnat_and_snat entries in
-# lr_in_ip_routing table.
-OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
-grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.3" -c`])
-
-OVS_WAIT_UNTIL([test 1 = `as hv3 ovs-ofctl dump-flows br-int table=17 | \
-grep "priority=400,ip,metadata=0x$key,nw_src=10.0.0.4" -c`])
-
 # Re-add nat-addresses option
 ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0 nat-addresses="router"
 
@@ -15141,7 +15127,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # Since the sw0-vir is not claimed by any chassis, eth.dst should be set to
 # zero if the ip4.dst is the virtual ip in the router pipeline.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
 ])
 
 ip_to_hex() {
@@ -15192,7 +15178,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p1's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
 ])
 
 # Forcibly clear virtual_parent. ovn-controller should release the binding
@@ -15233,7 +15219,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p2's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
 ])
 
 # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
@@ -15256,7 +15242,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p3's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
 ])
 
 # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
@@ -15277,7 +15263,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
 ])
 
 # Delete hv1-vif1 port. hv1 should release sw0-vir
@@ -15295,7 +15281,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
 ])
 
 # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
@@ -15316,7 +15302,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
 ])
 
 # Delete sw0-p2 logical port
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index 9ae6c6b1f..1e4f147b4 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -2747,6 +2747,17 @@  ADD_VETH(alice1, alice1, br-int, "172.16.1.2/24", "f0:00:00:01:02:05", \
 ovn-nbctl lsp-add alice alice1 \
 -- lsp-set-addresses alice1 "f0:00:00:01:02:05 172.16.1.2"
 
+# Add external network
+ADD_NAMESPACES(ext-net)
+ip link add alice-ext netns alice1 type veth peer name ext-veth netns ext-net
+ip -n ext-net link set dev ext-veth up
+ip -n ext-net addr add 10.0.0.1/24 dev ext-veth
+ip -n ext-net route add default via 10.0.0.2
+
+ip -n alice1 link set dev alice-ext up
+ip -n alice1 addr add 10.0.0.2/24 dev alice-ext
+ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
+
 # Add DNAT rules
 AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.3 192.168.1.2 foo1 00:00:02:02:03:04])
 AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:00:02:02:03:05])
@@ -2754,6 +2765,9 @@  AT_CHECK([ovn-nbctl lr-nat-add R1 dnat_and_snat 172.16.1.4 192.168.1.3 foo2 00:0
 # Add a SNAT rule
 AT_CHECK([ovn-nbctl lr-nat-add R1 snat 172.16.1.1 192.168.0.0/16])
 
+# Add default route to ext-net
+AT_CHECK([ovn-nbctl lr-route-add R1 10.0.0.0/24 172.16.1.2])
+
 ovn-nbctl --wait=hv sync
 OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep 'nat(src=172.16.1.1)'])
 
@@ -2776,6 +2790,20 @@  NS_CHECK_EXEC([foo2], [ping -q -c 3 -i 0.3 -w 2 172.16.1.2 | FORMAT_PING], \
 3 packets transmitted, 3 received, 0% packet loss, time 0ms
 ])
 
+# Try to ping external network
+NS_CHECK_EXEC([ext-net], [tcpdump -n -c 3 -i ext-veth dst 172.16.1.3 and icmp > ext-net.pcap &])
+sleep 1
+AT_CHECK([ovn-nbctl lr-nat-del R1 snat])
+NS_CHECK_EXEC([foo1], [ping -q -c 3 -i 0.3 -w 2 10.0.0.1 | FORMAT_PING], \
+[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_WAIT_UNTIL([
+    total_pkts=$(cat ext-net.pcap | wc -l)
+    test "${total_pkts}" = "3"
+])
+
 # We verify that SNAT indeed happened via 'dump-conntrack' command.
 AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.1.1) | \
 sed -e 's/zone=[[0-9]]*/zone=<cleared>/'], [0], [dnl