diff mbox series

[ovs-dev,1/3] northd: Fix broadcast of all traffic within a spine switch.

Message ID 20250212124149.3007842-2-i.maximets@ovn.org
State Accepted
Headers show
Series Fixes for broadcast behavior in Spine-Leaf topology. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_ovn-kubernetes success github build: passed
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Ilya Maximets Feb. 12, 2025, 12:41 p.m. UTC
Currently, FDB learning is not enabled for the switch-switch ports
connecting switches in the Spine-Leaf topology.  This is causing a
traffic broadcast in the spine switch for every packet.  Even in cases
where it doesn't end up creating extra work in the datapath (since
ovn-controller knows the whole topology), this still creates a lot
of extra work for OpenFlow processing, since we need to evaluate
those rules for every connected switch during upcall processing.
And in cases where leaf switches have ports with unknown addresses,
we may end up unnecessarily broadcasting the actual traffic within
the datapath to those ports.

Fix that by enabling FDB learning for switch ports as it is already
done for other ports with unknown addresses.

Tests are enhanced to check that FDB is actually working and that
we're not unnecessarily broadcasting traffic.

For the case with interconnect this only partially solves the problem,
since we can't learn from remote ports, and so the packets are still
broadcasted to all the zones on the transit spine switch.  At least,
now the traffic will be dropped on the unrelated leaf switches, once
they learn that the actual destination is behind the spine switch from
witch the packet just arrived.  Learning from remote ports to stop
the broadcasting will be addressed in the next commits.

Having an upcall per switch seems a little excessive, but it should
only happen once per MAC address and should not be a problem after
all the addresses are learned.  Also, with the main use case being
a transit switch, learning will only be triggered for switches local
to the availability zone, which should be a relatively small number.
However, this learning per switch behavior might still be a good
candidate for a future improvement.

Fixes: a2db2b2f263a ("northd: Add support for spine-leaf logical switch topology.")
Suggested-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 northd/northd.c         |   8 +-
 northd/ovn-northd.8.xml |   7 +-
 tests/ovn-ic.at         | 141 +++++++++++++++++++++--
 tests/ovn-northd.at     |  18 +++
 tests/ovn.at            | 244 +++++++++++++++++++++++++++++++++++++---
 5 files changed, 385 insertions(+), 33 deletions(-)

Comments

Ilya Maximets Feb. 12, 2025, 3:30 p.m. UTC | #1
On 2/12/25 13:41, Ilya Maximets wrote:
> Currently, FDB learning is not enabled for the switch-switch ports
> connecting switches in the Spine-Leaf topology.  This is causing a
> traffic broadcast in the spine switch for every packet.  Even in cases
> where it doesn't end up creating extra work in the datapath (since
> ovn-controller knows the whole topology), this still creates a lot
> of extra work for OpenFlow processing, since we need to evaluate
> those rules for every connected switch during upcall processing.
> And in cases where leaf switches have ports with unknown addresses,
> we may end up unnecessarily broadcasting the actual traffic within
> the datapath to those ports.
> 
> Fix that by enabling FDB learning for switch ports as it is already
> done for other ports with unknown addresses.
> 
> Tests are enhanced to check that FDB is actually working and that
> we're not unnecessarily broadcasting traffic.
> 
> For the case with interconnect this only partially solves the problem,
> since we can't learn from remote ports, and so the packets are still
> broadcasted to all the zones on the transit spine switch.  At least,
> now the traffic will be dropped on the unrelated leaf switches, once
> they learn that the actual destination is behind the spine switch from
> witch the packet just arrived.  Learning from remote ports to stop
> the broadcasting will be addressed in the next commits.
> 
> Having an upcall per switch seems a little excessive, but it should
> only happen once per MAC address and should not be a problem after
> all the addresses are learned.  Also, with the main use case being
> a transit switch, learning will only be triggered for switches local
> to the availability zone, which should be a relatively small number.
> However, this learning per switch behavior might still be a good
> candidate for a future improvement.
> 
> Fixes: a2db2b2f263a ("northd: Add support for spine-leaf logical switch topology.")
> Suggested-by: Numan Siddique <numans@ovn.org>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  northd/northd.c         |   8 +-
>  northd/ovn-northd.8.xml |   7 +-
>  tests/ovn-ic.at         | 141 +++++++++++++++++++++--
>  tests/ovn-northd.at     |  18 +++
>  tests/ovn.at            | 244 +++++++++++++++++++++++++++++++++++++---
>  5 files changed, 385 insertions(+), 33 deletions(-)

Docker Hub rate-limiting failed the container build.

Recheck-request: github-robot-_Build_and_Test
diff mbox series

Patch

diff --git a/northd/northd.c b/northd/northd.c
index 1097bb159..cad974929 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -5759,8 +5759,12 @@  build_lswitch_learn_fdb_op(
 {
     ovs_assert(op->nbsp);
 
-    if (!op->n_ps_addrs && op->has_unknown && (!strcmp(op->nbsp->type, "") ||
-        (lsp_is_localnet(op->nbsp) && localnet_can_learn_mac(op->nbsp)))) {
+    if (op->n_ps_addrs || !op->has_unknown) {
+        return;
+    }
+
+    if (!strcmp(op->nbsp->type, "") || lsp_is_switch(op->nbsp)
+        || (lsp_is_localnet(op->nbsp) && localnet_can_learn_mac(op->nbsp))) {
         ds_clear(match);
         ds_clear(actions);
         ds_put_format(match, "inport == %s", op->json_key);
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 93b1a9135..0dc577393 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -407,6 +407,8 @@ 
       port security is disabled and 'unknown' address setn as well as
       for localnet ports with option localnet_learn_fdb. A localnet
       port entry does not overwrite a VIF port entry.
+      Logical switch ports with type <code>switch</code> have implicit
+      'unknown' addresses and so they are also eligible for MAC learning.
     </p>
 
     <ul>
@@ -451,8 +453,9 @@ 
     <h3>Ingress Table 3: Learn MAC of 'unknown' ports.</h3>
 
     <p>
-      This table learns the MAC addresses seen on the VIF logical ports
-      whose port security is disabled and 'unknown' address set as well
+      This table learns the MAC addresses seen on the VIF or 'switch' logical
+      ports whose port security is disabled and 'unknown' address set (note:
+      'switch' ports have implicit 'unknown' addresses) as well
       as on localnet ports with localnet_learn_fdb option set
       if the <code>lookup_fdb</code> action returned false in the
       previous table. For localnet ports (with flags.localnet = 1),
diff --git a/tests/ovn-ic.at b/tests/ovn-ic.at
index 02191f13a..7eecbfc61 100644
--- a/tests/ovn-ic.at
+++ b/tests/ovn-ic.at
@@ -2652,7 +2652,7 @@  AT_CLEANUP
 ])
 
 OVN_FOR_EACH_NORTHD([
-AT_SETUP([spine-leaf: 2 AZs, 2 HVs, 2 LSs, connected via transit spine switch])
+AT_SETUP([spine-leaf: 3 AZs, 3 HVs, 3 LSs, connected via transit spine switch])
 AT_KEYWORDS([spine leaf])
 AT_SKIP_IF([test $HAVE_SCAPY = no])
 
@@ -2660,14 +2660,18 @@  ovn_init_ic_db
 
 ovn_start az1
 ovn_start az2
+ovn_start az3
 
 # Logical network:
-# Single network 172.16.1.0/24.  Two switches with VIF ports on HVs in two
+# Single network 172.16.1.0/24.  Three switches with VIF ports on HVs in three
 # separate AZs, connected to a spine transit switch via 'switch' ports.
+# Third switch has a port with unknown address, but we're only sending packets
+# between ports on other two switches.
 
 ovn-ic-nbctl ts-add spine
 ovn_as az1 check ovn-nbctl ls-add ls1
 ovn_as az2 check ovn-nbctl ls-add ls2
+ovn_as az3 check ovn-nbctl ls-add ls3
 
 # Connect ls1 to spine.
 ovn_as az1
@@ -2683,6 +2687,13 @@  check ovn-nbctl lsp-add ls2 ls2-to-spine
 check ovn-nbctl lsp-set-type spine-to-ls2 switch peer=ls2-to-spine
 check ovn-nbctl lsp-set-type ls2-to-spine switch peer=spine-to-ls2
 
+# Connect ls3 to spine.
+ovn_as az3
+check ovn-nbctl lsp-add spine spine-to-ls3
+check ovn-nbctl lsp-add ls3 ls3-to-spine
+check ovn-nbctl lsp-set-type spine-to-ls3 switch peer=ls3-to-spine
+check ovn-nbctl lsp-set-type ls3-to-spine switch peer=spine-to-ls3
+
 # Create logical port ls1-lp1 in ls1
 ovn_as az1 check ovn-nbctl lsp-add ls1 ls1-lp1 \
 -- lsp-set-addresses ls1-lp1 "f0:00:00:01:02:01 172.16.1.1"
@@ -2697,6 +2708,9 @@  ovn_as az2 check ovn-nbctl lsp-add ls2 ls2-lp1 \
 ovn_as az2 check ovn-nbctl lsp-add ls2 ls2-lp2 \
 -- lsp-set-addresses ls2-lp2 "f0:00:00:01:02:04 172.16.1.4"
 
+# Create logical port ls3-lp1 in ls3 with unknown address.
+check ovn-nbctl lsp-add ls3 ls3-lp1 -- lsp-set-addresses ls3-lp1 unknown
+
 # Create hypervisors and OVS ports corresponding to logical ports.
 net_add n1
 
@@ -2732,9 +2746,21 @@  ovs-vsctl -- add-port br-int vif4 -- \
     options:rxq_pcap=hv2/vif4-rx.pcap \
     ofport-request=4
 
+sim_add hv3
+as hv3
+check ovs-vsctl add-br br-phys
+ovn_az_attach az3 n1 br-phys 192.168.3.1 16
+check ovs-vsctl set open . external-ids:ovn-is-interconn=true
+ovs-vsctl -- add-port br-int vif5 -- \
+    set interface vif5 external-ids:iface-id=ls3-lp1 \
+    options:tx_pcap=hv3/vif5-tx.pcap \
+    options:rxq_pcap=hv3/vif5-rx.pcap \
+    ofport-request=5
+
 # Bind transit switch ports to their chassis.
 check ovn_as az1 ovn-nbctl lsp-set-options spine-to-ls1 requested-chassis=hv1
 check ovn_as az2 ovn-nbctl lsp-set-options spine-to-ls2 requested-chassis=hv2
+check ovn_as az3 ovn-nbctl lsp-set-options spine-to-ls3 requested-chassis=hv3
 
 # Pre-populate the hypervisors' ARP tables so that we don't lose any
 # packets for ARP resolution (native tunneling doesn't queue packets
@@ -2745,11 +2771,14 @@  ovn_as az1
 check ovn-nbctl --wait=hv sync
 ovn-sbctl dump-flows > az1/sbflows
 
-#wait_for_ports_up
 ovn_as az2
 check ovn-nbctl --wait=hv sync
 ovn-sbctl dump-flows > az2/sbflows
 
+ovn_as az3
+check ovn-nbctl --wait=hv sync
+ovn-sbctl dump-flows > az3/sbflows
+
 check ovn-ic-nbctl --wait=sb sync
 
 ovn-ic-nbctl show > ic-nbctl.dump
@@ -2776,27 +2805,71 @@  dst_ip=172.16.1.3
 packet=$(fmt_pkt "Ether(dst='${dst_mac}', src='${src_mac}')/ \
                   IP(src='${src_ip}', dst='${dst_ip}')/ \
                   UDP(sport=1538, dport=4369)")
-check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
 
 # Check that datapath is not doing any extra work and sends the packet out
-# through the tunnel.
+# through the tunnels.  We expect the packet to enter the spine switch, be
+# sent to userspace for FDB learning, then get broadcasted to other two
+# zones via remote ports.
 AT_CHECK([ovn_as az1 as hv1 ovs-appctl ofproto/trace --names \
                 br-int in_port=vif1 $packet > ofproto-trace-1])
 AT_CAPTURE_FILE([ofproto-trace-1])
 AT_CHECK([grep 'Megaflow:' ofproto-trace-1], [0], [dnl
 Megaflow: recirc_id=0,eth,ip,in_port=vif1,dl_src=f0:00:00:01:02:01,dl_dst=f0:00:00:01:02:03,nw_ecn=0,nw_frag=no
 ])
-AT_CHECK([grep -q \
-  'Datapath actions: tnl_push(tnl_port(genev_sys_6081).*out_port(br-phys))' \
-  ofproto-trace-1])
+AT_CHECK([cat ofproto-trace-1 | tail -1 \
+            | grep -oE 'tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+userspace
+clone
+tnl_push
+br-phys_n1
+tnl_push
+br-phys_n1
+])
+
+# Actually send the packet.
+check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
+
+# Wait for the FDB entries to be created and propagated to OpenFlow.  Only
+# one entry will be created - the entry for spine-to-ls1 port in the spine
+# switch.
+ovn_as az1
+wait_row_count FDB 1
+check ovn-nbctl --wait=hv sync
 
-# It's a little problematic to trace the other side, but we can check
-# datapath actions.
+# Only one entry is expected in the other zones as well - the entry for
+# the ls[23]-to-spine port in ls[23] switches.  Technically, we also need
+# an entry for a remote spine-to-ls1 port, but learning from remote ports
+# is not implemented yet.
+ovn_as az2
+wait_row_count FDB 1
+check ovn-nbctl --wait=hv sync
+ovn_as az3
+wait_row_count FDB 1
+check ovn-nbctl --wait=hv sync
+
+# FDB entry was created from the userspace() action in the datapath, but
+# those actions will be updated to not have it shortly, so just wait for
+# that to happen and don't try to catch the action while it's still in the
+# datapath.
+as hv2 ovs-appctl revalidator/wait
+as hv3 ovs-appctl revalidator/wait
+
+# It's a little problematic to trace the other side, but we can check datapath
+# actions.  Note: 'actions:br-phys' is a stray tunnel packet destined for the
+# other zone, but OVS from the 'main' namespace didn't learn addresses yet,
+# so it broadcasts.
 AT_CHECK([as hv2 ovs-appctl dpctl/dump-flows --names \
             | grep actions | sed 's/.*\(actions:.*\)/\1/' | sort], [0], [dnl
+actions:br-phys
 actions:tnl_pop(genev_sys_6081)
 actions:vif3
 ])
+AT_CHECK([as hv3 ovs-appctl dpctl/dump-flows --names \
+            | grep actions | sed 's/.*\(actions:.*\)/\1/' | sort], [0], [dnl
+actions:br-phys
+actions:tnl_pop(genev_sys_6081)
+actions:vif5
+])
 
 # No modifications expected.
 AT_CHECK([echo $packet > expected])
@@ -2804,10 +2877,56 @@  AT_CHECK([echo $packet > expected])
 AT_CHECK([touch empty])
 
 # Check that it is delivered where needed and not delivered where not.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv2/vif3-tx.pcap], [expected])
+OVN_CHECK_PACKETS([hv2/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv3/vif5-tx.pcap], [expected])
+
+# Trace a reply packet.
+reply=$(fmt_pkt "Ether(dst='${src_mac}', src='${dst_mac}')/ \
+                 IP(src='${dst_ip}', dst='${src_ip}')/ \
+                 UDP(sport=4369, dport=1538)")
+# Reply packet is still learned and broadcasted in the spine switch, because
+# learning from remote ports is not implemented, so we don't know where the
+# vif1 is located, even though we received some traffic from it.
+AT_CHECK([ovn_as az2 as hv2 ovs-appctl ofproto/trace --names \
+                br-int in_port=vif3 $reply > ofproto-trace-2])
+AT_CAPTURE_FILE([ofproto-trace-2])
+AT_CHECK([grep 'Megaflow:' ofproto-trace-2], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif3,dl_src=f0:00:00:01:02:03,dl_dst=f0:00:00:01:02:01,nw_ecn=0,nw_frag=no
+])
+AT_CHECK([cat ofproto-trace-2 | tail -1 \
+            | grep -oE 'tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+userspace
+clone
+tnl_push
+br-phys_n1
+tnl_push
+br-phys_n1
+])
+
+# Now actually send it.
+check as hv2 ovs-appctl netdev-dummy/receive vif3 $reply
+
+# Zones 1 and 2 should have 2 FDB entries now each.  One per side of a
+# switch-switch port connecting ls[12] with the spine.  Zone 3 only has two
+# entries on ls3 for traffic broadcasted in the spine from both vif1 and vif3.
+ovn_as az1 wait_row_count FDB 2
+ovn_as az2 wait_row_count FDB 2
+ovn_as az3 wait_row_count FDB 2
+
+AT_CHECK([echo $reply > reply])
+# Check that it is delivered where needed and not delivered where not.
+# While the traffic is broadcasted within the spine and arrives in zone 3, the
+# packets must be dropped, because ls3 learned that their destination addresses
+# are behind the spine switch, so no new packets should be seen on vif5.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [reply])
 OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
 OVN_CHECK_PACKETS([hv2/vif3-tx.pcap], [expected])
 OVN_CHECK_PACKETS([hv2/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv3/vif5-tx.pcap], [expected])
 
-OVN_CLEANUP_IC([az1], [az2])
+OVN_CLEANUP_IC([az1], [az2], [az3])
 AT_CLEANUP
 ])
diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
index 64991ff75..2d3d21d91 100644
--- a/tests/ovn-northd.at
+++ b/tests/ovn-northd.at
@@ -7448,6 +7448,12 @@  AT_CHECK([grep -E "ls_in_l2_lkup.*S1-|unknown" S1flows | ovn_strip_lflows], [0],
   table=??(ls_in_l2_unknown   ), priority=50   , match=(outport == "none"), action=(outport = "_MC_unknown"; output;)
 ])
 
+dnl Check that FDB learning is enabled for the switch port.
+AT_CHECK([grep -E "ls_.*fdb.*S1-" S1flows | ovn_strip_lflows], [0], [dnl
+  table=??(ls_in_lookup_fdb   ), priority=100  , match=(inport == "S1-S2"), action=(reg0[[11]] = lookup_fdb(inport, eth.src); next;)
+  table=??(ls_in_put_fdb      ), priority=100  , match=(inport == "S1-S2" && reg0[[11]] == 0), action=(put_fdb(inport, eth.src); next;)
+])
+
 ovn-sbctl dump-flows S2 > S2flows
 AT_CAPTURE_FILE([S2flows])
 
@@ -7459,6 +7465,12 @@  AT_CHECK([grep -E "ls_in_l2_lkup.*S2-|unknown" S2flows | ovn_strip_lflows], [0],
   table=??(ls_in_l2_unknown   ), priority=50   , match=(outport == "none"), action=(outport = "_MC_unknown"; output;)
 ])
 
+dnl Check that FDB learning is enabled for the switch port.
+AT_CHECK([grep -E "ls_.*fdb.*S2-" S2flows | ovn_strip_lflows], [0], [dnl
+  table=??(ls_in_lookup_fdb   ), priority=100  , match=(inport == "S2-S1"), action=(reg0[[11]] = lookup_fdb(inport, eth.src); next;)
+  table=??(ls_in_put_fdb      ), priority=100  , match=(inport == "S2-S1" && reg0[[11]] == 0), action=(put_fdb(inport, eth.src); next;)
+])
+
 dnl Add an explicit address to S1-S2 indicating that the port with
 dnl address of S2-vm is behind it.
 check ovn-nbctl --wait=sb lsp-set-addresses S1-S2 "50:54:00:00:00:02 192.168.0.2"
@@ -7474,6 +7486,12 @@  AT_CHECK([grep -E "ls_in_l2_lkup.*S1-|unknown" S1flows2 | ovn_strip_lflows], [0]
   table=??(ls_in_l2_unknown   ), priority=50   , match=(outport == "none"), action=(outport = "_MC_unknown"; output;)
 ])
 
+dnl Check that FDB learning is still enabled for the switch port.
+AT_CHECK([grep -E "ls_.*fdb.*S1-" S1flows | ovn_strip_lflows], [0], [dnl
+  table=??(ls_in_lookup_fdb   ), priority=100  , match=(inport == "S1-S2"), action=(reg0[[11]] = lookup_fdb(inport, eth.src); next;)
+  table=??(ls_in_put_fdb      ), priority=100  , match=(inport == "S1-S2" && reg0[[11]] == 0), action=(put_fdb(inport, eth.src); next;)
+])
+
 AT_CLEANUP
 ])
 
diff --git a/tests/ovn.at b/tests/ovn.at
index d105ed253..dd4bd0e4a 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -7915,19 +7915,22 @@  AT_CLEANUP
 ])
 
 OVN_FOR_EACH_NORTHD([
-AT_SETUP([spine-leaf: 1 HV, 2 LSs, connected via spine switch])
+AT_SETUP([spine-leaf: 1 HV, 3 LSs, connected via spine switch])
 AT_KEYWORDS([spine leaf])
 AT_SKIP_IF([test $HAVE_SCAPY = no])
 ovn_start
 
 # Logical network:
-# Single network 191.168.1.0/24.  Two switches with VIF ports, connected
-# to a spine logical switch via 'switch' ports.
+# Single network 191.168.1.0/24.  Three switches with VIF ports, connected
+# to a spine logical switch via 'switch' ports.  Third switch has a port
+# with unknown address, but we're only sending packets between ports on other
+# two switches.
 
 check ovn-nbctl ls-add spine
 
 check ovn-nbctl ls-add ls1
 check ovn-nbctl ls-add ls2
+check ovn-nbctl ls-add ls3
 
 # Connect ls1 to spine.
 check ovn-nbctl lsp-add spine spine-to-ls1
@@ -7941,6 +7944,12 @@  check ovn-nbctl lsp-add ls2 ls2-to-spine
 check ovn-nbctl lsp-set-type spine-to-ls2 switch peer=ls2-to-spine
 check ovn-nbctl lsp-set-type ls2-to-spine switch peer=spine-to-ls2
 
+# Connect ls3 to spine.
+check ovn-nbctl lsp-add spine spine-to-ls3
+check ovn-nbctl lsp-add ls3 ls3-to-spine
+check ovn-nbctl lsp-set-type spine-to-ls3 switch peer=ls3-to-spine
+check ovn-nbctl lsp-set-type ls3-to-spine switch peer=spine-to-ls3
+
 # Create logical port ls1-lp1 in ls1
 check ovn-nbctl lsp-add ls1 ls1-lp1 \
 -- lsp-set-addresses ls1-lp1 "f0:00:00:01:02:01 172.16.1.1"
@@ -7955,6 +7964,9 @@  check ovn-nbctl lsp-add ls2 ls2-lp1 \
 check ovn-nbctl lsp-add ls2 ls2-lp2 \
 -- lsp-set-addresses ls2-lp2 "f0:00:00:01:02:04 172.16.1.4"
 
+# Create logical port ls3-lp1 in ls3 with unknown address.
+check ovn-nbctl lsp-add ls3 ls3-lp1 -- lsp-set-addresses ls3-lp1 unknown
+
 # Create one hypervisor and create OVS ports corresponding to logical ports.
 net_add n1
 
@@ -7984,6 +7996,12 @@  ovs-vsctl -- add-port br-int vif4 -- \
     options:rxq_pcap=hv1/vif4-rx.pcap \
     ofport-request=4
 
+ovs-vsctl -- add-port br-int vif5 -- \
+    set interface vif5 external-ids:iface-id=ls3-lp1 \
+    options:tx_pcap=hv1/vif5-tx.pcap \
+    options:rxq_pcap=hv1/vif5-rx.pcap \
+    ofport-request=5
+
 wait_for_ports_up
 check ovn-nbctl --wait=hv sync
 
@@ -7998,43 +8016,118 @@  dst_ip=172.16.1.3
 packet=$(fmt_pkt "Ether(dst='${dst_mac}', src='${src_mac}')/ \
                   IP(src='${src_ip}', dst='${dst_ip}')/ \
                   UDP(sport=1538, dport=4369)")
-check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
 
-# Check that datapath is not doing any extra work.
-AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
-                br-int in_port=vif1 $packet | tail -2], [0], [dnl
+# Check that datapath is not doing any extra work.  Since FDB is not updated
+# yet, we expect the packet to be sent to userspace three times: one while
+# entering the spine switch, then it gets broadcasted to two other leaf
+# switches and sent to userspace from there.  We also expect a clone of the
+# packet to end up in vif5 due to aforementioned broadcast.  Need to sort the
+# output, since the order of ports may change.
+AT_CHECK([as hv1 ovs-appctl ofproto/trace --names br-int in_port=vif1 $packet \
+            | tail -2 | grep -oE 'Megaflow.*|userspace|vif[[0-9]]' | sort], [0], [dnl
 Megaflow: recirc_id=0,eth,ip,in_port=vif1,dl_src=f0:00:00:01:02:01,dl_dst=f0:00:00:01:02:03,nw_frag=no
-Datapath actions: vif3
+userspace
+userspace
+userspace
+vif3
+vif5
 ])
 
+# Actually send the packet.
+check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
+
+# Wait for the FDB entries to be created and propagated to OpenFlow.
+wait_row_count FDB 3
+check ovn-nbctl --wait=hv sync
+
 # No modifications expected.
 AT_CHECK([echo $packet > expected])
 
 AT_CHECK([touch empty])
 
 # Check that it is delivered where needed and not delivered where not.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [empty])
 OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
 OVN_CHECK_PACKETS([hv1/vif3-tx.pcap], [expected])
 OVN_CHECK_PACKETS([hv1/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif5-tx.pcap], [expected])
+
+# Trace a reply packet.
+reply=$(fmt_pkt "Ether(dst='${src_mac}', src='${dst_mac}')/ \
+                 IP(src='${dst_ip}', dst='${src_ip}')/ \
+                 UDP(sport=4369, dport=1538)")
+# For the reply packet we expect only two userspace actions for FDB update,
+# because we already learned that MAC of vif1 is behind spine-ls1 and no
+# longer need to broadcast to ls3/vif5.
+AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
+            br-int in_port=vif3 $reply | tail -2 \
+            | grep -oE 'Megaflow.*|userspace|vif[[0-9]]'], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif3,dl_src=f0:00:00:01:02:03,dl_dst=f0:00:00:01:02:01,nw_frag=no
+userspace
+userspace
+vif1
+])
+# Now actually send it.
+check as hv1 ovs-appctl netdev-dummy/receive vif3 $reply
+
+AT_CHECK([echo $reply > reply])
+# Check that it is delivered where needed and not delivered where not.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [reply])
+OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif3-tx.pcap], [expected])
+OVN_CHECK_PACKETS([hv1/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif5-tx.pcap], [expected])
+
+# Wait for the FDB entries to be created and propagated to OpenFlow.
+wait_row_count FDB 5
+check ovn-nbctl --wait=hv sync
+
+# Packets should flow directly to the destination now in both directions.
+AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
+            br-int in_port=vif1 $packet | tail -2], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif1,dl_src=f0:00:00:01:02:01,dl_dst=f0:00:00:01:02:03,nw_frag=no
+Datapath actions: vif3
+])
+AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
+            br-int in_port=vif3 $reply | tail -2], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif3,dl_src=f0:00:00:01:02:03,dl_dst=f0:00:00:01:02:01,nw_frag=no
+Datapath actions: vif1
+])
+
+# Send and check one more time.
+check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
+check as hv1 ovs-appctl netdev-dummy/receive vif3 $reply
+
+AT_CHECK([cp expected expected-vif5])
+AT_CHECK([echo $packet >> expected])
+AT_CHECK([echo $reply >> reply])
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [reply])
+OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif3-tx.pcap], [expected])
+OVN_CHECK_PACKETS([hv1/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif5-tx.pcap], [expected-vif5])
 
 OVN_CLEANUP([hv1])
 AT_CLEANUP
 ])
 
 OVN_FOR_EACH_NORTHD([
-AT_SETUP([spine-leaf: 2 HVs, 2 LSs, connected via distributed spine switch])
+AT_SETUP([spine-leaf: 3 HVs, 3 LSs, connected via distributed spine switch])
 AT_KEYWORDS([spine leaf])
 AT_SKIP_IF([test $HAVE_SCAPY = no])
 ovn_start
 
 # Logical network:
-# Single network 172.16.1.0/24.  Two switches with VIF ports on two HVs,
+# Single network 172.16.1.0/24.  Three switches with VIF ports on three HVs,
 # connected to a spine distributed logical switch via 'switch' ports.
+# Third switch has a port with unknown address, but we're only sending packets
+# between ports on other two switches.
 
 check ovn-nbctl ls-add spine
 
 check ovn-nbctl ls-add ls1
 check ovn-nbctl ls-add ls2
+check ovn-nbctl ls-add ls3
 
 # Connect ls1 to spine.
 check ovn-nbctl lsp-add spine spine-to-ls1
@@ -8048,6 +8141,12 @@  check ovn-nbctl lsp-add ls2 ls2-to-spine
 check ovn-nbctl lsp-set-type spine-to-ls2 switch peer=ls2-to-spine
 check ovn-nbctl lsp-set-type ls2-to-spine switch peer=spine-to-ls2
 
+# Connect ls3 to spine.
+check ovn-nbctl lsp-add spine spine-to-ls3
+check ovn-nbctl lsp-add ls3 ls3-to-spine
+check ovn-nbctl lsp-set-type spine-to-ls3 switch peer=ls3-to-spine
+check ovn-nbctl lsp-set-type ls3-to-spine switch peer=spine-to-ls3
+
 # Create logical port ls1-lp1 in ls1
 check ovn-nbctl lsp-add ls1 ls1-lp1 \
 -- lsp-set-addresses ls1-lp1 "f0:00:00:01:02:01 172.16.1.1"
@@ -8062,6 +8161,9 @@  check ovn-nbctl lsp-add ls2 ls2-lp1 \
 check ovn-nbctl lsp-add ls2 ls2-lp2 \
 -- lsp-set-addresses ls2-lp2 "f0:00:00:01:02:04 172.16.1.4"
 
+# Create logical port ls3-lp1 in ls3 with unknown address.
+check ovn-nbctl lsp-add ls3 ls3-lp1 -- lsp-set-addresses ls3-lp1 unknown
+
 # Create hypervisors and OVS ports corresponding to logical ports.
 net_add n1
 
@@ -8095,6 +8197,16 @@  ovs-vsctl -- add-port br-int vif4 -- \
     options:rxq_pcap=hv2/vif4-rx.pcap \
     ofport-request=4
 
+sim_add hv3
+as hv3
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.3
+ovs-vsctl -- add-port br-int vif5 -- \
+    set interface vif5 external-ids:iface-id=ls3-lp1 \
+    options:tx_pcap=hv3/vif5-tx.pcap \
+    options:rxq_pcap=hv3/vif5-rx.pcap \
+    ofport-request=5
+
 OVN_POPULATE_ARP
 
 wait_for_ports_up
@@ -8111,27 +8223,55 @@  dst_ip=172.16.1.3
 packet=$(fmt_pkt "Ether(dst='${dst_mac}', src='${src_mac}')/ \
                   IP(src='${src_ip}', dst='${dst_ip}')/ \
                   UDP(sport=1538, dport=4369)")
-check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
 
 # Check that datapath is not doing any extra work and sends the packet out
-# through the tunnel.
+# through the tunnels.  We expect the packet to enter the spine switch, be
+# sent to userspace for FDB learning, then get broadcasted to other two
+# switches.  In which of those it will be sent to userspace for FDB learning
+# as well and then sent out through the appropriate tunnel to the destination.
+# It's sent to vif3, because it is a destination and it's also sent to vif5
+# due to aforementioned broadcast and the unknown address on vif5.
 AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
                 br-int in_port=vif1 $packet > ofproto-trace-1])
 AT_CAPTURE_FILE([ofproto-trace-1])
 AT_CHECK([grep 'Megaflow:' ofproto-trace-1], [0], [dnl
 Megaflow: recirc_id=0,eth,ip,in_port=vif1,dl_src=f0:00:00:01:02:01,dl_dst=f0:00:00:01:02:03,nw_ecn=0,nw_frag=no
 ])
-AT_CHECK([grep -q \
-  'Datapath actions: tnl_push(tnl_port(genev_sys_6081).*out_port(br-phys))' \
-  ofproto-trace-1])
+AT_CHECK([cat ofproto-trace-1 | tail -1 \
+            | grep -oE 'tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+userspace
+userspace
+clone
+tnl_push
+br-phys_n1
+userspace
+tnl_push
+br-phys_n1
+])
+
+# Actually send the packet.
+check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
+
+# Wait for the FDB entries to be created and propagated to OpenFlow.
+wait_row_count FDB 3
+check ovn-nbctl --wait=hv sync
 
-# It's a little problematic to trace the other side, but we can check
-# datapath actions.
+# It's a little problematic to trace the other side, but we can check datapath
+# actions.  Note: 'actions:br-phys' is a stray tunnel packet destined for the
+# other node, but OVS from the 'main' namespace didn't learn addresses yet,
+# so it broadcasts.
 AT_CHECK([as hv2 ovs-appctl dpctl/dump-flows --names \
             | grep actions | sed 's/.*\(actions:.*\)/\1/' | sort], [0], [dnl
+actions:br-phys
 actions:tnl_pop(genev_sys_6081)
 actions:vif3
 ])
+AT_CHECK([as hv3 ovs-appctl dpctl/dump-flows --names \
+            | grep actions | sed 's/.*\(actions:.*\)/\1/' | sort], [0], [dnl
+actions:br-phys
+actions:tnl_pop(genev_sys_6081)
+actions:vif5
+])
 
 # No modifications expected.
 AT_CHECK([echo $packet > expected])
@@ -8139,11 +8279,79 @@  AT_CHECK([echo $packet > expected])
 AT_CHECK([touch empty])
 
 # Check that it is delivered where needed and not delivered where not.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv2/vif3-tx.pcap], [expected])
+OVN_CHECK_PACKETS([hv2/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv3/vif5-tx.pcap], [expected])
+
+# Trace a reply packet.
+reply=$(fmt_pkt "Ether(dst='${src_mac}', src='${dst_mac}')/ \
+                 IP(src='${dst_ip}', dst='${src_ip}')/ \
+                 UDP(sport=4369, dport=1538)")
+# For the reply packet we expect only two userspace actions for FDB update
+# and only one tunnel push and send, because we already learned that MAC of
+# vif1 is behind spine-ls1 and no longer need to broadcast to ls3/vif5.
+AT_CHECK([as hv2 ovs-appctl ofproto/trace --names \
+                br-int in_port=vif3 $reply > ofproto-trace-2])
+AT_CAPTURE_FILE([ofproto-trace-2])
+AT_CHECK([grep 'Megaflow:' ofproto-trace-2], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif3,dl_src=f0:00:00:01:02:03,dl_dst=f0:00:00:01:02:01,nw_ecn=0,nw_frag=no
+])
+AT_CHECK([cat ofproto-trace-2 | tail -1 \
+            | grep -oE 'tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+userspace
+userspace
+tnl_push
+br-phys_n1
+])
+
+# Now actually send it.
+check as hv2 ovs-appctl netdev-dummy/receive vif3 $reply
+
+AT_CHECK([echo $reply > reply])
+# Check that it is delivered where needed and not delivered where not.
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [reply])
 OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
 OVN_CHECK_PACKETS([hv2/vif3-tx.pcap], [expected])
 OVN_CHECK_PACKETS([hv2/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv3/vif5-tx.pcap], [expected])
 
-OVN_CLEANUP([hv1], [hv2])
+# Wait for the FDB entries to be created and propagated to OpenFlow.
+wait_row_count FDB 5
+check ovn-nbctl --wait=hv sync
+
+# Packets should flow directly to the destination (via tunnels) in both
+# directions now.
+AT_CHECK([as hv1 ovs-appctl ofproto/trace --names \
+            br-int in_port=vif1 $packet | tail -2 \
+            | grep -oE 'Megaflow.*|tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif1,dl_src=f0:00:00:01:02:01,dl_dst=f0:00:00:01:02:03,nw_ecn=0,nw_frag=no
+tnl_push
+br-phys_n1
+])
+AT_CHECK([as hv2 ovs-appctl ofproto/trace --names \
+            br-int in_port=vif3 $reply | tail -2 \
+            | grep -oE 'Megaflow.*|tnl_push|userspace|clone|br-phys_n1|vif[[0-9]]'], [0], [dnl
+Megaflow: recirc_id=0,eth,ip,in_port=vif3,dl_src=f0:00:00:01:02:03,dl_dst=f0:00:00:01:02:01,nw_ecn=0,nw_frag=no
+tnl_push
+br-phys_n1
+])
+
+# Send and check one more time.
+check as hv1 ovs-appctl netdev-dummy/receive vif1 $packet
+check as hv2 ovs-appctl netdev-dummy/receive vif3 $reply
+
+AT_CHECK([cp expected expected-vif5])
+AT_CHECK([echo $packet >> expected])
+AT_CHECK([echo $reply >> reply])
+OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [reply])
+OVN_CHECK_PACKETS([hv1/vif2-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv2/vif3-tx.pcap], [expected])
+OVN_CHECK_PACKETS([hv2/vif4-tx.pcap], [empty])
+OVN_CHECK_PACKETS([hv3/vif5-tx.pcap], [expected-vif5])
+
+OVN_CLEANUP([hv1], [hv2], [hv3])
 AT_CLEANUP
 ])