diff mbox series

[ovs-dev,v7,ovn,2/2] ovn-northd: Limit ARP/ND broadcast domain whenever possible.

Message ID 20191112102842.32061.37256.stgit@dceara.remote.csb
State Accepted
Headers show
Series [ovs-dev,v7,ovn,1/2] ovn-northd: Fix get_router_load_balancer_ips() for mixed address families. | expand

Commit Message

Dumitru Ceara Nov. 12, 2019, 10:28 a.m. UTC
ARP request and ND NS packets for router owned IPs were being
flooded in the complete L2 domain (using the MC_FLOOD multicast group).
However this creates a scaling issue in scenarios where aggregation
logical switches are connected to more logical routers (~350). The
logical pipelines of all routers would have to be executed before the
packet is finally replied to by a single router, the owner of the IP
address.

This commit limits the broadcast domain by bypassing the L2 Lookup stage
for ARP requests that will be replied by a single router. The packets
are forwarded only to the router port that owns the target IP address.

IPs that are owned by the routers and for which this fix applies are:
- IP addresses configured on the router ports.
- VIPs.
- NAT IPs.

Reported-at: https://bugzilla.redhat.com/1756945
Reported-by: Anil Venkata <vkommadi@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>

---
v7:
- Address Han's comments:
    - Remove flooding for all ARPs received on VLAN networks. To avoid
      that we now identify self originated (G)ARPs by matching on source
      MAC address too.
    - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
- Fix ovn-sb manpage.
- Split patch in a series of 2:
    - patch1: fixes the get_router_load_balancer_ips() function.
    - patch2: limits the ARP/ND broadcast domain.
v6:
- Address Han's comments:
    - remove flooding of ARPs targeting OVN owned IP addresses.
    - update ovn-architecture documentation.
    - rename ARP handling functions.
    - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
    account the new way of forwarding ARPs.
- Also, properly deal with ARP packets on VLAN-backed networks.
v5: Address Numan's comments: update comments & make autotest more
    robust.
v4: Rebase.
v3: Properly deal with VXLAN traffic. Address review comments from
    Numan (add autotests). Fix function get_router_load_balancer_ips.
    Rebase -> deal with IPv6 NAT too.
v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
address localnet ports too.
---
 northd/ovn-northd.8.xml |   14 ++
 northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
 ovn-architecture.7.xml  |   19 +++
 tests/ovn.at            |  307 +++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 530 insertions(+), 40 deletions(-)

Comments

Han Zhou Nov. 12, 2019, 5:16 p.m. UTC | #1
On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> ARP request and ND NS packets for router owned IPs were being
> flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> However this creates a scaling issue in scenarios where aggregation
> logical switches are connected to more logical routers (~350). The
> logical pipelines of all routers would have to be executed before the
> packet is finally replied to by a single router, the owner of the IP
> address.
>
> This commit limits the broadcast domain by bypassing the L2 Lookup stage
> for ARP requests that will be replied by a single router. The packets
> are forwarded only to the router port that owns the target IP address.
>
> IPs that are owned by the routers and for which this fix applies are:
> - IP addresses configured on the router ports.
> - VIPs.
> - NAT IPs.
>
> Reported-at: https://bugzilla.redhat.com/1756945
> Reported-by: Anil Venkata <vkommadi@redhat.com>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>
> ---
> v7:
> - Address Han's comments:
>     - Remove flooding for all ARPs received on VLAN networks. To avoid
>       that we now identify self originated (G)ARPs by matching on source
>       MAC address too.
>     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> - Fix ovn-sb manpage.
> - Split patch in a series of 2:
>     - patch1: fixes the get_router_load_balancer_ips() function.
>     - patch2: limits the ARP/ND broadcast domain.
> v6:
> - Address Han's comments:
>     - remove flooding of ARPs targeting OVN owned IP addresses.
>     - update ovn-architecture documentation.
>     - rename ARP handling functions.
>     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
>     account the new way of forwarding ARPs.
> - Also, properly deal with ARP packets on VLAN-backed networks.
> v5: Address Numan's comments: update comments & make autotest more
>     robust.
> v4: Rebase.
> v3: Properly deal with VXLAN traffic. Address review comments from
>     Numan (add autotests). Fix function get_router_load_balancer_ips.
>     Rebase -> deal with IPv6 NAT too.
> v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> address localnet ports too.
> ---
>  northd/ovn-northd.8.xml |   14 ++
>  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
>  ovn-architecture.7.xml  |   19 +++
>  tests/ovn.at            |  307
+++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 530 insertions(+), 40 deletions(-)
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index 0a33dcd..344cc0d 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -1005,6 +1005,20 @@ output;
>        </li>
>
>        <li>
> +        Priority-80 flows for each port connected to a logical router
> +        matching self originated GARP/ARP request/ND packets. These
packets
> +        are flooded to the <code>MC_FLOOD</code> which contains all
logical
> +        ports.
> +      </li>
> +
> +      <li>
> +        Priority-75 flows for each IP address/VIP/NAT address owned by a
> +        router port connected to the switch. These flows match ARP
requests
> +        and ND packets for the specific IP addresses.  Matched packets
are
> +        forwarded only to the router that owns the IP address.
> +      </li>
> +
> +      <li>
>          A priority-70 flow that outputs all packets with an Ethernet
broadcast
>          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
>          multicast group.
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 32f3200..d6beb97 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -210,6 +210,8 @@ enum ovn_stage {
>  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
>  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
>
> +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> +
>  /* Returns an "enum ovn_stage" built from the arguments. */
>  static enum ovn_stage
>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
pipeline,
> @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
>                            1, (1u << 15) - 1, &od->port_key_hint);
>  }
>
> +/* Returns true if the logical switch port 'enabled' column is empty or
> + * set to true.  Otherwise, returns false. */
> +static bool
> +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> +{
> +    return !lsp->n_enabled || *lsp->enabled;
> +}
> +
> +/* Returns true only if the logical switch port 'up' column is set to
true.
> + * Otherwise, if the column is not set or set to false, returns false. */
> +static bool
> +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> +{
> +    return lsp->n_up && *lsp->up;
> +}
> +
> +static bool
> +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> +{
> +    return !strcmp(nbsp->type, "external");
> +}
> +
> +static bool
> +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> +{
> +    return !lrport->enabled || *lrport->enabled;
> +}
> +
>  static char *
>  chassis_redirect_name(const char *port_name)
>  {
> @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline pipeline,
struct ovn_port *op,
>
>  }
>
> -/* Returns true if the logical switch port 'enabled' column is empty or
> - * set to true.  Otherwise, returns false. */
> -static bool
> -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> -{
> -    return !lsp->n_enabled || *lsp->enabled;
> -}
> -
> -/* Returns true only if the logical switch port 'up' column is set to
true.
> - * Otherwise, if the column is not set or set to false, returns false. */
> -static bool
> -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> -{
> -    return lsp->n_up && *lsp->up;
> -}
> -
> -static bool
> -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> -{
> -    return !strcmp(nbsp->type, "external");
> -}
> -
>  static bool
>  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
>                      struct ds *options_action, struct ds
*response_action,
> @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports, struct
ovs_list *lr_list)
>      }
>  }
>
> +/*
> + * Ingress table 17: Flows that flood self originated ARP/ND packets in
the
> + * switching domain.
> + */
> +static void
> +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> +                                           uint32_t priority,
> +                                           struct ovn_datapath *od,
> +                                           struct hmap *lflows)
> +{
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> +
> +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> +     * Determine that packets are self originated by also matching on
> +     * source MAC. Matching on ingress port is not reliable in case this
> +     * is a VLAN-backed network.
> +     * Priority: 80.
> +     */
> +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> +
> +        if (!nat->external_mac) {
> +            continue;
> +        }
> +
> +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> +    }

As discussed we need to add chassis unique MAC that are configured in
external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find
this in the patch. VLAN backed logical router may not work without this.

> +    ds_chomp(&eth_src, ' ');
> +    ds_chomp(&eth_src, ',');
> +    ds_put_cstr(&eth_src, "}");
> +
> +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
> +                  ds_cstr(&eth_src));
> +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> +                  ds_cstr(&match),
> +                  "outport = \""MC_FLOOD"\"; output;");
> +
> +    ds_destroy(&match);
> +    ds_destroy(&eth_src);
> +}
> +
> +/*
> + * Ingress table 17: Flows that forward ARP/ND requests only to the
routers
> + * that own the addresses. Other ARP/ND packets are still flooded in the
> + * switching domain as regular broadcast.
> + */
> +static void
> +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> +                                        int addr_family,
> +                                        struct ovn_port *patch_op,
> +                                        struct ovn_datapath *od,
> +                                        uint32_t priority,
> +                                        struct hmap *lflows)
> +{
> +    struct ds match   = DS_EMPTY_INITIALIZER;
> +    struct ds actions = DS_EMPTY_INITIALIZER;
> +
> +    /* Packets received from VXLAN tunnels have already been through the
> +     * router pipeline so we should skip them. Normally this is done by
the
> +     * multicast_group implementation (VXLAN packets skip table 32 which
> +     * delivers to patch ports) but we're bypassing multicast_groups.
> +     */
> +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> +
> +    if (addr_family == AF_INET) {
> +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> +    } else {
> +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
> +    }
> +
> +    const char *ip_address;
> +    SSET_FOR_EACH (ip_address, ips) {
> +        ds_put_format(&match, "%s, ", ip_address);
> +    }
> +
> +    ds_chomp(&match, ' ');
> +    ds_chomp(&match, ',');
> +    ds_put_cstr(&match, "}");
> +
> +    /* Send a the packet only to the router pipeline and skip flooding it
> +     * in the broadcast domain.
> +     */
> +    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
> +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> +                  ds_cstr(&match), ds_cstr(&actions));
> +
> +    ds_destroy(&match);
> +    ds_destroy(&actions);
> +}
> +
> +/*
> + * Ingress table 17: Flows that forward ARP/ND requests only to the
routers
> + * that own the addresses.
> + * Priorities:
> + * - 80: self originated GARPs that need to follow regular processing.
> + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> + */
> +static void
> +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
> +                                  struct ovn_datapath *sw_od,
> +                                  struct ovn_port *sw_op,
> +                                  struct hmap *lflows)
> +{
> +    if (!op || !op->nbrp) {
> +        return;
> +    }
> +
> +    if (!lrport_is_enabled(op->nbrp)) {
> +        return;
> +    }
> +
> +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> +     * Priority: 80.
> +     */
> +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od, lflows);
> +
> +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT) only
to this
> +     * router port.
> +     * Priority: 75.
> +     */
> +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> +
> +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> +        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
> +    }
> +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> +        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
> +    }
> +
> +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
> +
> +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> +
> +        if (!strcmp(nat->type, "snat")) {
> +            continue;
> +        }
> +
> +        ovs_be32 ip;
> +        ovs_be32 mask;
> +        struct in6_addr ipv6;
> +        struct in6_addr mask_v6;
> +
> +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> +            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
> +                sset_add(&all_ips_v6, nat->external_ip);
> +            }
> +        } else {
> +            sset_add(&all_ips_v4, nat->external_ip);
> +        }
> +    }
> +
> +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET, sw_op,
> +                                            sw_od, 75, lflows);
> +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6, sw_op,
> +                                            sw_od, 75, lflows);
> +
> +    sset_destroy(&all_ips_v4);
> +    sset_destroy(&all_ips_v6);
> +}
> +
>  static void
>  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>                      struct hmap *port_groups, struct hmap *lflows,
> @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths, struct
hmap *ports,
>              continue;
>          }
>
> +        /* For ports connected to logical routers add flows to bypass the
> +         * broadcast flooding of ARP/ND requests in table 17. We direct
the
> +         * requests only to the router port that owns the IP address.
> +         */
> +        if (!strcmp(op->nbsp->type, "router")) {
> +            build_lswitch_rport_arp_req_flows(op->peer, op->od, op,
lflows);
> +        }
> +
>          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
>              /* Addresses are owned by the logical port.
>               * Ethernet address followed by zero or more IPv4
> @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths, struct
hmap *ports,
>      ds_destroy(&actions);
>  }
>
> -static bool
> -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> -{
> -    return !lrport->enabled || *lrport->enabled;
> -}
> -
>  /* Returns a string of the IP address of the router port 'op' that
>   * overlaps with 'ip_s".  If one is not found, returns NULL.
>   *
> diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> index 7966b65..c43f16d 100644
> --- a/ovn-architecture.7.xml
> +++ b/ovn-architecture.7.xml
> @@ -1390,6 +1390,25 @@
>      http://docs.openvswitch.org/en/latest/topics/high-availability.
>    </p>
>
> +  <h3>ARP request and ND NS packet processing</h3>
> +
> +  <p>
> +    Due to the fact that ARP requests and ND NA packets are usually
broadcast
> +    packets, for performance reasons, OVN deals with requests that
target OVN
> +    owned IP addresses (i.e., IP addresses configured on the router
ports,
> +    VIPs, NAT IPs) in a specific way and only forwards them to the
logical
> +    router that owns the target IP address. This behavior is different
than
> +    that of traditional swithces and implies that other routers/hosts
> +    connected to the logical switch will not learn the MAC/IP binding
from
> +    the request packet.
> +  </p>
> +
> +  <p>
> +    All other ARP and ND packets are flooded in the L2 broadcast domain
and
> +    to all attached logical patch ports.
> +  </p>
> +
> +
>    <h2>Multiple localnet logical switches connected to a Logical
Router</h2>
>
>    <p>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 3e429e3..26e33d2 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -2877,7 +2877,7 @@ test_ip() {
>      done
>  }
>
> -# test_arp INPORT SHA SPA TPA [REPLY_HA]
> +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
>  #
>  # Causes a packet to be received on INPORT.  The packet is an ARP
>  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided,
then
> @@ -2888,21 +2888,25 @@ test_ip() {
>  # SHA and REPLY_HA are each 12 hex digits.
>  # SPA and TPA are each 8 hex digits.
>  test_arp() {
> -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
> +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
>      local
request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
>      hv=hv`vif_to_hv $inport`
>      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
>      as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
>
>      # Expect to receive the broadcast ARP on the other logical switch
ports if
> -    # IP address is not configured to the switch patch port.
> +    # IP address is not configured on the switch patch port or on the
router
> +    # port (i.e, $flood == 1).
>      local i=`vif_to_ls $inport`
>      local j k
>      for j in 1 2 3; do
>          for k in 1 2 3; do
> -            # 192.168.33.254 is configured to the switch patch port for
lrp33,
> -            # so no ARP flooding expected for it.
> -            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168
33 254`; then
> +            # Skip ingress port.
> +            if test $i$j$k == $inport; then
> +                continue
> +            fi
> +
> +            if test X$flood == X1; then
>                  echo $request >> $i$j$k.expected
>              fi
>          done
> @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
>        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
>        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
>
> -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
> -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
> -      test_arp $i$j$k $smac $sip        $otherip               #6
> +      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
> +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
> +      test_arp $i$j$k $smac $sip        $otherip   1                 #6
>
>        # When rip is 192.168.33.254, ARP request from externalip won't be
>        # filtered, because 192.168.33.254 is configured to switch peer
port
> @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
>        if test $i = 3 && test $j = 3; then
>          lrp33_rsp=$rmac
>        fi
> -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
> +      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
>
>        # MAC binding should be learned from ARP request.
>        host_mac_pretty=f0:00:00:00:0$i:$j$k
> @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
>  # Check that there is a logical flow in logical switch foo's pipeline
>  # to set the outport to rp-foo (which is expected).
>  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup
| \
> -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
>
>  # Set the option 'reside-on-redirect-chassis' for foo
>  ovn-nbctl set logical_router_port foo
options:reside-on-redirect-chassis=true
> @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo
options:reside-on-redirect-chassis=true
>  # to set the outport to rp-foo with the condition is_chassis_redirect.
>  ovn-sbctl dump-flows foo
>  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup
| \
> -grep rp-foo | grep is_chassis_resident | wc -l`])
> +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
>
>  echo "---------NB dump-----"
>  ovn-nbctl show
> @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
>  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
>
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> +ovn_start
> +
> +ip_to_hex() {
> +    printf "%02x%02x%02x%02x" "$@"
> +}
> +
> +send_arp_request() {
> +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> +    local eth_dst=ffffffffffff
> +    local eth_type=0806
> +    local eth=${eth_dst}${eth_src}${eth_type}
> +
> +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> +
> +    local request=${eth}${arp}
> +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> +}
> +
> +send_nd_ns() {
> +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> +
> +    local eth_dst=ffffffffffff
> +    local eth_type=86dd
> +    local eth=${eth_dst}${eth_src}${eth_type}
> +
> +    local ip_vhlen=60000000
> +    local ip_plen=0020
> +    local ip_next=3a
> +    local ip_ttl=ff
> +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> +
> +    # Neighbor Solicitation
> +    local icmp6_type=87
> +    local icmp6_code=00
> +    local icmp6_rsvd=00000000
> +    # ICMPv6 source lla option
> +    local icmp6_opt=01
> +    local icmp6_optlen=01
> +    local
icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> +
> +    local request=${eth}${ip}${icmp6}
> +
> +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> +}
> +
> +src_mac=000000000001
> +
> +net_add n1
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +
> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> +    options:tx_pcap=hv1/vif1-tx.pcap \
> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> +    ofport-request=1
> +
> +# One Aggregation Switch connected to two Logical networks (routers).
> +ovn-nbctl ls-add sw-agg
> +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> +
> +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> +    -- lsp-set-type sw-rtr1 router                 \
> +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> +    -- lsp-set-type sw-rtr2 router                 \
> +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> +
> +# Configure L3 interface IPv4 & IPv6 on both routers
> +ovn-nbctl lr-add rtr1
> +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
> +
> +ovn-nbctl lr-add rtr2
> +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
> +
> +OVN_POPULATE_ARP
> +ovn-nbctl --wait=hv sync
> +
> +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding
sw-agg)
> +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding
sw-agg)
> +
> +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding
sw-rtr1)
> +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding
sw-rtr2)
> +
> +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group
datapath=${sw_dp_uuid} name="_MC_flood")
> +mc_key=$(printf "%04x" $mc_key)
> +
> +match_sw_metadata="metadata=0x${sw_dp_key}"
> +
> +# Inject ARP request for first router owned IP address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0
0 1)
> +
> +# Verify that the ARP request is sent only to rtr1.
>
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for ofirst router owned IP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000001
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
>
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Configure load balancing on both routers.
> +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> +ovn-nbctl lb-add lb1-v6 10::11 42::1
> +ovn-nbctl lr-lb-add rtr1 lb1-v4
> +ovn-nbctl lr-lb-add rtr1 lb1-v6
> +
> +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> +ovn-nbctl lb-add lb2-v6 10::22 42::2
> +ovn-nbctl lr-lb-add rtr2 lb2-v4
> +ovn-nbctl lr-lb-add rtr2 lb2-v6
> +ovn-nbctl --wait=hv sync
> +
> +# Inject ARP request for first router owned VIP address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0
0 11)
> +
> +# Verify that the ARP request is sent only to rtr1.
>
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for first router owned VIP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000011
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
>
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Configure NAT on both routers
> +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> +
> +# Inject ARP request for first router owned NAT address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0
0 111)
> +
> +# Verify that the ARP request is sent only to rtr1.
>
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for first router owned IP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000111
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
>
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +OVN_CLEANUP([hv1])
> +AT_CLEANUP
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Dumitru Ceara Nov. 12, 2019, 6:09 p.m. UTC | #2
On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
>
>
>
> On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > ARP request and ND NS packets for router owned IPs were being
> > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > However this creates a scaling issue in scenarios where aggregation
> > logical switches are connected to more logical routers (~350). The
> > logical pipelines of all routers would have to be executed before the
> > packet is finally replied to by a single router, the owner of the IP
> > address.
> >
> > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > for ARP requests that will be replied by a single router. The packets
> > are forwarded only to the router port that owns the target IP address.
> >
> > IPs that are owned by the routers and for which this fix applies are:
> > - IP addresses configured on the router ports.
> > - VIPs.
> > - NAT IPs.
> >
> > Reported-at: https://bugzilla.redhat.com/1756945
> > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >
> > ---
> > v7:
> > - Address Han's comments:
> >     - Remove flooding for all ARPs received on VLAN networks. To avoid
> >       that we now identify self originated (G)ARPs by matching on source
> >       MAC address too.
> >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> > - Fix ovn-sb manpage.
> > - Split patch in a series of 2:
> >     - patch1: fixes the get_router_load_balancer_ips() function.
> >     - patch2: limits the ARP/ND broadcast domain.
> > v6:
> > - Address Han's comments:
> >     - remove flooding of ARPs targeting OVN owned IP addresses.
> >     - update ovn-architecture documentation.
> >     - rename ARP handling functions.
> >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> >     account the new way of forwarding ARPs.
> > - Also, properly deal with ARP packets on VLAN-backed networks.
> > v5: Address Numan's comments: update comments & make autotest more
> >     robust.
> > v4: Rebase.
> > v3: Properly deal with VXLAN traffic. Address review comments from
> >     Numan (add autotests). Fix function get_router_load_balancer_ips.
> >     Rebase -> deal with IPv6 NAT too.
> > v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> > address localnet ports too.
> > ---
> >  northd/ovn-northd.8.xml |   14 ++
> >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
> >  ovn-architecture.7.xml  |   19 +++
> >  tests/ovn.at            |  307 +++++++++++++++++++++++++++++++++++++++++++++--
> >  4 files changed, 530 insertions(+), 40 deletions(-)
> >
> > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > index 0a33dcd..344cc0d 100644
> > --- a/northd/ovn-northd.8.xml
> > +++ b/northd/ovn-northd.8.xml
> > @@ -1005,6 +1005,20 @@ output;
> >        </li>
> >
> >        <li>
> > +        Priority-80 flows for each port connected to a logical router
> > +        matching self originated GARP/ARP request/ND packets. These packets
> > +        are flooded to the <code>MC_FLOOD</code> which contains all logical
> > +        ports.
> > +      </li>
> > +
> > +      <li>
> > +        Priority-75 flows for each IP address/VIP/NAT address owned by a
> > +        router port connected to the switch. These flows match ARP requests
> > +        and ND packets for the specific IP addresses.  Matched packets are
> > +        forwarded only to the router that owns the IP address.
> > +      </li>
> > +
> > +      <li>
> >          A priority-70 flow that outputs all packets with an Ethernet broadcast
> >          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
> >          multicast group.
> > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > index 32f3200..d6beb97 100644
> > --- a/northd/ovn-northd.c
> > +++ b/northd/ovn-northd.c
> > @@ -210,6 +210,8 @@ enum ovn_stage {
> >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> >
> > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> > +
> >  /* Returns an "enum ovn_stage" built from the arguments. */
> >  static enum ovn_stage
> >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
> > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
> >                            1, (1u << 15) - 1, &od->port_key_hint);
> >  }
> >
> > +/* Returns true if the logical switch port 'enabled' column is empty or
> > + * set to true.  Otherwise, returns false. */
> > +static bool
> > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > +{
> > +    return !lsp->n_enabled || *lsp->enabled;
> > +}
> > +
> > +/* Returns true only if the logical switch port 'up' column is set to true.
> > + * Otherwise, if the column is not set or set to false, returns false. */
> > +static bool
> > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > +{
> > +    return lsp->n_up && *lsp->up;
> > +}
> > +
> > +static bool
> > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > +{
> > +    return !strcmp(nbsp->type, "external");
> > +}
> > +
> > +static bool
> > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > +{
> > +    return !lrport->enabled || *lrport->enabled;
> > +}
> > +
> >  static char *
> >  chassis_redirect_name(const char *port_name)
> >  {
> > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
> >
> >  }
> >
> > -/* Returns true if the logical switch port 'enabled' column is empty or
> > - * set to true.  Otherwise, returns false. */
> > -static bool
> > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > -{
> > -    return !lsp->n_enabled || *lsp->enabled;
> > -}
> > -
> > -/* Returns true only if the logical switch port 'up' column is set to true.
> > - * Otherwise, if the column is not set or set to false, returns false. */
> > -static bool
> > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > -{
> > -    return lsp->n_up && *lsp->up;
> > -}
> > -
> > -static bool
> > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > -{
> > -    return !strcmp(nbsp->type, "external");
> > -}
> > -
> >  static bool
> >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> >                      struct ds *options_action, struct ds *response_action,
> > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
> >      }
> >  }
> >
> > +/*
> > + * Ingress table 17: Flows that flood self originated ARP/ND packets in the
> > + * switching domain.
> > + */
> > +static void
> > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> > +                                           uint32_t priority,
> > +                                           struct ovn_datapath *od,
> > +                                           struct hmap *lflows)
> > +{
> > +    struct ds match = DS_EMPTY_INITIALIZER;
> > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> > +
> > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> > +     * Determine that packets are self originated by also matching on
> > +     * source MAC. Matching on ingress port is not reliable in case this
> > +     * is a VLAN-backed network.
> > +     * Priority: 80.
> > +     */
> > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > +
> > +        if (!nat->external_mac) {
> > +            continue;
> > +        }
> > +
> > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> > +    }
>
> As discussed we need to add chassis unique MAC that are configured in external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find this in the patch. VLAN backed logical router may not work without this.

Hi Han,

Maybe I misunderstood but in the discussion on v6 I mentioned that I
don't think we need to add the MACs from
external-ids:ovn-chassis-mac-mappings.

Whenever chassis MACs are configured, in ovn-controller we create a
conjunctive flow matching on any of the remote chassis MAC addresses:
https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501

And for all incoming traffic that matches this conjunction and VLAN-id
we change the MAC back to that of the logical router port:
https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558

Isn't this enough to cover the self originated ARP packets?

Thanks,
Dumitru

>
> > +    ds_chomp(&eth_src, ' ');
> > +    ds_chomp(&eth_src, ',');
> > +    ds_put_cstr(&eth_src, "}");
> > +
> > +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
> > +                  ds_cstr(&eth_src));
> > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > +                  ds_cstr(&match),
> > +                  "outport = \""MC_FLOOD"\"; output;");
> > +
> > +    ds_destroy(&match);
> > +    ds_destroy(&eth_src);
> > +}
> > +
> > +/*
> > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> > + * that own the addresses. Other ARP/ND packets are still flooded in the
> > + * switching domain as regular broadcast.
> > + */
> > +static void
> > +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> > +                                        int addr_family,
> > +                                        struct ovn_port *patch_op,
> > +                                        struct ovn_datapath *od,
> > +                                        uint32_t priority,
> > +                                        struct hmap *lflows)
> > +{
> > +    struct ds match   = DS_EMPTY_INITIALIZER;
> > +    struct ds actions = DS_EMPTY_INITIALIZER;
> > +
> > +    /* Packets received from VXLAN tunnels have already been through the
> > +     * router pipeline so we should skip them. Normally this is done by the
> > +     * multicast_group implementation (VXLAN packets skip table 32 which
> > +     * delivers to patch ports) but we're bypassing multicast_groups.
> > +     */
> > +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> > +
> > +    if (addr_family == AF_INET) {
> > +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> > +    } else {
> > +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
> > +    }
> > +
> > +    const char *ip_address;
> > +    SSET_FOR_EACH (ip_address, ips) {
> > +        ds_put_format(&match, "%s, ", ip_address);
> > +    }
> > +
> > +    ds_chomp(&match, ' ');
> > +    ds_chomp(&match, ',');
> > +    ds_put_cstr(&match, "}");
> > +
> > +    /* Send a the packet only to the router pipeline and skip flooding it
> > +     * in the broadcast domain.
> > +     */
> > +    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
> > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > +                  ds_cstr(&match), ds_cstr(&actions));
> > +
> > +    ds_destroy(&match);
> > +    ds_destroy(&actions);
> > +}
> > +
> > +/*
> > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> > + * that own the addresses.
> > + * Priorities:
> > + * - 80: self originated GARPs that need to follow regular processing.
> > + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> > + */
> > +static void
> > +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
> > +                                  struct ovn_datapath *sw_od,
> > +                                  struct ovn_port *sw_op,
> > +                                  struct hmap *lflows)
> > +{
> > +    if (!op || !op->nbrp) {
> > +        return;
> > +    }
> > +
> > +    if (!lrport_is_enabled(op->nbrp)) {
> > +        return;
> > +    }
> > +
> > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> > +     * Priority: 80.
> > +     */
> > +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od, lflows);
> > +
> > +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT) only to this
> > +     * router port.
> > +     * Priority: 75.
> > +     */
> > +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> > +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> > +
> > +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> > +        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
> > +    }
> > +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> > +        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
> > +    }
> > +
> > +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
> > +
> > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > +
> > +        if (!strcmp(nat->type, "snat")) {
> > +            continue;
> > +        }
> > +
> > +        ovs_be32 ip;
> > +        ovs_be32 mask;
> > +        struct in6_addr ipv6;
> > +        struct in6_addr mask_v6;
> > +
> > +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> > +            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
> > +                sset_add(&all_ips_v6, nat->external_ip);
> > +            }
> > +        } else {
> > +            sset_add(&all_ips_v4, nat->external_ip);
> > +        }
> > +    }
> > +
> > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET, sw_op,
> > +                                            sw_od, 75, lflows);
> > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6, sw_op,
> > +                                            sw_od, 75, lflows);
> > +
> > +    sset_destroy(&all_ips_v4);
> > +    sset_destroy(&all_ips_v6);
> > +}
> > +
> >  static void
> >  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> >                      struct hmap *port_groups, struct hmap *lflows,
> > @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> >              continue;
> >          }
> >
> > +        /* For ports connected to logical routers add flows to bypass the
> > +         * broadcast flooding of ARP/ND requests in table 17. We direct the
> > +         * requests only to the router port that owns the IP address.
> > +         */
> > +        if (!strcmp(op->nbsp->type, "router")) {
> > +            build_lswitch_rport_arp_req_flows(op->peer, op->od, op, lflows);
> > +        }
> > +
> >          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
> >              /* Addresses are owned by the logical port.
> >               * Ethernet address followed by zero or more IPv4
> > @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> >      ds_destroy(&actions);
> >  }
> >
> > -static bool
> > -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > -{
> > -    return !lrport->enabled || *lrport->enabled;
> > -}
> > -
> >  /* Returns a string of the IP address of the router port 'op' that
> >   * overlaps with 'ip_s".  If one is not found, returns NULL.
> >   *
> > diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> > index 7966b65..c43f16d 100644
> > --- a/ovn-architecture.7.xml
> > +++ b/ovn-architecture.7.xml
> > @@ -1390,6 +1390,25 @@
> >      http://docs.openvswitch.org/en/latest/topics/high-availability.
> >    </p>
> >
> > +  <h3>ARP request and ND NS packet processing</h3>
> > +
> > +  <p>
> > +    Due to the fact that ARP requests and ND NA packets are usually broadcast
> > +    packets, for performance reasons, OVN deals with requests that target OVN
> > +    owned IP addresses (i.e., IP addresses configured on the router ports,
> > +    VIPs, NAT IPs) in a specific way and only forwards them to the logical
> > +    router that owns the target IP address. This behavior is different than
> > +    that of traditional swithces and implies that other routers/hosts
> > +    connected to the logical switch will not learn the MAC/IP binding from
> > +    the request packet.
> > +  </p>
> > +
> > +  <p>
> > +    All other ARP and ND packets are flooded in the L2 broadcast domain and
> > +    to all attached logical patch ports.
> > +  </p>
> > +
> > +
> >    <h2>Multiple localnet logical switches connected to a Logical Router</h2>
> >
> >    <p>
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index 3e429e3..26e33d2 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -2877,7 +2877,7 @@ test_ip() {
> >      done
> >  }
> >
> > -# test_arp INPORT SHA SPA TPA [REPLY_HA]
> > +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
> >  #
> >  # Causes a packet to be received on INPORT.  The packet is an ARP
> >  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided, then
> > @@ -2888,21 +2888,25 @@ test_ip() {
> >  # SHA and REPLY_HA are each 12 hex digits.
> >  # SPA and TPA are each 8 hex digits.
> >  test_arp() {
> > -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
> > +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
> >      local request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
> >      hv=hv`vif_to_hv $inport`
> >      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
> >      as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
> >
> >      # Expect to receive the broadcast ARP on the other logical switch ports if
> > -    # IP address is not configured to the switch patch port.
> > +    # IP address is not configured on the switch patch port or on the router
> > +    # port (i.e, $flood == 1).
> >      local i=`vif_to_ls $inport`
> >      local j k
> >      for j in 1 2 3; do
> >          for k in 1 2 3; do
> > -            # 192.168.33.254 is configured to the switch patch port for lrp33,
> > -            # so no ARP flooding expected for it.
> > -            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
> > +            # Skip ingress port.
> > +            if test $i$j$k == $inport; then
> > +                continue
> > +            fi
> > +
> > +            if test X$flood == X1; then
> >                  echo $request >> $i$j$k.expected
> >              fi
> >          done
> > @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
> >        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
> >        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
> >
> > -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
> > -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
> > -      test_arp $i$j$k $smac $sip        $otherip               #6
> > +      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
> > +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
> > +      test_arp $i$j$k $smac $sip        $otherip   1                 #6
> >
> >        # When rip is 192.168.33.254, ARP request from externalip won't be
> >        # filtered, because 192.168.33.254 is configured to switch peer port
> > @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
> >        if test $i = 3 && test $j = 3; then
> >          lrp33_rsp=$rmac
> >        fi
> > -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
> > +      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
> >
> >        # MAC binding should be learned from ARP request.
> >        host_mac_pretty=f0:00:00:00:0$i:$j$k
> > @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
> >  # Check that there is a logical flow in logical switch foo's pipeline
> >  # to set the outport to rp-foo (which is expected).
> >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> > -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> > +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
> >
> >  # Set the option 'reside-on-redirect-chassis' for foo
> >  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
> > @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
> >  # to set the outport to rp-foo with the condition is_chassis_redirect.
> >  ovn-sbctl dump-flows foo
> >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> > -grep rp-foo | grep is_chassis_resident | wc -l`])
> > +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
> >
> >  echo "---------NB dump-----"
> >  ovn-nbctl show
> > @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
> >  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
> >
> >  AT_CLEANUP
> > +
> > +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> > +AT_SKIP_IF([test $HAVE_PYTHON = no])
> > +ovn_start
> > +
> > +ip_to_hex() {
> > +    printf "%02x%02x%02x%02x" "$@"
> > +}
> > +
> > +send_arp_request() {
> > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> > +    local eth_dst=ffffffffffff
> > +    local eth_type=0806
> > +    local eth=${eth_dst}${eth_src}${eth_type}
> > +
> > +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> > +
> > +    local request=${eth}${arp}
> > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> > +}
> > +
> > +send_nd_ns() {
> > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> > +
> > +    local eth_dst=ffffffffffff
> > +    local eth_type=86dd
> > +    local eth=${eth_dst}${eth_src}${eth_type}
> > +
> > +    local ip_vhlen=60000000
> > +    local ip_plen=0020
> > +    local ip_next=3a
> > +    local ip_ttl=ff
> > +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> > +
> > +    # Neighbor Solicitation
> > +    local icmp6_type=87
> > +    local icmp6_code=00
> > +    local icmp6_rsvd=00000000
> > +    # ICMPv6 source lla option
> > +    local icmp6_opt=01
> > +    local icmp6_optlen=01
> > +    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> > +
> > +    local request=${eth}${ip}${icmp6}
> > +
> > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> > +}
> > +
> > +src_mac=000000000001
> > +
> > +net_add n1
> > +sim_add hv1
> > +as hv1
> > +ovs-vsctl add-br br-phys
> > +ovn_attach n1 br-phys 192.168.0.1
> > +
> > +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> > +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> > +    options:tx_pcap=hv1/vif1-tx.pcap \
> > +    options:rxq_pcap=hv1/vif1-rx.pcap \
> > +    ofport-request=1
> > +
> > +# One Aggregation Switch connected to two Logical networks (routers).
> > +ovn-nbctl ls-add sw-agg
> > +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> > +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> > +
> > +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> > +    -- lsp-set-type sw-rtr1 router                 \
> > +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> > +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> > +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> > +    -- lsp-set-type sw-rtr2 router                 \
> > +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> > +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> > +
> > +# Configure L3 interface IPv4 & IPv6 on both routers
> > +ovn-nbctl lr-add rtr1
> > +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
> > +
> > +ovn-nbctl lr-add rtr2
> > +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
> > +
> > +OVN_POPULATE_ARP
> > +ovn-nbctl --wait=hv sync
> > +
> > +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
> > +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
> > +
> > +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
> > +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
> > +
> > +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
> > +mc_key=$(printf "%04x" $mc_key)
> > +
> > +match_sw_metadata="metadata=0x${sw_dp_key}"
> > +
> > +# Inject ARP request for first router owned IP address.
> > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
> > +
> > +# Verify that the ARP request is sent only to rtr1.
> > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +# Inject ND_NS for ofirst router owned IP address.
> > +src_ipv6=00100000000000000000000000000254
> > +dst_ipv6=00100000000000000000000000000001
> > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > +
> > +# Verify that the ND_NS is sent only to rtr1.
> > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +# Configure load balancing on both routers.
> > +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> > +ovn-nbctl lb-add lb1-v6 10::11 42::1
> > +ovn-nbctl lr-lb-add rtr1 lb1-v4
> > +ovn-nbctl lr-lb-add rtr1 lb1-v6
> > +
> > +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> > +ovn-nbctl lb-add lb2-v6 10::22 42::2
> > +ovn-nbctl lr-lb-add rtr2 lb2-v4
> > +ovn-nbctl lr-lb-add rtr2 lb2-v6
> > +ovn-nbctl --wait=hv sync
> > +
> > +# Inject ARP request for first router owned VIP address.
> > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
> > +
> > +# Verify that the ARP request is sent only to rtr1.
> > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +# Inject ND_NS for first router owned VIP address.
> > +src_ipv6=00100000000000000000000000000254
> > +dst_ipv6=00100000000000000000000000000011
> > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > +
> > +# Verify that the ND_NS is sent only to rtr1.
> > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +# Configure NAT on both routers
> > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> > +
> > +# Inject ARP request for first router owned NAT address.
> > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
> > +
> > +# Verify that the ARP request is sent only to rtr1.
> > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +# Inject ND_NS for first router owned IP address.
> > +src_ipv6=00100000000000000000000000000254
> > +dst_ipv6=00100000000000000000000000000111
> > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > +
> > +# Verify that the ND_NS is sent only to rtr1.
> > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> > +
> > +as hv1
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > +    grep n_packets=1 -c)
> > +    test "1" = "${pkts_to_rtr1}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > +    grep n_packets=1 -c)
> > +    test "0" = "${pkts_to_rtr2}"
> > +])
> > +OVS_WAIT_UNTIL([
> > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > +    test "0" = "${pkts_flooded}"
> > +])
> > +
> > +OVN_CLEANUP([hv1])
> > +AT_CLEANUP
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Han Zhou Nov. 12, 2019, 7:49 p.m. UTC | #3
On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
> >
> >
> >
> > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > ARP request and ND NS packets for router owned IPs were being
> > > flooded in the complete L2 domain (using the MC_FLOOD multicast
group).
> > > However this creates a scaling issue in scenarios where aggregation
> > > logical switches are connected to more logical routers (~350). The
> > > logical pipelines of all routers would have to be executed before the
> > > packet is finally replied to by a single router, the owner of the IP
> > > address.
> > >
> > > This commit limits the broadcast domain by bypassing the L2 Lookup
stage
> > > for ARP requests that will be replied by a single router. The packets
> > > are forwarded only to the router port that owns the target IP address.
> > >
> > > IPs that are owned by the routers and for which this fix applies are:
> > > - IP addresses configured on the router ports.
> > > - VIPs.
> > > - NAT IPs.
> > >
> > > Reported-at: https://bugzilla.redhat.com/1756945
> > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > >
> > > ---
> > > v7:
> > > - Address Han's comments:
> > >     - Remove flooding for all ARPs received on VLAN networks. To avoid
> > >       that we now identify self originated (G)ARPs by matching on
source
> > >       MAC address too.
> > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> > > - Fix ovn-sb manpage.
> > > - Split patch in a series of 2:
> > >     - patch1: fixes the get_router_load_balancer_ips() function.
> > >     - patch2: limits the ARP/ND broadcast domain.
> > > v6:
> > > - Address Han's comments:
> > >     - remove flooding of ARPs targeting OVN owned IP addresses.
> > >     - update ovn-architecture documentation.
> > >     - rename ARP handling functions.
> > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take
into
> > >     account the new way of forwarding ARPs.
> > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > > v5: Address Numan's comments: update comments & make autotest more
> > >     robust.
> > > v4: Rebase.
> > > v3: Properly deal with VXLAN traffic. Address review comments from
> > >     Numan (add autotests). Fix function get_router_load_balancer_ips.
> > >     Rebase -> deal with IPv6 NAT too.
> > > v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> > > address localnet ports too.
> > > ---
> > >  northd/ovn-northd.8.xml |   14 ++
> > >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
> > >  ovn-architecture.7.xml  |   19 +++
> > >  tests/ovn.at            |  307
+++++++++++++++++++++++++++++++++++++++++++++--
> > >  4 files changed, 530 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > index 0a33dcd..344cc0d 100644
> > > --- a/northd/ovn-northd.8.xml
> > > +++ b/northd/ovn-northd.8.xml
> > > @@ -1005,6 +1005,20 @@ output;
> > >        </li>
> > >
> > >        <li>
> > > +        Priority-80 flows for each port connected to a logical router
> > > +        matching self originated GARP/ARP request/ND packets. These
packets
> > > +        are flooded to the <code>MC_FLOOD</code> which contains all
logical
> > > +        ports.
> > > +      </li>
> > > +
> > > +      <li>
> > > +        Priority-75 flows for each IP address/VIP/NAT address owned
by a
> > > +        router port connected to the switch. These flows match ARP
requests
> > > +        and ND packets for the specific IP addresses.  Matched
packets are
> > > +        forwarded only to the router that owns the IP address.
> > > +      </li>
> > > +
> > > +      <li>
> > >          A priority-70 flow that outputs all packets with an Ethernet
broadcast
> > >          or multicast <code>eth.dst</code> to the
<code>MC_FLOOD</code>
> > >          multicast group.
> > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > index 32f3200..d6beb97 100644
> > > --- a/northd/ovn-northd.c
> > > +++ b/northd/ovn-northd.c
> > > @@ -210,6 +210,8 @@ enum ovn_stage {
> > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> > >
> > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> > > +
> > >  /* Returns an "enum ovn_stage" built from the arguments. */
> > >  static enum ovn_stage
> > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
pipeline,
> > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
> > >                            1, (1u << 15) - 1, &od->port_key_hint);
> > >  }
> > >
> > > +/* Returns true if the logical switch port 'enabled' column is empty
or
> > > + * set to true.  Otherwise, returns false. */
> > > +static bool
> > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > +{
> > > +    return !lsp->n_enabled || *lsp->enabled;
> > > +}
> > > +
> > > +/* Returns true only if the logical switch port 'up' column is set
to true.
> > > + * Otherwise, if the column is not set or set to false, returns
false. */
> > > +static bool
> > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > +{
> > > +    return lsp->n_up && *lsp->up;
> > > +}
> > > +
> > > +static bool
> > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > +{
> > > +    return !strcmp(nbsp->type, "external");
> > > +}
> > > +
> > > +static bool
> > > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > +{
> > > +    return !lrport->enabled || *lrport->enabled;
> > > +}
> > > +
> > >  static char *
> > >  chassis_redirect_name(const char *port_name)
> > >  {
> > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline
pipeline, struct ovn_port *op,
> > >
> > >  }
> > >
> > > -/* Returns true if the logical switch port 'enabled' column is empty
or
> > > - * set to true.  Otherwise, returns false. */
> > > -static bool
> > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > -{
> > > -    return !lsp->n_enabled || *lsp->enabled;
> > > -}
> > > -
> > > -/* Returns true only if the logical switch port 'up' column is set
to true.
> > > - * Otherwise, if the column is not set or set to false, returns
false. */
> > > -static bool
> > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > -{
> > > -    return lsp->n_up && *lsp->up;
> > > -}
> > > -
> > > -static bool
> > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > -{
> > > -    return !strcmp(nbsp->type, "external");
> > > -}
> > > -
> > >  static bool
> > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> > >                      struct ds *options_action, struct ds
*response_action,
> > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports,
struct ovs_list *lr_list)
> > >      }
> > >  }
> > >
> > > +/*
> > > + * Ingress table 17: Flows that flood self originated ARP/ND packets
in the
> > > + * switching domain.
> > > + */
> > > +static void
> > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> > > +                                           uint32_t priority,
> > > +                                           struct ovn_datapath *od,
> > > +                                           struct hmap *lflows)
> > > +{
> > > +    struct ds match = DS_EMPTY_INITIALIZER;
> > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> > > +
> > > +    /* Self originated (G)ARP requests/ND need to be flooded as
usual.
> > > +     * Determine that packets are self originated by also matching on
> > > +     * source MAC. Matching on ingress port is not reliable in case
this
> > > +     * is a VLAN-backed network.
> > > +     * Priority: 80.
> > > +     */
> > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > +
> > > +        if (!nat->external_mac) {
> > > +            continue;
> > > +        }
> > > +
> > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> > > +    }
> >
> > As discussed we need to add chassis unique MAC that are configured in
external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find
this in the patch. VLAN backed logical router may not work without this.
>
> Hi Han,
>
> Maybe I misunderstood but in the discussion on v6 I mentioned that I
> don't think we need to add the MACs from
> external-ids:ovn-chassis-mac-mappings.
>
> Whenever chassis MACs are configured, in ovn-controller we create a
> conjunctive flow matching on any of the remote chassis MAC addresses:
> https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
>
> And for all incoming traffic that matches this conjunction and VLAN-id
> we change the MAC back to that of the logical router port:
> https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
>
> Isn't this enough to cover the self originated ARP packets?
>
> Thanks,
> Dumitru
>

Dumitru, sorry that I misunderstood that you actually meant it was ok to
not adding chassis unique macs. Also I didn't realize that there are
already flows to change the chassis unique MACs back to the logical router
port's MACs.
With this precondition I think your patch should be good enough.

However, I revisited the function put_replace_chassis_mac_flows() and had
some difficulty to understand how would it work. For these flows, the match
conditions are:
- in_port of the localnet port
- conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
- vlan tag associated with the localnet port
The flow is added in a loop for each peer port to replace mac for each
router port on that logical switch. Since the match condition is all the
same, wouldn't it result in only one flow taking effect and others getting
dropped? I wonder if any other port-specific match condition should be
added so that MAC can be replaced back to its original router port mac
accordingly.

cc Ankur who is the author of VLAN backed router to help clarify.

This question is not directly related to the current patch. So for the
patch:
Acked-by: Han Zhou <hzhou@ovn.org>

I think it is better to wait until the above question is confirmed before
merging it.

> >
> > > +    ds_chomp(&eth_src, ' ');
> > > +    ds_chomp(&eth_src, ',');
> > > +    ds_put_cstr(&eth_src, "}");
> > > +
> > > +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
> > > +                  ds_cstr(&eth_src));
> > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > +                  ds_cstr(&match),
> > > +                  "outport = \""MC_FLOOD"\"; output;");
> > > +
> > > +    ds_destroy(&match);
> > > +    ds_destroy(&eth_src);
> > > +}
> > > +
> > > +/*
> > > + * Ingress table 17: Flows that forward ARP/ND requests only to the
routers
> > > + * that own the addresses. Other ARP/ND packets are still flooded in
the
> > > + * switching domain as regular broadcast.
> > > + */
> > > +static void
> > > +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> > > +                                        int addr_family,
> > > +                                        struct ovn_port *patch_op,
> > > +                                        struct ovn_datapath *od,
> > > +                                        uint32_t priority,
> > > +                                        struct hmap *lflows)
> > > +{
> > > +    struct ds match   = DS_EMPTY_INITIALIZER;
> > > +    struct ds actions = DS_EMPTY_INITIALIZER;
> > > +
> > > +    /* Packets received from VXLAN tunnels have already been through
the
> > > +     * router pipeline so we should skip them. Normally this is done
by the
> > > +     * multicast_group implementation (VXLAN packets skip table 32
which
> > > +     * delivers to patch ports) but we're bypassing multicast_groups.
> > > +     */
> > > +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> > > +
> > > +    if (addr_family == AF_INET) {
> > > +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> > > +    } else {
> > > +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
> > > +    }
> > > +
> > > +    const char *ip_address;
> > > +    SSET_FOR_EACH (ip_address, ips) {
> > > +        ds_put_format(&match, "%s, ", ip_address);
> > > +    }
> > > +
> > > +    ds_chomp(&match, ' ');
> > > +    ds_chomp(&match, ',');
> > > +    ds_put_cstr(&match, "}");
> > > +
> > > +    /* Send a the packet only to the router pipeline and skip
flooding it
> > > +     * in the broadcast domain.
> > > +     */
> > > +    ds_put_format(&actions, "outport = %s; output;",
patch_op->json_key);
> > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > +                  ds_cstr(&match), ds_cstr(&actions));
> > > +
> > > +    ds_destroy(&match);
> > > +    ds_destroy(&actions);
> > > +}
> > > +
> > > +/*
> > > + * Ingress table 17: Flows that forward ARP/ND requests only to the
routers
> > > + * that own the addresses.
> > > + * Priorities:
> > > + * - 80: self originated GARPs that need to follow regular
processing.
> > > + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> > > + */
> > > +static void
> > > +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
> > > +                                  struct ovn_datapath *sw_od,
> > > +                                  struct ovn_port *sw_op,
> > > +                                  struct hmap *lflows)
> > > +{
> > > +    if (!op || !op->nbrp) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (!lrport_is_enabled(op->nbrp)) {
> > > +        return;
> > > +    }
> > > +
> > > +    /* Self originated (G)ARP requests/ND need to be flooded as
usual.
> > > +     * Priority: 80.
> > > +     */
> > > +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od,
lflows);
> > > +
> > > +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT)
only to this
> > > +     * router port.
> > > +     * Priority: 75.
> > > +     */
> > > +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> > > +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> > > +
> > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> > > +        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
> > > +    }
> > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> > > +        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
> > > +    }
> > > +
> > > +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
> > > +
> > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > +
> > > +        if (!strcmp(nat->type, "snat")) {
> > > +            continue;
> > > +        }
> > > +
> > > +        ovs_be32 ip;
> > > +        ovs_be32 mask;
> > > +        struct in6_addr ipv6;
> > > +        struct in6_addr mask_v6;
> > > +
> > > +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> > > +            if (!ipv6_parse_masked(nat->external_ip, &ipv6,
&mask_v6)) {
> > > +                sset_add(&all_ips_v6, nat->external_ip);
> > > +            }
> > > +        } else {
> > > +            sset_add(&all_ips_v4, nat->external_ip);
> > > +        }
> > > +    }
> > > +
> > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET,
sw_op,
> > > +                                            sw_od, 75, lflows);
> > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6,
sw_op,
> > > +                                            sw_od, 75, lflows);
> > > +
> > > +    sset_destroy(&all_ips_v4);
> > > +    sset_destroy(&all_ips_v6);
> > > +}
> > > +
> > >  static void
> > >  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> > >                      struct hmap *port_groups, struct hmap *lflows,
> > > @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths,
struct hmap *ports,
> > >              continue;
> > >          }
> > >
> > > +        /* For ports connected to logical routers add flows to
bypass the
> > > +         * broadcast flooding of ARP/ND requests in table 17. We
direct the
> > > +         * requests only to the router port that owns the IP address.
> > > +         */
> > > +        if (!strcmp(op->nbsp->type, "router")) {
> > > +            build_lswitch_rport_arp_req_flows(op->peer, op->od, op,
lflows);
> > > +        }
> > > +
> > >          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
> > >              /* Addresses are owned by the logical port.
> > >               * Ethernet address followed by zero or more IPv4
> > > @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths,
struct hmap *ports,
> > >      ds_destroy(&actions);
> > >  }
> > >
> > > -static bool
> > > -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > -{
> > > -    return !lrport->enabled || *lrport->enabled;
> > > -}
> > > -
> > >  /* Returns a string of the IP address of the router port 'op' that
> > >   * overlaps with 'ip_s".  If one is not found, returns NULL.
> > >   *
> > > diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> > > index 7966b65..c43f16d 100644
> > > --- a/ovn-architecture.7.xml
> > > +++ b/ovn-architecture.7.xml
> > > @@ -1390,6 +1390,25 @@
> > >      http://docs.openvswitch.org/en/latest/topics/high-availability.
> > >    </p>
> > >
> > > +  <h3>ARP request and ND NS packet processing</h3>
> > > +
> > > +  <p>
> > > +    Due to the fact that ARP requests and ND NA packets are usually
broadcast
> > > +    packets, for performance reasons, OVN deals with requests that
target OVN
> > > +    owned IP addresses (i.e., IP addresses configured on the router
ports,
> > > +    VIPs, NAT IPs) in a specific way and only forwards them to the
logical
> > > +    router that owns the target IP address. This behavior is
different than
> > > +    that of traditional swithces and implies that other routers/hosts
> > > +    connected to the logical switch will not learn the MAC/IP
binding from
> > > +    the request packet.
> > > +  </p>
> > > +
> > > +  <p>
> > > +    All other ARP and ND packets are flooded in the L2 broadcast
domain and
> > > +    to all attached logical patch ports.
> > > +  </p>
> > > +
> > > +
> > >    <h2>Multiple localnet logical switches connected to a Logical
Router</h2>
> > >
> > >    <p>
> > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > index 3e429e3..26e33d2 100644
> > > --- a/tests/ovn.at
> > > +++ b/tests/ovn.at
> > > @@ -2877,7 +2877,7 @@ test_ip() {
> > >      done
> > >  }
> > >
> > > -# test_arp INPORT SHA SPA TPA [REPLY_HA]
> > > +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
> > >  #
> > >  # Causes a packet to be received on INPORT.  The packet is an ARP
> > >  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is
provided, then
> > > @@ -2888,21 +2888,25 @@ test_ip() {
> > >  # SHA and REPLY_HA are each 12 hex digits.
> > >  # SPA and TPA are each 8 hex digits.
> > >  test_arp() {
> > > -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
> > > +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
> > >      local
request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
> > >      hv=hv`vif_to_hv $inport`
> > >      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
> > >      as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
> > >
> > >      # Expect to receive the broadcast ARP on the other logical
switch ports if
> > > -    # IP address is not configured to the switch patch port.
> > > +    # IP address is not configured on the switch patch port or on
the router
> > > +    # port (i.e, $flood == 1).
> > >      local i=`vif_to_ls $inport`
> > >      local j k
> > >      for j in 1 2 3; do
> > >          for k in 1 2 3; do
> > > -            # 192.168.33.254 is configured to the switch patch port
for lrp33,
> > > -            # so no ARP flooding expected for it.
> > > -            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192
168 33 254`; then
> > > +            # Skip ingress port.
> > > +            if test $i$j$k == $inport; then
> > > +                continue
> > > +            fi
> > > +
> > > +            if test X$flood == X1; then
> > >                  echo $request >> $i$j$k.expected
> > >              fi
> > >          done
> > > @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
> > >        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
> > >        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in
subnet
> > >
> > > -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
> > > -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
> > > -      test_arp $i$j$k $smac $sip        $otherip               #6
> > > +      test_arp $i$j$k $smac $sip        $rip       0     $rmac
#4
> > > +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac
#5
> > > +      test_arp $i$j$k $smac $sip        $otherip   1
#6
> > >
> > >        # When rip is 192.168.33.254, ARP request from externalip
won't be
> > >        # filtered, because 192.168.33.254 is configured to switch
peer port
> > > @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
> > >        if test $i = 3 && test $j = 3; then
> > >          lrp33_rsp=$rmac
> > >        fi
> > > -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
> > > +      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp
#7
> > >
> > >        # MAC binding should be learned from ARP request.
> > >        host_mac_pretty=f0:00:00:00:0$i:$j$k
> > > @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
> > >  # Check that there is a logical flow in logical switch foo's pipeline
> > >  # to set the outport to rp-foo (which is expected).
> > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep
ls_in_l2_lkup | \
> > > -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> > > +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
> > >
> > >  # Set the option 'reside-on-redirect-chassis' for foo
> > >  ovn-nbctl set logical_router_port foo
options:reside-on-redirect-chassis=true
> > > @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo
options:reside-on-redirect-chassis=true
> > >  # to set the outport to rp-foo with the condition
is_chassis_redirect.
> > >  ovn-sbctl dump-flows foo
> > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep
ls_in_l2_lkup | \
> > > -grep rp-foo | grep is_chassis_resident | wc -l`])
> > > +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
> > >
> > >  echo "---------NB dump-----"
> > >  ovn-nbctl show
> > > @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
> > >  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
> > >
> > >  AT_CLEANUP
> > > +
> > > +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> > > +AT_SKIP_IF([test $HAVE_PYTHON = no])
> > > +ovn_start
> > > +
> > > +ip_to_hex() {
> > > +    printf "%02x%02x%02x%02x" "$@"
> > > +}
> > > +
> > > +send_arp_request() {
> > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> > > +    local eth_dst=ffffffffffff
> > > +    local eth_type=0806
> > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > +
> > > +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> > > +
> > > +    local request=${eth}${arp}
> > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport
$request
> > > +}
> > > +
> > > +send_nd_ns() {
> > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> > > +
> > > +    local eth_dst=ffffffffffff
> > > +    local eth_type=86dd
> > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > +
> > > +    local ip_vhlen=60000000
> > > +    local ip_plen=0020
> > > +    local ip_next=3a
> > > +    local ip_ttl=ff
> > > +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> > > +
> > > +    # Neighbor Solicitation
> > > +    local icmp6_type=87
> > > +    local icmp6_code=00
> > > +    local icmp6_rsvd=00000000
> > > +    # ICMPv6 source lla option
> > > +    local icmp6_opt=01
> > > +    local icmp6_optlen=01
> > > +    local
icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> > > +
> > > +    local request=${eth}${ip}${icmp6}
> > > +
> > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport
$request
> > > +}
> > > +
> > > +src_mac=000000000001
> > > +
> > > +net_add n1
> > > +sim_add hv1
> > > +as hv1
> > > +ovs-vsctl add-br br-phys
> > > +ovn_attach n1 br-phys 192.168.0.1
> > > +
> > > +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> > > +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> > > +    options:tx_pcap=hv1/vif1-tx.pcap \
> > > +    options:rxq_pcap=hv1/vif1-rx.pcap \
> > > +    ofport-request=1
> > > +
> > > +# One Aggregation Switch connected to two Logical networks (routers).
> > > +ovn-nbctl ls-add sw-agg
> > > +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> > > +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> > > +
> > > +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> > > +    -- lsp-set-type sw-rtr1 router                 \
> > > +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> > > +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> > > +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> > > +    -- lsp-set-type sw-rtr2 router                 \
> > > +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> > > +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> > > +
> > > +# Configure L3 interface IPv4 & IPv6 on both routers
> > > +ovn-nbctl lr-add rtr1
> > > +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
> > > +
> > > +ovn-nbctl lr-add rtr2
> > > +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
> > > +
> > > +OVN_POPULATE_ARP
> > > +ovn-nbctl --wait=hv sync
> > > +
> > > +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding
sw-agg)
> > > +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list
datapath_binding sw-agg)
> > > +
> > > +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding
sw-rtr1)
> > > +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding
sw-rtr2)
> > > +
> > > +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group
datapath=${sw_dp_uuid} name="_MC_flood")
> > > +mc_key=$(printf "%04x" $mc_key)
> > > +
> > > +match_sw_metadata="metadata=0x${sw_dp_key}"
> > > +
> > > +# Inject ARP request for first router owned IP address.
> > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex
10 0 0 1)
> > > +
> > > +# Verify that the ARP request is sent only to rtr1.
> > >
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +# Inject ND_NS for ofirst router owned IP address.
> > > +src_ipv6=00100000000000000000000000000254
> > > +dst_ipv6=00100000000000000000000000000001
> > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > +
> > > +# Verify that the ND_NS is sent only to rtr1.
> > >
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +# Configure load balancing on both routers.
> > > +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> > > +ovn-nbctl lb-add lb1-v6 10::11 42::1
> > > +ovn-nbctl lr-lb-add rtr1 lb1-v4
> > > +ovn-nbctl lr-lb-add rtr1 lb1-v6
> > > +
> > > +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> > > +ovn-nbctl lb-add lb2-v6 10::22 42::2
> > > +ovn-nbctl lr-lb-add rtr2 lb2-v4
> > > +ovn-nbctl lr-lb-add rtr2 lb2-v6
> > > +ovn-nbctl --wait=hv sync
> > > +
> > > +# Inject ARP request for first router owned VIP address.
> > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex
10 0 0 11)
> > > +
> > > +# Verify that the ARP request is sent only to rtr1.
> > >
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +# Inject ND_NS for first router owned VIP address.
> > > +src_ipv6=00100000000000000000000000000254
> > > +dst_ipv6=00100000000000000000000000000011
> > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > +
> > > +# Verify that the ND_NS is sent only to rtr1.
> > >
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +# Configure NAT on both routers
> > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> > > +
> > > +# Inject ARP request for first router owned NAT address.
> > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex
10 0 0 111)
> > > +
> > > +# Verify that the ARP request is sent only to rtr1.
> > >
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +# Inject ND_NS for first router owned IP address.
> > > +src_ipv6=00100000000000000000000000000254
> > > +dst_ipv6=00100000000000000000000000000111
> > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > +
> > > +# Verify that the ND_NS is sent only to rtr1.
> > >
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> > > +
> > > +as hv1
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "1" = "${pkts_to_rtr1}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > +    grep n_packets=1 -c)
> > > +    test "0" = "${pkts_to_rtr2}"
> > > +])
> > > +OVS_WAIT_UNTIL([
> > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
n_packets=0 -c)
> > > +    test "0" = "${pkts_flooded}"
> > > +])
> > > +
> > > +OVN_CLEANUP([hv1])
> > > +AT_CLEANUP
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara Nov. 13, 2019, 10:42 a.m. UTC | #4
On Tue, Nov 12, 2019 at 8:50 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
>
> On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
> > >
> > >
> > >
> > > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > > >
> > > > ARP request and ND NS packets for router owned IPs were being
> > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > > However this creates a scaling issue in scenarios where aggregation
> > > > logical switches are connected to more logical routers (~350). The
> > > > logical pipelines of all routers would have to be executed before the
> > > > packet is finally replied to by a single router, the owner of the IP
> > > > address.
> > > >
> > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > > for ARP requests that will be replied by a single router. The packets
> > > > are forwarded only to the router port that owns the target IP address.
> > > >
> > > > IPs that are owned by the routers and for which this fix applies are:
> > > > - IP addresses configured on the router ports.
> > > > - VIPs.
> > > > - NAT IPs.
> > > >
> > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > >
> > > > ---
> > > > v7:
> > > > - Address Han's comments:
> > > >     - Remove flooding for all ARPs received on VLAN networks. To avoid
> > > >       that we now identify self originated (G)ARPs by matching on source
> > > >       MAC address too.
> > > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> > > > - Fix ovn-sb manpage.
> > > > - Split patch in a series of 2:
> > > >     - patch1: fixes the get_router_load_balancer_ips() function.
> > > >     - patch2: limits the ARP/ND broadcast domain.
> > > > v6:
> > > > - Address Han's comments:
> > > >     - remove flooding of ARPs targeting OVN owned IP addresses.
> > > >     - update ovn-architecture documentation.
> > > >     - rename ARP handling functions.
> > > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> > > >     account the new way of forwarding ARPs.
> > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > > > v5: Address Numan's comments: update comments & make autotest more
> > > >     robust.
> > > > v4: Rebase.
> > > > v3: Properly deal with VXLAN traffic. Address review comments from
> > > >     Numan (add autotests). Fix function get_router_load_balancer_ips.
> > > >     Rebase -> deal with IPv6 NAT too.
> > > > v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> > > > address localnet ports too.
> > > > ---
> > > >  northd/ovn-northd.8.xml |   14 ++
> > > >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
> > > >  ovn-architecture.7.xml  |   19 +++
> > > >  tests/ovn.at            |  307 +++++++++++++++++++++++++++++++++++++++++++++--
> > > >  4 files changed, 530 insertions(+), 40 deletions(-)
> > > >
> > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > > index 0a33dcd..344cc0d 100644
> > > > --- a/northd/ovn-northd.8.xml
> > > > +++ b/northd/ovn-northd.8.xml
> > > > @@ -1005,6 +1005,20 @@ output;
> > > >        </li>
> > > >
> > > >        <li>
> > > > +        Priority-80 flows for each port connected to a logical router
> > > > +        matching self originated GARP/ARP request/ND packets. These packets
> > > > +        are flooded to the <code>MC_FLOOD</code> which contains all logical
> > > > +        ports.
> > > > +      </li>
> > > > +
> > > > +      <li>
> > > > +        Priority-75 flows for each IP address/VIP/NAT address owned by a
> > > > +        router port connected to the switch. These flows match ARP requests
> > > > +        and ND packets for the specific IP addresses.  Matched packets are
> > > > +        forwarded only to the router that owns the IP address.
> > > > +      </li>
> > > > +
> > > > +      <li>
> > > >          A priority-70 flow that outputs all packets with an Ethernet broadcast
> > > >          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
> > > >          multicast group.
> > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > > index 32f3200..d6beb97 100644
> > > > --- a/northd/ovn-northd.c
> > > > +++ b/northd/ovn-northd.c
> > > > @@ -210,6 +210,8 @@ enum ovn_stage {
> > > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> > > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> > > >
> > > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> > > > +
> > > >  /* Returns an "enum ovn_stage" built from the arguments. */
> > > >  static enum ovn_stage
> > > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
> > > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
> > > >                            1, (1u << 15) - 1, &od->port_key_hint);
> > > >  }
> > > >
> > > > +/* Returns true if the logical switch port 'enabled' column is empty or
> > > > + * set to true.  Otherwise, returns false. */
> > > > +static bool
> > > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > > +{
> > > > +    return !lsp->n_enabled || *lsp->enabled;
> > > > +}
> > > > +
> > > > +/* Returns true only if the logical switch port 'up' column is set to true.
> > > > + * Otherwise, if the column is not set or set to false, returns false. */
> > > > +static bool
> > > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > > +{
> > > > +    return lsp->n_up && *lsp->up;
> > > > +}
> > > > +
> > > > +static bool
> > > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > > +{
> > > > +    return !strcmp(nbsp->type, "external");
> > > > +}
> > > > +
> > > > +static bool
> > > > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > > +{
> > > > +    return !lrport->enabled || *lrport->enabled;
> > > > +}
> > > > +
> > > >  static char *
> > > >  chassis_redirect_name(const char *port_name)
> > > >  {
> > > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
> > > >
> > > >  }
> > > >
> > > > -/* Returns true if the logical switch port 'enabled' column is empty or
> > > > - * set to true.  Otherwise, returns false. */
> > > > -static bool
> > > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > > -{
> > > > -    return !lsp->n_enabled || *lsp->enabled;
> > > > -}
> > > > -
> > > > -/* Returns true only if the logical switch port 'up' column is set to true.
> > > > - * Otherwise, if the column is not set or set to false, returns false. */
> > > > -static bool
> > > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > > -{
> > > > -    return lsp->n_up && *lsp->up;
> > > > -}
> > > > -
> > > > -static bool
> > > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > > -{
> > > > -    return !strcmp(nbsp->type, "external");
> > > > -}
> > > > -
> > > >  static bool
> > > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> > > >                      struct ds *options_action, struct ds *response_action,
> > > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
> > > >      }
> > > >  }
> > > >
> > > > +/*
> > > > + * Ingress table 17: Flows that flood self originated ARP/ND packets in the
> > > > + * switching domain.
> > > > + */
> > > > +static void
> > > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> > > > +                                           uint32_t priority,
> > > > +                                           struct ovn_datapath *od,
> > > > +                                           struct hmap *lflows)
> > > > +{
> > > > +    struct ds match = DS_EMPTY_INITIALIZER;
> > > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> > > > +
> > > > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> > > > +     * Determine that packets are self originated by also matching on
> > > > +     * source MAC. Matching on ingress port is not reliable in case this
> > > > +     * is a VLAN-backed network.
> > > > +     * Priority: 80.
> > > > +     */
> > > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > > +
> > > > +        if (!nat->external_mac) {
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> > > > +    }
> > >
> > > As discussed we need to add chassis unique MAC that are configured in external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find this in the patch. VLAN backed logical router may not work without this.
> >
> > Hi Han,
> >
> > Maybe I misunderstood but in the discussion on v6 I mentioned that I
> > don't think we need to add the MACs from
> > external-ids:ovn-chassis-mac-mappings.
> >
> > Whenever chassis MACs are configured, in ovn-controller we create a
> > conjunctive flow matching on any of the remote chassis MAC addresses:
> > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
> >
> > And for all incoming traffic that matches this conjunction and VLAN-id
> > we change the MAC back to that of the logical router port:
> > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
> >
> > Isn't this enough to cover the self originated ARP packets?
> >
> > Thanks,
> > Dumitru
> >
>
> Dumitru, sorry that I misunderstood that you actually meant it was ok to not adding chassis unique macs. Also I didn't realize that there are already flows to change the chassis unique MACs back to the logical router port's MACs.
> With this precondition I think your patch should be good enough.
>
> However, I revisited the function put_replace_chassis_mac_flows() and had some difficulty to understand how would it work. For these flows, the match conditions are:
> - in_port of the localnet port
> - conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
> - vlan tag associated with the localnet port
> The flow is added in a loop for each peer port to replace mac for each router port on that logical switch. Since the match condition is all the same, wouldn't it result in only one flow taking effect and others getting dropped? I wonder if any other port-specific match condition should be added so that MAC can be replaced back to its original router port mac accordingly.

If I understand correctly the differentiator is the ingress localnet
port id and VLAN-ID.

Looking at the autotest for "ovn -- 2 HVs, 2 lports/HV, localnet
ports, DVR chassis mac" the network is:

$ ovn-nbctl --db=unix:$PWD/./ovn-nb/ovn-nb.sock show
switch df662f28-4a42-4ac4-aadb-89563347cae1 (ls1)
    port ls1-to-router
        type: router
        router-port: router-to-ls1
    port lp11
        addresses: ["f0:00:00:00:00:11 192.168.1.1"]
    port ln1
        type: localnet
        parent:
        tag: 101
        addresses: ["unknown"]
switch 91ddb7b2-df8c-42de-a899-6bb35ee08a16 (ls2)
    port ls2-to-router
        type: router
        router-port: router-to-ls2
    port lp22
        addresses: ["f0:00:00:00:00:22 192.168.2.2"]
    port ln2
        type: localnet
        parent:
        tag: 201
        addresses: ["unknown"]
router 4186cb04-3370-401c-9d65-29cb2af48af1 (router)
    port router-to-ls1
        mac: "00:00:01:01:02:03"
        networks: ["192.168.1.3/24"]
    port router-to-ls2
        mac: "00:00:01:01:02:05"
        networks: ["192.168.2.3/24"]

$ ovn-sbctl --db=unix:$PWD/./ovn-sb/ovn-sb.sock list chassis | grep -E
"uuid|external_ids"
_uuid               : 31ba7484-e0af-4326-a71b-c3f32e52e547
external_ids        : {datapath-type="",
iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
ovn-bridge-mappings="phys:br-phys",
ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:11", ovn-cms-options=""}

_uuid               : a04b5ad2-d76f-42c4-9f2b-21816e2624e2
external_ids        : {datapath-type="",
iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
ovn-bridge-mappings="phys:br-phys",
ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:22", ovn-cms-options=""}

On HV1 (chassis-mac aa:bb:cc:dd:ee:11) we have the following flow to
replace source MAC for already routed packets:

$ OVS_RUNDIR=$PWD/hv1 ovs-ofctl dump-flows br-int | grep
aa:bb:cc:dd:ee:11
 cookie=0xe0f6b198, duration=580.460s, table=65, n_packets=0,
n_bytes=0, idle_age=580,
priority=150,reg15=0x1,metadata=0x1,dl_src=00:00:01:01:02:03
actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:101,output:1
 cookie=0xfddc60a, duration=580.420s, table=65, n_packets=1,
n_bytes=42, idle_age=580,
priority=150,reg15=0x1,metadata=0x2,dl_src=00:00:01:01:02:05
actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:201,output:3

In the above output, the first flow is for packets destined to hosts
in LS1 (VLAN 101) and the second flow is for packets destined to hosts
in LS2 (VLAN 201).

If we inject a packet from lp11:
in_port=lp11, eth.src=f0:00:00:00:00:11, eth.dst=00:00:01:01:02:03,
ip.src=192.168.1.1, ip.dst=192.168.2.2

The router pipeline is executed on HV1, the eth.src address that of
the router-to-ls2 port (00:00:01:01:02:05) and finally the entry in
table=65 is hit and eth.src is changed to the configured chassis-mac
(aa:bb:cc:dd:ee:11).

The packet is sent out on port ln2:
$ OVS_RUNDIR=$PWD/hv1 ovs-vsctl --column ofport find interface
name=patch-br-int-to-ln2
ofport              : 3

Then it is received on HV2, where we have the following flows:
$ OVS_RUNDIR=$PWD/hv2 ovs-ofctl dump-flows br-int | grep conj
 cookie=0x31ba7484, duration=1545.430s, table=0, n_packets=0,
n_bytes=0, idle_age=1545, priority=180,dl_src=aa:bb:cc:dd:ee:11
actions=conjunction(100,1/2)
 cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=1,
n_bytes=46, idle_age=1545,
priority=180,conj_id=100,in_port=2,dl_vlan=201
actions=strip_vlan,load:0x4->NXM_NX_REG13[],load:0x3->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:05,resubmit(,8)
 cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
n_bytes=0, idle_age=1545,
priority=180,conj_id=100,in_port=3,dl_vlan=101
actions=strip_vlan,load:0x9->NXM_NX_REG13[],load:0x5->NXM_NX_REG11[],load:0x6->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:03,resubmit(,8)
 cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=0,
n_bytes=0, idle_age=1545, priority=180,dl_vlan=201
actions=conjunction(100,2/2)
 cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
n_bytes=0, idle_age=1545, priority=180,dl_vlan=101
actions=conjunction(100,2/2)

The packet matches:
- clause 1 of the conjunction (100) because eth.src is the chassis-mac
of HV1 (aa:bb:cc:dd:ee:11).
- clause 2 of the conjunction because VLAN_ID is 201.

And then matches this flow that fixes the eth.src in the packet to
that of router-to-ls2:
priority=180,conj_id=100,in_port=2,dl_vlan=201
actions=.....,mod_dl_src:00:00:01:01:02:05,....

>
> cc Ankur who is the author of VLAN backed router to help clarify.
>
> This question is not directly related to the current patch. So for the patch:
> Acked-by: Han Zhou <hzhou@ovn.org>
>
> I think it is better to wait until the above question is confirmed before merging it.

Sure, thanks again for reviewing this!

>
> > >
> > > > +    ds_chomp(&eth_src, ' ');
> > > > +    ds_chomp(&eth_src, ',');
> > > > +    ds_put_cstr(&eth_src, "}");
> > > > +
> > > > +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
> > > > +                  ds_cstr(&eth_src));
> > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > > +                  ds_cstr(&match),
> > > > +                  "outport = \""MC_FLOOD"\"; output;");
> > > > +
> > > > +    ds_destroy(&match);
> > > > +    ds_destroy(&eth_src);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> > > > + * that own the addresses. Other ARP/ND packets are still flooded in the
> > > > + * switching domain as regular broadcast.
> > > > + */
> > > > +static void
> > > > +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> > > > +                                        int addr_family,
> > > > +                                        struct ovn_port *patch_op,
> > > > +                                        struct ovn_datapath *od,
> > > > +                                        uint32_t priority,
> > > > +                                        struct hmap *lflows)
> > > > +{
> > > > +    struct ds match   = DS_EMPTY_INITIALIZER;
> > > > +    struct ds actions = DS_EMPTY_INITIALIZER;
> > > > +
> > > > +    /* Packets received from VXLAN tunnels have already been through the
> > > > +     * router pipeline so we should skip them. Normally this is done by the
> > > > +     * multicast_group implementation (VXLAN packets skip table 32 which
> > > > +     * delivers to patch ports) but we're bypassing multicast_groups.
> > > > +     */
> > > > +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> > > > +
> > > > +    if (addr_family == AF_INET) {
> > > > +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> > > > +    } else {
> > > > +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
> > > > +    }
> > > > +
> > > > +    const char *ip_address;
> > > > +    SSET_FOR_EACH (ip_address, ips) {
> > > > +        ds_put_format(&match, "%s, ", ip_address);
> > > > +    }
> > > > +
> > > > +    ds_chomp(&match, ' ');
> > > > +    ds_chomp(&match, ',');
> > > > +    ds_put_cstr(&match, "}");
> > > > +
> > > > +    /* Send a the packet only to the router pipeline and skip flooding it
> > > > +     * in the broadcast domain.
> > > > +     */
> > > > +    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
> > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > > +                  ds_cstr(&match), ds_cstr(&actions));
> > > > +
> > > > +    ds_destroy(&match);
> > > > +    ds_destroy(&actions);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> > > > + * that own the addresses.
> > > > + * Priorities:
> > > > + * - 80: self originated GARPs that need to follow regular processing.
> > > > + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> > > > + */
> > > > +static void
> > > > +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
> > > > +                                  struct ovn_datapath *sw_od,
> > > > +                                  struct ovn_port *sw_op,
> > > > +                                  struct hmap *lflows)
> > > > +{
> > > > +    if (!op || !op->nbrp) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    if (!lrport_is_enabled(op->nbrp)) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> > > > +     * Priority: 80.
> > > > +     */
> > > > +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od, lflows);
> > > > +
> > > > +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT) only to this
> > > > +     * router port.
> > > > +     * Priority: 75.
> > > > +     */
> > > > +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> > > > +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> > > > +
> > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> > > > +        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
> > > > +    }
> > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> > > > +        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
> > > > +    }
> > > > +
> > > > +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
> > > > +
> > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > > +
> > > > +        if (!strcmp(nat->type, "snat")) {
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        ovs_be32 ip;
> > > > +        ovs_be32 mask;
> > > > +        struct in6_addr ipv6;
> > > > +        struct in6_addr mask_v6;
> > > > +
> > > > +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> > > > +            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
> > > > +                sset_add(&all_ips_v6, nat->external_ip);
> > > > +            }
> > > > +        } else {
> > > > +            sset_add(&all_ips_v4, nat->external_ip);
> > > > +        }
> > > > +    }
> > > > +
> > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET, sw_op,
> > > > +                                            sw_od, 75, lflows);
> > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6, sw_op,
> > > > +                                            sw_od, 75, lflows);
> > > > +
> > > > +    sset_destroy(&all_ips_v4);
> > > > +    sset_destroy(&all_ips_v6);
> > > > +}
> > > > +
> > > >  static void
> > > >  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> > > >                      struct hmap *port_groups, struct hmap *lflows,
> > > > @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> > > >              continue;
> > > >          }
> > > >
> > > > +        /* For ports connected to logical routers add flows to bypass the
> > > > +         * broadcast flooding of ARP/ND requests in table 17. We direct the
> > > > +         * requests only to the router port that owns the IP address.
> > > > +         */
> > > > +        if (!strcmp(op->nbsp->type, "router")) {
> > > > +            build_lswitch_rport_arp_req_flows(op->peer, op->od, op, lflows);
> > > > +        }
> > > > +
> > > >          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
> > > >              /* Addresses are owned by the logical port.
> > > >               * Ethernet address followed by zero or more IPv4
> > > > @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> > > >      ds_destroy(&actions);
> > > >  }
> > > >
> > > > -static bool
> > > > -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > > -{
> > > > -    return !lrport->enabled || *lrport->enabled;
> > > > -}
> > > > -
> > > >  /* Returns a string of the IP address of the router port 'op' that
> > > >   * overlaps with 'ip_s".  If one is not found, returns NULL.
> > > >   *
> > > > diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> > > > index 7966b65..c43f16d 100644
> > > > --- a/ovn-architecture.7.xml
> > > > +++ b/ovn-architecture.7.xml
> > > > @@ -1390,6 +1390,25 @@
> > > >      http://docs.openvswitch.org/en/latest/topics/high-availability.
> > > >    </p>
> > > >
> > > > +  <h3>ARP request and ND NS packet processing</h3>
> > > > +
> > > > +  <p>
> > > > +    Due to the fact that ARP requests and ND NA packets are usually broadcast
> > > > +    packets, for performance reasons, OVN deals with requests that target OVN
> > > > +    owned IP addresses (i.e., IP addresses configured on the router ports,
> > > > +    VIPs, NAT IPs) in a specific way and only forwards them to the logical
> > > > +    router that owns the target IP address. This behavior is different than
> > > > +    that of traditional swithces and implies that other routers/hosts
> > > > +    connected to the logical switch will not learn the MAC/IP binding from
> > > > +    the request packet.
> > > > +  </p>
> > > > +
> > > > +  <p>
> > > > +    All other ARP and ND packets are flooded in the L2 broadcast domain and
> > > > +    to all attached logical patch ports.
> > > > +  </p>
> > > > +
> > > > +
> > > >    <h2>Multiple localnet logical switches connected to a Logical Router</h2>
> > > >
> > > >    <p>
> > > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > > index 3e429e3..26e33d2 100644
> > > > --- a/tests/ovn.at
> > > > +++ b/tests/ovn.at
> > > > @@ -2877,7 +2877,7 @@ test_ip() {
> > > >      done
> > > >  }
> > > >
> > > > -# test_arp INPORT SHA SPA TPA [REPLY_HA]
> > > > +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
> > > >  #
> > > >  # Causes a packet to be received on INPORT.  The packet is an ARP
> > > >  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided, then
> > > > @@ -2888,21 +2888,25 @@ test_ip() {
> > > >  # SHA and REPLY_HA are each 12 hex digits.
> > > >  # SPA and TPA are each 8 hex digits.
> > > >  test_arp() {
> > > > -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
> > > > +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
> > > >      local request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
> > > >      hv=hv`vif_to_hv $inport`
> > > >      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
> > > >      as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
> > > >
> > > >      # Expect to receive the broadcast ARP on the other logical switch ports if
> > > > -    # IP address is not configured to the switch patch port.
> > > > +    # IP address is not configured on the switch patch port or on the router
> > > > +    # port (i.e, $flood == 1).
> > > >      local i=`vif_to_ls $inport`
> > > >      local j k
> > > >      for j in 1 2 3; do
> > > >          for k in 1 2 3; do
> > > > -            # 192.168.33.254 is configured to the switch patch port for lrp33,
> > > > -            # so no ARP flooding expected for it.
> > > > -            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
> > > > +            # Skip ingress port.
> > > > +            if test $i$j$k == $inport; then
> > > > +                continue
> > > > +            fi
> > > > +
> > > > +            if test X$flood == X1; then
> > > >                  echo $request >> $i$j$k.expected
> > > >              fi
> > > >          done
> > > > @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
> > > >        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
> > > >        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
> > > >
> > > > -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
> > > > -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
> > > > -      test_arp $i$j$k $smac $sip        $otherip               #6
> > > > +      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
> > > > +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
> > > > +      test_arp $i$j$k $smac $sip        $otherip   1                 #6
> > > >
> > > >        # When rip is 192.168.33.254, ARP request from externalip won't be
> > > >        # filtered, because 192.168.33.254 is configured to switch peer port
> > > > @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
> > > >        if test $i = 3 && test $j = 3; then
> > > >          lrp33_rsp=$rmac
> > > >        fi
> > > > -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
> > > > +      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
> > > >
> > > >        # MAC binding should be learned from ARP request.
> > > >        host_mac_pretty=f0:00:00:00:0$i:$j$k
> > > > @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
> > > >  # Check that there is a logical flow in logical switch foo's pipeline
> > > >  # to set the outport to rp-foo (which is expected).
> > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> > > > -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> > > > +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
> > > >
> > > >  # Set the option 'reside-on-redirect-chassis' for foo
> > > >  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
> > > > @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
> > > >  # to set the outport to rp-foo with the condition is_chassis_redirect.
> > > >  ovn-sbctl dump-flows foo
> > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> > > > -grep rp-foo | grep is_chassis_resident | wc -l`])
> > > > +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
> > > >
> > > >  echo "---------NB dump-----"
> > > >  ovn-nbctl show
> > > > @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
> > > >  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
> > > >
> > > >  AT_CLEANUP
> > > > +
> > > > +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> > > > +AT_SKIP_IF([test $HAVE_PYTHON = no])
> > > > +ovn_start
> > > > +
> > > > +ip_to_hex() {
> > > > +    printf "%02x%02x%02x%02x" "$@"
> > > > +}
> > > > +
> > > > +send_arp_request() {
> > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> > > > +    local eth_dst=ffffffffffff
> > > > +    local eth_type=0806
> > > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > > +
> > > > +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> > > > +
> > > > +    local request=${eth}${arp}
> > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> > > > +}
> > > > +
> > > > +send_nd_ns() {
> > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> > > > +
> > > > +    local eth_dst=ffffffffffff
> > > > +    local eth_type=86dd
> > > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > > +
> > > > +    local ip_vhlen=60000000
> > > > +    local ip_plen=0020
> > > > +    local ip_next=3a
> > > > +    local ip_ttl=ff
> > > > +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> > > > +
> > > > +    # Neighbor Solicitation
> > > > +    local icmp6_type=87
> > > > +    local icmp6_code=00
> > > > +    local icmp6_rsvd=00000000
> > > > +    # ICMPv6 source lla option
> > > > +    local icmp6_opt=01
> > > > +    local icmp6_optlen=01
> > > > +    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> > > > +
> > > > +    local request=${eth}${ip}${icmp6}
> > > > +
> > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> > > > +}
> > > > +
> > > > +src_mac=000000000001
> > > > +
> > > > +net_add n1
> > > > +sim_add hv1
> > > > +as hv1
> > > > +ovs-vsctl add-br br-phys
> > > > +ovn_attach n1 br-phys 192.168.0.1
> > > > +
> > > > +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> > > > +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> > > > +    options:tx_pcap=hv1/vif1-tx.pcap \
> > > > +    options:rxq_pcap=hv1/vif1-rx.pcap \
> > > > +    ofport-request=1
> > > > +
> > > > +# One Aggregation Switch connected to two Logical networks (routers).
> > > > +ovn-nbctl ls-add sw-agg
> > > > +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> > > > +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> > > > +
> > > > +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> > > > +    -- lsp-set-type sw-rtr1 router                 \
> > > > +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> > > > +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> > > > +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> > > > +    -- lsp-set-type sw-rtr2 router                 \
> > > > +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> > > > +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> > > > +
> > > > +# Configure L3 interface IPv4 & IPv6 on both routers
> > > > +ovn-nbctl lr-add rtr1
> > > > +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
> > > > +
> > > > +ovn-nbctl lr-add rtr2
> > > > +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
> > > > +
> > > > +OVN_POPULATE_ARP
> > > > +ovn-nbctl --wait=hv sync
> > > > +
> > > > +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
> > > > +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
> > > > +
> > > > +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
> > > > +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
> > > > +
> > > > +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
> > > > +mc_key=$(printf "%04x" $mc_key)
> > > > +
> > > > +match_sw_metadata="metadata=0x${sw_dp_key}"
> > > > +
> > > > +# Inject ARP request for first router owned IP address.
> > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
> > > > +
> > > > +# Verify that the ARP request is sent only to rtr1.
> > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +# Inject ND_NS for ofirst router owned IP address.
> > > > +src_ipv6=00100000000000000000000000000254
> > > > +dst_ipv6=00100000000000000000000000000001
> > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > +
> > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +# Configure load balancing on both routers.
> > > > +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> > > > +ovn-nbctl lb-add lb1-v6 10::11 42::1
> > > > +ovn-nbctl lr-lb-add rtr1 lb1-v4
> > > > +ovn-nbctl lr-lb-add rtr1 lb1-v6
> > > > +
> > > > +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> > > > +ovn-nbctl lb-add lb2-v6 10::22 42::2
> > > > +ovn-nbctl lr-lb-add rtr2 lb2-v4
> > > > +ovn-nbctl lr-lb-add rtr2 lb2-v6
> > > > +ovn-nbctl --wait=hv sync
> > > > +
> > > > +# Inject ARP request for first router owned VIP address.
> > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
> > > > +
> > > > +# Verify that the ARP request is sent only to rtr1.
> > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +# Inject ND_NS for first router owned VIP address.
> > > > +src_ipv6=00100000000000000000000000000254
> > > > +dst_ipv6=00100000000000000000000000000011
> > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > +
> > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +# Configure NAT on both routers
> > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> > > > +
> > > > +# Inject ARP request for first router owned NAT address.
> > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
> > > > +
> > > > +# Verify that the ARP request is sent only to rtr1.
> > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +# Inject ND_NS for first router owned IP address.
> > > > +src_ipv6=00100000000000000000000000000254
> > > > +dst_ipv6=00100000000000000000000000000111
> > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > +
> > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> > > > +
> > > > +as hv1
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "1" = "${pkts_to_rtr1}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > +    grep n_packets=1 -c)
> > > > +    test "0" = "${pkts_to_rtr2}"
> > > > +])
> > > > +OVS_WAIT_UNTIL([
> > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> > > > +    test "0" = "${pkts_flooded}"
> > > > +])
> > > > +
> > > > +OVN_CLEANUP([hv1])
> > > > +AT_CLEANUP
> > > >
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
Han Zhou Nov. 13, 2019, 3:30 p.m. UTC | #5
On Wed, Nov 13, 2019 at 2:42 AM Dumitru Ceara <dceara@redhat.com> wrote:

> On Tue, Nov 12, 2019 at 8:50 PM Han Zhou <hzhou@ovn.org> wrote:
> >
> >
> >
> >
> > On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> > >
> > > On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
> > > > >
> > > > > ARP request and ND NS packets for router owned IPs were being
> > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast
> group).
> > > > > However this creates a scaling issue in scenarios where aggregation
> > > > > logical switches are connected to more logical routers (~350). The
> > > > > logical pipelines of all routers would have to be executed before
> the
> > > > > packet is finally replied to by a single router, the owner of the
> IP
> > > > > address.
> > > > >
> > > > > This commit limits the broadcast domain by bypassing the L2 Lookup
> stage
> > > > > for ARP requests that will be replied by a single router. The
> packets
> > > > > are forwarded only to the router port that owns the target IP
> address.
> > > > >
> > > > > IPs that are owned by the routers and for which this fix applies
> are:
> > > > > - IP addresses configured on the router ports.
> > > > > - VIPs.
> > > > > - NAT IPs.
> > > > >
> > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > > >
> > > > > ---
> > > > > v7:
> > > > > - Address Han's comments:
> > > > >     - Remove flooding for all ARPs received on VLAN networks. To
> avoid
> > > > >       that we now identify self originated (G)ARPs by matching on
> source
> > > > >       MAC address too.
> > > > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> > > > > - Fix ovn-sb manpage.
> > > > > - Split patch in a series of 2:
> > > > >     - patch1: fixes the get_router_load_balancer_ips() function.
> > > > >     - patch2: limits the ARP/ND broadcast domain.
> > > > > v6:
> > > > > - Address Han's comments:
> > > > >     - remove flooding of ARPs targeting OVN owned IP addresses.
> > > > >     - update ovn-architecture documentation.
> > > > >     - rename ARP handling functions.
> > > > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to
> take into
> > > > >     account the new way of forwarding ARPs.
> > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > > > > v5: Address Numan's comments: update comments & make autotest more
> > > > >     robust.
> > > > > v4: Rebase.
> > > > > v3: Properly deal with VXLAN traffic. Address review comments from
> > > > >     Numan (add autotests). Fix function
> get_router_load_balancer_ips.
> > > > >     Rebase -> deal with IPv6 NAT too.
> > > > > v2: Move ARP broadcast domain limiting to table
> S_SWITCH_IN_L2_LKUP to
> > > > > address localnet ports too.
> > > > > ---
> > > > >  northd/ovn-northd.8.xml |   14 ++
> > > > >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
> > > > >  ovn-architecture.7.xml  |   19 +++
> > > > >  tests/ovn.at            |  307
> +++++++++++++++++++++++++++++++++++++++++++++--
> > > > >  4 files changed, 530 insertions(+), 40 deletions(-)
> > > > >
> > > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > > > > index 0a33dcd..344cc0d 100644
> > > > > --- a/northd/ovn-northd.8.xml
> > > > > +++ b/northd/ovn-northd.8.xml
> > > > > @@ -1005,6 +1005,20 @@ output;
> > > > >        </li>
> > > > >
> > > > >        <li>
> > > > > +        Priority-80 flows for each port connected to a logical
> router
> > > > > +        matching self originated GARP/ARP request/ND packets.
> These packets
> > > > > +        are flooded to the <code>MC_FLOOD</code> which contains
> all logical
> > > > > +        ports.
> > > > > +      </li>
> > > > > +
> > > > > +      <li>
> > > > > +        Priority-75 flows for each IP address/VIP/NAT address
> owned by a
> > > > > +        router port connected to the switch. These flows match
> ARP requests
> > > > > +        and ND packets for the specific IP addresses.  Matched
> packets are
> > > > > +        forwarded only to the router that owns the IP address.
> > > > > +      </li>
> > > > > +
> > > > > +      <li>
> > > > >          A priority-70 flow that outputs all packets with an
> Ethernet broadcast
> > > > >          or multicast <code>eth.dst</code> to the
> <code>MC_FLOOD</code>
> > > > >          multicast group.
> > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > > > > index 32f3200..d6beb97 100644
> > > > > --- a/northd/ovn-northd.c
> > > > > +++ b/northd/ovn-northd.c
> > > > > @@ -210,6 +210,8 @@ enum ovn_stage {
> > > > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> > > > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> > > > >
> > > > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> > > > > +
> > > > >  /* Returns an "enum ovn_stage" built from the arguments. */
> > > > >  static enum ovn_stage
> > > > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
> pipeline,
> > > > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath
> *od)
> > > > >                            1, (1u << 15) - 1, &od->port_key_hint);
> > > > >  }
> > > > >
> > > > > +/* Returns true if the logical switch port 'enabled' column is
> empty or
> > > > > + * set to true.  Otherwise, returns false. */
> > > > > +static bool
> > > > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > > > +{
> > > > > +    return !lsp->n_enabled || *lsp->enabled;
> > > > > +}
> > > > > +
> > > > > +/* Returns true only if the logical switch port 'up' column is
> set to true.
> > > > > + * Otherwise, if the column is not set or set to false, returns
> false. */
> > > > > +static bool
> > > > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > > > +{
> > > > > +    return lsp->n_up && *lsp->up;
> > > > > +}
> > > > > +
> > > > > +static bool
> > > > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > > > +{
> > > > > +    return !strcmp(nbsp->type, "external");
> > > > > +}
> > > > > +
> > > > > +static bool
> > > > > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > > > +{
> > > > > +    return !lrport->enabled || *lrport->enabled;
> > > > > +}
> > > > > +
> > > > >  static char *
> > > > >  chassis_redirect_name(const char *port_name)
> > > > >  {
> > > > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline
> pipeline, struct ovn_port *op,
> > > > >
> > > > >  }
> > > > >
> > > > > -/* Returns true if the logical switch port 'enabled' column is
> empty or
> > > > > - * set to true.  Otherwise, returns false. */
> > > > > -static bool
> > > > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > > > > -{
> > > > > -    return !lsp->n_enabled || *lsp->enabled;
> > > > > -}
> > > > > -
> > > > > -/* Returns true only if the logical switch port 'up' column is
> set to true.
> > > > > - * Otherwise, if the column is not set or set to false, returns
> false. */
> > > > > -static bool
> > > > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > > > > -{
> > > > > -    return lsp->n_up && *lsp->up;
> > > > > -}
> > > > > -
> > > > > -static bool
> > > > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > > > > -{
> > > > > -    return !strcmp(nbsp->type, "external");
> > > > > -}
> > > > > -
> > > > >  static bool
> > > > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> > > > >                      struct ds *options_action, struct ds
> *response_action,
> > > > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports,
> struct ovs_list *lr_list)
> > > > >      }
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * Ingress table 17: Flows that flood self originated ARP/ND
> packets in the
> > > > > + * switching domain.
> > > > > + */
> > > > > +static void
> > > > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> > > > > +                                           uint32_t priority,
> > > > > +                                           struct ovn_datapath
> *od,
> > > > > +                                           struct hmap *lflows)
> > > > > +{
> > > > > +    struct ds match = DS_EMPTY_INITIALIZER;
> > > > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> > > > > +
> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded as
> usual.
> > > > > +     * Determine that packets are self originated by also
> matching on
> > > > > +     * source MAC. Matching on ingress port is not reliable in
> case this
> > > > > +     * is a VLAN-backed network.
> > > > > +     * Priority: 80.
> > > > > +     */
> > > > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > > > +
> > > > > +        if (!nat->external_mac) {
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> > > > > +    }
> > > >
> > > > As discussed we need to add chassis unique MAC that are configured
> in external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't
> find this in the patch. VLAN backed logical router may not work without
> this.
> > >
> > > Hi Han,
> > >
> > > Maybe I misunderstood but in the discussion on v6 I mentioned that I
> > > don't think we need to add the MACs from
> > > external-ids:ovn-chassis-mac-mappings.
> > >
> > > Whenever chassis MACs are configured, in ovn-controller we create a
> > > conjunctive flow matching on any of the remote chassis MAC addresses:
> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
> > >
> > > And for all incoming traffic that matches this conjunction and VLAN-id
> > > we change the MAC back to that of the logical router port:
> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
> > >
> > > Isn't this enough to cover the self originated ARP packets?
> > >
> > > Thanks,
> > > Dumitru
> > >
> >
> > Dumitru, sorry that I misunderstood that you actually meant it was ok to
> not adding chassis unique macs. Also I didn't realize that there are
> already flows to change the chassis unique MACs back to the logical router
> port's MACs.
> > With this precondition I think your patch should be good enough.
> >
> > However, I revisited the function put_replace_chassis_mac_flows() and
> had some difficulty to understand how would it work. For these flows, the
> match conditions are:
> > - in_port of the localnet port
> > - conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
> > - vlan tag associated with the localnet port
> > The flow is added in a loop for each peer port to replace mac for each
> router port on that logical switch. Since the match condition is all the
> same, wouldn't it result in only one flow taking effect and others getting
> dropped? I wonder if any other port-specific match condition should be
> added so that MAC can be replaced back to its original router port mac
> accordingly.
>
> If I understand correctly the differentiator is the ingress localnet
> port id and VLAN-ID.
>
> Looking at the autotest for "ovn -- 2 HVs, 2 lports/HV, localnet
> ports, DVR chassis mac" the network is:
>
> $ ovn-nbctl --db=unix:$PWD/./ovn-nb/ovn-nb.sock show
> switch df662f28-4a42-4ac4-aadb-89563347cae1 (ls1)
>     port ls1-to-router
>         type: router
>         router-port: router-to-ls1
>     port lp11
>         addresses: ["f0:00:00:00:00:11 192.168.1.1"]
>     port ln1
>         type: localnet
>         parent:
>         tag: 101
>         addresses: ["unknown"]
> switch 91ddb7b2-df8c-42de-a899-6bb35ee08a16 (ls2)
>     port ls2-to-router
>         type: router
>         router-port: router-to-ls2
>     port lp22
>         addresses: ["f0:00:00:00:00:22 192.168.2.2"]
>     port ln2
>         type: localnet
>         parent:
>         tag: 201
>         addresses: ["unknown"]
> router 4186cb04-3370-401c-9d65-29cb2af48af1 (router)
>     port router-to-ls1
>         mac: "00:00:01:01:02:03"
>         networks: ["192.168.1.3/24"]
>     port router-to-ls2
>         mac: "00:00:01:01:02:05"
>         networks: ["192.168.2.3/24"]
>
> $ ovn-sbctl --db=unix:$PWD/./ovn-sb/ovn-sb.sock list chassis | grep -E
> "uuid|external_ids"
> _uuid               : 31ba7484-e0af-4326-a71b-c3f32e52e547
> external_ids        : {datapath-type="",
>
> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> ovn-bridge-mappings="phys:br-phys",
> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:11", ovn-cms-options=""}
>
> _uuid               : a04b5ad2-d76f-42c4-9f2b-21816e2624e2
> external_ids        : {datapath-type="",
>
> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> ovn-bridge-mappings="phys:br-phys",
> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:22", ovn-cms-options=""}
>
> On HV1 (chassis-mac aa:bb:cc:dd:ee:11) we have the following flow to
> replace source MAC for already routed packets:
>
> $ OVS_RUNDIR=$PWD/hv1 ovs-ofctl dump-flows br-int | grep
> aa:bb:cc:dd:ee:11
>  cookie=0xe0f6b198, duration=580.460s, table=65, n_packets=0,
> n_bytes=0, idle_age=580,
> priority=150,reg15=0x1,metadata=0x1,dl_src=00:00:01:01:02:03
> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:101,output:1
>  cookie=0xfddc60a, duration=580.420s, table=65, n_packets=1,
> n_bytes=42, idle_age=580,
> priority=150,reg15=0x1,metadata=0x2,dl_src=00:00:01:01:02:05
> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:201,output:3
>
> In the above output, the first flow is for packets destined to hosts
> in LS1 (VLAN 101) and the second flow is for packets destined to hosts
> in LS2 (VLAN 201).
>
> If we inject a packet from lp11:
> in_port=lp11, eth.src=f0:00:00:00:00:11, eth.dst=00:00:01:01:02:03,
> ip.src=192.168.1.1, ip.dst=192.168.2.2
>
> The router pipeline is executed on HV1, the eth.src address that of
> the router-to-ls2 port (00:00:01:01:02:05) and finally the entry in
> table=65 is hit and eth.src is changed to the configured chassis-mac
> (aa:bb:cc:dd:ee:11).
>
> The packet is sent out on port ln2:
> $ OVS_RUNDIR=$PWD/hv1 ovs-vsctl --column ofport find interface
> name=patch-br-int-to-ln2
> ofport              : 3
>
> Then it is received on HV2, where we have the following flows:
> $ OVS_RUNDIR=$PWD/hv2 ovs-ofctl dump-flows br-int | grep conj
>  cookie=0x31ba7484, duration=1545.430s, table=0, n_packets=0,
> n_bytes=0, idle_age=1545, priority=180,dl_src=aa:bb:cc:dd:ee:11
> actions=conjunction(100,1/2)
>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=1,
> n_bytes=46, idle_age=1545,
> priority=180,conj_id=100,in_port=2,dl_vlan=201
>
> actions=strip_vlan,load:0x4->NXM_NX_REG13[],load:0x3->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:05,resubmit(,8)
>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> n_bytes=0, idle_age=1545,
> priority=180,conj_id=100,in_port=3,dl_vlan=101
>
> actions=strip_vlan,load:0x9->NXM_NX_REG13[],load:0x5->NXM_NX_REG11[],load:0x6->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:03,resubmit(,8)
>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=0,
> n_bytes=0, idle_age=1545, priority=180,dl_vlan=201
> actions=conjunction(100,2/2)
>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> n_bytes=0, idle_age=1545, priority=180,dl_vlan=101
> actions=conjunction(100,2/2)
>
> The packet matches:
> - clause 1 of the conjunction (100) because eth.src is the chassis-mac
> of HV1 (aa:bb:cc:dd:ee:11).
> - clause 2 of the conjunction because VLAN_ID is 201.
>
> And then matches this flow that fixes the eth.src in the packet to
> that of router-to-ls2:
> priority=180,conj_id=100,in_port=2,dl_vlan=201
> actions=.....,mod_dl_src:00:00:01:01:02:05,....
>

The test has only a single router with multiple lswitches, but the loop in
the code is trying to handle the case when there are multiple router ports
on the same lswitch (with same localnet port). In the loop the inport and
vlan doesn’t change across iterations.

>
> >
> > cc Ankur who is the author of VLAN backed router to help clarify.
> >
> > This question is not directly related to the current patch. So for the
> patch:
> > Acked-by: Han Zhou <hzhou@ovn.org>
> >
> > I think it is better to wait until the above question is confirmed
> before merging it.
>
> Sure, thanks again for reviewing this!
>
> >
> > > >
> > > > > +    ds_chomp(&eth_src, ' ');
> > > > > +    ds_chomp(&eth_src, ',');
> > > > > +    ds_put_cstr(&eth_src, "}");
> > > > > +
> > > > > +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 ||
> nd_ns)",
> > > > > +                  ds_cstr(&eth_src));
> > > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > > > +                  ds_cstr(&match),
> > > > > +                  "outport = \""MC_FLOOD"\"; output;");
> > > > > +
> > > > > +    ds_destroy(&match);
> > > > > +    ds_destroy(&eth_src);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Ingress table 17: Flows that forward ARP/ND requests only to
> the routers
> > > > > + * that own the addresses. Other ARP/ND packets are still flooded
> in the
> > > > > + * switching domain as regular broadcast.
> > > > > + */
> > > > > +static void
> > > > > +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> > > > > +                                        int addr_family,
> > > > > +                                        struct ovn_port *patch_op,
> > > > > +                                        struct ovn_datapath *od,
> > > > > +                                        uint32_t priority,
> > > > > +                                        struct hmap *lflows)
> > > > > +{
> > > > > +    struct ds match   = DS_EMPTY_INITIALIZER;
> > > > > +    struct ds actions = DS_EMPTY_INITIALIZER;
> > > > > +
> > > > > +    /* Packets received from VXLAN tunnels have already been
> through the
> > > > > +     * router pipeline so we should skip them. Normally this is
> done by the
> > > > > +     * multicast_group implementation (VXLAN packets skip table
> 32 which
> > > > > +     * delivers to patch ports) but we're bypassing
> multicast_groups.
> > > > > +     */
> > > > > +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> > > > > +
> > > > > +    if (addr_family == AF_INET) {
> > > > > +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> > > > > +    } else {
> > > > > +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
> > > > > +    }
> > > > > +
> > > > > +    const char *ip_address;
> > > > > +    SSET_FOR_EACH (ip_address, ips) {
> > > > > +        ds_put_format(&match, "%s, ", ip_address);
> > > > > +    }
> > > > > +
> > > > > +    ds_chomp(&match, ' ');
> > > > > +    ds_chomp(&match, ',');
> > > > > +    ds_put_cstr(&match, "}");
> > > > > +
> > > > > +    /* Send a the packet only to the router pipeline and skip
> flooding it
> > > > > +     * in the broadcast domain.
> > > > > +     */
> > > > > +    ds_put_format(&actions, "outport = %s; output;",
> patch_op->json_key);
> > > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> > > > > +                  ds_cstr(&match), ds_cstr(&actions));
> > > > > +
> > > > > +    ds_destroy(&match);
> > > > > +    ds_destroy(&actions);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Ingress table 17: Flows that forward ARP/ND requests only to
> the routers
> > > > > + * that own the addresses.
> > > > > + * Priorities:
> > > > > + * - 80: self originated GARPs that need to follow regular
> processing.
> > > > > + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> > > > > + */
> > > > > +static void
> > > > > +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
> > > > > +                                  struct ovn_datapath *sw_od,
> > > > > +                                  struct ovn_port *sw_op,
> > > > > +                                  struct hmap *lflows)
> > > > > +{
> > > > > +    if (!op || !op->nbrp) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    if (!lrport_is_enabled(op->nbrp)) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded as
> usual.
> > > > > +     * Priority: 80.
> > > > > +     */
> > > > > +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od,
> lflows);
> > > > > +
> > > > > +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT)
> only to this
> > > > > +     * router port.
> > > > > +     * Priority: 75.
> > > > > +     */
> > > > > +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> > > > > +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> > > > > +
> > > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> > > > > +        sset_add(&all_ips_v4,
> op->lrp_networks.ipv4_addrs[i].addr_s);
> > > > > +    }
> > > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> > > > > +        sset_add(&all_ips_v6,
> op->lrp_networks.ipv6_addrs[i].addr_s);
> > > > > +    }
> > > > > +
> > > > > +    get_router_load_balancer_ips(op->od, &all_ips_v4,
> &all_ips_v6);
> > > > > +
> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > > > > +
> > > > > +        if (!strcmp(nat->type, "snat")) {
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        ovs_be32 ip;
> > > > > +        ovs_be32 mask;
> > > > > +        struct in6_addr ipv6;
> > > > > +        struct in6_addr mask_v6;
> > > > > +
> > > > > +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> > > > > +            if (!ipv6_parse_masked(nat->external_ip, &ipv6,
> &mask_v6)) {
> > > > > +                sset_add(&all_ips_v6, nat->external_ip);
> > > > > +            }
> > > > > +        } else {
> > > > > +            sset_add(&all_ips_v4, nat->external_ip);
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET,
> sw_op,
> > > > > +                                            sw_od, 75, lflows);
> > > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6,
> AF_INET6, sw_op,
> > > > > +                                            sw_od, 75, lflows);
> > > > > +
> > > > > +    sset_destroy(&all_ips_v4);
> > > > > +    sset_destroy(&all_ips_v6);
> > > > > +}
> > > > > +
> > > > >  static void
> > > > >  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
> > > > >                      struct hmap *port_groups, struct hmap *lflows,
> > > > > @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths,
> struct hmap *ports,
> > > > >              continue;
> > > > >          }
> > > > >
> > > > > +        /* For ports connected to logical routers add flows to
> bypass the
> > > > > +         * broadcast flooding of ARP/ND requests in table 17. We
> direct the
> > > > > +         * requests only to the router port that owns the IP
> address.
> > > > > +         */
> > > > > +        if (!strcmp(op->nbsp->type, "router")) {
> > > > > +            build_lswitch_rport_arp_req_flows(op->peer, op->od,
> op, lflows);
> > > > > +        }
> > > > > +
> > > > >          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
> > > > >              /* Addresses are owned by the logical port.
> > > > >               * Ethernet address followed by zero or more IPv4
> > > > > @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths,
> struct hmap *ports,
> > > > >      ds_destroy(&actions);
> > > > >  }
> > > > >
> > > > > -static bool
> > > > > -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > > > > -{
> > > > > -    return !lrport->enabled || *lrport->enabled;
> > > > > -}
> > > > > -
> > > > >  /* Returns a string of the IP address of the router port 'op' that
> > > > >   * overlaps with 'ip_s".  If one is not found, returns NULL.
> > > > >   *
> > > > > diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> > > > > index 7966b65..c43f16d 100644
> > > > > --- a/ovn-architecture.7.xml
> > > > > +++ b/ovn-architecture.7.xml
> > > > > @@ -1390,6 +1390,25 @@
> > > > >
> http://docs.openvswitch.org/en/latest/topics/high-availability.
> > > > >    </p>
> > > > >
> > > > > +  <h3>ARP request and ND NS packet processing</h3>
> > > > > +
> > > > > +  <p>
> > > > > +    Due to the fact that ARP requests and ND NA packets are
> usually broadcast
> > > > > +    packets, for performance reasons, OVN deals with requests
> that target OVN
> > > > > +    owned IP addresses (i.e., IP addresses configured on the
> router ports,
> > > > > +    VIPs, NAT IPs) in a specific way and only forwards them to
> the logical
> > > > > +    router that owns the target IP address. This behavior is
> different than
> > > > > +    that of traditional swithces and implies that other
> routers/hosts
> > > > > +    connected to the logical switch will not learn the MAC/IP
> binding from
> > > > > +    the request packet.
> > > > > +  </p>
> > > > > +
> > > > > +  <p>
> > > > > +    All other ARP and ND packets are flooded in the L2 broadcast
> domain and
> > > > > +    to all attached logical patch ports.
> > > > > +  </p>
> > > > > +
> > > > > +
> > > > >    <h2>Multiple localnet logical switches connected to a Logical
> Router</h2>
> > > > >
> > > > >    <p>
> > > > > diff --git a/tests/ovn.at b/tests/ovn.at
> > > > > index 3e429e3..26e33d2 100644
> > > > > --- a/tests/ovn.at
> > > > > +++ b/tests/ovn.at
> > > > > @@ -2877,7 +2877,7 @@ test_ip() {
> > > > >      done
> > > > >  }
> > > > >
> > > > > -# test_arp INPORT SHA SPA TPA [REPLY_HA]
> > > > > +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
> > > > >  #
> > > > >  # Causes a packet to be received on INPORT.  The packet is an ARP
> > > > >  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is
> provided, then
> > > > > @@ -2888,21 +2888,25 @@ test_ip() {
> > > > >  # SHA and REPLY_HA are each 12 hex digits.
> > > > >  # SPA and TPA are each 8 hex digits.
> > > > >  test_arp() {
> > > > > -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
> > > > > +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
> > > > >      local
> request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
> > > > >      hv=hv`vif_to_hv $inport`
> > > > >      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
> > > > >      as $hv ovs-appctl ofproto/trace br-int in_port=$inport
> $request
> > > > >
> > > > >      # Expect to receive the broadcast ARP on the other logical
> switch ports if
> > > > > -    # IP address is not configured to the switch patch port.
> > > > > +    # IP address is not configured on the switch patch port or on
> the router
> > > > > +    # port (i.e, $flood == 1).
> > > > >      local i=`vif_to_ls $inport`
> > > > >      local j k
> > > > >      for j in 1 2 3; do
> > > > >          for k in 1 2 3; do
> > > > > -            # 192.168.33.254 is configured to the switch patch
> port for lrp33,
> > > > > -            # so no ARP flooding expected for it.
> > > > > -            if test $i$j$k != $inport && test $tpa != `ip_to_hex
> 192 168 33 254`; then
> > > > > +            # Skip ingress port.
> > > > > +            if test $i$j$k == $inport; then
> > > > > +                continue
> > > > > +            fi
> > > > > +
> > > > > +            if test X$flood == X1; then
> > > > >                  echo $request >> $i$j$k.expected
> > > > >              fi
> > > > >          done
> > > > > @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
> > > > >        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in
> subnet
> > > > >        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in
> subnet
> > > > >
> > > > > -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
> > > > > -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
> > > > > -      test_arp $i$j$k $smac $sip        $otherip               #6
> > > > > +      test_arp $i$j$k $smac $sip        $rip       0     $rmac
>    #4
> > > > > +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac
>    #5
> > > > > +      test_arp $i$j$k $smac $sip        $otherip   1
>    #6
> > > > >
> > > > >        # When rip is 192.168.33.254, ARP request from externalip
> won't be
> > > > >        # filtered, because 192.168.33.254 is configured to switch
> peer port
> > > > > @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
> > > > >        if test $i = 3 && test $j = 3; then
> > > > >          lrp33_rsp=$rmac
> > > > >        fi
> > > > > -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
> > > > > +      test_arp $i$j$k $smac $externalip $rip       0
> $lrp33_rsp #7
> > > > >
> > > > >        # MAC binding should be learned from ARP request.
> > > > >        host_mac_pretty=f0:00:00:00:0$i:$j$k
> > > > > @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
> > > > >  # Check that there is a logical flow in logical switch foo's
> pipeline
> > > > >  # to set the outport to rp-foo (which is expected).
> > > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep
> ls_in_l2_lkup | \
> > > > > -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> > > > > +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
> > > > >
> > > > >  # Set the option 'reside-on-redirect-chassis' for foo
> > > > >  ovn-nbctl set logical_router_port foo
> options:reside-on-redirect-chassis=true
> > > > > @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo
> options:reside-on-redirect-chassis=true
> > > > >  # to set the outport to rp-foo with the condition
> is_chassis_redirect.
> > > > >  ovn-sbctl dump-flows foo
> > > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep
> ls_in_l2_lkup | \
> > > > > -grep rp-foo | grep is_chassis_resident | wc -l`])
> > > > > +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
> > > > >
> > > > >  echo "---------NB dump-----"
> > > > >  ovn-nbctl show
> > > > > @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
> > > > >  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
> > > > >
> > > > >  AT_CLEANUP
> > > > > +
> > > > > +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> > > > > +AT_SKIP_IF([test $HAVE_PYTHON = no])
> > > > > +ovn_start
> > > > > +
> > > > > +ip_to_hex() {
> > > > > +    printf "%02x%02x%02x%02x" "$@"
> > > > > +}
> > > > > +
> > > > > +send_arp_request() {
> > > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> > > > > +    local eth_dst=ffffffffffff
> > > > > +    local eth_type=0806
> > > > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > > > +
> > > > > +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> > > > > +
> > > > > +    local request=${eth}${arp}
> > > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport
> $request
> > > > > +}
> > > > > +
> > > > > +send_nd_ns() {
> > > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> > > > > +
> > > > > +    local eth_dst=ffffffffffff
> > > > > +    local eth_type=86dd
> > > > > +    local eth=${eth_dst}${eth_src}${eth_type}
> > > > > +
> > > > > +    local ip_vhlen=60000000
> > > > > +    local ip_plen=0020
> > > > > +    local ip_next=3a
> > > > > +    local ip_ttl=ff
> > > > > +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> > > > > +
> > > > > +    # Neighbor Solicitation
> > > > > +    local icmp6_type=87
> > > > > +    local icmp6_code=00
> > > > > +    local icmp6_rsvd=00000000
> > > > > +    # ICMPv6 source lla option
> > > > > +    local icmp6_opt=01
> > > > > +    local icmp6_optlen=01
> > > > > +    local
> icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> > > > > +
> > > > > +    local request=${eth}${ip}${icmp6}
> > > > > +
> > > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport
> $request
> > > > > +}
> > > > > +
> > > > > +src_mac=000000000001
> > > > > +
> > > > > +net_add n1
> > > > > +sim_add hv1
> > > > > +as hv1
> > > > > +ovs-vsctl add-br br-phys
> > > > > +ovn_attach n1 br-phys 192.168.0.1
> > > > > +
> > > > > +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> > > > > +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> > > > > +    options:tx_pcap=hv1/vif1-tx.pcap \
> > > > > +    options:rxq_pcap=hv1/vif1-rx.pcap \
> > > > > +    ofport-request=1
> > > > > +
> > > > > +# One Aggregation Switch connected to two Logical networks
> (routers).
> > > > > +ovn-nbctl ls-add sw-agg
> > > > > +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> > > > > +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> > > > > +
> > > > > +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> > > > > +    -- lsp-set-type sw-rtr1 router                 \
> > > > > +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> > > > > +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> > > > > +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> > > > > +    -- lsp-set-type sw-rtr2 router                 \
> > > > > +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> > > > > +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> > > > > +
> > > > > +# Configure L3 interface IPv4 & IPv6 on both routers
> > > > > +ovn-nbctl lr-add rtr1
> > > > > +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24
> 10::1/64
> > > > > +
> > > > > +ovn-nbctl lr-add rtr2
> > > > > +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24
> 10::2/64
> > > > > +
> > > > > +OVN_POPULATE_ARP
> > > > > +ovn-nbctl --wait=hv sync
> > > > > +
> > > > > +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list
> datapath_binding sw-agg)
> > > > > +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list
> datapath_binding sw-agg)
> > > > > +
> > > > > +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list
> port_binding sw-rtr1)
> > > > > +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list
> port_binding sw-rtr2)
> > > > > +
> > > > > +mc_key=$(ovn-sbctl --bare --columns tunnel_key find
> multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
> > > > > +mc_key=$(printf "%04x" $mc_key)
> > > > > +
> > > > > +match_sw_metadata="metadata=0x${sw_dp_key}"
> > > > > +
> > > > > +# Inject ARP request for first router owned IP address.
> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254)
> $(ip_to_hex 10 0 0 1)
> > > > > +
> > > > > +# Verify that the ARP request is sent only to rtr1.
> > > > >
> +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +# Inject ND_NS for ofirst router owned IP address.
> > > > > +src_ipv6=00100000000000000000000000000254
> > > > > +dst_ipv6=00100000000000000000000000000001
> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > > +
> > > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > >
> +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +# Configure load balancing on both routers.
> > > > > +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> > > > > +ovn-nbctl lb-add lb1-v6 10::11 42::1
> > > > > +ovn-nbctl lr-lb-add rtr1 lb1-v4
> > > > > +ovn-nbctl lr-lb-add rtr1 lb1-v6
> > > > > +
> > > > > +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> > > > > +ovn-nbctl lb-add lb2-v6 10::22 42::2
> > > > > +ovn-nbctl lr-lb-add rtr2 lb2-v4
> > > > > +ovn-nbctl lr-lb-add rtr2 lb2-v6
> > > > > +ovn-nbctl --wait=hv sync
> > > > > +
> > > > > +# Inject ARP request for first router owned VIP address.
> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254)
> $(ip_to_hex 10 0 0 11)
> > > > > +
> > > > > +# Verify that the ARP request is sent only to rtr1.
> > > > >
> +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +# Inject ND_NS for first router owned VIP address.
> > > > > +src_ipv6=00100000000000000000000000000254
> > > > > +dst_ipv6=00100000000000000000000000000011
> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > > +
> > > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > >
> +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +# Configure NAT on both routers
> > > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> > > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> > > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> > > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> > > > > +
> > > > > +# Inject ARP request for first router owned NAT address.
> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254)
> $(ip_to_hex 10 0 0 111)
> > > > > +
> > > > > +# Verify that the ARP request is sent only to rtr1.
> > > > >
> +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +# Inject ND_NS for first router owned IP address.
> > > > > +src_ipv6=00100000000000000000000000000254
> > > > > +dst_ipv6=00100000000000000000000000000111
> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> > > > > +
> > > > > +# Verify that the ND_NS is sent only to rtr1.
> > > > >
> +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> > > > > +
> > > > > +as hv1
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "1" = "${pkts_to_rtr1}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> > > > > +    grep n_packets=1 -c)
> > > > > +    test "0" = "${pkts_to_rtr2}"
> > > > > +])
> > > > > +OVS_WAIT_UNTIL([
> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v
> n_packets=0 -c)
> > > > > +    test "0" = "${pkts_flooded}"
> > > > > +])
> > > > > +
> > > > > +OVN_CLEANUP([hv1])
> > > > > +AT_CLEANUP
> > > > >
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > dev@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
>
>
Dumitru Ceara Nov. 13, 2019, 4:13 p.m. UTC | #6
On Wed, Nov 13, 2019 at 4:30 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Wed, Nov 13, 2019 at 2:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On Tue, Nov 12, 2019 at 8:50 PM Han Zhou <hzhou@ovn.org> wrote:
>> >
>> >
>> >
>> >
>> > On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com> wrote:
>> > >
>> > > On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
>> > > > >
>> > > > > ARP request and ND NS packets for router owned IPs were being
>> > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
>> > > > > However this creates a scaling issue in scenarios where aggregation
>> > > > > logical switches are connected to more logical routers (~350). The
>> > > > > logical pipelines of all routers would have to be executed before the
>> > > > > packet is finally replied to by a single router, the owner of the IP
>> > > > > address.
>> > > > >
>> > > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
>> > > > > for ARP requests that will be replied by a single router. The packets
>> > > > > are forwarded only to the router port that owns the target IP address.
>> > > > >
>> > > > > IPs that are owned by the routers and for which this fix applies are:
>> > > > > - IP addresses configured on the router ports.
>> > > > > - VIPs.
>> > > > > - NAT IPs.
>> > > > >
>> > > > > Reported-at: https://bugzilla.redhat.com/1756945
>> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
>> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> > > > >
>> > > > > ---
>> > > > > v7:
>> > > > > - Address Han's comments:
>> > > > >     - Remove flooding for all ARPs received on VLAN networks. To avoid
>> > > > >       that we now identify self originated (G)ARPs by matching on source
>> > > > >       MAC address too.
>> > > > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
>> > > > > - Fix ovn-sb manpage.
>> > > > > - Split patch in a series of 2:
>> > > > >     - patch1: fixes the get_router_load_balancer_ips() function.
>> > > > >     - patch2: limits the ARP/ND broadcast domain.
>> > > > > v6:
>> > > > > - Address Han's comments:
>> > > > >     - remove flooding of ARPs targeting OVN owned IP addresses.
>> > > > >     - update ovn-architecture documentation.
>> > > > >     - rename ARP handling functions.
>> > > > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
>> > > > >     account the new way of forwarding ARPs.
>> > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
>> > > > > v5: Address Numan's comments: update comments & make autotest more
>> > > > >     robust.
>> > > > > v4: Rebase.
>> > > > > v3: Properly deal with VXLAN traffic. Address review comments from
>> > > > >     Numan (add autotests). Fix function get_router_load_balancer_ips.
>> > > > >     Rebase -> deal with IPv6 NAT too.
>> > > > > v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
>> > > > > address localnet ports too.
>> > > > > ---
>> > > > >  northd/ovn-northd.8.xml |   14 ++
>> > > > >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
>> > > > >  ovn-architecture.7.xml  |   19 +++
>> > > > >  tests/ovn.at            |  307 +++++++++++++++++++++++++++++++++++++++++++++--
>> > > > >  4 files changed, 530 insertions(+), 40 deletions(-)
>> > > > >
>> > > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
>> > > > > index 0a33dcd..344cc0d 100644
>> > > > > --- a/northd/ovn-northd.8.xml
>> > > > > +++ b/northd/ovn-northd.8.xml
>> > > > > @@ -1005,6 +1005,20 @@ output;
>> > > > >        </li>
>> > > > >
>> > > > >        <li>
>> > > > > +        Priority-80 flows for each port connected to a logical router
>> > > > > +        matching self originated GARP/ARP request/ND packets. These packets
>> > > > > +        are flooded to the <code>MC_FLOOD</code> which contains all logical
>> > > > > +        ports.
>> > > > > +      </li>
>> > > > > +
>> > > > > +      <li>
>> > > > > +        Priority-75 flows for each IP address/VIP/NAT address owned by a
>> > > > > +        router port connected to the switch. These flows match ARP requests
>> > > > > +        and ND packets for the specific IP addresses.  Matched packets are
>> > > > > +        forwarded only to the router that owns the IP address.
>> > > > > +      </li>
>> > > > > +
>> > > > > +      <li>
>> > > > >          A priority-70 flow that outputs all packets with an Ethernet broadcast
>> > > > >          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
>> > > > >          multicast group.
>> > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
>> > > > > index 32f3200..d6beb97 100644
>> > > > > --- a/northd/ovn-northd.c
>> > > > > +++ b/northd/ovn-northd.c
>> > > > > @@ -210,6 +210,8 @@ enum ovn_stage {
>> > > > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
>> > > > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
>> > > > >
>> > > > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
>> > > > > +
>> > > > >  /* Returns an "enum ovn_stage" built from the arguments. */
>> > > > >  static enum ovn_stage
>> > > > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
>> > > > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
>> > > > >                            1, (1u << 15) - 1, &od->port_key_hint);
>> > > > >  }
>> > > > >
>> > > > > +/* Returns true if the logical switch port 'enabled' column is empty or
>> > > > > + * set to true.  Otherwise, returns false. */
>> > > > > +static bool
>> > > > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
>> > > > > +{
>> > > > > +    return !lsp->n_enabled || *lsp->enabled;
>> > > > > +}
>> > > > > +
>> > > > > +/* Returns true only if the logical switch port 'up' column is set to true.
>> > > > > + * Otherwise, if the column is not set or set to false, returns false. */
>> > > > > +static bool
>> > > > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
>> > > > > +{
>> > > > > +    return lsp->n_up && *lsp->up;
>> > > > > +}
>> > > > > +
>> > > > > +static bool
>> > > > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
>> > > > > +{
>> > > > > +    return !strcmp(nbsp->type, "external");
>> > > > > +}
>> > > > > +
>> > > > > +static bool
>> > > > > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
>> > > > > +{
>> > > > > +    return !lrport->enabled || *lrport->enabled;
>> > > > > +}
>> > > > > +
>> > > > >  static char *
>> > > > >  chassis_redirect_name(const char *port_name)
>> > > > >  {
>> > > > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
>> > > > >
>> > > > >  }
>> > > > >
>> > > > > -/* Returns true if the logical switch port 'enabled' column is empty or
>> > > > > - * set to true.  Otherwise, returns false. */
>> > > > > -static bool
>> > > > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
>> > > > > -{
>> > > > > -    return !lsp->n_enabled || *lsp->enabled;
>> > > > > -}
>> > > > > -
>> > > > > -/* Returns true only if the logical switch port 'up' column is set to true.
>> > > > > - * Otherwise, if the column is not set or set to false, returns false. */
>> > > > > -static bool
>> > > > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
>> > > > > -{
>> > > > > -    return lsp->n_up && *lsp->up;
>> > > > > -}
>> > > > > -
>> > > > > -static bool
>> > > > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
>> > > > > -{
>> > > > > -    return !strcmp(nbsp->type, "external");
>> > > > > -}
>> > > > > -
>> > > > >  static bool
>> > > > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
>> > > > >                      struct ds *options_action, struct ds *response_action,
>> > > > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
>> > > > >      }
>> > > > >  }
>> > > > >
>> > > > > +/*
>> > > > > + * Ingress table 17: Flows that flood self originated ARP/ND packets in the
>> > > > > + * switching domain.
>> > > > > + */
>> > > > > +static void
>> > > > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
>> > > > > +                                           uint32_t priority,
>> > > > > +                                           struct ovn_datapath *od,
>> > > > > +                                           struct hmap *lflows)
>> > > > > +{
>> > > > > +    struct ds match = DS_EMPTY_INITIALIZER;
>> > > > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
>> > > > > +
>> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
>> > > > > +     * Determine that packets are self originated by also matching on
>> > > > > +     * source MAC. Matching on ingress port is not reliable in case this
>> > > > > +     * is a VLAN-backed network.
>> > > > > +     * Priority: 80.
>> > > > > +     */
>> > > > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
>> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
>> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
>> > > > > +
>> > > > > +        if (!nat->external_mac) {
>> > > > > +            continue;
>> > > > > +        }
>> > > > > +
>> > > > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
>> > > > > +    }
>> > > >
>> > > > As discussed we need to add chassis unique MAC that are configured in external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find this in the patch. VLAN backed logical router may not work without this.
>> > >
>> > > Hi Han,
>> > >
>> > > Maybe I misunderstood but in the discussion on v6 I mentioned that I
>> > > don't think we need to add the MACs from
>> > > external-ids:ovn-chassis-mac-mappings.
>> > >
>> > > Whenever chassis MACs are configured, in ovn-controller we create a
>> > > conjunctive flow matching on any of the remote chassis MAC addresses:
>> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
>> > >
>> > > And for all incoming traffic that matches this conjunction and VLAN-id
>> > > we change the MAC back to that of the logical router port:
>> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
>> > >
>> > > Isn't this enough to cover the self originated ARP packets?
>> > >
>> > > Thanks,
>> > > Dumitru
>> > >
>> >
>> > Dumitru, sorry that I misunderstood that you actually meant it was ok to not adding chassis unique macs. Also I didn't realize that there are already flows to change the chassis unique MACs back to the logical router port's MACs.
>> > With this precondition I think your patch should be good enough.
>> >
>> > However, I revisited the function put_replace_chassis_mac_flows() and had some difficulty to understand how would it work. For these flows, the match conditions are:
>> > - in_port of the localnet port
>> > - conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
>> > - vlan tag associated with the localnet port
>> > The flow is added in a loop for each peer port to replace mac for each router port on that logical switch. Since the match condition is all the same, wouldn't it result in only one flow taking effect and others getting dropped? I wonder if any other port-specific match condition should be added so that MAC can be replaced back to its original router port mac accordingly.
>>
>> If I understand correctly the differentiator is the ingress localnet
>> port id and VLAN-ID.
>>
>> Looking at the autotest for "ovn -- 2 HVs, 2 lports/HV, localnet
>> ports, DVR chassis mac" the network is:
>>
>> $ ovn-nbctl --db=unix:$PWD/./ovn-nb/ovn-nb.sock show
>> switch df662f28-4a42-4ac4-aadb-89563347cae1 (ls1)
>>     port ls1-to-router
>>         type: router
>>         router-port: router-to-ls1
>>     port lp11
>>         addresses: ["f0:00:00:00:00:11 192.168.1.1"]
>>     port ln1
>>         type: localnet
>>         parent:
>>         tag: 101
>>         addresses: ["unknown"]
>> switch 91ddb7b2-df8c-42de-a899-6bb35ee08a16 (ls2)
>>     port ls2-to-router
>>         type: router
>>         router-port: router-to-ls2
>>     port lp22
>>         addresses: ["f0:00:00:00:00:22 192.168.2.2"]
>>     port ln2
>>         type: localnet
>>         parent:
>>         tag: 201
>>         addresses: ["unknown"]
>> router 4186cb04-3370-401c-9d65-29cb2af48af1 (router)
>>     port router-to-ls1
>>         mac: "00:00:01:01:02:03"
>>         networks: ["192.168.1.3/24"]
>>     port router-to-ls2
>>         mac: "00:00:01:01:02:05"
>>         networks: ["192.168.2.3/24"]
>>
>> $ ovn-sbctl --db=unix:$PWD/./ovn-sb/ovn-sb.sock list chassis | grep -E
>> "uuid|external_ids"
>> _uuid               : 31ba7484-e0af-4326-a71b-c3f32e52e547
>> external_ids        : {datapath-type="",
>> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
>> ovn-bridge-mappings="phys:br-phys",
>> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:11", ovn-cms-options=""}
>>
>> _uuid               : a04b5ad2-d76f-42c4-9f2b-21816e2624e2
>> external_ids        : {datapath-type="",
>> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
>> ovn-bridge-mappings="phys:br-phys",
>> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:22", ovn-cms-options=""}
>>
>> On HV1 (chassis-mac aa:bb:cc:dd:ee:11) we have the following flow to
>> replace source MAC for already routed packets:
>>
>> $ OVS_RUNDIR=$PWD/hv1 ovs-ofctl dump-flows br-int | grep
>> aa:bb:cc:dd:ee:11
>>  cookie=0xe0f6b198, duration=580.460s, table=65, n_packets=0,
>> n_bytes=0, idle_age=580,
>> priority=150,reg15=0x1,metadata=0x1,dl_src=00:00:01:01:02:03
>> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:101,output:1
>>  cookie=0xfddc60a, duration=580.420s, table=65, n_packets=1,
>> n_bytes=42, idle_age=580,
>> priority=150,reg15=0x1,metadata=0x2,dl_src=00:00:01:01:02:05
>> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:201,output:3
>>
>> In the above output, the first flow is for packets destined to hosts
>> in LS1 (VLAN 101) and the second flow is for packets destined to hosts
>> in LS2 (VLAN 201).
>>
>> If we inject a packet from lp11:
>> in_port=lp11, eth.src=f0:00:00:00:00:11, eth.dst=00:00:01:01:02:03,
>> ip.src=192.168.1.1, ip.dst=192.168.2.2
>>
>> The router pipeline is executed on HV1, the eth.src address that of
>> the router-to-ls2 port (00:00:01:01:02:05) and finally the entry in
>> table=65 is hit and eth.src is changed to the configured chassis-mac
>> (aa:bb:cc:dd:ee:11).
>>
>> The packet is sent out on port ln2:
>> $ OVS_RUNDIR=$PWD/hv1 ovs-vsctl --column ofport find interface
>> name=patch-br-int-to-ln2
>> ofport              : 3
>>
>> Then it is received on HV2, where we have the following flows:
>> $ OVS_RUNDIR=$PWD/hv2 ovs-ofctl dump-flows br-int | grep conj
>>  cookie=0x31ba7484, duration=1545.430s, table=0, n_packets=0,
>> n_bytes=0, idle_age=1545, priority=180,dl_src=aa:bb:cc:dd:ee:11
>> actions=conjunction(100,1/2)
>>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=1,
>> n_bytes=46, idle_age=1545,
>> priority=180,conj_id=100,in_port=2,dl_vlan=201
>> actions=strip_vlan,load:0x4->NXM_NX_REG13[],load:0x3->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:05,resubmit(,8)
>>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
>> n_bytes=0, idle_age=1545,
>> priority=180,conj_id=100,in_port=3,dl_vlan=101
>> actions=strip_vlan,load:0x9->NXM_NX_REG13[],load:0x5->NXM_NX_REG11[],load:0x6->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:03,resubmit(,8)
>>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=0,
>> n_bytes=0, idle_age=1545, priority=180,dl_vlan=201
>> actions=conjunction(100,2/2)
>>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
>> n_bytes=0, idle_age=1545, priority=180,dl_vlan=101
>> actions=conjunction(100,2/2)
>>
>> The packet matches:
>> - clause 1 of the conjunction (100) because eth.src is the chassis-mac
>> of HV1 (aa:bb:cc:dd:ee:11).
>> - clause 2 of the conjunction because VLAN_ID is 201.
>>
>> And then matches this flow that fixes the eth.src in the packet to
>> that of router-to-ls2:
>> priority=180,conj_id=100,in_port=2,dl_vlan=201
>> actions=.....,mod_dl_src:00:00:01:01:02:05,....
>
>
> The test has only a single router with multiple lswitches, but the loop in the code is trying to handle the case when there are multiple router ports on the same lswitch (with same localnet port). In the loop the inport and vlan doesn’t change across iterations.

Ok, I see what you mean now, I was under the impression that we'd
always have a single router port per lswitch with vlan networks.

I'll let Ankur comment on this because it seems to me that we can't
restore the router port MAC if multiple router-ports map to the same
VLAN-ID + localnet-port.

>>
>>
>> >
>> > cc Ankur who is the author of VLAN backed router to help clarify.
>> >
>> > This question is not directly related to the current patch. So for the patch:
>> > Acked-by: Han Zhou <hzhou@ovn.org>
>> >
>> > I think it is better to wait until the above question is confirmed before merging it.
>>
>> Sure, thanks again for reviewing this!
>>
>> >
>> > > >
>> > > > > +    ds_chomp(&eth_src, ' ');
>> > > > > +    ds_chomp(&eth_src, ',');
>> > > > > +    ds_put_cstr(&eth_src, "}");
>> > > > > +
>> > > > > +    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
>> > > > > +                  ds_cstr(&eth_src));
>> > > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
>> > > > > +                  ds_cstr(&match),
>> > > > > +                  "outport = \""MC_FLOOD"\"; output;");
>> > > > > +
>> > > > > +    ds_destroy(&match);
>> > > > > +    ds_destroy(&eth_src);
>> > > > > +}
>> > > > > +
>> > > > > +/*
>> > > > > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
>> > > > > + * that own the addresses. Other ARP/ND packets are still flooded in the
>> > > > > + * switching domain as regular broadcast.
>> > > > > + */
>> > > > > +static void
>> > > > > +build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
>> > > > > +                                        int addr_family,
>> > > > > +                                        struct ovn_port *patch_op,
>> > > > > +                                        struct ovn_datapath *od,
>> > > > > +                                        uint32_t priority,
>> > > > > +                                        struct hmap *lflows)
>> > > > > +{
>> > > > > +    struct ds match   = DS_EMPTY_INITIALIZER;
>> > > > > +    struct ds actions = DS_EMPTY_INITIALIZER;
>> > > > > +
>> > > > > +    /* Packets received from VXLAN tunnels have already been through the
>> > > > > +     * router pipeline so we should skip them. Normally this is done by the
>> > > > > +     * multicast_group implementation (VXLAN packets skip table 32 which
>> > > > > +     * delivers to patch ports) but we're bypassing multicast_groups.
>> > > > > +     */
>> > > > > +    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
>> > > > > +
>> > > > > +    if (addr_family == AF_INET) {
>> > > > > +        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
>> > > > > +    } else {
>> > > > > +        ds_put_cstr(&match, "nd_ns && nd.target == { ");
>> > > > > +    }
>> > > > > +
>> > > > > +    const char *ip_address;
>> > > > > +    SSET_FOR_EACH (ip_address, ips) {
>> > > > > +        ds_put_format(&match, "%s, ", ip_address);
>> > > > > +    }
>> > > > > +
>> > > > > +    ds_chomp(&match, ' ');
>> > > > > +    ds_chomp(&match, ',');
>> > > > > +    ds_put_cstr(&match, "}");
>> > > > > +
>> > > > > +    /* Send a the packet only to the router pipeline and skip flooding it
>> > > > > +     * in the broadcast domain.
>> > > > > +     */
>> > > > > +    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
>> > > > > +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
>> > > > > +                  ds_cstr(&match), ds_cstr(&actions));
>> > > > > +
>> > > > > +    ds_destroy(&match);
>> > > > > +    ds_destroy(&actions);
>> > > > > +}
>> > > > > +
>> > > > > +/*
>> > > > > + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
>> > > > > + * that own the addresses.
>> > > > > + * Priorities:
>> > > > > + * - 80: self originated GARPs that need to follow regular processing.
>> > > > > + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
>> > > > > + */
>> > > > > +static void
>> > > > > +build_lswitch_rport_arp_req_flows(struct ovn_port *op,
>> > > > > +                                  struct ovn_datapath *sw_od,
>> > > > > +                                  struct ovn_port *sw_op,
>> > > > > +                                  struct hmap *lflows)
>> > > > > +{
>> > > > > +    if (!op || !op->nbrp) {
>> > > > > +        return;
>> > > > > +    }
>> > > > > +
>> > > > > +    if (!lrport_is_enabled(op->nbrp)) {
>> > > > > +        return;
>> > > > > +    }
>> > > > > +
>> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
>> > > > > +     * Priority: 80.
>> > > > > +     */
>> > > > > +    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od, lflows);
>> > > > > +
>> > > > > +    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT) only to this
>> > > > > +     * router port.
>> > > > > +     * Priority: 75.
>> > > > > +     */
>> > > > > +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
>> > > > > +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
>> > > > > +
>> > > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
>> > > > > +        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
>> > > > > +    }
>> > > > > +    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
>> > > > > +        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
>> > > > > +    }
>> > > > > +
>> > > > > +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
>> > > > > +
>> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
>> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
>> > > > > +
>> > > > > +        if (!strcmp(nat->type, "snat")) {
>> > > > > +            continue;
>> > > > > +        }
>> > > > > +
>> > > > > +        ovs_be32 ip;
>> > > > > +        ovs_be32 mask;
>> > > > > +        struct in6_addr ipv6;
>> > > > > +        struct in6_addr mask_v6;
>> > > > > +
>> > > > > +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
>> > > > > +            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
>> > > > > +                sset_add(&all_ips_v6, nat->external_ip);
>> > > > > +            }
>> > > > > +        } else {
>> > > > > +            sset_add(&all_ips_v4, nat->external_ip);
>> > > > > +        }
>> > > > > +    }
>> > > > > +
>> > > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET, sw_op,
>> > > > > +                                            sw_od, 75, lflows);
>> > > > > +    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6, sw_op,
>> > > > > +                                            sw_od, 75, lflows);
>> > > > > +
>> > > > > +    sset_destroy(&all_ips_v4);
>> > > > > +    sset_destroy(&all_ips_v6);
>> > > > > +}
>> > > > > +
>> > > > >  static void
>> > > > >  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>> > > > >                      struct hmap *port_groups, struct hmap *lflows,
>> > > > > @@ -5761,6 +5933,14 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>> > > > >              continue;
>> > > > >          }
>> > > > >
>> > > > > +        /* For ports connected to logical routers add flows to bypass the
>> > > > > +         * broadcast flooding of ARP/ND requests in table 17. We direct the
>> > > > > +         * requests only to the router port that owns the IP address.
>> > > > > +         */
>> > > > > +        if (!strcmp(op->nbsp->type, "router")) {
>> > > > > +            build_lswitch_rport_arp_req_flows(op->peer, op->od, op, lflows);
>> > > > > +        }
>> > > > > +
>> > > > >          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
>> > > > >              /* Addresses are owned by the logical port.
>> > > > >               * Ethernet address followed by zero or more IPv4
>> > > > > @@ -5892,12 +6072,6 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>> > > > >      ds_destroy(&actions);
>> > > > >  }
>> > > > >
>> > > > > -static bool
>> > > > > -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
>> > > > > -{
>> > > > > -    return !lrport->enabled || *lrport->enabled;
>> > > > > -}
>> > > > > -
>> > > > >  /* Returns a string of the IP address of the router port 'op' that
>> > > > >   * overlaps with 'ip_s".  If one is not found, returns NULL.
>> > > > >   *
>> > > > > diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
>> > > > > index 7966b65..c43f16d 100644
>> > > > > --- a/ovn-architecture.7.xml
>> > > > > +++ b/ovn-architecture.7.xml
>> > > > > @@ -1390,6 +1390,25 @@
>> > > > >      http://docs.openvswitch.org/en/latest/topics/high-availability.
>> > > > >    </p>
>> > > > >
>> > > > > +  <h3>ARP request and ND NS packet processing</h3>
>> > > > > +
>> > > > > +  <p>
>> > > > > +    Due to the fact that ARP requests and ND NA packets are usually broadcast
>> > > > > +    packets, for performance reasons, OVN deals with requests that target OVN
>> > > > > +    owned IP addresses (i.e., IP addresses configured on the router ports,
>> > > > > +    VIPs, NAT IPs) in a specific way and only forwards them to the logical
>> > > > > +    router that owns the target IP address. This behavior is different than
>> > > > > +    that of traditional swithces and implies that other routers/hosts
>> > > > > +    connected to the logical switch will not learn the MAC/IP binding from
>> > > > > +    the request packet.
>> > > > > +  </p>
>> > > > > +
>> > > > > +  <p>
>> > > > > +    All other ARP and ND packets are flooded in the L2 broadcast domain and
>> > > > > +    to all attached logical patch ports.
>> > > > > +  </p>
>> > > > > +
>> > > > > +
>> > > > >    <h2>Multiple localnet logical switches connected to a Logical Router</h2>
>> > > > >
>> > > > >    <p>
>> > > > > diff --git a/tests/ovn.at b/tests/ovn.at
>> > > > > index 3e429e3..26e33d2 100644
>> > > > > --- a/tests/ovn.at
>> > > > > +++ b/tests/ovn.at
>> > > > > @@ -2877,7 +2877,7 @@ test_ip() {
>> > > > >      done
>> > > > >  }
>> > > > >
>> > > > > -# test_arp INPORT SHA SPA TPA [REPLY_HA]
>> > > > > +# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
>> > > > >  #
>> > > > >  # Causes a packet to be received on INPORT.  The packet is an ARP
>> > > > >  # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided, then
>> > > > > @@ -2888,21 +2888,25 @@ test_ip() {
>> > > > >  # SHA and REPLY_HA are each 12 hex digits.
>> > > > >  # SPA and TPA are each 8 hex digits.
>> > > > >  test_arp() {
>> > > > > -    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
>> > > > > +    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
>> > > > >      local request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
>> > > > >      hv=hv`vif_to_hv $inport`
>> > > > >      as $hv ovs-appctl netdev-dummy/receive vif$inport $request
>> > > > >      as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
>> > > > >
>> > > > >      # Expect to receive the broadcast ARP on the other logical switch ports if
>> > > > > -    # IP address is not configured to the switch patch port.
>> > > > > +    # IP address is not configured on the switch patch port or on the router
>> > > > > +    # port (i.e, $flood == 1).
>> > > > >      local i=`vif_to_ls $inport`
>> > > > >      local j k
>> > > > >      for j in 1 2 3; do
>> > > > >          for k in 1 2 3; do
>> > > > > -            # 192.168.33.254 is configured to the switch patch port for lrp33,
>> > > > > -            # so no ARP flooding expected for it.
>> > > > > -            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
>> > > > > +            # Skip ingress port.
>> > > > > +            if test $i$j$k == $inport; then
>> > > > > +                continue
>> > > > > +            fi
>> > > > > +
>> > > > > +            if test X$flood == X1; then
>> > > > >                  echo $request >> $i$j$k.expected
>> > > > >              fi
>> > > > >          done
>> > > > > @@ -3039,9 +3043,9 @@ for i in 1 2 3; do
>> > > > >        otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
>> > > > >        externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
>> > > > >
>> > > > > -      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
>> > > > > -      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
>> > > > > -      test_arp $i$j$k $smac $sip        $otherip               #6
>> > > > > +      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
>> > > > > +      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
>> > > > > +      test_arp $i$j$k $smac $sip        $otherip   1                 #6
>> > > > >
>> > > > >        # When rip is 192.168.33.254, ARP request from externalip won't be
>> > > > >        # filtered, because 192.168.33.254 is configured to switch peer port
>> > > > > @@ -3050,7 +3054,7 @@ for i in 1 2 3; do
>> > > > >        if test $i = 3 && test $j = 3; then
>> > > > >          lrp33_rsp=$rmac
>> > > > >        fi
>> > > > > -      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
>> > > > > +      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
>> > > > >
>> > > > >        # MAC binding should be learned from ARP request.
>> > > > >        host_mac_pretty=f0:00:00:00:0$i:$j$k
>> > > > > @@ -9595,7 +9599,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
>> > > > >  # Check that there is a logical flow in logical switch foo's pipeline
>> > > > >  # to set the outport to rp-foo (which is expected).
>> > > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
>> > > > > -grep rp-foo | grep -v is_chassis_resident | wc -l`])
>> > > > > +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
>> > > > >
>> > > > >  # Set the option 'reside-on-redirect-chassis' for foo
>> > > > >  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
>> > > > > @@ -9603,7 +9607,7 @@ ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
>> > > > >  # to set the outport to rp-foo with the condition is_chassis_redirect.
>> > > > >  ovn-sbctl dump-flows foo
>> > > > >  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
>> > > > > -grep rp-foo | grep is_chassis_resident | wc -l`])
>> > > > > +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
>> > > > >
>> > > > >  echo "---------NB dump-----"
>> > > > >  ovn-nbctl show
>> > > > > @@ -16694,3 +16698,282 @@ as hv4 ovs-appctl fdb/show br-phys
>> > > > >  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
>> > > > >
>> > > > >  AT_CLEANUP
>> > > > > +
>> > > > > +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
>> > > > > +AT_SKIP_IF([test $HAVE_PYTHON = no])
>> > > > > +ovn_start
>> > > > > +
>> > > > > +ip_to_hex() {
>> > > > > +    printf "%02x%02x%02x%02x" "$@"
>> > > > > +}
>> > > > > +
>> > > > > +send_arp_request() {
>> > > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
>> > > > > +    local eth_dst=ffffffffffff
>> > > > > +    local eth_type=0806
>> > > > > +    local eth=${eth_dst}${eth_src}${eth_type}
>> > > > > +
>> > > > > +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
>> > > > > +
>> > > > > +    local request=${eth}${arp}
>> > > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
>> > > > > +}
>> > > > > +
>> > > > > +send_nd_ns() {
>> > > > > +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
>> > > > > +
>> > > > > +    local eth_dst=ffffffffffff
>> > > > > +    local eth_type=86dd
>> > > > > +    local eth=${eth_dst}${eth_src}${eth_type}
>> > > > > +
>> > > > > +    local ip_vhlen=60000000
>> > > > > +    local ip_plen=0020
>> > > > > +    local ip_next=3a
>> > > > > +    local ip_ttl=ff
>> > > > > +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
>> > > > > +
>> > > > > +    # Neighbor Solicitation
>> > > > > +    local icmp6_type=87
>> > > > > +    local icmp6_code=00
>> > > > > +    local icmp6_rsvd=00000000
>> > > > > +    # ICMPv6 source lla option
>> > > > > +    local icmp6_opt=01
>> > > > > +    local icmp6_optlen=01
>> > > > > +    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
>> > > > > +
>> > > > > +    local request=${eth}${ip}${icmp6}
>> > > > > +
>> > > > > +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
>> > > > > +}
>> > > > > +
>> > > > > +src_mac=000000000001
>> > > > > +
>> > > > > +net_add n1
>> > > > > +sim_add hv1
>> > > > > +as hv1
>> > > > > +ovs-vsctl add-br br-phys
>> > > > > +ovn_attach n1 br-phys 192.168.0.1
>> > > > > +
>> > > > > +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>> > > > > +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
>> > > > > +    options:tx_pcap=hv1/vif1-tx.pcap \
>> > > > > +    options:rxq_pcap=hv1/vif1-rx.pcap \
>> > > > > +    ofport-request=1
>> > > > > +
>> > > > > +# One Aggregation Switch connected to two Logical networks (routers).
>> > > > > +ovn-nbctl ls-add sw-agg
>> > > > > +ovn-nbctl lsp-add sw-agg sw-agg-ext \
>> > > > > +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
>> > > > > +
>> > > > > +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
>> > > > > +    -- lsp-set-type sw-rtr1 router                 \
>> > > > > +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
>> > > > > +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
>> > > > > +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
>> > > > > +    -- lsp-set-type sw-rtr2 router                 \
>> > > > > +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
>> > > > > +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
>> > > > > +
>> > > > > +# Configure L3 interface IPv4 & IPv6 on both routers
>> > > > > +ovn-nbctl lr-add rtr1
>> > > > > +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
>> > > > > +
>> > > > > +ovn-nbctl lr-add rtr2
>> > > > > +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
>> > > > > +
>> > > > > +OVN_POPULATE_ARP
>> > > > > +ovn-nbctl --wait=hv sync
>> > > > > +
>> > > > > +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
>> > > > > +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
>> > > > > +
>> > > > > +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
>> > > > > +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
>> > > > > +
>> > > > > +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
>> > > > > +mc_key=$(printf "%04x" $mc_key)
>> > > > > +
>> > > > > +match_sw_metadata="metadata=0x${sw_dp_key}"
>> > > > > +
>> > > > > +# Inject ARP request for first router owned IP address.
>> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
>> > > > > +
>> > > > > +# Verify that the ARP request is sent only to rtr1.
>> > > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
>> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
>> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +# Inject ND_NS for ofirst router owned IP address.
>> > > > > +src_ipv6=00100000000000000000000000000254
>> > > > > +dst_ipv6=00100000000000000000000000000001
>> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
>> > > > > +
>> > > > > +# Verify that the ND_NS is sent only to rtr1.
>> > > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +# Configure load balancing on both routers.
>> > > > > +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
>> > > > > +ovn-nbctl lb-add lb1-v6 10::11 42::1
>> > > > > +ovn-nbctl lr-lb-add rtr1 lb1-v4
>> > > > > +ovn-nbctl lr-lb-add rtr1 lb1-v6
>> > > > > +
>> > > > > +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
>> > > > > +ovn-nbctl lb-add lb2-v6 10::22 42::2
>> > > > > +ovn-nbctl lr-lb-add rtr2 lb2-v4
>> > > > > +ovn-nbctl lr-lb-add rtr2 lb2-v6
>> > > > > +ovn-nbctl --wait=hv sync
>> > > > > +
>> > > > > +# Inject ARP request for first router owned VIP address.
>> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
>> > > > > +
>> > > > > +# Verify that the ARP request is sent only to rtr1.
>> > > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
>> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
>> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +# Inject ND_NS for first router owned VIP address.
>> > > > > +src_ipv6=00100000000000000000000000000254
>> > > > > +dst_ipv6=00100000000000000000000000000011
>> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
>> > > > > +
>> > > > > +# Verify that the ND_NS is sent only to rtr1.
>> > > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +# Configure NAT on both routers
>> > > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
>> > > > > +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
>> > > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
>> > > > > +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
>> > > > > +
>> > > > > +# Inject ARP request for first router owned NAT address.
>> > > > > +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
>> > > > > +
>> > > > > +# Verify that the ARP request is sent only to rtr1.
>> > > > > +match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
>> > > > > +match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
>> > > > > +match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +# Inject ND_NS for first router owned IP address.
>> > > > > +src_ipv6=00100000000000000000000000000254
>> > > > > +dst_ipv6=00100000000000000000000000000111
>> > > > > +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
>> > > > > +
>> > > > > +# Verify that the ND_NS is sent only to rtr1.
>> > > > > +match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
>> > > > > +
>> > > > > +as hv1
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "1" = "${pkts_to_rtr1}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
>> > > > > +    grep n_packets=1 -c)
>> > > > > +    test "0" = "${pkts_to_rtr2}"
>> > > > > +])
>> > > > > +OVS_WAIT_UNTIL([
>> > > > > +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
>> > > > > +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
>> > > > > +    test "0" = "${pkts_flooded}"
>> > > > > +])
>> > > > > +
>> > > > > +OVN_CLEANUP([hv1])
>> > > > > +AT_CLEANUP
>> > > > >
>> > > > > _______________________________________________
>> > > > > dev mailing list
>> > > > > dev@openvswitch.org
>> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> > >
>>
Han Zhou Nov. 13, 2019, 7:51 p.m. UTC | #7
On Wed, Nov 13, 2019 at 8:14 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Wed, Nov 13, 2019 at 4:30 PM Han Zhou <hzhou@ovn.org> wrote:
> >
> >
> >
> > On Wed, Nov 13, 2019 at 2:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On Tue, Nov 12, 2019 at 8:50 PM Han Zhou <hzhou@ovn.org> wrote:
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >> > >
> >> > > On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >> > > > >
> >> > > > > ARP request and ND NS packets for router owned IPs were being
> >> > > > > flooded in the complete L2 domain (using the MC_FLOOD
multicast group).
> >> > > > > However this creates a scaling issue in scenarios where
aggregation
> >> > > > > logical switches are connected to more logical routers (~350).
The
> >> > > > > logical pipelines of all routers would have to be executed
before the
> >> > > > > packet is finally replied to by a single router, the owner of
the IP
> >> > > > > address.
> >> > > > >
> >> > > > > This commit limits the broadcast domain by bypassing the L2
Lookup stage
> >> > > > > for ARP requests that will be replied by a single router. The
packets
> >> > > > > are forwarded only to the router port that owns the target IP
address.
> >> > > > >
> >> > > > > IPs that are owned by the routers and for which this fix
applies are:
> >> > > > > - IP addresses configured on the router ports.
> >> > > > > - VIPs.
> >> > > > > - NAT IPs.
> >> > > > >
> >> > > > > Reported-at: https://bugzilla.redhat.com/1756945
> >> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> >> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> > > > >
> >> > > > > ---
> >> > > > > v7:
> >> > > > > - Address Han's comments:
> >> > > > >     - Remove flooding for all ARPs received on VLAN networks.
To avoid
> >> > > > >       that we now identify self originated (G)ARPs by matching
on source
> >> > > > >       MAC address too.
> >> > > > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> >> > > > > - Fix ovn-sb manpage.
> >> > > > > - Split patch in a series of 2:
> >> > > > >     - patch1: fixes the get_router_load_balancer_ips()
function.
> >> > > > >     - patch2: limits the ARP/ND broadcast domain.
> >> > > > > v6:
> >> > > > > - Address Han's comments:
> >> > > > >     - remove flooding of ARPs targeting OVN owned IP addresses.
> >> > > > >     - update ovn-architecture documentation.
> >> > > > >     - rename ARP handling functions.
> >> > > > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest
to take into
> >> > > > >     account the new way of forwarding ARPs.
> >> > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> >> > > > > v5: Address Numan's comments: update comments & make autotest
more
> >> > > > >     robust.
> >> > > > > v4: Rebase.
> >> > > > > v3: Properly deal with VXLAN traffic. Address review comments
from
> >> > > > >     Numan (add autotests). Fix function
get_router_load_balancer_ips.
> >> > > > >     Rebase -> deal with IPv6 NAT too.
> >> > > > > v2: Move ARP broadcast domain limiting to table
S_SWITCH_IN_L2_LKUP to
> >> > > > > address localnet ports too.
> >> > > > > ---
> >> > > > >  northd/ovn-northd.8.xml |   14 ++
> >> > > > >  northd/ovn-northd.c     |  230
+++++++++++++++++++++++++++++++----
> >> > > > >  ovn-architecture.7.xml  |   19 +++
> >> > > > >  tests/ovn.at            |  307
+++++++++++++++++++++++++++++++++++++++++++++--
> >> > > > >  4 files changed, 530 insertions(+), 40 deletions(-)
> >> > > > >
> >> > > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> >> > > > > index 0a33dcd..344cc0d 100644
> >> > > > > --- a/northd/ovn-northd.8.xml
> >> > > > > +++ b/northd/ovn-northd.8.xml
> >> > > > > @@ -1005,6 +1005,20 @@ output;
> >> > > > >        </li>
> >> > > > >
> >> > > > >        <li>
> >> > > > > +        Priority-80 flows for each port connected to a
logical router
> >> > > > > +        matching self originated GARP/ARP request/ND packets.
These packets
> >> > > > > +        are flooded to the <code>MC_FLOOD</code> which
contains all logical
> >> > > > > +        ports.
> >> > > > > +      </li>
> >> > > > > +
> >> > > > > +      <li>
> >> > > > > +        Priority-75 flows for each IP address/VIP/NAT address
owned by a
> >> > > > > +        router port connected to the switch. These flows
match ARP requests
> >> > > > > +        and ND packets for the specific IP addresses.
Matched packets are
> >> > > > > +        forwarded only to the router that owns the IP address.
> >> > > > > +      </li>
> >> > > > > +
> >> > > > > +      <li>
> >> > > > >          A priority-70 flow that outputs all packets with an
Ethernet broadcast
> >> > > > >          or multicast <code>eth.dst</code> to the
<code>MC_FLOOD</code>
> >> > > > >          multicast group.
> >> > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> >> > > > > index 32f3200..d6beb97 100644
> >> > > > > --- a/northd/ovn-northd.c
> >> > > > > +++ b/northd/ovn-northd.c
> >> > > > > @@ -210,6 +210,8 @@ enum ovn_stage {
> >> > > > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> >> > > > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> >> > > > >
> >> > > > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> >> > > > > +
> >> > > > >  /* Returns an "enum ovn_stage" built from the arguments. */
> >> > > > >  static enum ovn_stage
> >> > > > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum
ovn_pipeline pipeline,
> >> > > > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct
ovn_datapath *od)
> >> > > > >                            1, (1u << 15) - 1,
&od->port_key_hint);
> >> > > > >  }
> >> > > > >
> >> > > > > +/* Returns true if the logical switch port 'enabled' column
is empty or
> >> > > > > + * set to true.  Otherwise, returns false. */
> >> > > > > +static bool
> >> > > > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> >> > > > > +{
> >> > > > > +    return !lsp->n_enabled || *lsp->enabled;
> >> > > > > +}
> >> > > > > +
> >> > > > > +/* Returns true only if the logical switch port 'up' column
is set to true.
> >> > > > > + * Otherwise, if the column is not set or set to false,
returns false. */
> >> > > > > +static bool
> >> > > > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> >> > > > > +{
> >> > > > > +    return lsp->n_up && *lsp->up;
> >> > > > > +}
> >> > > > > +
> >> > > > > +static bool
> >> > > > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> >> > > > > +{
> >> > > > > +    return !strcmp(nbsp->type, "external");
> >> > > > > +}
> >> > > > > +
> >> > > > > +static bool
> >> > > > > +lrport_is_enabled(const struct nbrec_logical_router_port
*lrport)
> >> > > > > +{
> >> > > > > +    return !lrport->enabled || *lrport->enabled;
> >> > > > > +}
> >> > > > > +
> >> > > > >  static char *
> >> > > > >  chassis_redirect_name(const char *port_name)
> >> > > > >  {
> >> > > > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum
ovn_pipeline pipeline, struct ovn_port *op,
> >> > > > >
> >> > > > >  }
> >> > > > >
> >> > > > > -/* Returns true if the logical switch port 'enabled' column
is empty or
> >> > > > > - * set to true.  Otherwise, returns false. */
> >> > > > > -static bool
> >> > > > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> >> > > > > -{
> >> > > > > -    return !lsp->n_enabled || *lsp->enabled;
> >> > > > > -}
> >> > > > > -
> >> > > > > -/* Returns true only if the logical switch port 'up' column
is set to true.
> >> > > > > - * Otherwise, if the column is not set or set to false,
returns false. */
> >> > > > > -static bool
> >> > > > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> >> > > > > -{
> >> > > > > -    return lsp->n_up && *lsp->up;
> >> > > > > -}
> >> > > > > -
> >> > > > > -static bool
> >> > > > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> >> > > > > -{
> >> > > > > -    return !strcmp(nbsp->type, "external");
> >> > > > > -}
> >> > > > > -
> >> > > > >  static bool
> >> > > > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> >> > > > >                      struct ds *options_action, struct ds
*response_action,
> >> > > > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap
*ports, struct ovs_list *lr_list)
> >> > > > >      }
> >> > > > >  }
> >> > > > >
> >> > > > > +/*
> >> > > > > + * Ingress table 17: Flows that flood self originated ARP/ND
packets in the
> >> > > > > + * switching domain.
> >> > > > > + */
> >> > > > > +static void
> >> > > > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port
*op,
> >> > > > > +                                           uint32_t priority,
> >> > > > > +                                           struct
ovn_datapath *od,
> >> > > > > +                                           struct hmap
*lflows)
> >> > > > > +{
> >> > > > > +    struct ds match = DS_EMPTY_INITIALIZER;
> >> > > > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> >> > > > > +
> >> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded
as usual.
> >> > > > > +     * Determine that packets are self originated by also
matching on
> >> > > > > +     * source MAC. Matching on ingress port is not reliable
in case this
> >> > > > > +     * is a VLAN-backed network.
> >> > > > > +     * Priority: 80.
> >> > > > > +     */
> >> > > > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> >> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> >> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> >> > > > > +
> >> > > > > +        if (!nat->external_mac) {
> >> > > > > +            continue;
> >> > > > > +        }
> >> > > > > +
> >> > > > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> >> > > > > +    }
> >> > > >
> >> > > > As discussed we need to add chassis unique MAC that are
configured in external-ids:ovn-chassis-mac-mappings of Chassis records, but
I didn't find this in the patch. VLAN backed logical router may not work
without this.
> >> > >
> >> > > Hi Han,
> >> > >
> >> > > Maybe I misunderstood but in the discussion on v6 I mentioned that
I
> >> > > don't think we need to add the MACs from
> >> > > external-ids:ovn-chassis-mac-mappings.
> >> > >
> >> > > Whenever chassis MACs are configured, in ovn-controller we create a
> >> > > conjunctive flow matching on any of the remote chassis MAC
addresses:
> >> > >
https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
> >> > >
> >> > > And for all incoming traffic that matches this conjunction and
VLAN-id
> >> > > we change the MAC back to that of the logical router port:
> >> > >
https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
> >> > >
> >> > > Isn't this enough to cover the self originated ARP packets?
> >> > >
> >> > > Thanks,
> >> > > Dumitru
> >> > >
> >> >
> >> > Dumitru, sorry that I misunderstood that you actually meant it was
ok to not adding chassis unique macs. Also I didn't realize that there are
already flows to change the chassis unique MACs back to the logical router
port's MACs.
> >> > With this precondition I think your patch should be good enough.
> >> >
> >> > However, I revisited the function put_replace_chassis_mac_flows()
and had some difficulty to understand how would it work. For these flows,
the match conditions are:
> >> > - in_port of the localnet port
> >> > - conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
> >> > - vlan tag associated with the localnet port
> >> > The flow is added in a loop for each peer port to replace mac for
each router port on that logical switch. Since the match condition is all
the same, wouldn't it result in only one flow taking effect and others
getting dropped? I wonder if any other port-specific match condition should
be added so that MAC can be replaced back to its original router port mac
accordingly.
> >>
> >> If I understand correctly the differentiator is the ingress localnet
> >> port id and VLAN-ID.
> >>
> >> Looking at the autotest for "ovn -- 2 HVs, 2 lports/HV, localnet
> >> ports, DVR chassis mac" the network is:
> >>
> >> $ ovn-nbctl --db=unix:$PWD/./ovn-nb/ovn-nb.sock show
> >> switch df662f28-4a42-4ac4-aadb-89563347cae1 (ls1)
> >>     port ls1-to-router
> >>         type: router
> >>         router-port: router-to-ls1
> >>     port lp11
> >>         addresses: ["f0:00:00:00:00:11 192.168.1.1"]
> >>     port ln1
> >>         type: localnet
> >>         parent:
> >>         tag: 101
> >>         addresses: ["unknown"]
> >> switch 91ddb7b2-df8c-42de-a899-6bb35ee08a16 (ls2)
> >>     port ls2-to-router
> >>         type: router
> >>         router-port: router-to-ls2
> >>     port lp22
> >>         addresses: ["f0:00:00:00:00:22 192.168.2.2"]
> >>     port ln2
> >>         type: localnet
> >>         parent:
> >>         tag: 201
> >>         addresses: ["unknown"]
> >> router 4186cb04-3370-401c-9d65-29cb2af48af1 (router)
> >>     port router-to-ls1
> >>         mac: "00:00:01:01:02:03"
> >>         networks: ["192.168.1.3/24"]
> >>     port router-to-ls2
> >>         mac: "00:00:01:01:02:05"
> >>         networks: ["192.168.2.3/24"]
> >>
> >> $ ovn-sbctl --db=unix:$PWD/./ovn-sb/ovn-sb.sock list chassis | grep -E
> >> "uuid|external_ids"
> >> _uuid               : 31ba7484-e0af-4326-a71b-c3f32e52e547
> >> external_ids        : {datapath-type="",
> >>
iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> >> ovn-bridge-mappings="phys:br-phys",
> >> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:11", ovn-cms-options=""}
> >>
> >> _uuid               : a04b5ad2-d76f-42c4-9f2b-21816e2624e2
> >> external_ids        : {datapath-type="",
> >>
iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> >> ovn-bridge-mappings="phys:br-phys",
> >> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:22", ovn-cms-options=""}
> >>
> >> On HV1 (chassis-mac aa:bb:cc:dd:ee:11) we have the following flow to
> >> replace source MAC for already routed packets:
> >>
> >> $ OVS_RUNDIR=$PWD/hv1 ovs-ofctl dump-flows br-int | grep
> >> aa:bb:cc:dd:ee:11
> >>  cookie=0xe0f6b198, duration=580.460s, table=65, n_packets=0,
> >> n_bytes=0, idle_age=580,
> >> priority=150,reg15=0x1,metadata=0x1,dl_src=00:00:01:01:02:03
> >> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:101,output:1
> >>  cookie=0xfddc60a, duration=580.420s, table=65, n_packets=1,
> >> n_bytes=42, idle_age=580,
> >> priority=150,reg15=0x1,metadata=0x2,dl_src=00:00:01:01:02:05
> >> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:201,output:3
> >>
> >> In the above output, the first flow is for packets destined to hosts
> >> in LS1 (VLAN 101) and the second flow is for packets destined to hosts
> >> in LS2 (VLAN 201).
> >>
> >> If we inject a packet from lp11:
> >> in_port=lp11, eth.src=f0:00:00:00:00:11, eth.dst=00:00:01:01:02:03,
> >> ip.src=192.168.1.1, ip.dst=192.168.2.2
> >>
> >> The router pipeline is executed on HV1, the eth.src address that of
> >> the router-to-ls2 port (00:00:01:01:02:05) and finally the entry in
> >> table=65 is hit and eth.src is changed to the configured chassis-mac
> >> (aa:bb:cc:dd:ee:11).
> >>
> >> The packet is sent out on port ln2:
> >> $ OVS_RUNDIR=$PWD/hv1 ovs-vsctl --column ofport find interface
> >> name=patch-br-int-to-ln2
> >> ofport              : 3
> >>
> >> Then it is received on HV2, where we have the following flows:
> >> $ OVS_RUNDIR=$PWD/hv2 ovs-ofctl dump-flows br-int | grep conj
> >>  cookie=0x31ba7484, duration=1545.430s, table=0, n_packets=0,
> >> n_bytes=0, idle_age=1545, priority=180,dl_src=aa:bb:cc:dd:ee:11
> >> actions=conjunction(100,1/2)
> >>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=1,
> >> n_bytes=46, idle_age=1545,
> >> priority=180,conj_id=100,in_port=2,dl_vlan=201
> >>
actions=strip_vlan,load:0x4->NXM_NX_REG13[],load:0x3->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:05,resubmit(,8)
> >>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> >> n_bytes=0, idle_age=1545,
> >> priority=180,conj_id=100,in_port=3,dl_vlan=101
> >>
actions=strip_vlan,load:0x9->NXM_NX_REG13[],load:0x5->NXM_NX_REG11[],load:0x6->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:03,resubmit(,8)
> >>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=0,
> >> n_bytes=0, idle_age=1545, priority=180,dl_vlan=201
> >> actions=conjunction(100,2/2)
> >>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> >> n_bytes=0, idle_age=1545, priority=180,dl_vlan=101
> >> actions=conjunction(100,2/2)
> >>
> >> The packet matches:
> >> - clause 1 of the conjunction (100) because eth.src is the chassis-mac
> >> of HV1 (aa:bb:cc:dd:ee:11).
> >> - clause 2 of the conjunction because VLAN_ID is 201.
> >>
> >> And then matches this flow that fixes the eth.src in the packet to
> >> that of router-to-ls2:
> >> priority=180,conj_id=100,in_port=2,dl_vlan=201
> >> actions=.....,mod_dl_src:00:00:01:01:02:05,....
> >
> >
> > The test has only a single router with multiple lswitches, but the loop
in the code is trying to handle the case when there are multiple router
ports on the same lswitch (with same localnet port). In the loop the inport
and vlan doesn’t change across iterations.
>
> Ok, I see what you mean now, I was under the impression that we'd
> always have a single router port per lswitch with vlan networks.
>
> I'll let Ankur comment on this because it seems to me that we can't
> restore the router port MAC if multiple router-ports map to the same
> VLAN-ID + localnet-port.

Ok, now that this is clarified (at least common understanding of the
problem), I think we may move this forward by merging the current patch
first, and address the source MAC conversion problem separately in a new
thread discussing with Ankur.
I think it shouldn't affect the solution of this patch, either with current
implementation which replaces chassis mac to a probably wrong router's mac,
or with a new implementation that properly replaces chassis mac back to the
corresponding router's mac, the current patch should work well without
considering the chassis mac.
I applied this to master.

Thanks,
Han
Dumitru Ceara Nov. 14, 2019, 8:47 a.m. UTC | #8
On Wed, Nov 13, 2019 at 8:52 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Wed, Nov 13, 2019 at 8:14 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On Wed, Nov 13, 2019 at 4:30 PM Han Zhou <hzhou@ovn.org> wrote:
> > >
> > >
> > >
> > > On Wed, Nov 13, 2019 at 2:42 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >>
> > >> On Tue, Nov 12, 2019 at 8:50 PM Han Zhou <hzhou@ovn.org> wrote:
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Nov 12, 2019 at 10:10 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >> > >
> > >> > > On Tue, Nov 12, 2019 at 6:17 PM Han Zhou <zhouhan@gmail.com> wrote:
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Nov 12, 2019 at 2:29 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >> > > > >
> > >> > > > > ARP request and ND NS packets for router owned IPs were being
> > >> > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > >> > > > > However this creates a scaling issue in scenarios where aggregation
> > >> > > > > logical switches are connected to more logical routers (~350). The
> > >> > > > > logical pipelines of all routers would have to be executed before the
> > >> > > > > packet is finally replied to by a single router, the owner of the IP
> > >> > > > > address.
> > >> > > > >
> > >> > > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > >> > > > > for ARP requests that will be replied by a single router. The packets
> > >> > > > > are forwarded only to the router port that owns the target IP address.
> > >> > > > >
> > >> > > > > IPs that are owned by the routers and for which this fix applies are:
> > >> > > > > - IP addresses configured on the router ports.
> > >> > > > > - VIPs.
> > >> > > > > - NAT IPs.
> > >> > > > >
> > >> > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > >> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > >> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > >> > > > >
> > >> > > > > ---
> > >> > > > > v7:
> > >> > > > > - Address Han's comments:
> > >> > > > >     - Remove flooding for all ARPs received on VLAN networks. To avoid
> > >> > > > >       that we now identify self originated (G)ARPs by matching on source
> > >> > > > >       MAC address too.
> > >> > > > >     - Rename REGBIT_NOT_VXLAN to FLAGBIT_NOT_VXLAN.
> > >> > > > > - Fix ovn-sb manpage.
> > >> > > > > - Split patch in a series of 2:
> > >> > > > >     - patch1: fixes the get_router_load_balancer_ips() function.
> > >> > > > >     - patch2: limits the ARP/ND broadcast domain.
> > >> > > > > v6:
> > >> > > > > - Address Han's comments:
> > >> > > > >     - remove flooding of ARPs targeting OVN owned IP addresses.
> > >> > > > >     - update ovn-architecture documentation.
> > >> > > > >     - rename ARP handling functions.
> > >> > > > >     - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> > >> > > > >     account the new way of forwarding ARPs.
> > >> > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > >> > > > > v5: Address Numan's comments: update comments & make autotest more
> > >> > > > >     robust.
> > >> > > > > v4: Rebase.
> > >> > > > > v3: Properly deal with VXLAN traffic. Address review comments from
> > >> > > > >     Numan (add autotests). Fix function get_router_load_balancer_ips.
> > >> > > > >     Rebase -> deal with IPv6 NAT too.
> > >> > > > > v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> > >> > > > > address localnet ports too.
> > >> > > > > ---
> > >> > > > >  northd/ovn-northd.8.xml |   14 ++
> > >> > > > >  northd/ovn-northd.c     |  230 +++++++++++++++++++++++++++++++----
> > >> > > > >  ovn-architecture.7.xml  |   19 +++
> > >> > > > >  tests/ovn.at            |  307 +++++++++++++++++++++++++++++++++++++++++++++--
> > >> > > > >  4 files changed, 530 insertions(+), 40 deletions(-)
> > >> > > > >
> > >> > > > > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > >> > > > > index 0a33dcd..344cc0d 100644
> > >> > > > > --- a/northd/ovn-northd.8.xml
> > >> > > > > +++ b/northd/ovn-northd.8.xml
> > >> > > > > @@ -1005,6 +1005,20 @@ output;
> > >> > > > >        </li>
> > >> > > > >
> > >> > > > >        <li>
> > >> > > > > +        Priority-80 flows for each port connected to a logical router
> > >> > > > > +        matching self originated GARP/ARP request/ND packets. These packets
> > >> > > > > +        are flooded to the <code>MC_FLOOD</code> which contains all logical
> > >> > > > > +        ports.
> > >> > > > > +      </li>
> > >> > > > > +
> > >> > > > > +      <li>
> > >> > > > > +        Priority-75 flows for each IP address/VIP/NAT address owned by a
> > >> > > > > +        router port connected to the switch. These flows match ARP requests
> > >> > > > > +        and ND packets for the specific IP addresses.  Matched packets are
> > >> > > > > +        forwarded only to the router that owns the IP address.
> > >> > > > > +      </li>
> > >> > > > > +
> > >> > > > > +      <li>
> > >> > > > >          A priority-70 flow that outputs all packets with an Ethernet broadcast
> > >> > > > >          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
> > >> > > > >          multicast group.
> > >> > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > >> > > > > index 32f3200..d6beb97 100644
> > >> > > > > --- a/northd/ovn-northd.c
> > >> > > > > +++ b/northd/ovn-northd.c
> > >> > > > > @@ -210,6 +210,8 @@ enum ovn_stage {
> > >> > > > >  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
> > >> > > > >  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
> > >> > > > >
> > >> > > > > +#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
> > >> > > > > +
> > >> > > > >  /* Returns an "enum ovn_stage" built from the arguments. */
> > >> > > > >  static enum ovn_stage
> > >> > > > >  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
> > >> > > > > @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
> > >> > > > >                            1, (1u << 15) - 1, &od->port_key_hint);
> > >> > > > >  }
> > >> > > > >
> > >> > > > > +/* Returns true if the logical switch port 'enabled' column is empty or
> > >> > > > > + * set to true.  Otherwise, returns false. */
> > >> > > > > +static bool
> > >> > > > > +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > >> > > > > +{
> > >> > > > > +    return !lsp->n_enabled || *lsp->enabled;
> > >> > > > > +}
> > >> > > > > +
> > >> > > > > +/* Returns true only if the logical switch port 'up' column is set to true.
> > >> > > > > + * Otherwise, if the column is not set or set to false, returns false. */
> > >> > > > > +static bool
> > >> > > > > +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > >> > > > > +{
> > >> > > > > +    return lsp->n_up && *lsp->up;
> > >> > > > > +}
> > >> > > > > +
> > >> > > > > +static bool
> > >> > > > > +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > >> > > > > +{
> > >> > > > > +    return !strcmp(nbsp->type, "external");
> > >> > > > > +}
> > >> > > > > +
> > >> > > > > +static bool
> > >> > > > > +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> > >> > > > > +{
> > >> > > > > +    return !lrport->enabled || *lrport->enabled;
> > >> > > > > +}
> > >> > > > > +
> > >> > > > >  static char *
> > >> > > > >  chassis_redirect_name(const char *port_name)
> > >> > > > >  {
> > >> > > > > @@ -3750,28 +3780,6 @@ build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
> > >> > > > >
> > >> > > > >  }
> > >> > > > >
> > >> > > > > -/* Returns true if the logical switch port 'enabled' column is empty or
> > >> > > > > - * set to true.  Otherwise, returns false. */
> > >> > > > > -static bool
> > >> > > > > -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> > >> > > > > -{
> > >> > > > > -    return !lsp->n_enabled || *lsp->enabled;
> > >> > > > > -}
> > >> > > > > -
> > >> > > > > -/* Returns true only if the logical switch port 'up' column is set to true.
> > >> > > > > - * Otherwise, if the column is not set or set to false, returns false. */
> > >> > > > > -static bool
> > >> > > > > -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> > >> > > > > -{
> > >> > > > > -    return lsp->n_up && *lsp->up;
> > >> > > > > -}
> > >> > > > > -
> > >> > > > > -static bool
> > >> > > > > -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> > >> > > > > -{
> > >> > > > > -    return !strcmp(nbsp->type, "external");
> > >> > > > > -}
> > >> > > > > -
> > >> > > > >  static bool
> > >> > > > >  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
> > >> > > > >                      struct ds *options_action, struct ds *response_action,
> > >> > > > > @@ -5174,6 +5182,170 @@ build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
> > >> > > > >      }
> > >> > > > >  }
> > >> > > > >
> > >> > > > > +/*
> > >> > > > > + * Ingress table 17: Flows that flood self originated ARP/ND packets in the
> > >> > > > > + * switching domain.
> > >> > > > > + */
> > >> > > > > +static void
> > >> > > > > +build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> > >> > > > > +                                           uint32_t priority,
> > >> > > > > +                                           struct ovn_datapath *od,
> > >> > > > > +                                           struct hmap *lflows)
> > >> > > > > +{
> > >> > > > > +    struct ds match = DS_EMPTY_INITIALIZER;
> > >> > > > > +    struct ds eth_src = DS_EMPTY_INITIALIZER;
> > >> > > > > +
> > >> > > > > +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> > >> > > > > +     * Determine that packets are self originated by also matching on
> > >> > > > > +     * source MAC. Matching on ingress port is not reliable in case this
> > >> > > > > +     * is a VLAN-backed network.
> > >> > > > > +     * Priority: 80.
> > >> > > > > +     */
> > >> > > > > +    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
> > >> > > > > +    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
> > >> > > > > +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> > >> > > > > +
> > >> > > > > +        if (!nat->external_mac) {
> > >> > > > > +            continue;
> > >> > > > > +        }
> > >> > > > > +
> > >> > > > > +        ds_put_format(&eth_src, "%s, ", nat->external_mac);
> > >> > > > > +    }
> > >> > > >
> > >> > > > As discussed we need to add chassis unique MAC that are configured in external-ids:ovn-chassis-mac-mappings of Chassis records, but I didn't find this in the patch. VLAN backed logical router may not work without this.
> > >> > >
> > >> > > Hi Han,
> > >> > >
> > >> > > Maybe I misunderstood but in the discussion on v6 I mentioned that I
> > >> > > don't think we need to add the MACs from
> > >> > > external-ids:ovn-chassis-mac-mappings.
> > >> > >
> > >> > > Whenever chassis MACs are configured, in ovn-controller we create a
> > >> > > conjunctive flow matching on any of the remote chassis MAC addresses:
> > >> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L501
> > >> > >
> > >> > > And for all incoming traffic that matches this conjunction and VLAN-id
> > >> > > we change the MAC back to that of the logical router port:
> > >> > > https://github.com/ovn-org/ovn/blob/master/controller/physical.c#L558
> > >> > >
> > >> > > Isn't this enough to cover the self originated ARP packets?
> > >> > >
> > >> > > Thanks,
> > >> > > Dumitru
> > >> > >
> > >> >
> > >> > Dumitru, sorry that I misunderstood that you actually meant it was ok to not adding chassis unique macs. Also I didn't realize that there are already flows to change the chassis unique MACs back to the logical router port's MACs.
> > >> > With this precondition I think your patch should be good enough.
> > >> >
> > >> > However, I revisited the function put_replace_chassis_mac_flows() and had some difficulty to understand how would it work. For these flows, the match conditions are:
> > >> > - in_port of the localnet port
> > >> > - conjunction id: CHASSIS_MAC_TO_ROUTER_MAC_CONJID (value 100)
> > >> > - vlan tag associated with the localnet port
> > >> > The flow is added in a loop for each peer port to replace mac for each router port on that logical switch. Since the match condition is all the same, wouldn't it result in only one flow taking effect and others getting dropped? I wonder if any other port-specific match condition should be added so that MAC can be replaced back to its original router port mac accordingly.
> > >>
> > >> If I understand correctly the differentiator is the ingress localnet
> > >> port id and VLAN-ID.
> > >>
> > >> Looking at the autotest for "ovn -- 2 HVs, 2 lports/HV, localnet
> > >> ports, DVR chassis mac" the network is:
> > >>
> > >> $ ovn-nbctl --db=unix:$PWD/./ovn-nb/ovn-nb.sock show
> > >> switch df662f28-4a42-4ac4-aadb-89563347cae1 (ls1)
> > >>     port ls1-to-router
> > >>         type: router
> > >>         router-port: router-to-ls1
> > >>     port lp11
> > >>         addresses: ["f0:00:00:00:00:11 192.168.1.1"]
> > >>     port ln1
> > >>         type: localnet
> > >>         parent:
> > >>         tag: 101
> > >>         addresses: ["unknown"]
> > >> switch 91ddb7b2-df8c-42de-a899-6bb35ee08a16 (ls2)
> > >>     port ls2-to-router
> > >>         type: router
> > >>         router-port: router-to-ls2
> > >>     port lp22
> > >>         addresses: ["f0:00:00:00:00:22 192.168.2.2"]
> > >>     port ln2
> > >>         type: localnet
> > >>         parent:
> > >>         tag: 201
> > >>         addresses: ["unknown"]
> > >> router 4186cb04-3370-401c-9d65-29cb2af48af1 (router)
> > >>     port router-to-ls1
> > >>         mac: "00:00:01:01:02:03"
> > >>         networks: ["192.168.1.3/24"]
> > >>     port router-to-ls2
> > >>         mac: "00:00:01:01:02:05"
> > >>         networks: ["192.168.2.3/24"]
> > >>
> > >> $ ovn-sbctl --db=unix:$PWD/./ovn-sb/ovn-sb.sock list chassis | grep -E
> > >> "uuid|external_ids"
> > >> _uuid               : 31ba7484-e0af-4326-a71b-c3f32e52e547
> > >> external_ids        : {datapath-type="",
> > >> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> > >> ovn-bridge-mappings="phys:br-phys",
> > >> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:11", ovn-cms-options=""}
> > >>
> > >> _uuid               : a04b5ad2-d76f-42c4-9f2b-21816e2624e2
> > >> external_ids        : {datapath-type="",
> > >> iface-types="dummy,dummy-internal,dummy-pmd,erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan",
> > >> ovn-bridge-mappings="phys:br-phys",
> > >> ovn-chassis-mac-mappings="phys:aa:bb:cc:dd:ee:22", ovn-cms-options=""}
> > >>
> > >> On HV1 (chassis-mac aa:bb:cc:dd:ee:11) we have the following flow to
> > >> replace source MAC for already routed packets:
> > >>
> > >> $ OVS_RUNDIR=$PWD/hv1 ovs-ofctl dump-flows br-int | grep
> > >> aa:bb:cc:dd:ee:11
> > >>  cookie=0xe0f6b198, duration=580.460s, table=65, n_packets=0,
> > >> n_bytes=0, idle_age=580,
> > >> priority=150,reg15=0x1,metadata=0x1,dl_src=00:00:01:01:02:03
> > >> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:101,output:1
> > >>  cookie=0xfddc60a, duration=580.420s, table=65, n_packets=1,
> > >> n_bytes=42, idle_age=580,
> > >> priority=150,reg15=0x1,metadata=0x2,dl_src=00:00:01:01:02:05
> > >> actions=mod_dl_src:aa:bb:cc:dd:ee:11,mod_vlan_vid:201,output:3
> > >>
> > >> In the above output, the first flow is for packets destined to hosts
> > >> in LS1 (VLAN 101) and the second flow is for packets destined to hosts
> > >> in LS2 (VLAN 201).
> > >>
> > >> If we inject a packet from lp11:
> > >> in_port=lp11, eth.src=f0:00:00:00:00:11, eth.dst=00:00:01:01:02:03,
> > >> ip.src=192.168.1.1, ip.dst=192.168.2.2
> > >>
> > >> The router pipeline is executed on HV1, the eth.src address that of
> > >> the router-to-ls2 port (00:00:01:01:02:05) and finally the entry in
> > >> table=65 is hit and eth.src is changed to the configured chassis-mac
> > >> (aa:bb:cc:dd:ee:11).
> > >>
> > >> The packet is sent out on port ln2:
> > >> $ OVS_RUNDIR=$PWD/hv1 ovs-vsctl --column ofport find interface
> > >> name=patch-br-int-to-ln2
> > >> ofport              : 3
> > >>
> > >> Then it is received on HV2, where we have the following flows:
> > >> $ OVS_RUNDIR=$PWD/hv2 ovs-ofctl dump-flows br-int | grep conj
> > >>  cookie=0x31ba7484, duration=1545.430s, table=0, n_packets=0,
> > >> n_bytes=0, idle_age=1545, priority=180,dl_src=aa:bb:cc:dd:ee:11
> > >> actions=conjunction(100,1/2)
> > >>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=1,
> > >> n_bytes=46, idle_age=1545,
> > >> priority=180,conj_id=100,in_port=2,dl_vlan=201
> > >> actions=strip_vlan,load:0x4->NXM_NX_REG13[],load:0x3->NXM_NX_REG11[],load:0x2->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:05,resubmit(,8)
> > >>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> > >> n_bytes=0, idle_age=1545,
> > >> priority=180,conj_id=100,in_port=3,dl_vlan=101
> > >> actions=strip_vlan,load:0x9->NXM_NX_REG13[],load:0x5->NXM_NX_REG11[],load:0x6->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],mod_dl_src:00:00:01:01:02:03,resubmit(,8)
> > >>  cookie=0x2c6a49b3, duration=1545.368s, table=0, n_packets=0,
> > >> n_bytes=0, idle_age=1545, priority=180,dl_vlan=201
> > >> actions=conjunction(100,2/2)
> > >>  cookie=0xd8937e1d, duration=1545.332s, table=0, n_packets=0,
> > >> n_bytes=0, idle_age=1545, priority=180,dl_vlan=101
> > >> actions=conjunction(100,2/2)
> > >>
> > >> The packet matches:
> > >> - clause 1 of the conjunction (100) because eth.src is the chassis-mac
> > >> of HV1 (aa:bb:cc:dd:ee:11).
> > >> - clause 2 of the conjunction because VLAN_ID is 201.
> > >>
> > >> And then matches this flow that fixes the eth.src in the packet to
> > >> that of router-to-ls2:
> > >> priority=180,conj_id=100,in_port=2,dl_vlan=201
> > >> actions=.....,mod_dl_src:00:00:01:01:02:05,....
> > >
> > >
> > > The test has only a single router with multiple lswitches, but the loop in the code is trying to handle the case when there are multiple router ports on the same lswitch (with same localnet port). In the loop the inport and vlan doesn’t change across iterations.
> >
> > Ok, I see what you mean now, I was under the impression that we'd
> > always have a single router port per lswitch with vlan networks.
> >
> > I'll let Ankur comment on this because it seems to me that we can't
> > restore the router port MAC if multiple router-ports map to the same
> > VLAN-ID + localnet-port.
>
> Ok, now that this is clarified (at least common understanding of the problem), I think we may move this forward by merging the current patch first, and address the source MAC conversion problem separately in a new thread discussing with Ankur.
> I think it shouldn't affect the solution of this patch, either with current implementation which replaces chassis mac to a probably wrong router's mac, or with a new implementation that properly replaces chassis mac back to the corresponding router's mac, the current patch should work well without considering the chassis mac.
> I applied this to master.
>
> Thanks,
> Han
>


Thank you!
diff mbox series

Patch

diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 0a33dcd..344cc0d 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -1005,6 +1005,20 @@  output;
       </li>
 
       <li>
+        Priority-80 flows for each port connected to a logical router
+        matching self originated GARP/ARP request/ND packets. These packets
+        are flooded to the <code>MC_FLOOD</code> which contains all logical
+        ports.
+      </li>
+
+      <li>
+        Priority-75 flows for each IP address/VIP/NAT address owned by a
+        router port connected to the switch. These flows match ARP requests
+        and ND packets for the specific IP addresses.  Matched packets are
+        forwarded only to the router that owns the IP address.
+      </li>
+
+      <li>
         A priority-70 flow that outputs all packets with an Ethernet broadcast
         or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
         multicast group.
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 32f3200..d6beb97 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -210,6 +210,8 @@  enum ovn_stage {
 #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
 #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
 
+#define FLAGBIT_NOT_VXLAN "flags[1] == 0"
+
 /* Returns an "enum ovn_stage" built from the arguments. */
 static enum ovn_stage
 ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
@@ -1202,6 +1204,34 @@  ovn_port_allocate_key(struct ovn_datapath *od)
                           1, (1u << 15) - 1, &od->port_key_hint);
 }
 
+/* Returns true if the logical switch port 'enabled' column is empty or
+ * set to true.  Otherwise, returns false. */
+static bool
+lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
+{
+    return !lsp->n_enabled || *lsp->enabled;
+}
+
+/* Returns true only if the logical switch port 'up' column is set to true.
+ * Otherwise, if the column is not set or set to false, returns false. */
+static bool
+lsp_is_up(const struct nbrec_logical_switch_port *lsp)
+{
+    return lsp->n_up && *lsp->up;
+}
+
+static bool
+lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
+{
+    return !strcmp(nbsp->type, "external");
+}
+
+static bool
+lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
+{
+    return !lrport->enabled || *lrport->enabled;
+}
+
 static char *
 chassis_redirect_name(const char *port_name)
 {
@@ -3750,28 +3780,6 @@  build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
 
 }
 
-/* Returns true if the logical switch port 'enabled' column is empty or
- * set to true.  Otherwise, returns false. */
-static bool
-lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
-{
-    return !lsp->n_enabled || *lsp->enabled;
-}
-
-/* Returns true only if the logical switch port 'up' column is set to true.
- * Otherwise, if the column is not set or set to false, returns false. */
-static bool
-lsp_is_up(const struct nbrec_logical_switch_port *lsp)
-{
-    return lsp->n_up && *lsp->up;
-}
-
-static bool
-lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
-{
-    return !strcmp(nbsp->type, "external");
-}
-
 static bool
 build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
                     struct ds *options_action, struct ds *response_action,
@@ -5174,6 +5182,170 @@  build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
     }
 }
 
+/*
+ * Ingress table 17: Flows that flood self originated ARP/ND packets in the
+ * switching domain.
+ */
+static void
+build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
+                                           uint32_t priority,
+                                           struct ovn_datapath *od,
+                                           struct hmap *lflows)
+{
+    struct ds match = DS_EMPTY_INITIALIZER;
+    struct ds eth_src = DS_EMPTY_INITIALIZER;
+
+    /* Self originated (G)ARP requests/ND need to be flooded as usual.
+     * Determine that packets are self originated by also matching on
+     * source MAC. Matching on ingress port is not reliable in case this
+     * is a VLAN-backed network.
+     * Priority: 80.
+     */
+    ds_put_format(&eth_src, "{ %s, ", op->lrp_networks.ea_s);
+    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
+        const struct nbrec_nat *nat = op->od->nbr->nat[i];
+
+        if (!nat->external_mac) {
+            continue;
+        }
+
+        ds_put_format(&eth_src, "%s, ", nat->external_mac);
+    }
+    ds_chomp(&eth_src, ' ');
+    ds_chomp(&eth_src, ',');
+    ds_put_cstr(&eth_src, "}");
+
+    ds_put_format(&match, "eth.src == %s && (arp.op == 1 || nd_ns)",
+                  ds_cstr(&eth_src));
+    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
+                  ds_cstr(&match),
+                  "outport = \""MC_FLOOD"\"; output;");
+
+    ds_destroy(&match);
+    ds_destroy(&eth_src);
+}
+
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses. Other ARP/ND packets are still flooded in the
+ * switching domain as regular broadcast.
+ */
+static void
+build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
+                                        int addr_family,
+                                        struct ovn_port *patch_op,
+                                        struct ovn_datapath *od,
+                                        uint32_t priority,
+                                        struct hmap *lflows)
+{
+    struct ds match   = DS_EMPTY_INITIALIZER;
+    struct ds actions = DS_EMPTY_INITIALIZER;
+
+    /* Packets received from VXLAN tunnels have already been through the
+     * router pipeline so we should skip them. Normally this is done by the
+     * multicast_group implementation (VXLAN packets skip table 32 which
+     * delivers to patch ports) but we're bypassing multicast_groups.
+     */
+    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
+
+    if (addr_family == AF_INET) {
+        ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
+    } else {
+        ds_put_cstr(&match, "nd_ns && nd.target == { ");
+    }
+
+    const char *ip_address;
+    SSET_FOR_EACH (ip_address, ips) {
+        ds_put_format(&match, "%s, ", ip_address);
+    }
+
+    ds_chomp(&match, ' ');
+    ds_chomp(&match, ',');
+    ds_put_cstr(&match, "}");
+
+    /* Send a the packet only to the router pipeline and skip flooding it
+     * in the broadcast domain.
+     */
+    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
+    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
+                  ds_cstr(&match), ds_cstr(&actions));
+
+    ds_destroy(&match);
+    ds_destroy(&actions);
+}
+
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses.
+ * Priorities:
+ * - 80: self originated GARPs that need to follow regular processing.
+ * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
+ */
+static void
+build_lswitch_rport_arp_req_flows(struct ovn_port *op,
+                                  struct ovn_datapath *sw_od,
+                                  struct ovn_port *sw_op,
+                                  struct hmap *lflows)
+{
+    if (!op || !op->nbrp) {
+        return;
+    }
+
+    if (!lrport_is_enabled(op->nbrp)) {
+        return;
+    }
+
+    /* Self originated (G)ARP requests/ND need to be flooded as usual.
+     * Priority: 80.
+     */
+    build_lswitch_rport_arp_req_self_orig_flow(op, 80, sw_od, lflows);
+
+    /* Forward ARP requests for owned IP addresses (L3, VIP, NAT) only to this
+     * router port.
+     * Priority: 75.
+     */
+    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+
+    for (size_t i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
+        sset_add(&all_ips_v4, op->lrp_networks.ipv4_addrs[i].addr_s);
+    }
+    for (size_t i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
+        sset_add(&all_ips_v6, op->lrp_networks.ipv6_addrs[i].addr_s);
+    }
+
+    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
+
+    for (size_t i = 0; i < op->od->nbr->n_nat; i++) {
+        const struct nbrec_nat *nat = op->od->nbr->nat[i];
+
+        if (!strcmp(nat->type, "snat")) {
+            continue;
+        }
+
+        ovs_be32 ip;
+        ovs_be32 mask;
+        struct in6_addr ipv6;
+        struct in6_addr mask_v6;
+
+        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
+            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
+                sset_add(&all_ips_v6, nat->external_ip);
+            }
+        } else {
+            sset_add(&all_ips_v4, nat->external_ip);
+        }
+    }
+
+    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v4, AF_INET, sw_op,
+                                            sw_od, 75, lflows);
+    build_lswitch_rport_arp_req_flow_for_ip(&all_ips_v6, AF_INET6, sw_op,
+                                            sw_od, 75, lflows);
+
+    sset_destroy(&all_ips_v4);
+    sset_destroy(&all_ips_v6);
+}
+
 static void
 build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
                     struct hmap *port_groups, struct hmap *lflows,
@@ -5761,6 +5933,14 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
             continue;
         }
 
+        /* For ports connected to logical routers add flows to bypass the
+         * broadcast flooding of ARP/ND requests in table 17. We direct the
+         * requests only to the router port that owns the IP address.
+         */
+        if (!strcmp(op->nbsp->type, "router")) {
+            build_lswitch_rport_arp_req_flows(op->peer, op->od, op, lflows);
+        }
+
         for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
             /* Addresses are owned by the logical port.
              * Ethernet address followed by zero or more IPv4
@@ -5892,12 +6072,6 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
     ds_destroy(&actions);
 }
 
-static bool
-lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
-{
-    return !lrport->enabled || *lrport->enabled;
-}
-
 /* Returns a string of the IP address of the router port 'op' that
  * overlaps with 'ip_s".  If one is not found, returns NULL.
  *
diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
index 7966b65..c43f16d 100644
--- a/ovn-architecture.7.xml
+++ b/ovn-architecture.7.xml
@@ -1390,6 +1390,25 @@ 
     http://docs.openvswitch.org/en/latest/topics/high-availability.
   </p>
 
+  <h3>ARP request and ND NS packet processing</h3>
+
+  <p>
+    Due to the fact that ARP requests and ND NA packets are usually broadcast
+    packets, for performance reasons, OVN deals with requests that target OVN
+    owned IP addresses (i.e., IP addresses configured on the router ports,
+    VIPs, NAT IPs) in a specific way and only forwards them to the logical
+    router that owns the target IP address. This behavior is different than
+    that of traditional swithces and implies that other routers/hosts
+    connected to the logical switch will not learn the MAC/IP binding from
+    the request packet.
+  </p>
+
+  <p>
+    All other ARP and ND packets are flooded in the L2 broadcast domain and
+    to all attached logical patch ports.
+  </p>
+
+
   <h2>Multiple localnet logical switches connected to a Logical Router</h2>
 
   <p>
diff --git a/tests/ovn.at b/tests/ovn.at
index 3e429e3..26e33d2 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -2877,7 +2877,7 @@  test_ip() {
     done
 }
 
-# test_arp INPORT SHA SPA TPA [REPLY_HA]
+# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
 #
 # Causes a packet to be received on INPORT.  The packet is an ARP
 # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided, then
@@ -2888,21 +2888,25 @@  test_ip() {
 # SHA and REPLY_HA are each 12 hex digits.
 # SPA and TPA are each 8 hex digits.
 test_arp() {
-    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
+    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
     local request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
     hv=hv`vif_to_hv $inport`
     as $hv ovs-appctl netdev-dummy/receive vif$inport $request
     as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
 
     # Expect to receive the broadcast ARP on the other logical switch ports if
-    # IP address is not configured to the switch patch port.
+    # IP address is not configured on the switch patch port or on the router
+    # port (i.e, $flood == 1).
     local i=`vif_to_ls $inport`
     local j k
     for j in 1 2 3; do
         for k in 1 2 3; do
-            # 192.168.33.254 is configured to the switch patch port for lrp33,
-            # so no ARP flooding expected for it.
-            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
+            # Skip ingress port.
+            if test $i$j$k == $inport; then
+                continue
+            fi
+
+            if test X$flood == X1; then
                 echo $request >> $i$j$k.expected
             fi
         done
@@ -3039,9 +3043,9 @@  for i in 1 2 3; do
       otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
       externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
 
-      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
-      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
-      test_arp $i$j$k $smac $sip        $otherip               #6
+      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
+      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
+      test_arp $i$j$k $smac $sip        $otherip   1                 #6
 
       # When rip is 192.168.33.254, ARP request from externalip won't be
       # filtered, because 192.168.33.254 is configured to switch peer port
@@ -3050,7 +3054,7 @@  for i in 1 2 3; do
       if test $i = 3 && test $j = 3; then
         lrp33_rsp=$rmac
       fi
-      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
+      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
 
       # MAC binding should be learned from ARP request.
       host_mac_pretty=f0:00:00:00:0$i:$j$k
@@ -9595,7 +9599,7 @@  ovn-nbctl --wait=hv --timeout=3 sync
 # Check that there is a logical flow in logical switch foo's pipeline
 # to set the outport to rp-foo (which is expected).
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep -v is_chassis_resident | wc -l`])
+grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
 
 # Set the option 'reside-on-redirect-chassis' for foo
 ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
@@ -9603,7 +9607,7 @@  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
 # to set the outport to rp-foo with the condition is_chassis_redirect.
 ovn-sbctl dump-flows foo
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep is_chassis_resident | wc -l`])
+grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
 
 echo "---------NB dump-----"
 ovn-nbctl show
@@ -16694,3 +16698,282 @@  as hv4 ovs-appctl fdb/show br-phys
 OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
 
 AT_CLEANUP
+
+AT_SETUP([ovn -- ARP/ND request broadcast limiting])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+ip_to_hex() {
+    printf "%02x%02x%02x%02x" "$@"
+}
+
+send_arp_request() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
+    local eth_dst=ffffffffffff
+    local eth_type=0806
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
+
+    local request=${eth}${arp}
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+send_nd_ns() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
+
+    local eth_dst=ffffffffffff
+    local eth_type=86dd
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local ip_vhlen=60000000
+    local ip_plen=0020
+    local ip_next=3a
+    local ip_ttl=ff
+    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
+
+    # Neighbor Solicitation
+    local icmp6_type=87
+    local icmp6_code=00
+    local icmp6_rsvd=00000000
+    # ICMPv6 source lla option
+    local icmp6_opt=01
+    local icmp6_optlen=01
+    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
+
+    local request=${eth}${ip}${icmp6}
+
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+src_mac=000000000001
+
+net_add n1
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+
+ovs-vsctl -- add-port br-int hv1-vif1 -- \
+    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
+    options:tx_pcap=hv1/vif1-tx.pcap \
+    options:rxq_pcap=hv1/vif1-rx.pcap \
+    ofport-request=1
+
+# One Aggregation Switch connected to two Logical networks (routers).
+ovn-nbctl ls-add sw-agg
+ovn-nbctl lsp-add sw-agg sw-agg-ext \
+    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
+
+ovn-nbctl lsp-add sw-agg sw-rtr1                   \
+    -- lsp-set-type sw-rtr1 router                 \
+    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
+    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
+ovn-nbctl lsp-add sw-agg sw-rtr2                   \
+    -- lsp-set-type sw-rtr2 router                 \
+    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
+    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
+
+# Configure L3 interface IPv4 & IPv6 on both routers
+ovn-nbctl lr-add rtr1
+ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
+
+ovn-nbctl lr-add rtr2
+ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
+
+OVN_POPULATE_ARP
+ovn-nbctl --wait=hv sync
+
+sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
+sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
+
+r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
+r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
+
+mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
+mc_key=$(printf "%04x" $mc_key)
+
+match_sw_metadata="metadata=0x${sw_dp_key}"
+
+# Inject ARP request for first router owned IP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for ofirst router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000001
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure load balancing on both routers.
+ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
+ovn-nbctl lb-add lb1-v6 10::11 42::1
+ovn-nbctl lr-lb-add rtr1 lb1-v4
+ovn-nbctl lr-lb-add rtr1 lb1-v6
+
+ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
+ovn-nbctl lb-add lb2-v6 10::22 42::2
+ovn-nbctl lr-lb-add rtr2 lb2-v4
+ovn-nbctl lr-lb-add rtr2 lb2-v6
+ovn-nbctl --wait=hv sync
+
+# Inject ARP request for first router owned VIP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned VIP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000011
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure NAT on both routers
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
+
+# Inject ARP request for first router owned NAT address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000111
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+OVN_CLEANUP([hv1])
+AT_CLEANUP