diff mbox series

[ovs-dev,ovn,v5] ovn-northd: Limit ARP/ND broadcast domain whenever possible.

Message ID 1572879961-30041-1-git-send-email-dceara@redhat.com
State Superseded
Headers show
Series [ovs-dev,ovn,v5] ovn-northd: Limit ARP/ND broadcast domain whenever possible. | expand

Commit Message

Dumitru Ceara Nov. 4, 2019, 3:06 p.m. UTC
ARP request and ND NS packets for router owned IPs were being
flooded in the complete L2 domain (using the MC_FLOOD multicast group).
However this creates a scaling issue in scenarios where aggregation
logical switches are connected to more logical routers (~350). The
logical pipelines of all routers would have to be executed before the
packet is finally replied to by a single router, the owner of the IP
address.

This commit limits the broadcast domain by bypassing the L2 Lookup stage
for ARP requests that will be replied by a single router. The packets
are still flooded in the L2 domain but not on any of the other patch
ports towards other routers connected to the switch. This restricted
flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).

IPs that are owned by the routers and for which this fix applies are:
- IP addresses configured on the router ports.
- VIPs.
- NAT IPs.

This commit also fixes function get_router_load_balancer_ips() which
was incorrectly returning a single address_family even though the
IP set could contain a mix of IPv4 and IPv6 addresses.

Reported-at: https://bugzilla.redhat.com/1756945
Reported-by: Anil Venkata <vkommadi@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>

---
v5: Address Numan's comments: update comments & make autotest more
    robust.
v4: Rebase.
v3: Properly deal with VXLAN traffic. Address review comments from
    Numan (add autotests). Fix function get_router_load_balancer_ips.
    Rebase -> deal with IPv6 NAT too.
v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
address localnet ports too.
---
 lib/mcast-group-index.h |   1 +
 northd/ovn-northd.8.xml |  16 +++
 northd/ovn-northd.c     | 340 ++++++++++++++++++++++++++++++++++++------------
 tests/ovn.at            | 283 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 556 insertions(+), 84 deletions(-)

Comments

Numan Siddique Nov. 4, 2019, 7:32 p.m. UTC | #1
On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> ARP request and ND NS packets for router owned IPs were being
> flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> However this creates a scaling issue in scenarios where aggregation
> logical switches are connected to more logical routers (~350). The
> logical pipelines of all routers would have to be executed before the
> packet is finally replied to by a single router, the owner of the IP
> address.
>
> This commit limits the broadcast domain by bypassing the L2 Lookup stage
> for ARP requests that will be replied by a single router. The packets
> are still flooded in the L2 domain but not on any of the other patch
> ports towards other routers connected to the switch. This restricted
> flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
>
> IPs that are owned by the routers and for which this fix applies are:
> - IP addresses configured on the router ports.
> - VIPs.
> - NAT IPs.
>
> This commit also fixes function get_router_load_balancer_ips() which
> was incorrectly returning a single address_family even though the
> IP set could contain a mix of IPv4 and IPv6 addresses.
>
> Reported-at: https://bugzilla.redhat.com/1756945
> Reported-by: Anil Venkata <vkommadi@redhat.com>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>

Thanks Dumitru, for addressing the review comments.

Acked-by: Numan Siddique <numans@ovn.org>

Han, if you can take a look in this patch and provide your comments,
that would be great.

Thanks
Numan

>
> ---
> v5: Address Numan's comments: update comments & make autotest more
>     robust.
> v4: Rebase.
> v3: Properly deal with VXLAN traffic. Address review comments from
>     Numan (add autotests). Fix function get_router_load_balancer_ips.
>     Rebase -> deal with IPv6 NAT too.
> v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
> address localnet ports too.
> ---
>  lib/mcast-group-index.h |   1 +
>  northd/ovn-northd.8.xml |  16 +++
>  northd/ovn-northd.c     | 340 ++++++++++++++++++++++++++++++++++++------------
>  tests/ovn.at            | 283 +++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 556 insertions(+), 84 deletions(-)
>
> diff --git a/lib/mcast-group-index.h b/lib/mcast-group-index.h
> index ba995ba..06bd8b3 100644
> --- a/lib/mcast-group-index.h
> +++ b/lib/mcast-group-index.h
> @@ -27,6 +27,7 @@ enum ovn_mcast_tunnel_keys {
>
>      OVN_MCAST_FLOOD_TUNNEL_KEY = OVN_MIN_MULTICAST,
>      OVN_MCAST_UNKNOWN_TUNNEL_KEY,
> +    OVN_MCAST_ARP_ND_TUNNEL_KEY,
>      OVN_MCAST_MROUTER_FLOOD_TUNNEL_KEY,
>      OVN_MCAST_MROUTER_STATIC_TUNNEL_KEY,
>      OVN_MCAST_STATIC_TUNNEL_KEY,
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index 0a33dcd..6fbb3ab 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -1005,6 +1005,22 @@ output;
>        </li>
>
>        <li>
> +        Priority-80 flows for each port connected to a logical router
> +        matching self originated GARP/ARP request/ND packets. These packets
> +        are flooded to the <code>MC_FLOOD</code> which contains all logical
> +        ports.
> +      </li>
> +
> +      <li>
> +        Priority-75 flows for each IP address/VIP/NAT address owned by a
> +        router port connected to the switch. These flows match ARP requests
> +        and ND packets for the specific IP addresses.  Matched packets are
> +        forwarded in the L3 domain only to the router that owns the IP
> +        address and flooded in the L2 domain on all ports except patch
> +        ports connected to logical routers.
> +      </li>
> +
> +      <li>
>          A priority-70 flow that outputs all packets with an Ethernet broadcast
>          or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
>          multicast group.
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index c23c270..31e3c78 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -210,6 +210,8 @@ enum ovn_stage {
>  #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
>  #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
>
> +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> +
>  /* Returns an "enum ovn_stage" built from the arguments. */
>  static enum ovn_stage
>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
> @@ -1202,6 +1204,34 @@ ovn_port_allocate_key(struct ovn_datapath *od)
>                            1, (1u << 15) - 1, &od->port_key_hint);
>  }
>
> +/* Returns true if the logical switch port 'enabled' column is empty or
> + * set to true.  Otherwise, returns false. */
> +static bool
> +lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> +{
> +    return !lsp->n_enabled || *lsp->enabled;
> +}
> +
> +/* Returns true only if the logical switch port 'up' column is set to true.
> + * Otherwise, if the column is not set or set to false, returns false. */
> +static bool
> +lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> +{
> +    return lsp->n_up && *lsp->up;
> +}
> +
> +static bool
> +lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> +{
> +    return !strcmp(nbsp->type, "external");
> +}
> +
> +static bool
> +lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> +{
> +    return !lrport->enabled || *lrport->enabled;
> +}
> +
>  static char *
>  chassis_redirect_name(const char *port_name)
>  {
> @@ -2184,7 +2214,7 @@ ip_address_and_port_from_lb_key(const char *key, char **ip_address,
>
>  static void
>  get_router_load_balancer_ips(const struct ovn_datapath *od,
> -                             struct sset *all_ips, int *addr_family)
> +                             struct sset *all_ips_v4, struct sset *all_ips_v6)
>  {
>      if (!od->nbr) {
>          return;
> @@ -2199,13 +2229,21 @@ get_router_load_balancer_ips(const struct ovn_datapath *od,
>              /* node->key contains IP:port or just IP. */
>              char *ip_address = NULL;
>              uint16_t port;
> +            int addr_family;
>
>              ip_address_and_port_from_lb_key(node->key, &ip_address, &port,
> -                                            addr_family);
> +                                            &addr_family);
>              if (!ip_address) {
>                  continue;
>              }
>
> +            struct sset *all_ips;
> +            if (addr_family == AF_INET) {
> +                all_ips = all_ips_v4;
> +            } else {
> +                all_ips = all_ips_v6;
> +            }
> +
>              if (!sset_contains(all_ips, ip_address)) {
>                  sset_add(all_ips, ip_address);
>              }
> @@ -2299,17 +2337,22 @@ get_nat_addresses(const struct ovn_port *op, size_t *n)
>          }
>      }
>
> -    /* A set to hold all load-balancer vips. */
> -    struct sset all_ips = SSET_INITIALIZER(&all_ips);
> -    int addr_family;
> -    get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
> +    /* Two sets to hold all load-balancer vips. */
> +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
>
>      const char *ip_address;
> -    SSET_FOR_EACH (ip_address, &all_ips) {
> +    SSET_FOR_EACH (ip_address, &all_ips_v4) {
>          ds_put_format(&c_addresses, " %s", ip_address);
>          central_ip_address = true;
>      }
> -    sset_destroy(&all_ips);
> +    SSET_FOR_EACH (ip_address, &all_ips_v6) {
> +        ds_put_format(&c_addresses, " %s", ip_address);
> +        central_ip_address = true;
> +    }
> +    sset_destroy(&all_ips_v4);
> +    sset_destroy(&all_ips_v6);
>
>      if (central_ip_address) {
>          /* Gratuitous ARP for centralized NAT rules on distributed gateway
> @@ -3036,6 +3079,10 @@ static const struct multicast_group mc_static =
>  static const struct multicast_group mc_unknown =
>      { MC_UNKNOWN, OVN_MCAST_UNKNOWN_TUNNEL_KEY };
>
> +#define MC_ARP_ND "_MC_arp_nd"
> +static const struct multicast_group mc_arp_nd =
> +    { MC_ARP_ND, OVN_MCAST_ARP_ND_TUNNEL_KEY };
> +
>  static bool
>  multicast_group_equal(const struct multicast_group *a,
>                        const struct multicast_group *b)
> @@ -3737,28 +3784,6 @@ build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
>
>  }
>
> -/* Returns true if the logical switch port 'enabled' column is empty or
> - * set to true.  Otherwise, returns false. */
> -static bool
> -lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
> -{
> -    return !lsp->n_enabled || *lsp->enabled;
> -}
> -
> -/* Returns true only if the logical switch port 'up' column is set to true.
> - * Otherwise, if the column is not set or set to false, returns false. */
> -static bool
> -lsp_is_up(const struct nbrec_logical_switch_port *lsp)
> -{
> -    return lsp->n_up && *lsp->up;
> -}
> -
> -static bool
> -lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
> -{
> -    return !strcmp(nbsp->type, "external");
> -}
> -
>  static bool
>  build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
>                      struct ds *options_action, struct ds *response_action,
> @@ -5161,6 +5186,143 @@ build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
>      }
>  }
>
> +/*
> + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> + * that own the addresses. Packets are still flooded in the switching domain
> + * as regular broadcast.
> + */
> +static void
> +build_lswitch_rport_arp_flow(const char *target_address, int addr_family,
> +                             struct ovn_port *patch_op,
> +                             struct ovn_datapath *od,
> +                             uint32_t priority,
> +                             struct hmap *lflows)
> +{
> +    struct ds match   = DS_EMPTY_INITIALIZER;
> +    struct ds actions = DS_EMPTY_INITIALIZER;
> +
> +    if (addr_family == AF_INET) {
> +        ds_put_format(&match, "arp.tpa == %s && arp.op == 1", target_address);
> +    } else {
> +        ds_put_format(&match, "nd.target == %s && nd_ns", target_address);
> +    }
> +
> +    /* Packets received from VXLAN tunnels have already been through the
> +     * router pipeline so we should skip them. Normally this is done by the
> +     * multicast_group implementation (VXLAN packets skip table 32 which
> +     * delivers to patch ports) but we're bypassing multicast_groups.
> +     */
> +    ds_put_format(&match, " && "REGBIT_NOT_VXLAN);
> +
> +    /* Send a clone of the packet to the router pipeline and flood the
> +     * original in the broadcast domain (skipping router ports). */
> +    ds_put_format(&actions,
> +                  "clone { outport = %s; output; }; "
> +                  "outport = \""MC_ARP_ND"\"; output;",
> +                  patch_op->json_key);
> +    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
> +                  ds_cstr(&match), ds_cstr(&actions));
> +
> +    ds_destroy(&match);
> +    ds_destroy(&actions);
> +}
> +
> +/*
> + * Ingress table 17: Flows that forward ARP/ND requests only to the routers
> + * that own the addresses.
> + * Priorities:
> + * - 80: self originated GARPs that need to follow regular processing.
> + * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
> + */
> +static void
> +build_lswitch_rport_arp_responders(struct ovn_port *op,
> +                                   struct ovn_datapath *sw_od,
> +                                   struct ovn_port *sw_op,
> +                                   struct hmap *lflows)
> +{
> +    if (!op || !op->nbrp) {
> +        return;
> +    }
> +
> +    if (!lrport_is_enabled(op->nbrp)) {
> +        return;
> +    }
> +
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +
> +    /* Self originated (G)ARP requests/ND need to be flooded as usual.
> +     * Priority: 80.
> +     */
> +    ds_put_format(&match, "inport == %s && (arp.op == 1 || nd_ns)",
> +                  sw_op->json_key);
> +    ovn_lflow_add(lflows, sw_od, S_SWITCH_IN_L2_LKUP, 80,
> +                  ds_cstr(&match),
> +                  "outport = \""MC_FLOOD"\"; output;");
> +
> +    ds_destroy(&match);
> +
> +    /* Forward ARP requests for IPs configured on the router only to this
> +     * router port.
> +     * Priority: 75.
> +     */
> +    for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> +        build_lswitch_rport_arp_flow(op->lrp_networks.ipv4_addrs[i].addr_s,
> +                                     AF_INET, sw_op, sw_od, 75, lflows);
> +    }
> +    for (int i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
> +        build_lswitch_rport_arp_flow(op->lrp_networks.ipv6_addrs[i].addr_s,
> +                                     AF_INET6, sw_op, sw_od, 75, lflows);
> +    }
> +
> +    /* Forward ARP requests to load-balancer VIPs configured on the router
> +     * only to this router port.
> +     * Priority: 75.
> +     */
> +    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> +    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> +    const char *ip_address;
> +
> +    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
> +
> +    SSET_FOR_EACH (ip_address, &all_ips_v4) {
> +        build_lswitch_rport_arp_flow(ip_address, AF_INET, sw_op, sw_od,
> +                                     75, lflows);
> +    }
> +    SSET_FOR_EACH (ip_address, &all_ips_v6) {
> +        build_lswitch_rport_arp_flow(ip_address, AF_INET6, sw_op, sw_od,
> +                                     75, lflows);
> +    }
> +    sset_destroy(&all_ips_v4);
> +    sset_destroy(&all_ips_v6);
> +
> +    /* Forward ARP requests to NAT addresses configured on the router
> +     * only to this router port.
> +     * Priority: 75.
> +     */
> +    for (int i = 0; i < op->od->nbr->n_nat; i++) {
> +        const struct nbrec_nat *nat = op->od->nbr->nat[i];
> +
> +        if (!strcmp(nat->type, "snat")) {
> +            continue;
> +        }
> +
> +        ovs_be32 ip;
> +        ovs_be32 mask;
> +        struct in6_addr ipv6;
> +        struct in6_addr mask_v6;
> +
> +        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
> +            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
> +                build_lswitch_rport_arp_flow(nat->external_ip, AF_INET6, sw_op,
> +                                             sw_od, 75, lflows);
> +            }
> +        } else {
> +            build_lswitch_rport_arp_flow(nat->external_ip, AF_INET, sw_op,
> +                                         sw_od, 75, lflows);
> +        }
> +    }
> +}
> +
>  static void
>  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>                      struct hmap *port_groups, struct hmap *lflows,
> @@ -5748,6 +5910,15 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>              continue;
>          }
>
> +        /* For ports connected to logical routers add flows to bypass the
> +         * broadcast flooding of ARP/ND requests in table 17. We direct the
> +         * requests only to the router port that owns the IP address.
> +         */
> +        if (!strcmp(op->nbsp->type, "router")) {
> +            build_lswitch_rport_arp_responders(op->peer, op->od, op,
> +                                               lflows);
> +        }
> +
>          for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
>              /* Addresses are owned by the logical port.
>               * Ethernet address followed by zero or more IPv4
> @@ -5879,12 +6050,6 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
>      ds_destroy(&actions);
>  }
>
> -static bool
> -lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
> -{
> -    return !lrport->enabled || *lrport->enabled;
> -}
> -
>  /* Returns a string of the IP address of the router port 'op' that
>   * overlaps with 'ip_s".  If one is not found, returns NULL.
>   *
> @@ -6904,61 +7069,66 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
>          }
>
>          /* A set to hold all load-balancer vips that need ARP responses. */
> -        struct sset all_ips = SSET_INITIALIZER(&all_ips);
> -        int addr_family;
> -        get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
> +        struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
> +        struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
> +        get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
>
>          const char *ip_address;
> -        SSET_FOR_EACH(ip_address, &all_ips) {
> +        SSET_FOR_EACH (ip_address, &all_ips_v4) {
>              ds_clear(&match);
> -            if (addr_family == AF_INET) {
> -                ds_put_format(&match,
> -                              "inport == %s && arp.tpa == %s && arp.op == 1",
> -                              op->json_key, ip_address);
> -            } else {
> -                ds_put_format(&match,
> -                              "inport == %s && nd_ns && nd.target == %s",
> -                              op->json_key, ip_address);
> -            }
> +            ds_put_format(&match,
> +                          "inport == %s && arp.tpa == %s && arp.op == 1",
> +                          op->json_key, ip_address);
>
>              ds_clear(&actions);
> -            if (addr_family == AF_INET) {
> -                ds_put_format(&actions,
> -                "eth.dst = eth.src; "
> -                "eth.src = %s; "
> -                "arp.op = 2; /* ARP reply */ "
> -                "arp.tha = arp.sha; "
> -                "arp.sha = %s; "
> -                "arp.tpa = arp.spa; "
> -                "arp.spa = %s; "
> -                "outport = %s; "
> -                "flags.loopback = 1; "
> -                "output;",
> -                op->lrp_networks.ea_s,
> -                op->lrp_networks.ea_s,
> -                ip_address,
> -                op->json_key);
> -            } else {
> -                ds_put_format(&actions,
> -                "nd_na { "
> -                "eth.src = %s; "
> -                "ip6.src = %s; "
> -                "nd.target = %s; "
> -                "nd.tll = %s; "
> -                "outport = inport; "
> -                "flags.loopback = 1; "
> -                "output; "
> -                "};",
> -                op->lrp_networks.ea_s,
> -                ip_address,
> -                ip_address,
> -                op->lrp_networks.ea_s);
> -            }
> +            ds_put_format(&actions,
> +                          "eth.dst = eth.src; "
> +                          "eth.src = %s; "
> +                          "arp.op = 2; /* ARP reply */ "
> +                          "arp.tha = arp.sha; "
> +                          "arp.sha = %s; "
> +                          "arp.tpa = arp.spa; "
> +                          "arp.spa = %s; "
> +                          "outport = %s; "
> +                          "flags.loopback = 1; "
> +                          "output;",
> +                          op->lrp_networks.ea_s,
> +                          op->lrp_networks.ea_s,
> +                          ip_address,
> +                          op->json_key);
> +
>              ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
>                            ds_cstr(&match), ds_cstr(&actions));
>          }
>
> -        sset_destroy(&all_ips);
> +        SSET_FOR_EACH (ip_address, &all_ips_v6) {
> +            ds_clear(&match);
> +            ds_put_format(&match,
> +                          "inport == %s && nd_ns && nd.target == %s",
> +                          op->json_key, ip_address);
> +
> +            ds_clear(&actions);
> +            ds_put_format(&actions,
> +                          "nd_na { "
> +                          "eth.src = %s; "
> +                          "ip6.src = %s; "
> +                          "nd.target = %s; "
> +                          "nd.tll = %s; "
> +                          "outport = inport; "
> +                          "flags.loopback = 1; "
> +                          "output; "
> +                          "};",
> +                          op->lrp_networks.ea_s,
> +                          ip_address,
> +                          ip_address,
> +                          op->lrp_networks.ea_s);
> +
> +            ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
> +                          ds_cstr(&match), ds_cstr(&actions));
> +        }
> +
> +        sset_destroy(&all_ips_v4);
> +        sset_destroy(&all_ips_v6);
>
>          /* A gateway router can have 2 SNAT IP addresses to force DNATed and
>           * LBed traffic respectively to be SNATed.  In addition, there can be
> @@ -9392,6 +9562,12 @@ build_mcast_groups(struct northd_context *ctx,
>          } else if (op->nbsp && lsp_is_enabled(op->nbsp)) {
>              ovn_multicast_add(mcast_groups, &mc_flood, op);
>
> +            /* Add all non-router ports to the ARP ND L2 broadcast flood
> +             * domain entry. */
> +            if (strcmp(op->nbsp->type, "router")) {
> +                ovn_multicast_add(mcast_groups, &mc_arp_nd, op);
> +            }
> +
>              /* If this port is connected to a multicast router then add it
>               * to the MC_MROUTER_FLOOD group.
>               */
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 410f4b5..196d379 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -9595,7 +9595,7 @@ ovn-nbctl --wait=hv --timeout=3 sync
>  # Check that there is a logical flow in logical switch foo's pipeline
>  # to set the outport to rp-foo (which is expected).
>  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> -grep rp-foo | grep -v is_chassis_resident | wc -l`])
> +grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
>
>  # Set the option 'reside-on-redirect-chassis' for foo
>  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
> @@ -9603,7 +9603,7 @@ ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
>  # to set the outport to rp-foo with the condition is_chassis_redirect.
>  ovn-sbctl dump-flows foo
>  OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
> -grep rp-foo | grep is_chassis_resident | wc -l`])
> +grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
>
>  echo "---------NB dump-----"
>  ovn-nbctl show
> @@ -16676,3 +16676,282 @@ as hv4 ovs-appctl fdb/show br-phys
>  OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
>
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- ARP/ND request broadcast limiting])
> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> +ovn_start
> +
> +ip_to_hex() {
> +    printf "%02x%02x%02x%02x" "$@"
> +}
> +
> +send_arp_request() {
> +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
> +    local eth_dst=ffffffffffff
> +    local eth_type=0806
> +    local eth=${eth_dst}${eth_src}${eth_type}
> +
> +    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
> +
> +    local request=${eth}${arp}
> +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> +}
> +
> +send_nd_ns() {
> +    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
> +
> +    local eth_dst=ffffffffffff
> +    local eth_type=86dd
> +    local eth=${eth_dst}${eth_src}${eth_type}
> +
> +    local ip_vhlen=60000000
> +    local ip_plen=0020
> +    local ip_next=3a
> +    local ip_ttl=ff
> +    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
> +
> +    # Neighbor Solicitation
> +    local icmp6_type=87
> +    local icmp6_code=00
> +    local icmp6_rsvd=00000000
> +    # ICMPv6 source lla option
> +    local icmp6_opt=01
> +    local icmp6_optlen=01
> +    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
> +
> +    local request=${eth}${ip}${icmp6}
> +
> +    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
> +}
> +
> +src_mac=000000000001
> +
> +net_add n1
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +
> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
> +    options:tx_pcap=hv1/vif1-tx.pcap \
> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> +    ofport-request=1
> +
> +# One Aggregation Switch connected to two Logical networks (routers).
> +ovn-nbctl ls-add sw-agg
> +ovn-nbctl lsp-add sw-agg sw-agg-ext \
> +    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
> +
> +ovn-nbctl lsp-add sw-agg sw-rtr1                   \
> +    -- lsp-set-type sw-rtr1 router                 \
> +    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
> +    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
> +ovn-nbctl lsp-add sw-agg sw-rtr2                   \
> +    -- lsp-set-type sw-rtr2 router                 \
> +    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
> +    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
> +
> +# Configure L3 interface IPv4 & IPv6 on both routers
> +ovn-nbctl lr-add rtr1
> +ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
> +
> +ovn-nbctl lr-add rtr2
> +ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
> +
> +OVN_POPULATE_ARP
> +ovn-nbctl --wait=hv sync
> +
> +sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
> +sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
> +
> +r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
> +r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
> +
> +mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
> +mc_key=$(printf "%04x" $mc_key)
> +
> +match_sw_metadata="metadata=0x${sw_dp_key}"
> +
> +# Inject ARP request for first router owned IP address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
> +
> +# Verify that the ARP request is sent only to rtr1.
> +match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
> +match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for ofirst router owned IP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000001
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
> +match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Configure load balancing on both routers.
> +ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
> +ovn-nbctl lb-add lb1-v6 10::11 42::1
> +ovn-nbctl lr-lb-add rtr1 lb1-v4
> +ovn-nbctl lr-lb-add rtr1 lb1-v6
> +
> +ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
> +ovn-nbctl lb-add lb2-v6 10::22 42::2
> +ovn-nbctl lr-lb-add rtr2 lb2-v4
> +ovn-nbctl lr-lb-add rtr2 lb2-v6
> +ovn-nbctl --wait=hv sync
> +
> +# Inject ARP request for first router owned VIP address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
> +
> +# Verify that the ARP request is sent only to rtr1.
> +match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
> +match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for first router owned VIP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000011
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
> +match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Configure NAT on both routers
> +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
> +ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
> +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
> +ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
> +
> +# Inject ARP request for first router owned NAT address.
> +send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
> +
> +# Verify that the ARP request is sent only to rtr1.
> +match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
> +match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
> +match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +# Inject ND_NS for first router owned IP address.
> +src_ipv6=00100000000000000000000000000254
> +dst_ipv6=00100000000000000000000000000111
> +send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
> +
> +# Verify that the ND_NS is sent only to rtr1.
> +match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
> +
> +as hv1
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
> +    grep n_packets=1 -c)
> +    test "1" = "${pkts_to_rtr1}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
> +    grep n_packets=1 -c)
> +    test "0" = "${pkts_to_rtr2}"
> +])
> +OVS_WAIT_UNTIL([
> +    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
> +    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
> +    test "0" = "${pkts_flooded}"
> +])
> +
> +OVN_CLEANUP([hv1])
> +AT_CLEANUP
> --
> 1.8.3.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Han Zhou Nov. 7, 2019, 2:51 a.m. UTC | #2
On Mon, Nov 4, 2019 at 11:32 AM Numan Siddique <numans@ovn.org> wrote:
>
> On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > ARP request and ND NS packets for router owned IPs were being
> > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > However this creates a scaling issue in scenarios where aggregation
> > logical switches are connected to more logical routers (~350). The
> > logical pipelines of all routers would have to be executed before the
> > packet is finally replied to by a single router, the owner of the IP
> > address.
> >
> > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > for ARP requests that will be replied by a single router. The packets
> > are still flooded in the L2 domain but not on any of the other patch
> > ports towards other routers connected to the switch. This restricted
> > flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
> >
> > IPs that are owned by the routers and for which this fix applies are:
> > - IP addresses configured on the router ports.
> > - VIPs.
> > - NAT IPs.
> >
> > This commit also fixes function get_router_load_balancer_ips() which
> > was incorrectly returning a single address_family even though the
> > IP set could contain a mix of IPv4 and IPv6 addresses.
> >
> > Reported-at: https://bugzilla.redhat.com/1756945
> > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>
> Thanks Dumitru, for addressing the review comments.
>
> Acked-by: Numan Siddique <numans@ovn.org>
>
> Han, if you can take a look in this patch and provide your comments,
> that would be great.
>

Sorry for delayed response :(

The patch looks good to me except:

1. It changes the behavior that when an ARP request for a LRP's IP is
coming to the logical switch, other routers will no longer learn the MAC-IP
binding of the ARP sender. This has been discussed and I tend to agree with
Dumitru that it should not cause real issue. I think it just worth to be
documented clearly in the ovn-architecture, probably in the section:
Logical Routers and Logical Patch Ports, because this is something
different from a traditional switch's behavior, and network administrators
may get suprised.

2. Since we are anyway changing the behavior of ARP request handling and
bypassing logical router ports, and we think it should not cause real
problem, then I wonder why adding the MC_ARP_ND group to still flood to the
non-router ports. Is it really useful? Maybe it is not a big deal, but I
think the extra MC group would add some performance cost and it is almost
redundant with the regular MC_FLOOD except that there is no router ports in
it - it is a lot of redundancy considering that most LSPs are regular VIFs
in typical large scale environments. I would suggest to simplify this
unless there is clear concerns of not flooding to other ports.

3. Not a big thing, but the name of the functions
build_lswitch_rport_arp_responders() and build_lswitch_rport_arp_flow() are
a little bit confusing. It is not for arp-responder, but for bypassing
router ports. Would it be better to be something like:
build_lswitch_rport_arp_req_flows() and
build_lswitch_rport_arp_request_flow_for_ip()?

In addition, not for the patch, but for the problem itself, I think we'd
better quantify the number of router ports we support with current resubmit
limit (as suggested by Ben), and document somewhere the limit and the
impact if user exceeds the limit. If the limit is easily exceed in very
common deployments, we probably should consider increasing the resubmit
limit.

Thanks,
Han
Dumitru Ceara Nov. 7, 2019, 3:42 p.m. UTC | #3
On Thu, Nov 7, 2019 at 3:51 AM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Mon, Nov 4, 2019 at 11:32 AM Numan Siddique <numans@ovn.org> wrote:
> >
> > On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > ARP request and ND NS packets for router owned IPs were being
> > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > However this creates a scaling issue in scenarios where aggregation
> > > logical switches are connected to more logical routers (~350). The
> > > logical pipelines of all routers would have to be executed before the
> > > packet is finally replied to by a single router, the owner of the IP
> > > address.
> > >
> > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > for ARP requests that will be replied by a single router. The packets
> > > are still flooded in the L2 domain but not on any of the other patch
> > > ports towards other routers connected to the switch. This restricted
> > > flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
> > >
> > > IPs that are owned by the routers and for which this fix applies are:
> > > - IP addresses configured on the router ports.
> > > - VIPs.
> > > - NAT IPs.
> > >
> > > This commit also fixes function get_router_load_balancer_ips() which
> > > was incorrectly returning a single address_family even though the
> > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > >
> > > Reported-at: https://bugzilla.redhat.com/1756945
> > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >
> > Thanks Dumitru, for addressing the review comments.
> >
> > Acked-by: Numan Siddique <numans@ovn.org>
> >
> > Han, if you can take a look in this patch and provide your comments,
> > that would be great.
> >

Hi Han,

>
> Sorry for delayed response :(

No worries and thanks for reviewing this change.

>
> The patch looks good to me except:
>
> 1. It changes the behavior that when an ARP request for a LRP's IP is coming to the logical switch, other routers will no longer learn the MAC-IP binding of the ARP sender. This has been discussed and I tend to agree with Dumitru that it should not cause real issue. I think it just worth to be documented clearly in the ovn-architecture, probably in the section: Logical Routers and Logical Patch Ports, because this is something different from a traditional switch's behavior, and network administrators may get suprised.

Ack, I'll add this in v6.

>
> 2. Since we are anyway changing the behavior of ARP request handling and bypassing logical router ports, and we think it should not cause real problem, then I wonder why adding the MC_ARP_ND group to still flood to the non-router ports. Is it really useful? Maybe it is not a big deal, but I think the extra MC group would add some performance cost and it is almost redundant with the regular MC_FLOOD except that there is no router ports in it - it is a lot of redundancy considering that most LSPs are regular VIFs in typical large scale environments. I would suggest to simplify this unless there is clear concerns of not flooding to other ports.

If I skip flooding to non-router ports the following test fails
because one of the things checked by the test is that we flood
broadcasts (including ARPs) in the broadcast domain:
 36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR           FAILED (ovs-macros.at:220)

So my concern was that people might expect the ARP requests to be
flooded within the L2 broadcast domain.
However, we could add configuration knob to change the behavior and
only flood them on localnet ports. Like this we could maintain the
current L2 flooding but in deployments where this is not necessary the
users could disable the flooding to VIF ports and this would remove
the MC_ARP_ND group unless there's a localnet port in the datapath.

What do you think?

>
> 3. Not a big thing, but the name of the functions build_lswitch_rport_arp_responders() and build_lswitch_rport_arp_flow() are a little bit confusing. It is not for arp-responder, but for bypassing router ports. Would it be better to be something like: build_lswitch_rport_arp_req_flows() and build_lswitch_rport_arp_request_flow_for_ip()?

Sure, I'll change them.
>
> In addition, not for the patch, but for the problem itself, I think we'd better quantify the number of router ports we support with current resubmit limit (as suggested by Ben), and document somewhere the limit and the impact if user exceeds the limit. If the limit is easily exceed in very common deployments, we probably should consider increasing the resubmit limit.

Agreed, that should be done too.

Thanks,
Dumitru

>
> Thanks,
> Han
Han Zhou Nov. 7, 2019, 5:02 p.m. UTC | #4
On Thu, Nov 7, 2019 at 7:43 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Thu, Nov 7, 2019 at 3:51 AM Han Zhou <hzhou@ovn.org> wrote:
> >
> >
> >
> > On Mon, Nov 4, 2019 at 11:32 AM Numan Siddique <numans@ovn.org> wrote:
> > >
> > > On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com>
wrote:
> > > >
> > > > ARP request and ND NS packets for router owned IPs were being
> > > > flooded in the complete L2 domain (using the MC_FLOOD multicast
group).
> > > > However this creates a scaling issue in scenarios where aggregation
> > > > logical switches are connected to more logical routers (~350). The
> > > > logical pipelines of all routers would have to be executed before
the
> > > > packet is finally replied to by a single router, the owner of the IP
> > > > address.
> > > >
> > > > This commit limits the broadcast domain by bypassing the L2 Lookup
stage
> > > > for ARP requests that will be replied by a single router. The
packets
> > > > are still flooded in the L2 domain but not on any of the other patch
> > > > ports towards other routers connected to the switch. This restricted
> > > > flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
> > > >
> > > > IPs that are owned by the routers and for which this fix applies
are:
> > > > - IP addresses configured on the router ports.
> > > > - VIPs.
> > > > - NAT IPs.
> > > >
> > > > This commit also fixes function get_router_load_balancer_ips() which
> > > > was incorrectly returning a single address_family even though the
> > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > >
> > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > >
> > > Thanks Dumitru, for addressing the review comments.
> > >
> > > Acked-by: Numan Siddique <numans@ovn.org>
> > >
> > > Han, if you can take a look in this patch and provide your comments,
> > > that would be great.
> > >
>
> Hi Han,
>
> >
> > Sorry for delayed response :(
>
> No worries and thanks for reviewing this change.
>
> >
> > The patch looks good to me except:
> >
> > 1. It changes the behavior that when an ARP request for a LRP's IP is
coming to the logical switch, other routers will no longer learn the MAC-IP
binding of the ARP sender. This has been discussed and I tend to agree with
Dumitru that it should not cause real issue. I think it just worth to be
documented clearly in the ovn-architecture, probably in the section:
Logical Routers and Logical Patch Ports, because this is something
different from a traditional switch's behavior, and network administrators
may get suprised.
>
> Ack, I'll add this in v6.
>
> >
> > 2. Since we are anyway changing the behavior of ARP request handling
and bypassing logical router ports, and we think it should not cause real
problem, then I wonder why adding the MC_ARP_ND group to still flood to the
non-router ports. Is it really useful? Maybe it is not a big deal, but I
think the extra MC group would add some performance cost and it is almost
redundant with the regular MC_FLOOD except that there is no router ports in
it - it is a lot of redundancy considering that most LSPs are regular VIFs
in typical large scale environments. I would suggest to simplify this
unless there is clear concerns of not flooding to other ports.
>
> If I skip flooding to non-router ports the following test fails
> because one of the things checked by the test is that we flood
> broadcasts (including ARPs) in the broadcast domain:
>  36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR           FAILED (
ovs-macros.at:220)
>
> So my concern was that people might expect the ARP requests to be
> flooded within the L2 broadcast domain.
> However, we could add configuration knob to change the behavior and
> only flood them on localnet ports. Like this we could maintain the
> current L2 flooding but in deployments where this is not necessary the
> users could disable the flooding to VIF ports and this would remove
> the MC_ARP_ND group unless there's a localnet port in the datapath.
>
> What do you think?
>

Adding configuration knob is a valid option, but I hoped we could simplify
the change instead of making it even more complex :)

My point is that since we changed the behavior, it might be better to
change it with a more straightforward philosophy: for ARP request for an IP
that is owned by a LRP, it is not flooded. This is in fact comply with our
ARP responder behavior in LS: for a known IP-MAC binding, the ARP request
is not flooded but directly responded to the sender. Now with your change
we are just extending this behavior for IPs owned by LRPs behind the LSPs
with type "router".

I don't think we need to worry about the test case "36: ovn -- 3 HVs, 3 LS,
3 lports/LS, 1 LR". In fact the test already considered a case where
flooding is not expected:
            # 192.168.33.254 is configured to the switch patch port for
lrp33,
            # so no ARP flooding expected for it.
Now we can just change the test case a little more to exclude the IPs owned
by LRPs, since this is new expected behavior. What do you think?

Also I am confused why you pointed out localnet port as an exception. I
think we should just treat it with the same philosophy. If we are not
flooding, then just don't flood to any port, no matter if it is VIF or
localnet.

(If we do believe there is a need to flood with a new group MC_ARP_ND, I'd
say let's don't bother about the configuration knob now, until we see
performance impact from scale test)


> >
> > 3. Not a big thing, but the name of the functions
build_lswitch_rport_arp_responders() and build_lswitch_rport_arp_flow() are
a little bit confusing. It is not for arp-responder, but for bypassing
router ports. Would it be better to be something like:
build_lswitch_rport_arp_req_flows() and
build_lswitch_rport_arp_request_flow_for_ip()?
>
> Sure, I'll change them.
> >
> > In addition, not for the patch, but for the problem itself, I think
we'd better quantify the number of router ports we support with current
resubmit limit (as suggested by Ben), and document somewhere the limit and
the impact if user exceeds the limit. If the limit is easily exceed in very
common deployments, we probably should consider increasing the resubmit
limit.
>
> Agreed, that should be done too.
>
> Thanks,
> Dumitru
>
> >
> > Thanks,
> > Han
>
Dumitru Ceara Nov. 7, 2019, 6:02 p.m. UTC | #5
On Thu, Nov 7, 2019 at 6:02 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Thu, Nov 7, 2019 at 7:43 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On Thu, Nov 7, 2019 at 3:51 AM Han Zhou <hzhou@ovn.org> wrote:
> > >
> > >
> > >
> > > On Mon, Nov 4, 2019 at 11:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > >
> > > > On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com> wrote:
> > > > >
> > > > > ARP request and ND NS packets for router owned IPs were being
> > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > > > However this creates a scaling issue in scenarios where aggregation
> > > > > logical switches are connected to more logical routers (~350). The
> > > > > logical pipelines of all routers would have to be executed before the
> > > > > packet is finally replied to by a single router, the owner of the IP
> > > > > address.
> > > > >
> > > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > > > for ARP requests that will be replied by a single router. The packets
> > > > > are still flooded in the L2 domain but not on any of the other patch
> > > > > ports towards other routers connected to the switch. This restricted
> > > > > flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
> > > > >
> > > > > IPs that are owned by the routers and for which this fix applies are:
> > > > > - IP addresses configured on the router ports.
> > > > > - VIPs.
> > > > > - NAT IPs.
> > > > >
> > > > > This commit also fixes function get_router_load_balancer_ips() which
> > > > > was incorrectly returning a single address_family even though the
> > > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > > >
> > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > >
> > > > Thanks Dumitru, for addressing the review comments.
> > > >
> > > > Acked-by: Numan Siddique <numans@ovn.org>
> > > >
> > > > Han, if you can take a look in this patch and provide your comments,
> > > > that would be great.
> > > >
> >
> > Hi Han,
> >
> > >
> > > Sorry for delayed response :(
> >
> > No worries and thanks for reviewing this change.
> >
> > >
> > > The patch looks good to me except:
> > >
> > > 1. It changes the behavior that when an ARP request for a LRP's IP is coming to the logical switch, other routers will no longer learn the MAC-IP binding of the ARP sender. This has been discussed and I tend to agree with Dumitru that it should not cause real issue. I think it just worth to be documented clearly in the ovn-architecture, probably in the section: Logical Routers and Logical Patch Ports, because this is something different from a traditional switch's behavior, and network administrators may get suprised.
> >
> > Ack, I'll add this in v6.
> >
> > >
> > > 2. Since we are anyway changing the behavior of ARP request handling and bypassing logical router ports, and we think it should not cause real problem, then I wonder why adding the MC_ARP_ND group to still flood to the non-router ports. Is it really useful? Maybe it is not a big deal, but I think the extra MC group would add some performance cost and it is almost redundant with the regular MC_FLOOD except that there is no router ports in it - it is a lot of redundancy considering that most LSPs are regular VIFs in typical large scale environments. I would suggest to simplify this unless there is clear concerns of not flooding to other ports.
> >
> > If I skip flooding to non-router ports the following test fails
> > because one of the things checked by the test is that we flood
> > broadcasts (including ARPs) in the broadcast domain:
> >  36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR           FAILED (ovs-macros.at:220)
> >
> > So my concern was that people might expect the ARP requests to be
> > flooded within the L2 broadcast domain.
> > However, we could add configuration knob to change the behavior and
> > only flood them on localnet ports. Like this we could maintain the
> > current L2 flooding but in deployments where this is not necessary the
> > users could disable the flooding to VIF ports and this would remove
> > the MC_ARP_ND group unless there's a localnet port in the datapath.
> >
> > What do you think?
> >
>
> Adding configuration knob is a valid option, but I hoped we could simplify the change instead of making it even more complex :)
>
> My point is that since we changed the behavior, it might be better to change it with a more straightforward philosophy: for ARP request for an IP that is owned by a LRP, it is not flooded. This is in fact comply with our ARP responder behavior in LS: for a known IP-MAC binding, the ARP request is not flooded but directly responded to the sender. Now with your change we are just extending this behavior for IPs owned by LRPs behind the LSPs with type "router".

Sounds good to me.

>
> I don't think we need to worry about the test case "36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR". In fact the test already considered a case where flooding is not expected:
>             # 192.168.33.254 is configured to the switch patch port for lrp33,
>             # so no ARP flooding expected for it.
> Now we can just change the test case a little more to exclude the IPs owned by LRPs, since this is new expected behavior. What do you think?

Sure, this can be done, I was just raising the point that I'm not sure
whether this would bother users or not.

>
> Also I am confused why you pointed out localnet port as an exception. I think we should just treat it with the same philosophy. If we are not flooding, then just don't flood to any port, no matter if it is VIF or localnet.

My reason was that behind localnet ports we might have larger physical
networks and flooding the ARP requests there might help external
devices already update their FDBs/FIBs and avoid more broadcasts
later. I'm not completely sure if the benefit is worth the effort
though.

>
> (If we do believe there is a need to flood with a new group MC_ARP_ND, I'd say let's don't bother about the configuration knob now, until we see performance impact from scale test)
>

Sure, I'm inclining though towards removing the MC_ARP_ND. The only
concern I have is regarding the localnet ports but if you think it's
ok to skip flooding ARPs there then I can send a follow-up v6 soon.

Thanks,
Dumitru

>
> > >
> > > 3. Not a big thing, but the name of the functions build_lswitch_rport_arp_responders() and build_lswitch_rport_arp_flow() are a little bit confusing. It is not for arp-responder, but for bypassing router ports. Would it be better to be something like: build_lswitch_rport_arp_req_flows() and build_lswitch_rport_arp_request_flow_for_ip()?
> >
> > Sure, I'll change them.
> > >
> > > In addition, not for the patch, but for the problem itself, I think we'd better quantify the number of router ports we support with current resubmit limit (as suggested by Ben), and document somewhere the limit and the impact if user exceeds the limit. If the limit is easily exceed in very common deployments, we probably should consider increasing the resubmit limit.
> >
> > Agreed, that should be done too.
> >
> > Thanks,
> > Dumitru
> >
> > >
> > > Thanks,
> > > Han
> >
Dumitru Ceara Nov. 8, 2019, 2:46 p.m. UTC | #6
On Thu, Nov 7, 2019 at 7:02 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Thu, Nov 7, 2019 at 6:02 PM Han Zhou <hzhou@ovn.org> wrote:
> >
> >
> >
> > On Thu, Nov 7, 2019 at 7:43 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > On Thu, Nov 7, 2019 at 3:51 AM Han Zhou <hzhou@ovn.org> wrote:
> > > >
> > > >
> > > >
> > > > On Mon, Nov 4, 2019 at 11:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > > >
> > > > > On Mon, Nov 4, 2019 at 8:37 PM Dumitru Ceara <dceara@redhat.com> wrote:
> > > > > >
> > > > > > ARP request and ND NS packets for router owned IPs were being
> > > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > > > > However this creates a scaling issue in scenarios where aggregation
> > > > > > logical switches are connected to more logical routers (~350). The
> > > > > > logical pipelines of all routers would have to be executed before the
> > > > > > packet is finally replied to by a single router, the owner of the IP
> > > > > > address.
> > > > > >
> > > > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > > > > for ARP requests that will be replied by a single router. The packets
> > > > > > are still flooded in the L2 domain but not on any of the other patch
> > > > > > ports towards other routers connected to the switch. This restricted
> > > > > > flooding is done by using a new multicast group (MC_ARP_ND_FLOOD).
> > > > > >
> > > > > > IPs that are owned by the routers and for which this fix applies are:
> > > > > > - IP addresses configured on the router ports.
> > > > > > - VIPs.
> > > > > > - NAT IPs.
> > > > > >
> > > > > > This commit also fixes function get_router_load_balancer_ips() which
> > > > > > was incorrectly returning a single address_family even though the
> > > > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > > > >
> > > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > > >
> > > > > Thanks Dumitru, for addressing the review comments.
> > > > >
> > > > > Acked-by: Numan Siddique <numans@ovn.org>
> > > > >
> > > > > Han, if you can take a look in this patch and provide your comments,
> > > > > that would be great.
> > > > >
> > >
> > > Hi Han,
> > >
> > > >
> > > > Sorry for delayed response :(
> > >
> > > No worries and thanks for reviewing this change.
> > >
> > > >
> > > > The patch looks good to me except:
> > > >
> > > > 1. It changes the behavior that when an ARP request for a LRP's IP is coming to the logical switch, other routers will no longer learn the MAC-IP binding of the ARP sender. This has been discussed and I tend to agree with Dumitru that it should not cause real issue. I think it just worth to be documented clearly in the ovn-architecture, probably in the section: Logical Routers and Logical Patch Ports, because this is something different from a traditional switch's behavior, and network administrators may get suprised.
> > >
> > > Ack, I'll add this in v6.
> > >
> > > >
> > > > 2. Since we are anyway changing the behavior of ARP request handling and bypassing logical router ports, and we think it should not cause real problem, then I wonder why adding the MC_ARP_ND group to still flood to the non-router ports. Is it really useful? Maybe it is not a big deal, but I think the extra MC group would add some performance cost and it is almost redundant with the regular MC_FLOOD except that there is no router ports in it - it is a lot of redundancy considering that most LSPs are regular VIFs in typical large scale environments. I would suggest to simplify this unless there is clear concerns of not flooding to other ports.
> > >
> > > If I skip flooding to non-router ports the following test fails
> > > because one of the things checked by the test is that we flood
> > > broadcasts (including ARPs) in the broadcast domain:
> > >  36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR           FAILED (ovs-macros.at:220)
> > >
> > > So my concern was that people might expect the ARP requests to be
> > > flooded within the L2 broadcast domain.
> > > However, we could add configuration knob to change the behavior and
> > > only flood them on localnet ports. Like this we could maintain the
> > > current L2 flooding but in deployments where this is not necessary the
> > > users could disable the flooding to VIF ports and this would remove
> > > the MC_ARP_ND group unless there's a localnet port in the datapath.
> > >
> > > What do you think?
> > >
> >
> > Adding configuration knob is a valid option, but I hoped we could simplify the change instead of making it even more complex :)
> >
> > My point is that since we changed the behavior, it might be better to change it with a more straightforward philosophy: for ARP request for an IP that is owned by a LRP, it is not flooded. This is in fact comply with our ARP responder behavior in LS: for a known IP-MAC binding, the ARP request is not flooded but directly responded to the sender. Now with your change we are just extending this behavior for IPs owned by LRPs behind the LSPs with type "router".
>
> Sounds good to me.
>
> >
> > I don't think we need to worry about the test case "36: ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR". In fact the test already considered a case where flooding is not expected:
> >             # 192.168.33.254 is configured to the switch patch port for lrp33,
> >             # so no ARP flooding expected for it.
> > Now we can just change the test case a little more to exclude the IPs owned by LRPs, since this is new expected behavior. What do you think?
>
> Sure, this can be done, I was just raising the point that I'm not sure
> whether this would bother users or not.
>
> >
> > Also I am confused why you pointed out localnet port as an exception. I think we should just treat it with the same philosophy. If we are not flooding, then just don't flood to any port, no matter if it is VIF or localnet.
>
> My reason was that behind localnet ports we might have larger physical
> networks and flooding the ARP requests there might help external
> devices already update their FDBs/FIBs and avoid more broadcasts
> later. I'm not completely sure if the benefit is worth the effort
> though.
>
> >
> > (If we do believe there is a need to flood with a new group MC_ARP_ND, I'd say let's don't bother about the configuration knob now, until we see performance impact from scale test)
> >
>
> Sure, I'm inclining though towards removing the MC_ARP_ND. The only
> concern I have is regarding the localnet ports but if you think it's
> ok to skip flooding ARPs there then I can send a follow-up v6 soon.
>
> Thanks,
> Dumitru

Hi Han, Numan,

I sent v6 addressing the points discussed above:
https://patchwork.ozlabs.org/patch/1191926/

Numan, I forgot to add your Acked-by to the patch but it might be
better to have another look as I did change a bit the code between v5
and v6.

Thanks,
Dumitru

>
> >
> > > >
> > > > 3. Not a big thing, but the name of the functions build_lswitch_rport_arp_responders() and build_lswitch_rport_arp_flow() are a little bit confusing. It is not for arp-responder, but for bypassing router ports. Would it be better to be something like: build_lswitch_rport_arp_req_flows() and build_lswitch_rport_arp_request_flow_for_ip()?
> > >
> > > Sure, I'll change them.
> > > >
> > > > In addition, not for the patch, but for the problem itself, I think we'd better quantify the number of router ports we support with current resubmit limit (as suggested by Ben), and document somewhere the limit and the impact if user exceeds the limit. If the limit is easily exceed in very common deployments, we probably should consider increasing the resubmit limit.
> > >
> > > Agreed, that should be done too.
> > >
> > > Thanks,
> > > Dumitru
> > >
> > > >
> > > > Thanks,
> > > > Han
> > >
diff mbox series

Patch

diff --git a/lib/mcast-group-index.h b/lib/mcast-group-index.h
index ba995ba..06bd8b3 100644
--- a/lib/mcast-group-index.h
+++ b/lib/mcast-group-index.h
@@ -27,6 +27,7 @@  enum ovn_mcast_tunnel_keys {
 
     OVN_MCAST_FLOOD_TUNNEL_KEY = OVN_MIN_MULTICAST,
     OVN_MCAST_UNKNOWN_TUNNEL_KEY,
+    OVN_MCAST_ARP_ND_TUNNEL_KEY,
     OVN_MCAST_MROUTER_FLOOD_TUNNEL_KEY,
     OVN_MCAST_MROUTER_STATIC_TUNNEL_KEY,
     OVN_MCAST_STATIC_TUNNEL_KEY,
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 0a33dcd..6fbb3ab 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -1005,6 +1005,22 @@  output;
       </li>
 
       <li>
+        Priority-80 flows for each port connected to a logical router
+        matching self originated GARP/ARP request/ND packets. These packets
+        are flooded to the <code>MC_FLOOD</code> which contains all logical
+        ports.
+      </li>
+
+      <li>
+        Priority-75 flows for each IP address/VIP/NAT address owned by a
+        router port connected to the switch. These flows match ARP requests
+        and ND packets for the specific IP addresses.  Matched packets are
+        forwarded in the L3 domain only to the router that owns the IP
+        address and flooded in the L2 domain on all ports except patch
+        ports connected to logical routers.
+      </li>
+
+      <li>
         A priority-70 flow that outputs all packets with an Ethernet broadcast
         or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
         multicast group.
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index c23c270..31e3c78 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -210,6 +210,8 @@  enum ovn_stage {
 #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
 #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
 
+#define REGBIT_NOT_VXLAN "flags[1] == 0"
+
 /* Returns an "enum ovn_stage" built from the arguments. */
 static enum ovn_stage
 ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
@@ -1202,6 +1204,34 @@  ovn_port_allocate_key(struct ovn_datapath *od)
                           1, (1u << 15) - 1, &od->port_key_hint);
 }
 
+/* Returns true if the logical switch port 'enabled' column is empty or
+ * set to true.  Otherwise, returns false. */
+static bool
+lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
+{
+    return !lsp->n_enabled || *lsp->enabled;
+}
+
+/* Returns true only if the logical switch port 'up' column is set to true.
+ * Otherwise, if the column is not set or set to false, returns false. */
+static bool
+lsp_is_up(const struct nbrec_logical_switch_port *lsp)
+{
+    return lsp->n_up && *lsp->up;
+}
+
+static bool
+lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
+{
+    return !strcmp(nbsp->type, "external");
+}
+
+static bool
+lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
+{
+    return !lrport->enabled || *lrport->enabled;
+}
+
 static char *
 chassis_redirect_name(const char *port_name)
 {
@@ -2184,7 +2214,7 @@  ip_address_and_port_from_lb_key(const char *key, char **ip_address,
 
 static void
 get_router_load_balancer_ips(const struct ovn_datapath *od,
-                             struct sset *all_ips, int *addr_family)
+                             struct sset *all_ips_v4, struct sset *all_ips_v6)
 {
     if (!od->nbr) {
         return;
@@ -2199,13 +2229,21 @@  get_router_load_balancer_ips(const struct ovn_datapath *od,
             /* node->key contains IP:port or just IP. */
             char *ip_address = NULL;
             uint16_t port;
+            int addr_family;
 
             ip_address_and_port_from_lb_key(node->key, &ip_address, &port,
-                                            addr_family);
+                                            &addr_family);
             if (!ip_address) {
                 continue;
             }
 
+            struct sset *all_ips;
+            if (addr_family == AF_INET) {
+                all_ips = all_ips_v4;
+            } else {
+                all_ips = all_ips_v6;
+            }
+
             if (!sset_contains(all_ips, ip_address)) {
                 sset_add(all_ips, ip_address);
             }
@@ -2299,17 +2337,22 @@  get_nat_addresses(const struct ovn_port *op, size_t *n)
         }
     }
 
-    /* A set to hold all load-balancer vips. */
-    struct sset all_ips = SSET_INITIALIZER(&all_ips);
-    int addr_family;
-    get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
+    /* Two sets to hold all load-balancer vips. */
+    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
 
     const char *ip_address;
-    SSET_FOR_EACH (ip_address, &all_ips) {
+    SSET_FOR_EACH (ip_address, &all_ips_v4) {
         ds_put_format(&c_addresses, " %s", ip_address);
         central_ip_address = true;
     }
-    sset_destroy(&all_ips);
+    SSET_FOR_EACH (ip_address, &all_ips_v6) {
+        ds_put_format(&c_addresses, " %s", ip_address);
+        central_ip_address = true;
+    }
+    sset_destroy(&all_ips_v4);
+    sset_destroy(&all_ips_v6);
 
     if (central_ip_address) {
         /* Gratuitous ARP for centralized NAT rules on distributed gateway
@@ -3036,6 +3079,10 @@  static const struct multicast_group mc_static =
 static const struct multicast_group mc_unknown =
     { MC_UNKNOWN, OVN_MCAST_UNKNOWN_TUNNEL_KEY };
 
+#define MC_ARP_ND "_MC_arp_nd"
+static const struct multicast_group mc_arp_nd =
+    { MC_ARP_ND, OVN_MCAST_ARP_ND_TUNNEL_KEY };
+
 static bool
 multicast_group_equal(const struct multicast_group *a,
                       const struct multicast_group *b)
@@ -3737,28 +3784,6 @@  build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
 
 }
 
-/* Returns true if the logical switch port 'enabled' column is empty or
- * set to true.  Otherwise, returns false. */
-static bool
-lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
-{
-    return !lsp->n_enabled || *lsp->enabled;
-}
-
-/* Returns true only if the logical switch port 'up' column is set to true.
- * Otherwise, if the column is not set or set to false, returns false. */
-static bool
-lsp_is_up(const struct nbrec_logical_switch_port *lsp)
-{
-    return lsp->n_up && *lsp->up;
-}
-
-static bool
-lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
-{
-    return !strcmp(nbsp->type, "external");
-}
-
 static bool
 build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
                     struct ds *options_action, struct ds *response_action,
@@ -5161,6 +5186,143 @@  build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
     }
 }
 
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses. Packets are still flooded in the switching domain
+ * as regular broadcast.
+ */
+static void
+build_lswitch_rport_arp_flow(const char *target_address, int addr_family,
+                             struct ovn_port *patch_op,
+                             struct ovn_datapath *od,
+                             uint32_t priority,
+                             struct hmap *lflows)
+{
+    struct ds match   = DS_EMPTY_INITIALIZER;
+    struct ds actions = DS_EMPTY_INITIALIZER;
+
+    if (addr_family == AF_INET) {
+        ds_put_format(&match, "arp.tpa == %s && arp.op == 1", target_address);
+    } else {
+        ds_put_format(&match, "nd.target == %s && nd_ns", target_address);
+    }
+
+    /* Packets received from VXLAN tunnels have already been through the
+     * router pipeline so we should skip them. Normally this is done by the
+     * multicast_group implementation (VXLAN packets skip table 32 which
+     * delivers to patch ports) but we're bypassing multicast_groups.
+     */
+    ds_put_format(&match, " && "REGBIT_NOT_VXLAN);
+
+    /* Send a clone of the packet to the router pipeline and flood the
+     * original in the broadcast domain (skipping router ports). */
+    ds_put_format(&actions,
+                  "clone { outport = %s; output; }; "
+                  "outport = \""MC_ARP_ND"\"; output;",
+                  patch_op->json_key);
+    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
+                  ds_cstr(&match), ds_cstr(&actions));
+
+    ds_destroy(&match);
+    ds_destroy(&actions);
+}
+
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses.
+ * Priorities:
+ * - 80: self originated GARPs that need to follow regular processing.
+ * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
+ */
+static void
+build_lswitch_rport_arp_responders(struct ovn_port *op,
+                                   struct ovn_datapath *sw_od,
+                                   struct ovn_port *sw_op,
+                                   struct hmap *lflows)
+{
+    if (!op || !op->nbrp) {
+        return;
+    }
+
+    if (!lrport_is_enabled(op->nbrp)) {
+        return;
+    }
+
+    struct ds match = DS_EMPTY_INITIALIZER;
+
+    /* Self originated (G)ARP requests/ND need to be flooded as usual.
+     * Priority: 80.
+     */
+    ds_put_format(&match, "inport == %s && (arp.op == 1 || nd_ns)",
+                  sw_op->json_key);
+    ovn_lflow_add(lflows, sw_od, S_SWITCH_IN_L2_LKUP, 80,
+                  ds_cstr(&match),
+                  "outport = \""MC_FLOOD"\"; output;");
+
+    ds_destroy(&match);
+
+    /* Forward ARP requests for IPs configured on the router only to this
+     * router port.
+     * Priority: 75.
+     */
+    for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
+        build_lswitch_rport_arp_flow(op->lrp_networks.ipv4_addrs[i].addr_s,
+                                     AF_INET, sw_op, sw_od, 75, lflows);
+    }
+    for (int i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
+        build_lswitch_rport_arp_flow(op->lrp_networks.ipv6_addrs[i].addr_s,
+                                     AF_INET6, sw_op, sw_od, 75, lflows);
+    }
+
+    /* Forward ARP requests to load-balancer VIPs configured on the router
+     * only to this router port.
+     * Priority: 75.
+     */
+    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+    const char *ip_address;
+
+    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
+
+    SSET_FOR_EACH (ip_address, &all_ips_v4) {
+        build_lswitch_rport_arp_flow(ip_address, AF_INET, sw_op, sw_od,
+                                     75, lflows);
+    }
+    SSET_FOR_EACH (ip_address, &all_ips_v6) {
+        build_lswitch_rport_arp_flow(ip_address, AF_INET6, sw_op, sw_od,
+                                     75, lflows);
+    }
+    sset_destroy(&all_ips_v4);
+    sset_destroy(&all_ips_v6);
+
+    /* Forward ARP requests to NAT addresses configured on the router
+     * only to this router port.
+     * Priority: 75.
+     */
+    for (int i = 0; i < op->od->nbr->n_nat; i++) {
+        const struct nbrec_nat *nat = op->od->nbr->nat[i];
+
+        if (!strcmp(nat->type, "snat")) {
+            continue;
+        }
+
+        ovs_be32 ip;
+        ovs_be32 mask;
+        struct in6_addr ipv6;
+        struct in6_addr mask_v6;
+
+        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
+            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
+                build_lswitch_rport_arp_flow(nat->external_ip, AF_INET6, sw_op,
+                                             sw_od, 75, lflows);
+            }
+        } else {
+            build_lswitch_rport_arp_flow(nat->external_ip, AF_INET, sw_op,
+                                         sw_od, 75, lflows);
+        }
+    }
+}
+
 static void
 build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
                     struct hmap *port_groups, struct hmap *lflows,
@@ -5748,6 +5910,15 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
             continue;
         }
 
+        /* For ports connected to logical routers add flows to bypass the
+         * broadcast flooding of ARP/ND requests in table 17. We direct the
+         * requests only to the router port that owns the IP address.
+         */
+        if (!strcmp(op->nbsp->type, "router")) {
+            build_lswitch_rport_arp_responders(op->peer, op->od, op,
+                                               lflows);
+        }
+
         for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
             /* Addresses are owned by the logical port.
              * Ethernet address followed by zero or more IPv4
@@ -5879,12 +6050,6 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
     ds_destroy(&actions);
 }
 
-static bool
-lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
-{
-    return !lrport->enabled || *lrport->enabled;
-}
-
 /* Returns a string of the IP address of the router port 'op' that
  * overlaps with 'ip_s".  If one is not found, returns NULL.
  *
@@ -6904,61 +7069,66 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         }
 
         /* A set to hold all load-balancer vips that need ARP responses. */
-        struct sset all_ips = SSET_INITIALIZER(&all_ips);
-        int addr_family;
-        get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
+        struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+        struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+        get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
 
         const char *ip_address;
-        SSET_FOR_EACH(ip_address, &all_ips) {
+        SSET_FOR_EACH (ip_address, &all_ips_v4) {
             ds_clear(&match);
-            if (addr_family == AF_INET) {
-                ds_put_format(&match,
-                              "inport == %s && arp.tpa == %s && arp.op == 1",
-                              op->json_key, ip_address);
-            } else {
-                ds_put_format(&match,
-                              "inport == %s && nd_ns && nd.target == %s",
-                              op->json_key, ip_address);
-            }
+            ds_put_format(&match,
+                          "inport == %s && arp.tpa == %s && arp.op == 1",
+                          op->json_key, ip_address);
 
             ds_clear(&actions);
-            if (addr_family == AF_INET) {
-                ds_put_format(&actions,
-                "eth.dst = eth.src; "
-                "eth.src = %s; "
-                "arp.op = 2; /* ARP reply */ "
-                "arp.tha = arp.sha; "
-                "arp.sha = %s; "
-                "arp.tpa = arp.spa; "
-                "arp.spa = %s; "
-                "outport = %s; "
-                "flags.loopback = 1; "
-                "output;",
-                op->lrp_networks.ea_s,
-                op->lrp_networks.ea_s,
-                ip_address,
-                op->json_key);
-            } else {
-                ds_put_format(&actions,
-                "nd_na { "
-                "eth.src = %s; "
-                "ip6.src = %s; "
-                "nd.target = %s; "
-                "nd.tll = %s; "
-                "outport = inport; "
-                "flags.loopback = 1; "
-                "output; "
-                "};",
-                op->lrp_networks.ea_s,
-                ip_address,
-                ip_address,
-                op->lrp_networks.ea_s);
-            }
+            ds_put_format(&actions,
+                          "eth.dst = eth.src; "
+                          "eth.src = %s; "
+                          "arp.op = 2; /* ARP reply */ "
+                          "arp.tha = arp.sha; "
+                          "arp.sha = %s; "
+                          "arp.tpa = arp.spa; "
+                          "arp.spa = %s; "
+                          "outport = %s; "
+                          "flags.loopback = 1; "
+                          "output;",
+                          op->lrp_networks.ea_s,
+                          op->lrp_networks.ea_s,
+                          ip_address,
+                          op->json_key);
+
             ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
                           ds_cstr(&match), ds_cstr(&actions));
         }
 
-        sset_destroy(&all_ips);
+        SSET_FOR_EACH (ip_address, &all_ips_v6) {
+            ds_clear(&match);
+            ds_put_format(&match,
+                          "inport == %s && nd_ns && nd.target == %s",
+                          op->json_key, ip_address);
+
+            ds_clear(&actions);
+            ds_put_format(&actions,
+                          "nd_na { "
+                          "eth.src = %s; "
+                          "ip6.src = %s; "
+                          "nd.target = %s; "
+                          "nd.tll = %s; "
+                          "outport = inport; "
+                          "flags.loopback = 1; "
+                          "output; "
+                          "};",
+                          op->lrp_networks.ea_s,
+                          ip_address,
+                          ip_address,
+                          op->lrp_networks.ea_s);
+
+            ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
+                          ds_cstr(&match), ds_cstr(&actions));
+        }
+
+        sset_destroy(&all_ips_v4);
+        sset_destroy(&all_ips_v6);
 
         /* A gateway router can have 2 SNAT IP addresses to force DNATed and
          * LBed traffic respectively to be SNATed.  In addition, there can be
@@ -9392,6 +9562,12 @@  build_mcast_groups(struct northd_context *ctx,
         } else if (op->nbsp && lsp_is_enabled(op->nbsp)) {
             ovn_multicast_add(mcast_groups, &mc_flood, op);
 
+            /* Add all non-router ports to the ARP ND L2 broadcast flood
+             * domain entry. */
+            if (strcmp(op->nbsp->type, "router")) {
+                ovn_multicast_add(mcast_groups, &mc_arp_nd, op);
+            }
+
             /* If this port is connected to a multicast router then add it
              * to the MC_MROUTER_FLOOD group.
              */
diff --git a/tests/ovn.at b/tests/ovn.at
index 410f4b5..196d379 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -9595,7 +9595,7 @@  ovn-nbctl --wait=hv --timeout=3 sync
 # Check that there is a logical flow in logical switch foo's pipeline
 # to set the outport to rp-foo (which is expected).
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep -v is_chassis_resident | wc -l`])
+grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
 
 # Set the option 'reside-on-redirect-chassis' for foo
 ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
@@ -9603,7 +9603,7 @@  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
 # to set the outport to rp-foo with the condition is_chassis_redirect.
 ovn-sbctl dump-flows foo
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep is_chassis_resident | wc -l`])
+grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
 
 echo "---------NB dump-----"
 ovn-nbctl show
@@ -16676,3 +16676,282 @@  as hv4 ovs-appctl fdb/show br-phys
 OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
 
 AT_CLEANUP
+
+AT_SETUP([ovn -- ARP/ND request broadcast limiting])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+ip_to_hex() {
+    printf "%02x%02x%02x%02x" "$@"
+}
+
+send_arp_request() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
+    local eth_dst=ffffffffffff
+    local eth_type=0806
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
+
+    local request=${eth}${arp}
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+send_nd_ns() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
+
+    local eth_dst=ffffffffffff
+    local eth_type=86dd
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local ip_vhlen=60000000
+    local ip_plen=0020
+    local ip_next=3a
+    local ip_ttl=ff
+    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
+
+    # Neighbor Solicitation
+    local icmp6_type=87
+    local icmp6_code=00
+    local icmp6_rsvd=00000000
+    # ICMPv6 source lla option
+    local icmp6_opt=01
+    local icmp6_optlen=01
+    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
+
+    local request=${eth}${ip}${icmp6}
+
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+src_mac=000000000001
+
+net_add n1
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+
+ovs-vsctl -- add-port br-int hv1-vif1 -- \
+    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
+    options:tx_pcap=hv1/vif1-tx.pcap \
+    options:rxq_pcap=hv1/vif1-rx.pcap \
+    ofport-request=1
+
+# One Aggregation Switch connected to two Logical networks (routers).
+ovn-nbctl ls-add sw-agg
+ovn-nbctl lsp-add sw-agg sw-agg-ext \
+    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
+
+ovn-nbctl lsp-add sw-agg sw-rtr1                   \
+    -- lsp-set-type sw-rtr1 router                 \
+    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
+    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
+ovn-nbctl lsp-add sw-agg sw-rtr2                   \
+    -- lsp-set-type sw-rtr2 router                 \
+    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
+    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
+
+# Configure L3 interface IPv4 & IPv6 on both routers
+ovn-nbctl lr-add rtr1
+ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
+
+ovn-nbctl lr-add rtr2
+ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
+
+OVN_POPULATE_ARP
+ovn-nbctl --wait=hv sync
+
+sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
+sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
+
+r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
+r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
+
+mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
+mc_key=$(printf "%04x" $mc_key)
+
+match_sw_metadata="metadata=0x${sw_dp_key}"
+
+# Inject ARP request for first router owned IP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
+match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for ofirst router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000001
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure load balancing on both routers.
+ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
+ovn-nbctl lb-add lb1-v6 10::11 42::1
+ovn-nbctl lr-lb-add rtr1 lb1-v4
+ovn-nbctl lr-lb-add rtr1 lb1-v6
+
+ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
+ovn-nbctl lb-add lb2-v6 10::22 42::2
+ovn-nbctl lr-lb-add rtr2 lb2-v4
+ovn-nbctl lr-lb-add rtr2 lb2-v6
+ovn-nbctl --wait=hv sync
+
+# Inject ARP request for first router owned VIP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
+match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned VIP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000011
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure NAT on both routers
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
+
+# Inject ARP request for first router owned NAT address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
+match_send_rtr1="clone(load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="clone(load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000111
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+OVN_CLEANUP([hv1])
+AT_CLEANUP