[ovs-dev,ovn,v6] ovn-northd: Limit ARP/ND broadcast domain whenever possible.
diff mbox series

Message ID 1573223879-9535-1-git-send-email-dceara@redhat.com
State Superseded
Headers show
Series
  • [ovs-dev,ovn,v6] ovn-northd: Limit ARP/ND broadcast domain whenever possible.
Related show

Commit Message

Dumitru Ceara Nov. 8, 2019, 2:37 p.m. UTC
ARP request and ND NS packets for router owned IPs were being
flooded in the complete L2 domain (using the MC_FLOOD multicast group).
However this creates a scaling issue in scenarios where aggregation
logical switches are connected to more logical routers (~350). The
logical pipelines of all routers would have to be executed before the
packet is finally replied to by a single router, the owner of the IP
address.

This commit limits the broadcast domain by bypassing the L2 Lookup stage
for ARP requests that will be replied by a single router. The packets
are forwarded only to the router port that owns the target IP address.

IPs that are owned by the routers and for which this fix applies are:
- IP addresses configured on the router ports.
- VIPs.
- NAT IPs.

This commit also fixes function get_router_load_balancer_ips() which
was incorrectly returning a single address_family even though the
IP set could contain a mix of IPv4 and IPv6 addresses.

Reported-at: https://bugzilla.redhat.com/1756945
Reported-by: Anil Venkata <vkommadi@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>

---
v6:
- Address Han's comments:
  - remove flooding of ARPs targeting OVN owned IP addresses.
  - update ovn-architecture documentation.
  - rename ARP handling functions.
  - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
    account the new way of forwarding ARPs.
- Also, properly deal with ARP packets on VLAN-backed networks.
v5: Address Numan's comments: update comments & make autotest more
    robust.
v4: Rebase.
v3: Properly deal with VXLAN traffic. Address review comments from
    Numan (add autotests). Fix function get_router_load_balancer_ips.
    Rebase -> deal with IPv6 NAT too.
v2: Move ARP broadcast domain limiting to table S_SWITCH_IN_L2_LKUP to
address localnet ports too.
---
 controller/physical.c        |   2 +
 include/ovn/logical-fields.h |   4 +
 northd/ovn-northd.8.xml      |  16 ++
 northd/ovn-northd.c          | 337 ++++++++++++++++++++++++++++++++-----------
 ovn-architecture.7.xml       |  19 +++
 tests/ovn.at                 | 307 +++++++++++++++++++++++++++++++++++++--
 6 files changed, 591 insertions(+), 94 deletions(-)

Comments

Han Zhou Nov. 9, 2019, 7:34 a.m. UTC | #1
On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> ARP request and ND NS packets for router owned IPs were being
> flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> However this creates a scaling issue in scenarios where aggregation
> logical switches are connected to more logical routers (~350). The
> logical pipelines of all routers would have to be executed before the
> packet is finally replied to by a single router, the owner of the IP
> address.
>
> This commit limits the broadcast domain by bypassing the L2 Lookup stage
> for ARP requests that will be replied by a single router. The packets
> are forwarded only to the router port that owns the target IP address.
>
> IPs that are owned by the routers and for which this fix applies are:
> - IP addresses configured on the router ports.
> - VIPs.
> - NAT IPs.
>
> This commit also fixes function get_router_load_balancer_ips() which
> was incorrectly returning a single address_family even though the
> IP set could contain a mix of IPv4 and IPv6 addresses.
>
> Reported-at: https://bugzilla.redhat.com/1756945
> Reported-by: Anil Venkata <vkommadi@redhat.com>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>
> ---
> v6:
> - Address Han's comments:
>   - remove flooding of ARPs targeting OVN owned IP addresses.
>   - update ovn-architecture documentation.
>   - rename ARP handling functions.
>   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
>     account the new way of forwarding ARPs.
> - Also, properly deal with ARP packets on VLAN-backed networks.

I am confused by this additional VLAN related change. VLAN is just handled
as bridged logical switch where a localnet port is used as *inport*. It
seems to me no difference from regular localnet port no matter with or
without VLAN tag. When an ARP request is going through the ingress pipeline
of the bridged logical switch, the logic of bypassing the flooding added by
this patch should just apply, right? It is also very common scenario that
the *aggregation switch* for the routers is an external physical network
backed by VLAN. I think the purpose of this patch is to make sure scenario
scale. Did I misunderstand anything here?

In addition, the below macro definition may be renamed to FLAGBIT_...,
because it is for the bits of MFF_LOG_FLAGS, which is one of the
pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
>
> +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> +#define REGBIT_NOT_VLAN "flags[7] == 0"
> +

The other part looks good to me. Thanks for simply the patch.

Han
Dumitru Ceara Nov. 11, 2019, 1:32 p.m. UTC | #2
On Sat, Nov 9, 2019 at 8:35 AM Han Zhou <zhouhan@gmail.com> wrote:
>
>
>
> On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > ARP request and ND NS packets for router owned IPs were being
> > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > However this creates a scaling issue in scenarios where aggregation
> > logical switches are connected to more logical routers (~350). The
> > logical pipelines of all routers would have to be executed before the
> > packet is finally replied to by a single router, the owner of the IP
> > address.
> >
> > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > for ARP requests that will be replied by a single router. The packets
> > are forwarded only to the router port that owns the target IP address.
> >
> > IPs that are owned by the routers and for which this fix applies are:
> > - IP addresses configured on the router ports.
> > - VIPs.
> > - NAT IPs.
> >
> > This commit also fixes function get_router_load_balancer_ips() which
> > was incorrectly returning a single address_family even though the
> > IP set could contain a mix of IPv4 and IPv6 addresses.
> >
> > Reported-at: https://bugzilla.redhat.com/1756945
> > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >
> > ---
> > v6:
> > - Address Han's comments:
> >   - remove flooding of ARPs targeting OVN owned IP addresses.
> >   - update ovn-architecture documentation.
> >   - rename ARP handling functions.
> >   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> >     account the new way of forwarding ARPs.
> > - Also, properly deal with ARP packets on VLAN-backed networks.
>
> I am confused by this additional VLAN related change. VLAN is just handled as bridged logical switch where a localnet port is used as *inport*. It seems to me no difference from regular localnet port no matter with or without VLAN tag. When an ARP request is going through the ingress pipeline of the bridged logical switch, the logic of bypassing the flooding added by this patch should just apply, right? It is also very common scenario that the *aggregation switch* for the routers is an external physical network backed by VLAN. I think the purpose of this patch is to make sure scenario scale. Did I misunderstand anything here?

Hi Han,

The reason behind it was that with VLAN backed networks when packets
move between hypervisors without going through geneve we rerun the
ingress pipeline. That would mean that the flows I added for self
originated (G)ARP packets won't be hit for ARP requests originated by
ovn-controller on a remote hypervisor:

priority=80 match: "inport=<patch-port-to-lr>, arp.op == 1" action:
"outport=MC_FLOOD"

But instead we would hit:
priority=75 match: "arp.op == 1 && arp.tpa == <OVN-owned-IP" action:
"outport=<patch-port-to-lr>" and never send flood the packet out on
the second HV.

You're right, the way I added the check for all VLAN packets is not OK
as we fall back to the old behavior too often. However, I'm not sure
what the best option is. Do you think it's fine if I change the
priority 80 flow to the following?

priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
== 1" action: "outport=MC_FLOOD"

The idea would be to identify self originated ARP requests by matching
the source mac instead of logical ingress port (as this might not be
present). I tried it locally and it works fine but we do need to add
more flows than before.

Thanks,
Dumitru

>
> In addition, the below macro definition may be renamed to FLAGBIT_..., because it is for the bits of MFF_LOG_FLAGS, which is one of the pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
> >
> > +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> > +#define REGBIT_NOT_VLAN "flags[7] == 0"
> > +
>
> The other part looks good to me. Thanks for simply the patch.
>
> Han
Han Zhou Nov. 11, 2019, 5:10 p.m. UTC | #3
1. Would there be problem for VLAN backed logical router use case, since a
chassis MAC is used as src MAC to send packets from router ports?
2. How about checking if tpa == spa to make sure GARP is always flooded?
(not directly supported by OF)


On Mon, Nov 11, 2019 at 5:32 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Sat, Nov 9, 2019 at 8:35 AM Han Zhou <zhouhan@gmail.com> wrote:
> >
> >
> >
> > On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > ARP request and ND NS packets for router owned IPs were being
> > > flooded in the complete L2 domain (using the MC_FLOOD multicast
group).
> > > However this creates a scaling issue in scenarios where aggregation
> > > logical switches are connected to more logical routers (~350). The
> > > logical pipelines of all routers would have to be executed before the
> > > packet is finally replied to by a single router, the owner of the IP
> > > address.
> > >
> > > This commit limits the broadcast domain by bypassing the L2 Lookup
stage
> > > for ARP requests that will be replied by a single router. The packets
> > > are forwarded only to the router port that owns the target IP address.
> > >
> > > IPs that are owned by the routers and for which this fix applies are:
> > > - IP addresses configured on the router ports.
> > > - VIPs.
> > > - NAT IPs.
> > >
> > > This commit also fixes function get_router_load_balancer_ips() which
> > > was incorrectly returning a single address_family even though the
> > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > >
> > > Reported-at: https://bugzilla.redhat.com/1756945
> > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > >
> > > ---
> > > v6:
> > > - Address Han's comments:
> > >   - remove flooding of ARPs targeting OVN owned IP addresses.
> > >   - update ovn-architecture documentation.
> > >   - rename ARP handling functions.
> > >   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take
into
> > >     account the new way of forwarding ARPs.
> > > - Also, properly deal with ARP packets on VLAN-backed networks.
> >
> > I am confused by this additional VLAN related change. VLAN is just
handled as bridged logical switch where a localnet port is used as
*inport*. It seems to me no difference from regular localnet port no matter
with or without VLAN tag. When an ARP request is going through the ingress
pipeline of the bridged logical switch, the logic of bypassing the flooding
added by this patch should just apply, right? It is also very common
scenario that the *aggregation switch* for the routers is an external
physical network backed by VLAN. I think the purpose of this patch is to
make sure scenario scale. Did I misunderstand anything here?
>
> Hi Han,
>
> The reason behind it was that with VLAN backed networks when packets
> move between hypervisors without going through geneve we rerun the
> ingress pipeline. That would mean that the flows I added for self
> originated (G)ARP packets won't be hit for ARP requests originated by
> ovn-controller on a remote hypervisor:
>
> priority=80 match: "inport=<patch-port-to-lr>, arp.op == 1" action:
> "outport=MC_FLOOD"
>
> But instead we would hit:
> priority=75 match: "arp.op == 1 && arp.tpa == <OVN-owned-IP" action:
> "outport=<patch-port-to-lr>" and never send flood the packet out on
> the second HV.
>

Thanks for the explain. Now I understand the problem.

> You're right, the way I added the check for all VLAN packets is not OK
> as we fall back to the old behavior too often. However, I'm not sure
> what the best option is. Do you think it's fine if I change the
> priority 80 flow to the following?
>
> priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> == 1" action: "outport=MC_FLOOD"
>
> The idea would be to identify self originated ARP requests by matching
> the source mac instead of logical ingress port (as this might not be
> present). I tried it locally and it works fine but we do need to add
> more flows than before.
>

Would this work when "ovn-chassis-mac-mappings" is configured for VLAN
backed logical routers? We might end up adding chassis unique MACs, too?

Alternatively, I think we may change the priority 80 flow to match for each
OVN router owned IP: arp.tpa == IP && arp.spa == IP ... Would this ensure
OVN router generated ARPs are flooded?

>
> >
> > In addition, the below macro definition may be renamed to FLAGBIT_...,
because it is for the bits of MFF_LOG_FLAGS, which is one of the
pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
> > >
> > > +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> > > +#define REGBIT_NOT_VLAN "flags[7] == 0"
> > > +
> >
> > The other part looks good to me. Thanks for simply the patch.
> >
> > Han
>
Dumitru Ceara Nov. 11, 2019, 7:03 p.m. UTC | #4
On Mon, Nov 11, 2019 at 6:11 PM Han Zhou <zhouhan@gmail.com> wrote:
>
> 1. Would there be problem for VLAN backed logical router use case, since a chassis MAC is used as src MAC to send packets from router ports?
> 2. How about checking if tpa == spa to make sure GARP is always flooded? (not directly supported by OF)

This would've been nice to have in OF :)

>
>
> On Mon, Nov 11, 2019 at 5:32 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On Sat, Nov 9, 2019 at 8:35 AM Han Zhou <zhouhan@gmail.com> wrote:
> > >
> > >
> > >
> > > On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > > >
> > > > ARP request and ND NS packets for router owned IPs were being
> > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > > However this creates a scaling issue in scenarios where aggregation
> > > > logical switches are connected to more logical routers (~350). The
> > > > logical pipelines of all routers would have to be executed before the
> > > > packet is finally replied to by a single router, the owner of the IP
> > > > address.
> > > >
> > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > > for ARP requests that will be replied by a single router. The packets
> > > > are forwarded only to the router port that owns the target IP address.
> > > >
> > > > IPs that are owned by the routers and for which this fix applies are:
> > > > - IP addresses configured on the router ports.
> > > > - VIPs.
> > > > - NAT IPs.
> > > >
> > > > This commit also fixes function get_router_load_balancer_ips() which
> > > > was incorrectly returning a single address_family even though the
> > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > >
> > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > >
> > > > ---
> > > > v6:
> > > > - Address Han's comments:
> > > >   - remove flooding of ARPs targeting OVN owned IP addresses.
> > > >   - update ovn-architecture documentation.
> > > >   - rename ARP handling functions.
> > > >   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> > > >     account the new way of forwarding ARPs.
> > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > >
> > > I am confused by this additional VLAN related change. VLAN is just handled as bridged logical switch where a localnet port is used as *inport*. It seems to me no difference from regular localnet port no matter with or without VLAN tag. When an ARP request is going through the ingress pipeline of the bridged logical switch, the logic of bypassing the flooding added by this patch should just apply, right? It is also very common scenario that the *aggregation switch* for the routers is an external physical network backed by VLAN. I think the purpose of this patch is to make sure scenario scale. Did I misunderstand anything here?
> >
> > Hi Han,
> >
> > The reason behind it was that with VLAN backed networks when packets
> > move between hypervisors without going through geneve we rerun the
> > ingress pipeline. That would mean that the flows I added for self
> > originated (G)ARP packets won't be hit for ARP requests originated by
> > ovn-controller on a remote hypervisor:
> >
> > priority=80 match: "inport=<patch-port-to-lr>, arp.op == 1" action:
> > "outport=MC_FLOOD"
> >
> > But instead we would hit:
> > priority=75 match: "arp.op == 1 && arp.tpa == <OVN-owned-IP" action:
> > "outport=<patch-port-to-lr>" and never send flood the packet out on
> > the second HV.
> >
>
> Thanks for the explain. Now I understand the problem.
>
> > You're right, the way I added the check for all VLAN packets is not OK
> > as we fall back to the old behavior too often. However, I'm not sure
> > what the best option is. Do you think it's fine if I change the
> > priority 80 flow to the following?
> >
> > priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> > == 1" action: "outport=MC_FLOOD"
> >
> > The idea would be to identify self originated ARP requests by matching
> > the source mac instead of logical ingress port (as this might not be
> > present). I tried it locally and it works fine but we do need to add
> > more flows than before.
> >
>
> Would this work when "ovn-chassis-mac-mappings" is configured for VLAN backed logical routers? We might end up adding chassis unique MACs, too?

I think it will work fine because when ovn-chassis-mac-mappings is
configured we add an entry in table OFTABLE_PHY_TO_LOG to match on
CHASSIS_MAC_TO_ROUTER_MAC_CONJID (comparing eth.src to all
ovn-chassis-macs) to rewrite the eth.src address with that of the
router port.

>
> Alternatively, I think we may change the priority 80 flow to match for each OVN router owned IP: arp.tpa == IP && arp.spa == IP ... Would this ensure OVN router generated ARPs are flooded?
>

I considered this too but because a router port can have an
"unlimited" number of networks configured I decided to go for MAC to
minimize the number of flows.

Also, if we match on eth.src we can create a single priority 80 logical flow:
priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
== 1" action: "outport=MC_FLOOD"

Whereas if we match on each of the IPs we need individual flows:
priority=80 match: "arp.tpa == IP && arp.spa == IP && arp.op == 1"
action: "outport=MC_FLOOD"

In the end they translate to the same number of OF entries but we
store less logical flows in the database.

Do you think there's anything else missing?

Thanks,
Dumitru

> >
> > >
> > > In addition, the below macro definition may be renamed to FLAGBIT_..., because it is for the bits of MFF_LOG_FLAGS, which is one of the pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
> > > >
> > > > +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> > > > +#define REGBIT_NOT_VLAN "flags[7] == 0"
> > > > +
> > >
> > > The other part looks good to me. Thanks for simply the patch.
> > >
> > > Han
> >
Han Zhou Nov. 12, 2019, 12:56 a.m. UTC | #5
On Mon, Nov 11, 2019 at 11:03 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On Mon, Nov 11, 2019 at 6:11 PM Han Zhou <zhouhan@gmail.com> wrote:
> >
> > 1. Would there be problem for VLAN backed logical router use case,
since a chassis MAC is used as src MAC to send packets from router ports?
> > 2. How about checking if tpa == spa to make sure GARP is always
flooded? (not directly supported by OF)
>
> This would've been nice to have in OF :)
>
> >
> >
> > On Mon, Nov 11, 2019 at 5:32 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > >
> > > On Sat, Nov 9, 2019 at 8:35 AM Han Zhou <zhouhan@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> > > > >
> > > > > ARP request and ND NS packets for router owned IPs were being
> > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast
group).
> > > > > However this creates a scaling issue in scenarios where
aggregation
> > > > > logical switches are connected to more logical routers (~350). The
> > > > > logical pipelines of all routers would have to be executed before
the
> > > > > packet is finally replied to by a single router, the owner of the
IP
> > > > > address.
> > > > >
> > > > > This commit limits the broadcast domain by bypassing the L2
Lookup stage
> > > > > for ARP requests that will be replied by a single router. The
packets
> > > > > are forwarded only to the router port that owns the target IP
address.
> > > > >
> > > > > IPs that are owned by the routers and for which this fix applies
are:
> > > > > - IP addresses configured on the router ports.
> > > > > - VIPs.
> > > > > - NAT IPs.
> > > > >
> > > > > This commit also fixes function get_router_load_balancer_ips()
which
> > > > > was incorrectly returning a single address_family even though the
> > > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > > >
> > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > > >
> > > > > ---
> > > > > v6:
> > > > > - Address Han's comments:
> > > > >   - remove flooding of ARPs targeting OVN owned IP addresses.
> > > > >   - update ovn-architecture documentation.
> > > > >   - rename ARP handling functions.
> > > > >   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to
take into
> > > > >     account the new way of forwarding ARPs.
> > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > > >
> > > > I am confused by this additional VLAN related change. VLAN is just
handled as bridged logical switch where a localnet port is used as
*inport*. It seems to me no difference from regular localnet port no matter
with or without VLAN tag. When an ARP request is going through the ingress
pipeline of the bridged logical switch, the logic of bypassing the flooding
added by this patch should just apply, right? It is also very common
scenario that the *aggregation switch* for the routers is an external
physical network backed by VLAN. I think the purpose of this patch is to
make sure scenario scale. Did I misunderstand anything here?
> > >
> > > Hi Han,
> > >
> > > The reason behind it was that with VLAN backed networks when packets
> > > move between hypervisors without going through geneve we rerun the
> > > ingress pipeline. That would mean that the flows I added for self
> > > originated (G)ARP packets won't be hit for ARP requests originated by
> > > ovn-controller on a remote hypervisor:
> > >
> > > priority=80 match: "inport=<patch-port-to-lr>, arp.op == 1" action:
> > > "outport=MC_FLOOD"
> > >
> > > But instead we would hit:
> > > priority=75 match: "arp.op == 1 && arp.tpa == <OVN-owned-IP" action:
> > > "outport=<patch-port-to-lr>" and never send flood the packet out on
> > > the second HV.
> > >
> >
> > Thanks for the explain. Now I understand the problem.
> >
> > > You're right, the way I added the check for all VLAN packets is not OK
> > > as we fall back to the old behavior too often. However, I'm not sure
> > > what the best option is. Do you think it's fine if I change the
> > > priority 80 flow to the following?
> > >
> > > priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> > > == 1" action: "outport=MC_FLOOD"
> > >
> > > The idea would be to identify self originated ARP requests by matching
> > > the source mac instead of logical ingress port (as this might not be
> > > present). I tried it locally and it works fine but we do need to add
> > > more flows than before.
> > >
> >
> > Would this work when "ovn-chassis-mac-mappings" is configured for VLAN
backed logical routers? We might end up adding chassis unique MACs, too?
>
> I think it will work fine because when ovn-chassis-mac-mappings is
> configured we add an entry in table OFTABLE_PHY_TO_LOG to match on
> CHASSIS_MAC_TO_ROUTER_MAC_CONJID (comparing eth.src to all
> ovn-chassis-macs) to rewrite the eth.src address with that of the
> router port.
>
> >
> > Alternatively, I think we may change the priority 80 flow to match for
each OVN router owned IP: arp.tpa == IP && arp.spa == IP ... Would this
ensure OVN router generated ARPs are flooded?
> >
>
> I considered this too but because a router port can have an
> "unlimited" number of networks configured I decided to go for MAC to
> minimize the number of flows.
>
> Also, if we match on eth.src we can create a single priority 80 logical
flow:
> priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> == 1" action: "outport=MC_FLOOD"
>
> Whereas if we match on each of the IPs we need individual flows:
> priority=80 match: "arp.tpa == IP && arp.spa == IP && arp.op == 1"
> action: "outport=MC_FLOOD"
>
> In the end they translate to the same number of OF entries but we
> store less logical flows in the database.
>
> Do you think there's anything else missing?

Ok. I agree that adding chassis unique MACs should be ok and it is likely
to result in less number of flows than using IP. Please let me know when
you send the new version.

>
> Thanks,
> Dumitru
>
> > >
> > > >
> > > > In addition, the below macro definition may be renamed to
FLAGBIT_..., because it is for the bits of MFF_LOG_FLAGS, which is one of
the pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
> > > > >
> > > > > +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> > > > > +#define REGBIT_NOT_VLAN "flags[7] == 0"
> > > > > +
> > > >
> > > > The other part looks good to me. Thanks for simply the patch.
> > > >
> > > > Han
> > >
>
Dumitru Ceara Nov. 12, 2019, 10:33 a.m. UTC | #6
On Tue, Nov 12, 2019 at 1:56 AM Han Zhou <zhouhan@gmail.com> wrote:
>
>
>
> On Mon, Nov 11, 2019 at 11:03 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > On Mon, Nov 11, 2019 at 6:11 PM Han Zhou <zhouhan@gmail.com> wrote:
> > >
> > > 1. Would there be problem for VLAN backed logical router use case, since a chassis MAC is used as src MAC to send packets from router ports?
> > > 2. How about checking if tpa == spa to make sure GARP is always flooded? (not directly supported by OF)
> >
> > This would've been nice to have in OF :)
> >
> > >
> > >
> > > On Mon, Nov 11, 2019 at 5:32 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > > >
> > > > On Sat, Nov 9, 2019 at 8:35 AM Han Zhou <zhouhan@gmail.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Nov 8, 2019 at 6:38 AM Dumitru Ceara <dceara@redhat.com> wrote:
> > > > > >
> > > > > > ARP request and ND NS packets for router owned IPs were being
> > > > > > flooded in the complete L2 domain (using the MC_FLOOD multicast group).
> > > > > > However this creates a scaling issue in scenarios where aggregation
> > > > > > logical switches are connected to more logical routers (~350). The
> > > > > > logical pipelines of all routers would have to be executed before the
> > > > > > packet is finally replied to by a single router, the owner of the IP
> > > > > > address.
> > > > > >
> > > > > > This commit limits the broadcast domain by bypassing the L2 Lookup stage
> > > > > > for ARP requests that will be replied by a single router. The packets
> > > > > > are forwarded only to the router port that owns the target IP address.
> > > > > >
> > > > > > IPs that are owned by the routers and for which this fix applies are:
> > > > > > - IP addresses configured on the router ports.
> > > > > > - VIPs.
> > > > > > - NAT IPs.
> > > > > >
> > > > > > This commit also fixes function get_router_load_balancer_ips() which
> > > > > > was incorrectly returning a single address_family even though the
> > > > > > IP set could contain a mix of IPv4 and IPv6 addresses.
> > > > > >
> > > > > > Reported-at: https://bugzilla.redhat.com/1756945
> > > > > > Reported-by: Anil Venkata <vkommadi@redhat.com>
> > > > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > > > > >
> > > > > > ---
> > > > > > v6:
> > > > > > - Address Han's comments:
> > > > > >   - remove flooding of ARPs targeting OVN owned IP addresses.
> > > > > >   - update ovn-architecture documentation.
> > > > > >   - rename ARP handling functions.
> > > > > >   - Adapt "ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR" autotest to take into
> > > > > >     account the new way of forwarding ARPs.
> > > > > > - Also, properly deal with ARP packets on VLAN-backed networks.
> > > > >
> > > > > I am confused by this additional VLAN related change. VLAN is just handled as bridged logical switch where a localnet port is used as *inport*. It seems to me no difference from regular localnet port no matter with or without VLAN tag. When an ARP request is going through the ingress pipeline of the bridged logical switch, the logic of bypassing the flooding added by this patch should just apply, right? It is also very common scenario that the *aggregation switch* for the routers is an external physical network backed by VLAN. I think the purpose of this patch is to make sure scenario scale. Did I misunderstand anything here?
> > > >
> > > > Hi Han,
> > > >
> > > > The reason behind it was that with VLAN backed networks when packets
> > > > move between hypervisors without going through geneve we rerun the
> > > > ingress pipeline. That would mean that the flows I added for self
> > > > originated (G)ARP packets won't be hit for ARP requests originated by
> > > > ovn-controller on a remote hypervisor:
> > > >
> > > > priority=80 match: "inport=<patch-port-to-lr>, arp.op == 1" action:
> > > > "outport=MC_FLOOD"
> > > >
> > > > But instead we would hit:
> > > > priority=75 match: "arp.op == 1 && arp.tpa == <OVN-owned-IP" action:
> > > > "outport=<patch-port-to-lr>" and never send flood the packet out on
> > > > the second HV.
> > > >
> > >
> > > Thanks for the explain. Now I understand the problem.
> > >
> > > > You're right, the way I added the check for all VLAN packets is not OK
> > > > as we fall back to the old behavior too often. However, I'm not sure
> > > > what the best option is. Do you think it's fine if I change the
> > > > priority 80 flow to the following?
> > > >
> > > > priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> > > > == 1" action: "outport=MC_FLOOD"
> > > >
> > > > The idea would be to identify self originated ARP requests by matching
> > > > the source mac instead of logical ingress port (as this might not be
> > > > present). I tried it locally and it works fine but we do need to add
> > > > more flows than before.
> > > >
> > >
> > > Would this work when "ovn-chassis-mac-mappings" is configured for VLAN backed logical routers? We might end up adding chassis unique MACs, too?
> >
> > I think it will work fine because when ovn-chassis-mac-mappings is
> > configured we add an entry in table OFTABLE_PHY_TO_LOG to match on
> > CHASSIS_MAC_TO_ROUTER_MAC_CONJID (comparing eth.src to all
> > ovn-chassis-macs) to rewrite the eth.src address with that of the
> > router port.
> >
> > >
> > > Alternatively, I think we may change the priority 80 flow to match for each OVN router owned IP: arp.tpa == IP && arp.spa == IP ... Would this ensure OVN router generated ARPs are flooded?
> > >
> >
> > I considered this too but because a router port can have an
> > "unlimited" number of networks configured I decided to go for MAC to
> > minimize the number of flows.
> >
> > Also, if we match on eth.src we can create a single priority 80 logical flow:
> > priority=80 match: "eth.src={lrp.ea, lr.nat.external_macs} && arp.op
> > == 1" action: "outport=MC_FLOOD"
> >
> > Whereas if we match on each of the IPs we need individual flows:
> > priority=80 match: "arp.tpa == IP && arp.spa == IP && arp.op == 1"
> > action: "outport=MC_FLOOD"
> >
> > In the end they translate to the same number of OF entries but we
> > store less logical flows in the database.
> >
> > Do you think there's anything else missing?
>
> Ok. I agree that adding chassis unique MACs should be ok and it is likely to result in less number of flows than using IP. Please let me know when you send the new version.

Hi Han,

I sent a v7 split in series:
- patch1: the get_router_load_balancer_ips() fix as it is independent
from the ARP/ND broadcast limiting
- patch2: the ARP/ND broadcast domain limiting

https://patchwork.ozlabs.org/project/openvswitch/list/?series=142268

Thanks,
Dumitru

>
> >
> > Thanks,
> > Dumitru
> >
> > > >
> > > > >
> > > > > In addition, the below macro definition may be renamed to FLAGBIT_..., because it is for the bits of MFF_LOG_FLAGS, which is one of the pre-defined logical fields, instead of for the MFF_LOG_REG0-9 registers.
> > > > > >
> > > > > > +#define REGBIT_NOT_VXLAN "flags[1] == 0"
> > > > > > +#define REGBIT_NOT_VLAN "flags[7] == 0"
> > > > > > +
> > > > >
> > > > > The other part looks good to me. Thanks for simply the patch.
> > > > >
> > > > > Han
> > > >
> >

Patch
diff mbox series

diff --git a/controller/physical.c b/controller/physical.c
index 500d419..751dbbf 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -567,6 +567,7 @@  put_replace_chassis_mac_flows(const struct simap *ct_zones,
 
         if (tag) {
             ofpact_put_STRIP_VLAN(ofpacts_p);
+            put_load(1, MFF_LOG_FLAGS, MLF_RCV_FROM_VLAN_BIT, 1, ofpacts_p);
         }
         load_logical_ingress_metadata(localnet_port, &zone_ids, ofpacts_p);
         replace_mac = ofpact_put_SET_ETH_SRC(ofpacts_p);
@@ -1124,6 +1125,7 @@  consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
                          ofpacts_p);
             }
             ofpact_put_STRIP_VLAN(ofpacts_p);
+            put_load(1, MFF_LOG_FLAGS, MLF_RCV_FROM_VLAN_BIT, 1, ofpacts_p);
         }
 
         /* Remember the size with just strip vlan added so far,
diff --git a/include/ovn/logical-fields.h b/include/ovn/logical-fields.h
index 9b7c34f..15e0d1e 100644
--- a/include/ovn/logical-fields.h
+++ b/include/ovn/logical-fields.h
@@ -57,6 +57,7 @@  enum mff_log_flags_bits {
     MLF_LOCAL_ONLY_BIT = 4,
     MLF_NESTED_CONTAINER_BIT = 5,
     MLF_LOOKUP_MAC_BIT = 6,
+    MLF_RCV_FROM_VLAN_BIT = 7,
 };
 
 /* MFF_LOG_FLAGS_REG flag assignments */
@@ -88,6 +89,9 @@  enum mff_log_flags {
 
     /* Indicate that the lookup in the mac binding table was successful. */
     MLF_LOOKUP_MAC = (1 << MLF_LOOKUP_MAC_BIT),
+
+    /* Indicate that a packet was received on a VLAN backed network. */
+    MLF_RCV_FROM_VLAN = (1 << MLF_RCV_FROM_VLAN_BIT),
 };
 
 /* OVN logical fields
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 0a33dcd..6fbb3ab 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -1005,6 +1005,22 @@  output;
       </li>
 
       <li>
+        Priority-80 flows for each port connected to a logical router
+        matching self originated GARP/ARP request/ND packets. These packets
+        are flooded to the <code>MC_FLOOD</code> which contains all logical
+        ports.
+      </li>
+
+      <li>
+        Priority-75 flows for each IP address/VIP/NAT address owned by a
+        router port connected to the switch. These flows match ARP requests
+        and ND packets for the specific IP addresses.  Matched packets are
+        forwarded in the L3 domain only to the router that owns the IP
+        address and flooded in the L2 domain on all ports except patch
+        ports connected to logical routers.
+      </li>
+
+      <li>
         A priority-70 flow that outputs all packets with an Ethernet broadcast
         or multicast <code>eth.dst</code> to the <code>MC_FLOOD</code>
         multicast group.
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 2f0f501..9de3d20 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -210,6 +210,9 @@  enum ovn_stage {
 #define REGBIT_LOOKUP_NEIGHBOR_RESULT "reg9[4]"
 #define REGBIT_SKIP_LOOKUP_NEIGHBOR "reg9[5]"
 
+#define REGBIT_NOT_VXLAN "flags[1] == 0"
+#define REGBIT_NOT_VLAN "flags[7] == 0"
+
 /* Returns an "enum ovn_stage" built from the arguments. */
 static enum ovn_stage
 ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
@@ -1202,6 +1205,34 @@  ovn_port_allocate_key(struct ovn_datapath *od)
                           1, (1u << 15) - 1, &od->port_key_hint);
 }
 
+/* Returns true if the logical switch port 'enabled' column is empty or
+ * set to true.  Otherwise, returns false. */
+static bool
+lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
+{
+    return !lsp->n_enabled || *lsp->enabled;
+}
+
+/* Returns true only if the logical switch port 'up' column is set to true.
+ * Otherwise, if the column is not set or set to false, returns false. */
+static bool
+lsp_is_up(const struct nbrec_logical_switch_port *lsp)
+{
+    return lsp->n_up && *lsp->up;
+}
+
+static bool
+lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
+{
+    return !strcmp(nbsp->type, "external");
+}
+
+static bool
+lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
+{
+    return !lrport->enabled || *lrport->enabled;
+}
+
 static char *
 chassis_redirect_name(const char *port_name)
 {
@@ -2184,7 +2215,7 @@  ip_address_and_port_from_lb_key(const char *key, char **ip_address,
 
 static void
 get_router_load_balancer_ips(const struct ovn_datapath *od,
-                             struct sset *all_ips, int *addr_family)
+                             struct sset *all_ips_v4, struct sset *all_ips_v6)
 {
     if (!od->nbr) {
         return;
@@ -2199,13 +2230,21 @@  get_router_load_balancer_ips(const struct ovn_datapath *od,
             /* node->key contains IP:port or just IP. */
             char *ip_address = NULL;
             uint16_t port;
+            int addr_family;
 
             ip_address_and_port_from_lb_key(node->key, &ip_address, &port,
-                                            addr_family);
+                                            &addr_family);
             if (!ip_address) {
                 continue;
             }
 
+            struct sset *all_ips;
+            if (addr_family == AF_INET) {
+                all_ips = all_ips_v4;
+            } else {
+                all_ips = all_ips_v6;
+            }
+
             if (!sset_contains(all_ips, ip_address)) {
                 sset_add(all_ips, ip_address);
             }
@@ -2299,17 +2338,22 @@  get_nat_addresses(const struct ovn_port *op, size_t *n)
         }
     }
 
-    /* A set to hold all load-balancer vips. */
-    struct sset all_ips = SSET_INITIALIZER(&all_ips);
-    int addr_family;
-    get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
+    /* Two sets to hold all load-balancer vips. */
+    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
 
     const char *ip_address;
-    SSET_FOR_EACH (ip_address, &all_ips) {
+    SSET_FOR_EACH (ip_address, &all_ips_v4) {
         ds_put_format(&c_addresses, " %s", ip_address);
         central_ip_address = true;
     }
-    sset_destroy(&all_ips);
+    SSET_FOR_EACH (ip_address, &all_ips_v6) {
+        ds_put_format(&c_addresses, " %s", ip_address);
+        central_ip_address = true;
+    }
+    sset_destroy(&all_ips_v4);
+    sset_destroy(&all_ips_v6);
 
     if (central_ip_address) {
         /* Gratuitous ARP for centralized NAT rules on distributed gateway
@@ -3737,28 +3781,6 @@  build_port_security_ip(enum ovn_pipeline pipeline, struct ovn_port *op,
 
 }
 
-/* Returns true if the logical switch port 'enabled' column is empty or
- * set to true.  Otherwise, returns false. */
-static bool
-lsp_is_enabled(const struct nbrec_logical_switch_port *lsp)
-{
-    return !lsp->n_enabled || *lsp->enabled;
-}
-
-/* Returns true only if the logical switch port 'up' column is set to true.
- * Otherwise, if the column is not set or set to false, returns false. */
-static bool
-lsp_is_up(const struct nbrec_logical_switch_port *lsp)
-{
-    return lsp->n_up && *lsp->up;
-}
-
-static bool
-lsp_is_external(const struct nbrec_logical_switch_port *nbsp)
-{
-    return !strcmp(nbsp->type, "external");
-}
-
 static bool
 build_dhcpv4_action(struct ovn_port *op, ovs_be32 offer_ip,
                     struct ds *options_action, struct ds *response_action,
@@ -5161,6 +5183,150 @@  build_lrouter_groups(struct hmap *ports, struct ovs_list *lr_list)
     }
 }
 
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses. Packets are still flooded in the switching domain
+ * as regular broadcast.
+ */
+static void
+build_lswitch_rport_arp_req_flow_for_ip(const char *target_address,
+                                        int addr_family,
+                                        struct ovn_port *patch_op,
+                                        struct ovn_datapath *od,
+                                        uint32_t priority,
+                                        struct hmap *lflows)
+{
+    struct ds match   = DS_EMPTY_INITIALIZER;
+    struct ds actions = DS_EMPTY_INITIALIZER;
+
+    if (addr_family == AF_INET) {
+        ds_put_format(&match, "arp.tpa == %s && arp.op == 1", target_address);
+    } else {
+        ds_put_format(&match, "nd.target == %s && nd_ns", target_address);
+    }
+
+    /* Packets received from VXLAN tunnels have already been through the
+     * router pipeline so we should skip them. Normally this is done by the
+     * multicast_group implementation (VXLAN packets skip table 32 which
+     * delivers to patch ports) but we're bypassing multicast_groups.
+     *
+     * Packets received on VLAN backed networks were also already routed at
+     * source.
+     */
+    ds_put_format(&match, " && " REGBIT_NOT_VXLAN " && " REGBIT_NOT_VLAN);
+
+    /* Send a the packet only to the router pipeline and skip flooding it
+     * in the broadcast domain.
+     */
+    ds_put_format(&actions, "outport = %s; output;", patch_op->json_key);
+    ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, priority,
+                  ds_cstr(&match), ds_cstr(&actions));
+
+    ds_destroy(&match);
+    ds_destroy(&actions);
+}
+
+/*
+ * Ingress table 17: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses.
+ * Priorities:
+ * - 80: self originated GARPs that need to follow regular processing.
+ * - 75: ARP requests to router owned IPs (interface IP/LB/NAT).
+ */
+static void
+build_lswitch_rport_arp_req_flows(struct ovn_port *op,
+                                  struct ovn_datapath *sw_od,
+                                  struct ovn_port *sw_op,
+                                  struct hmap *lflows)
+{
+    if (!op || !op->nbrp) {
+        return;
+    }
+
+    if (!lrport_is_enabled(op->nbrp)) {
+        return;
+    }
+
+    struct ds match = DS_EMPTY_INITIALIZER;
+
+    /* Self originated (G)ARP requests/ND need to be flooded as usual.
+     * Priority: 80.
+     */
+    ds_put_format(&match, "inport == %s && (arp.op == 1 || nd_ns)",
+                  sw_op->json_key);
+    ovn_lflow_add(lflows, sw_od, S_SWITCH_IN_L2_LKUP, 80,
+                  ds_cstr(&match),
+                  "outport = \""MC_FLOOD"\"; output;");
+
+    ds_destroy(&match);
+
+    /* Forward ARP requests for IPs configured on the router only to this
+     * router port.
+     * Priority: 75.
+     */
+    for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
+        struct ipv4_netaddr *ip4 = &op->lrp_networks.ipv4_addrs[i];
+
+        build_lswitch_rport_arp_req_flow_for_ip(ip4->addr_s, AF_INET, sw_op,
+                                                sw_od, 75, lflows);
+    }
+    for (int i = 0; i < op->lrp_networks.n_ipv6_addrs; i++) {
+        struct ipv6_netaddr *ip6 = &op->lrp_networks.ipv6_addrs[i];
+
+        build_lswitch_rport_arp_req_flow_for_ip(ip6->addr_s, AF_INET6, sw_op,
+                                                sw_od, 75, lflows);
+    }
+
+    /* Forward ARP requests to load-balancer VIPs configured on the router
+     * only to this router port.
+     * Priority: 75.
+     */
+    struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+    struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+    const char *ip_address;
+
+    get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
+
+    SSET_FOR_EACH (ip_address, &all_ips_v4) {
+        build_lswitch_rport_arp_req_flow_for_ip(ip_address, AF_INET, sw_op,
+                                                sw_od, 75, lflows);
+    }
+    SSET_FOR_EACH (ip_address, &all_ips_v6) {
+        build_lswitch_rport_arp_req_flow_for_ip(ip_address, AF_INET6, sw_op,
+                                                sw_od, 75, lflows);
+    }
+    sset_destroy(&all_ips_v4);
+    sset_destroy(&all_ips_v6);
+
+    /* Forward ARP requests to NAT addresses configured on the router
+     * only to this router port.
+     * Priority: 75.
+     */
+    for (int i = 0; i < op->od->nbr->n_nat; i++) {
+        const struct nbrec_nat *nat = op->od->nbr->nat[i];
+
+        if (!strcmp(nat->type, "snat")) {
+            continue;
+        }
+
+        ovs_be32 ip;
+        ovs_be32 mask;
+        struct in6_addr ipv6;
+        struct in6_addr mask_v6;
+
+        if (ip_parse_masked(nat->external_ip, &ip, &mask)) {
+            if (!ipv6_parse_masked(nat->external_ip, &ipv6, &mask_v6)) {
+                build_lswitch_rport_arp_req_flow_for_ip(nat->external_ip,
+                                                        AF_INET6, sw_op,
+                                                        sw_od, 75, lflows);
+            }
+        } else {
+            build_lswitch_rport_arp_req_flow_for_ip(nat->external_ip, AF_INET,
+                                                    sw_op, sw_od, 75, lflows);
+        }
+    }
+}
+
 static void
 build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
                     struct hmap *port_groups, struct hmap *lflows,
@@ -5748,6 +5914,14 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
             continue;
         }
 
+        /* For ports connected to logical routers add flows to bypass the
+         * broadcast flooding of ARP/ND requests in table 17. We direct the
+         * requests only to the router port that owns the IP address.
+         */
+        if (!strcmp(op->nbsp->type, "router")) {
+            build_lswitch_rport_arp_req_flows(op->peer, op->od, op, lflows);
+        }
+
         for (size_t i = 0; i < op->nbsp->n_addresses; i++) {
             /* Addresses are owned by the logical port.
              * Ethernet address followed by zero or more IPv4
@@ -5879,12 +6053,6 @@  build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
     ds_destroy(&actions);
 }
 
-static bool
-lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
-{
-    return !lrport->enabled || *lrport->enabled;
-}
-
 /* Returns a string of the IP address of the router port 'op' that
  * overlaps with 'ip_s".  If one is not found, returns NULL.
  *
@@ -6911,61 +7079,66 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         }
 
         /* A set to hold all load-balancer vips that need ARP responses. */
-        struct sset all_ips = SSET_INITIALIZER(&all_ips);
-        int addr_family;
-        get_router_load_balancer_ips(op->od, &all_ips, &addr_family);
+        struct sset all_ips_v4 = SSET_INITIALIZER(&all_ips_v4);
+        struct sset all_ips_v6 = SSET_INITIALIZER(&all_ips_v6);
+        get_router_load_balancer_ips(op->od, &all_ips_v4, &all_ips_v6);
 
         const char *ip_address;
-        SSET_FOR_EACH(ip_address, &all_ips) {
+        SSET_FOR_EACH (ip_address, &all_ips_v4) {
             ds_clear(&match);
-            if (addr_family == AF_INET) {
-                ds_put_format(&match,
-                              "inport == %s && arp.tpa == %s && arp.op == 1",
-                              op->json_key, ip_address);
-            } else {
-                ds_put_format(&match,
-                              "inport == %s && nd_ns && nd.target == %s",
-                              op->json_key, ip_address);
-            }
+            ds_put_format(&match,
+                          "inport == %s && arp.tpa == %s && arp.op == 1",
+                          op->json_key, ip_address);
 
             ds_clear(&actions);
-            if (addr_family == AF_INET) {
-                ds_put_format(&actions,
-                "eth.dst = eth.src; "
-                "eth.src = %s; "
-                "arp.op = 2; /* ARP reply */ "
-                "arp.tha = arp.sha; "
-                "arp.sha = %s; "
-                "arp.tpa = arp.spa; "
-                "arp.spa = %s; "
-                "outport = %s; "
-                "flags.loopback = 1; "
-                "output;",
-                op->lrp_networks.ea_s,
-                op->lrp_networks.ea_s,
-                ip_address,
-                op->json_key);
-            } else {
-                ds_put_format(&actions,
-                "nd_na { "
-                "eth.src = %s; "
-                "ip6.src = %s; "
-                "nd.target = %s; "
-                "nd.tll = %s; "
-                "outport = inport; "
-                "flags.loopback = 1; "
-                "output; "
-                "};",
-                op->lrp_networks.ea_s,
-                ip_address,
-                ip_address,
-                op->lrp_networks.ea_s);
-            }
+            ds_put_format(&actions,
+                          "eth.dst = eth.src; "
+                          "eth.src = %s; "
+                          "arp.op = 2; /* ARP reply */ "
+                          "arp.tha = arp.sha; "
+                          "arp.sha = %s; "
+                          "arp.tpa = arp.spa; "
+                          "arp.spa = %s; "
+                          "outport = %s; "
+                          "flags.loopback = 1; "
+                          "output;",
+                          op->lrp_networks.ea_s,
+                          op->lrp_networks.ea_s,
+                          ip_address,
+                          op->json_key);
+
             ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
                           ds_cstr(&match), ds_cstr(&actions));
         }
 
-        sset_destroy(&all_ips);
+        SSET_FOR_EACH (ip_address, &all_ips_v6) {
+            ds_clear(&match);
+            ds_put_format(&match,
+                          "inport == %s && nd_ns && nd.target == %s",
+                          op->json_key, ip_address);
+
+            ds_clear(&actions);
+            ds_put_format(&actions,
+                          "nd_na { "
+                          "eth.src = %s; "
+                          "ip6.src = %s; "
+                          "nd.target = %s; "
+                          "nd.tll = %s; "
+                          "outport = inport; "
+                          "flags.loopback = 1; "
+                          "output; "
+                          "};",
+                          op->lrp_networks.ea_s,
+                          ip_address,
+                          ip_address,
+                          op->lrp_networks.ea_s);
+
+            ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
+                          ds_cstr(&match), ds_cstr(&actions));
+        }
+
+        sset_destroy(&all_ips_v4);
+        sset_destroy(&all_ips_v6);
 
         /* A gateway router can have 2 SNAT IP addresses to force DNATed and
          * LBed traffic respectively to be SNATed.  In addition, there can be
diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
index 7966b65..c43f16d 100644
--- a/ovn-architecture.7.xml
+++ b/ovn-architecture.7.xml
@@ -1390,6 +1390,25 @@ 
     http://docs.openvswitch.org/en/latest/topics/high-availability.
   </p>
 
+  <h3>ARP request and ND NS packet processing</h3>
+
+  <p>
+    Due to the fact that ARP requests and ND NA packets are usually broadcast
+    packets, for performance reasons, OVN deals with requests that target OVN
+    owned IP addresses (i.e., IP addresses configured on the router ports,
+    VIPs, NAT IPs) in a specific way and only forwards them to the logical
+    router that owns the target IP address. This behavior is different than
+    that of traditional swithces and implies that other routers/hosts
+    connected to the logical switch will not learn the MAC/IP binding from
+    the request packet.
+  </p>
+
+  <p>
+    All other ARP and ND packets are flooded in the L2 broadcast domain and
+    to all attached logical patch ports.
+  </p>
+
+
   <h2>Multiple localnet logical switches connected to a Logical Router</h2>
 
   <p>
diff --git a/tests/ovn.at b/tests/ovn.at
index cb7903d..38d02d7 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -2877,7 +2877,7 @@  test_ip() {
     done
 }
 
-# test_arp INPORT SHA SPA TPA [REPLY_HA]
+# test_arp INPORT SHA SPA TPA FLOOD [REPLY_HA]
 #
 # Causes a packet to be received on INPORT.  The packet is an ARP
 # request with SHA, SPA, and TPA as specified.  If REPLY_HA is provided, then
@@ -2888,21 +2888,25 @@  test_ip() {
 # SHA and REPLY_HA are each 12 hex digits.
 # SPA and TPA are each 8 hex digits.
 test_arp() {
-    local inport=$1 sha=$2 spa=$3 tpa=$4 reply_ha=$5
+    local inport=$1 sha=$2 spa=$3 tpa=$4 flood=$5 reply_ha=$6
     local request=ffffffffffff${sha}08060001080006040001${sha}${spa}ffffffffffff${tpa}
     hv=hv`vif_to_hv $inport`
     as $hv ovs-appctl netdev-dummy/receive vif$inport $request
     as $hv ovs-appctl ofproto/trace br-int in_port=$inport $request
 
     # Expect to receive the broadcast ARP on the other logical switch ports if
-    # IP address is not configured to the switch patch port.
+    # IP address is not configured on the switch patch port or on the router
+    # port (i.e, $flood == 1).
     local i=`vif_to_ls $inport`
     local j k
     for j in 1 2 3; do
         for k in 1 2 3; do
-            # 192.168.33.254 is configured to the switch patch port for lrp33,
-            # so no ARP flooding expected for it.
-            if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
+            # Skip ingress port.
+            if test $i$j$k == $inport; then
+                continue
+            fi
+
+            if test X$flood == X1; then
                 echo $request >> $i$j$k.expected
             fi
         done
@@ -3039,9 +3043,9 @@  for i in 1 2 3; do
       otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
       externalip=`ip_to_hex 1 2 3 4`      # Some other IP not in subnet
 
-      test_arp $i$j$k $smac $sip        $rip        $rmac      #4
-      test_arp $i$j$k $smac $otherip    $rip        $rmac      #5
-      test_arp $i$j$k $smac $sip        $otherip               #6
+      test_arp $i$j$k $smac $sip        $rip       0     $rmac       #4
+      test_arp $i$j$k $smac $otherip    $rip       0     $rmac       #5
+      test_arp $i$j$k $smac $sip        $otherip   1                 #6
 
       # When rip is 192.168.33.254, ARP request from externalip won't be
       # filtered, because 192.168.33.254 is configured to switch peer port
@@ -3050,7 +3054,7 @@  for i in 1 2 3; do
       if test $i = 3 && test $j = 3; then
         lrp33_rsp=$rmac
       fi
-      test_arp $i$j$k $smac $externalip $rip        $lrp33_rsp #7
+      test_arp $i$j$k $smac $externalip $rip       0      $lrp33_rsp #7
 
       # MAC binding should be learned from ARP request.
       host_mac_pretty=f0:00:00:00:0$i:$j$k
@@ -9595,7 +9599,7 @@  ovn-nbctl --wait=hv --timeout=3 sync
 # Check that there is a logical flow in logical switch foo's pipeline
 # to set the outport to rp-foo (which is expected).
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep -v is_chassis_resident | wc -l`])
+grep rp-foo | grep -v is_chassis_resident | grep priority=50 -c`])
 
 # Set the option 'reside-on-redirect-chassis' for foo
 ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
@@ -9603,7 +9607,7 @@  ovn-nbctl set logical_router_port foo options:reside-on-redirect-chassis=true
 # to set the outport to rp-foo with the condition is_chassis_redirect.
 ovn-sbctl dump-flows foo
 OVS_WAIT_UNTIL([test 1 = `ovn-sbctl dump-flows foo | grep ls_in_l2_lkup | \
-grep rp-foo | grep is_chassis_resident | wc -l`])
+grep rp-foo | grep is_chassis_resident | grep priority=50 -c`])
 
 echo "---------NB dump-----"
 ovn-nbctl show
@@ -16694,3 +16698,282 @@  as hv4 ovs-appctl fdb/show br-phys
 OVN_CLEANUP([hv1],[hv2],[hv3],[hv4])
 
 AT_CLEANUP
+
+AT_SETUP([ovn -- ARP/ND request broadcast limiting])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+ip_to_hex() {
+    printf "%02x%02x%02x%02x" "$@"
+}
+
+send_arp_request() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5
+    local eth_dst=ffffffffffff
+    local eth_type=0806
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local arp=0001080006040001${eth_src}${spa}${eth_dst}${tpa}
+
+    local request=${eth}${arp}
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+send_nd_ns() {
+    local hv=$1 inport=$2 eth_src=$3 spa=$4 tpa=$5 cksum=$6
+
+    local eth_dst=ffffffffffff
+    local eth_type=86dd
+    local eth=${eth_dst}${eth_src}${eth_type}
+
+    local ip_vhlen=60000000
+    local ip_plen=0020
+    local ip_next=3a
+    local ip_ttl=ff
+    local ip=${ip_vhlen}${ip_plen}${ip_next}${ip_ttl}${spa}${tpa}
+
+    # Neighbor Solicitation
+    local icmp6_type=87
+    local icmp6_code=00
+    local icmp6_rsvd=00000000
+    # ICMPv6 source lla option
+    local icmp6_opt=01
+    local icmp6_optlen=01
+    local icmp6=${icmp6_type}${icmp6_code}${cksum}${icmp6_rsvd}${tpa}${icmp6_opt}${icmp6_optlen}${eth_src}
+
+    local request=${eth}${ip}${icmp6}
+
+    as hv$hv ovs-appctl netdev-dummy/receive hv${hv}-vif$inport $request
+}
+
+src_mac=000000000001
+
+net_add n1
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+
+ovs-vsctl -- add-port br-int hv1-vif1 -- \
+    set interface hv1-vif1 external-ids:iface-id=sw-agg-ext \
+    options:tx_pcap=hv1/vif1-tx.pcap \
+    options:rxq_pcap=hv1/vif1-rx.pcap \
+    ofport-request=1
+
+# One Aggregation Switch connected to two Logical networks (routers).
+ovn-nbctl ls-add sw-agg
+ovn-nbctl lsp-add sw-agg sw-agg-ext \
+    -- lsp-set-addresses sw-agg-ext 00:00:00:00:00:01
+
+ovn-nbctl lsp-add sw-agg sw-rtr1                   \
+    -- lsp-set-type sw-rtr1 router                 \
+    -- lsp-set-addresses sw-rtr1 00:00:00:00:01:00 \
+    -- lsp-set-options sw-rtr1 router-port=rtr1-sw
+ovn-nbctl lsp-add sw-agg sw-rtr2                   \
+    -- lsp-set-type sw-rtr2 router                 \
+    -- lsp-set-addresses sw-rtr2 00:00:00:00:02:00 \
+    -- lsp-set-options sw-rtr2 router-port=rtr2-sw
+
+# Configure L3 interface IPv4 & IPv6 on both routers
+ovn-nbctl lr-add rtr1
+ovn-nbctl lrp-add rtr1 rtr1-sw 00:00:00:00:01:00 10.0.0.1/24 10::1/64
+
+ovn-nbctl lr-add rtr2
+ovn-nbctl lrp-add rtr2 rtr2-sw 00:00:00:00:02:00 10.0.0.2/24 10::2/64
+
+OVN_POPULATE_ARP
+ovn-nbctl --wait=hv sync
+
+sw_dp_uuid=$(ovn-sbctl --bare --columns _uuid list datapath_binding sw-agg)
+sw_dp_key=$(ovn-sbctl --bare --columns tunnel_key list datapath_binding sw-agg)
+
+r1_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr1)
+r2_tnl_key=$(ovn-sbctl --bare --columns tunnel_key list port_binding sw-rtr2)
+
+mc_key=$(ovn-sbctl --bare --columns tunnel_key find multicast_group datapath=${sw_dp_uuid} name="_MC_flood")
+mc_key=$(printf "%04x" $mc_key)
+
+match_sw_metadata="metadata=0x${sw_dp_key}"
+
+# Inject ARP request for first router owned IP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 1)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.1,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for ofirst router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000001
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::1"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure load balancing on both routers.
+ovn-nbctl lb-add lb1-v4 10.0.0.11 42.42.42.1
+ovn-nbctl lb-add lb1-v6 10::11 42::1
+ovn-nbctl lr-lb-add rtr1 lb1-v4
+ovn-nbctl lr-lb-add rtr1 lb1-v6
+
+ovn-nbctl lb-add lb2-v4 10.0.0.22 42.42.42.2
+ovn-nbctl lb-add lb2-v6 10::22 42::2
+ovn-nbctl lr-lb-add rtr2 lb2-v4
+ovn-nbctl lr-lb-add rtr2 lb2-v6
+ovn-nbctl --wait=hv sync
+
+# Inject ARP request for first router owned VIP address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 11)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.11,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned VIP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000011
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::11"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Configure NAT on both routers
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10.0.0.111 42.42.42.1
+ovn-nbctl lr-nat-add rtr1 dnat_and_snat 10::111 42::1
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10.0.0.222 42.42.42.2
+ovn-nbctl lr-nat-add rtr2 dnat_and_snat 10::222 42::2
+
+# Inject ARP request for first router owned NAT address.
+send_arp_request 1 1 ${src_mac} $(ip_to_hex 10 0 0 254) $(ip_to_hex 10 0 0 111)
+
+# Verify that the ARP request is sent only to rtr1.
+match_arp_req="priority=75.*${match_sw_metadata}.*arp_tpa=10.0.0.111,arp_op=1"
+match_send_rtr1="load:0x${r1_tnl_key}->NXM_NX_REG15"
+match_send_rtr2="load:0x${r2_tnl_key}->NXM_NX_REG15"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_arp_req}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+# Inject ND_NS for first router owned IP address.
+src_ipv6=00100000000000000000000000000254
+dst_ipv6=00100000000000000000000000000111
+send_nd_ns 1 1 ${src_mac} ${src_ipv6} ${dst_ipv6} 751d
+
+# Verify that the ND_NS is sent only to rtr1.
+match_nd_ns="priority=75.*${match_sw_metadata}.*icmp_type=135.*nd_target=10::111"
+
+as hv1
+OVS_WAIT_UNTIL([
+    pkts_to_rtr1=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr1}" | \
+    grep n_packets=1 -c)
+    test "1" = "${pkts_to_rtr1}"
+])
+OVS_WAIT_UNTIL([
+    pkts_to_rtr2=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_nd_ns}" | grep "${match_send_rtr2}" | \
+    grep n_packets=1 -c)
+    test "0" = "${pkts_to_rtr2}"
+])
+OVS_WAIT_UNTIL([
+    pkts_flooded=$(ovs-ofctl dump-flows br-int | \
+    grep -E "${match_sw_metadata}" | grep ${mc_key} | grep -v n_packets=0 -c)
+    test "0" = "${pkts_flooded}"
+])
+
+OVN_CLEANUP([hv1])
+AT_CLEANUP