diff mbox

[ovs-dev] ovn-northd: Support connecting multiple routers to a switch.

Message ID 1462551705-15458-1-git-send-email-guru@ovn.org
State Superseded
Headers show

Commit Message

Gurucharan Shetty May 6, 2016, 4:21 p.m. UTC
Currently we can connect routers via "peer"ing. This limits
the number of routers that can be connected with each other
directly to 2.

One of the design goals for L3 Gateway is to be able to
have multiple gateways (each with their own router)
connected to a distributed router via a switch.

With the above goal in mind, this commit gives the general
ability to connect multiple routers via a switch.

Signed-off-by: Gurucharan Shetty <guru@ovn.org>
---
This needs the following 2 commits under review to
first go in.
1. ofproto-dpif: Rename "recurse" to "indentation".
2. ofproto-dpif: Do not count resubmit to later tables against limit.
---
 ovn/controller/lflow.c  |   19 ++--
 ovn/northd/ovn-northd.c |   56 ++++++++++-
 ovn/ovn-nb.xml          |    7 --
 tests/ovn.at            |  236 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 299 insertions(+), 19 deletions(-)

Comments

Darrell Ball May 9, 2016, 9:51 p.m. UTC | #1
I have some initial clarifications first before review

On Fri, May 6, 2016 at 9:21 AM, Gurucharan Shetty <guru@ovn.org> wrote:

> Currently we can connect routers via "peer"ing. This limits
> the number of routers that can be connected with each other
> directly to 2.
>


This sounds like you are saying that cannot have a topology like

R1-------R2
 \           /
  \        /
   \      /
    \   /
     R3

which is common.
Did I read the above comment correctly ?



>
> One of the design goals for L3 Gateway is to be able to
> have multiple gateways (each with their own router)
> connected to a distributed router via a switch.
>

Its not ideal to have a switch required to connect routers - it somewhat
defeats the advantages
of having routers in the first place.
This is usually only done if there is existing switching equipment than
must be traversed
But in the case, we are just dealing with software where we have total
flexibility.



>
> With the above goal in mind, this commit gives the general
> ability to connect multiple routers via a switch.
>
> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
> ---
> This needs the following 2 commits under review to
> first go in.
> 1. ofproto-dpif: Rename "recurse" to "indentation".
> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
> ---
>  ovn/controller/lflow.c  |   19 ++--
>  ovn/northd/ovn-northd.c |   56 ++++++++++-
>  ovn/ovn-nb.xml          |    7 --
>  tests/ovn.at            |  236
> +++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 299 insertions(+), 19 deletions(-)
>
> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
> index 96b7c66..efc427d 100644
> --- a/ovn/controller/lflow.c
> +++ b/ovn/controller/lflow.c
> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx, const
> struct lport_index *lports,
>          if (is_switch(ldp)) {
>              /* For a logical switch datapath, local_datapaths tells us if
> there
>               * are any local ports for this datapath.  If not, we can skip
> -             * processing logical flows if the flow belongs to egress
> pipeline
> -             * or if that logical switch datapath is not patched to any
> logical
> -             * router.
> +             * processing logical flows if that logical switch datapath
> is not
> +             * patched to any logical router.
>               *
> -             * Otherwise, we still need the ingress pipeline because even
> if
> -             * there are no local ports, we still may need to execute the
> ingress
> -             * pipeline after a packet leaves a logical router.  Further
> -             * optimization is possible, but not based on what we know
> with
> -             * local_datapaths right now.
> +             * Otherwise, we still need both ingress and egress pipeline
> +             * because even if there are no local ports, we still may
> need to
> +             * execute the ingress pipeline after a packet leaves a
> logical
> +             * router and we need to do egress pipeline for a switch that
> +             * is connected to only routers.  Further optimization is
> possible,
> +             * but not based on what we know with local_datapaths right
> now.
>               *
>               * A better approach would be a kind of "flood fill"
> algorithm:
>               *
> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx, const
> struct lport_index *lports,
>               */
>
>              if (!get_local_datapath(local_datapaths, ldp->tunnel_key)) {
> -                if (!ingress) {
> -                    continue;
> -                }
>

This is removing a previous change that was done for some optimization for
avoiding
unnecessary flow creation.
I am not sure about the process here, but maybe this should be tracked as a
separate
patch ?




>                  if (!get_patched_datapath(patched_datapaths,
>                                            ldp->tunnel_key)) {
>                      continue;
> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> index 9e03606..3a29770 100644
> --- a/ovn/northd/ovn-northd.c
> +++ b/ovn/northd/ovn-northd.c
> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>                  free(actions);
>                  free(match);
>              }
> -        } else if (op->od->n_router_ports) {
> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
> "router")) {
> +            /* This is a logical switch port that backs a VM or a
> container.
> +             * Extract its addresses. For each of the address, go through
> all
> +             * the router ports attached to the switch (to which this port
> +             * connects) and if the address in question is reachable from
> the
> +             * router port, add an ARP entry in that router's pipeline. */
> +
>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>                  struct lport_addresses laddrs;
>                  if (!extract_lport_addresses(op->nbs->addresses[i],
> &laddrs,
> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>
>                  free(laddrs.ipv4_addrs);
>              }
> +        } else if (!strcmp(op->nbs->type, "router")) {
> +            /* This is a logical switch port that connects to a router. */
> +
> +            /* The peer of this switch port is the router port for which
> +             * we need to add logical flows such that it can resolve
> +             * ARP entries for all the other router ports connected to
> +             * the switch in question. */
> +
> +            const char *peer_name = smap_get(&op->nbs->options,
> +                                             "router-port");
> +            if (!peer_name) {
> +                continue;
> +            }
> +
> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +            if (!peer || !peer->nbr || !peer->ip) {
> +                continue;
> +            }
> +
> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
> +                const char *router_port_name = smap_get(
> +
> &op->od->router_ports[j]->nbs->options,
> +                                    "router-port");
> +                struct ovn_port *router_port = ovn_port_find(ports,
> +
>  router_port_name);
> +                if (!router_port || !router_port->nbr ||
> !router_port->ip) {
> +                    continue;
> +                }
> +
> +                /* Skip the router port under consideration. */
> +                if (router_port == peer) {
> +                   continue;
> +                }
> +
> +                if (!router_port->ip) {
> +                    continue;
> +                }
> +                char *match = xasprintf("outport == %s && reg0 == "IP_FMT,
> +                                        peer->json_key,
> +                                        IP_ARGS(router_port->ip));
> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
> next;",
> +
> ETH_ADDR_ARGS(router_port->mac));
> +                ovn_lflow_add(lflows, peer->od, S_ROUTER_IN_ARP_RESOLVE,
> +                              101, match, actions);
> +                free(actions);
> +                free(match);
> +            }
>          }
>      }
> +
>      HMAP_FOR_EACH (od, key_node, datapaths) {
>          if (!od->nbr) {
>              continue;
> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
> index c01455d..d7fd595 100644
> --- a/ovn/ovn-nb.xml
> +++ b/ovn/ovn-nb.xml
> @@ -154,13 +154,6 @@
>            These options apply when <ref column="type"/> is
> <code>router</code>.
>          </p>
>
> -        <p>
> -          If a given logical switch has multiple <code>router</code>
> ports, the
> -          <ref table="Logical_Router_Port"/> rows that they reference
> must be
> -          all on the same <ref table="Logical_Router"/> (for different
> -          subnets).
> -        </p>
> -
>          <column name="options" key="router-port">
>            Required.  The <ref column="name"/> of the <ref
>            table="Logical_Router_Port"/> to which this logical switch port
> is
> diff --git a/tests/ovn.at b/tests/ovn.at
> index b9d7ada..6961be0 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>
>  AT_CLEANUP
>
> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
> +AT_KEYWORDS([ovnstaticroutes])
> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> +ovn_start
> +
> +# Logical network:
> +# Three LRs - R1, R2 and R3 that are connected to each other via LS "join"
> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
> 10.32.1.0/24)
> +# connected to it.
> +
> +ovn-nbctl create Logical_Router name=R1
> +ovn-nbctl create Logical_Router name=R2
> +ovn-nbctl create Logical_Router name=R3
> +
> +ovn-nbctl lswitch-add foo
> +ovn-nbctl lswitch-add alice
> +ovn-nbctl lswitch-add bob
> +ovn-nbctl lswitch-add join
> +
> +# Connect foo to R1
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add Logical_Router
> R1 \
> +ports @lrp -- lport-add foo rp-foo
> +
> +ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo \
> +addresses=\"00:00:01:01:02:03\"
> +
> +# Connect alice to R2
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router R2
> \
> +ports @lrp -- lport-add alice rp-alice
> +
> +ovn-nbctl set Logical_port rp-alice type=router options:router-port=alice
> \
> +addresses=\"00:00:02:01:02:03\"
> +
> +# Connect bob to R3
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router R3 \
> +ports @lrp -- lport-add bob rp-bob
> +
> +ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob \
> +addresses=\"00:00:03:01:02:03\"
> +
> +# Connect R1 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router R1 \
> +ports @lrp -- lport-add join r1-join
> +
> +ovn-nbctl set Logical_port r1-join type=router
> options:router-port=R1_join \
> +addresses='"00:00:04:01:02:03"'
> +
> +# Connect R2 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router R2 \
> +ports @lrp -- lport-add join r2-join
> +
> +ovn-nbctl set Logical_port r2-join type=router
> options:router-port=R2_join \
> +addresses='"00:00:04:01:02:04"'
> +
> +
> +# Connect R3 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router R3 \
> +ports @lrp -- lport-add join r3-join
> +
> +ovn-nbctl set Logical_port r3-join type=router
> options:router-port=R3_join \
> +addresses='"00:00:04:01:02:05"'
> +
> +#install static routes
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> +R1 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> +R1 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> +R2 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> +R2 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> +R3 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> +R3 static_routes @lrt
> +
> +# Create logical port foo1 in foo
> +ovn-nbctl lport-add foo foo1 \
> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
> +
> +# Create logical port alice1 in alice
> +ovn-nbctl lport-add alice alice1 \
> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
> +
> +# Create logical port bob1 in bob
> +ovn-nbctl lport-add bob bob1 \
> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
> +
> +# Create two hypervisor and create OVS ports corresponding to logical
> ports.
> +net_add n1
> +
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
> +    options:tx_pcap=hv1/vif1-tx.pcap \
> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> +    ofport-request=1
> +
> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
> +    options:tx_pcap=hv1/vif2-tx.pcap \
> +    options:rxq_pcap=hv1/vif2-rx.pcap \
> +    ofport-request=2
> +
> +sim_add hv2
> +as hv2
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.2
> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
> +    options:tx_pcap=hv2/vif1-tx.pcap \
> +    options:rxq_pcap=hv2/vif1-rx.pcap \
> +    ofport-request=1
> +
> +
> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
> +# packets for ARP resolution (native tunneling doesn't queue packets
> +# for ARP resolution).
> +ovn_populate_arp
> +
> +# Allow some time for ovn-northd and ovn-controller to catch up.
> +# XXX This should be more systematic.
> +sleep 1
> +
> +ip_to_hex() {
> +    printf "%02x%02x%02x%02x" "$@"
> +}
> +trim_zeros() {
> +    sed 's/\(00\)\{1,\}$//'
> +}
> +
> +# Send ip packets between foo1 and alice1
> +src_mac="f00000010203"
> +dst_mac="000001010203"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 172 16 1 2`
>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
> +
> +# Send ip packets between foo1 and bob1
> +src_mac="f00000010203"
> +dst_mac="000001010203"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 10 32 1 2`
>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> +
> +echo "---------NB dump-----"
> +ovn-nbctl show
> +echo "---------------------"
> +ovn-nbctl list logical_router
> +echo "---------------------"
> +ovn-nbctl list logical_router_port
> +echo "---------------------"
> +
> +echo "---------SB dump-----"
> +ovn-sbctl list datapath_binding
> +echo "---------------------"
> +ovn-sbctl list port_binding
> +echo "---------------------"
> +ovn-sbctl dump-flows
> +echo "---------------------"
> +
> +echo "------ hv1 dump ----------"
> +as hv1 ovs-ofctl show br-int
> +as hv1 ovs-ofctl dump-flows br-int
> +echo "------ hv2 dump ----------"
> +as hv2 ovs-ofctl show br-int
> +as hv2 ovs-ofctl dump-flows br-int
> +echo "----------------------------"
> +
> +# Packet to Expect at bob1
> +src_mac="000003010203"
> +dst_mac="f00000010205"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 10 32 1 2`
>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> +
> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
> trim_zeros > received.packets
> +echo $expected | trim_zeros > expout
> +AT_CHECK([cat received.packets], [0], [expout])
> +
> +# Packet to Expect at alice1
> +src_mac="000002010203"
> +dst_mac="f00000010204"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 172 16 1 2`
>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> +
> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
> trim_zeros > received1.packets
> +echo $expected | trim_zeros > expout
> +AT_CHECK([cat received1.packets], [0], [expout])
> +
> +for sim in hv1 hv2; do
> +    as $sim
> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +done
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as main
> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +AT_CLEANUP
> --
> 1.7.9.5
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Gurucharan Shetty May 10, 2016, 2:38 p.m. UTC | #2
On 9 May 2016 at 14:51, Darrell Ball <dlu998@gmail.com> wrote:

> I have some initial clarifications first before review
>
> On Fri, May 6, 2016 at 9:21 AM, Gurucharan Shetty <guru@ovn.org> wrote:
>
>> Currently we can connect routers via "peer"ing. This limits
>> the number of routers that can be connected with each other
>> directly to 2.
>>
>
>
> This sounds like you are saying that cannot have a topology like
>
> R1-------R2
>  \           /
>   \        /
>    \      /
>     \   /
>      R3
>
> which is common.
> Did I read the above comment correctly ?
>

The above drawn topology will start getting complex with more gateways. (In
case of k8s each machine is a l3 gateway.). I agree that my usage of word
"directly" may be confusing. I meant routers connected to each other in the
same subnet.


>
>
>
>>
>> One of the design goals for L3 Gateway is to be able to
>> have multiple gateways (each with their own router)
>> connected to a distributed router via a switch.
>>
>
> Its not ideal to have a switch required to connect routers - it somewhat
> defeats the advantages
> of having routers in the first place.
>

I do not have much experience with real world networking setups. There are
probably reasons why your statement is true. From my perspective , all I
see is that a switch is used to connect multiple endpoints in the same
subnet and If I want multiple routers connected to each other in the same
subnet, I need to use a switch. VMware NSX uses something similar in
customer deployments, so I believe it is not out of ordinary to do
something like this.


> This is usually only done if there is existing switching equipment than
> must be traversed
> But in the case, we are just dealing with software where we have total
> flexibility.
>

From OpenStack perspective, each tenant gets a router and multiple switches
with different subnets. The idea is that the OpenStack network plugin, can
at best split this single tenant router into 2 with a switch in between on
a  subnet that neutron does not use. Taking the same logic forward for k8s,
I can support multiple gateways.

Connecting multiple routers to each other without a switch makes it a pain
to remember the interconnections and create multiple subnets (which may
simply not be available for a networking backend for internal use).


>
>
>>
>> With the above goal in mind, this commit gives the general
>> ability to connect multiple routers via a switch.
>>
>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
>> ---
>> This needs the following 2 commits under review to
>> first go in.
>> 1. ofproto-dpif: Rename "recurse" to "indentation".
>> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
>> ---
>>  ovn/controller/lflow.c  |   19 ++--
>>  ovn/northd/ovn-northd.c |   56 ++++++++++-
>>  ovn/ovn-nb.xml          |    7 --
>>  tests/ovn.at            |  236
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 299 insertions(+), 19 deletions(-)
>>
>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
>> index 96b7c66..efc427d 100644
>> --- a/ovn/controller/lflow.c
>> +++ b/ovn/controller/lflow.c
>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx, const
>> struct lport_index *lports,
>>          if (is_switch(ldp)) {
>>              /* For a logical switch datapath, local_datapaths tells us
>> if there
>>               * are any local ports for this datapath.  If not, we can
>> skip
>> -             * processing logical flows if the flow belongs to egress
>> pipeline
>> -             * or if that logical switch datapath is not patched to any
>> logical
>> -             * router.
>> +             * processing logical flows if that logical switch datapath
>> is not
>> +             * patched to any logical router.
>>               *
>> -             * Otherwise, we still need the ingress pipeline because
>> even if
>> -             * there are no local ports, we still may need to execute
>> the ingress
>> -             * pipeline after a packet leaves a logical router.  Further
>> -             * optimization is possible, but not based on what we know
>> with
>> -             * local_datapaths right now.
>> +             * Otherwise, we still need both ingress and egress pipeline
>> +             * because even if there are no local ports, we still may
>> need to
>> +             * execute the ingress pipeline after a packet leaves a
>> logical
>> +             * router and we need to do egress pipeline for a switch that
>> +             * is connected to only routers.  Further optimization is
>> possible,
>> +             * but not based on what we know with local_datapaths right
>> now.
>>               *
>>               * A better approach would be a kind of "flood fill"
>> algorithm:
>>               *
>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx, const
>> struct lport_index *lports,
>>               */
>>
>>              if (!get_local_datapath(local_datapaths, ldp->tunnel_key)) {
>> -                if (!ingress) {
>> -                    continue;
>> -                }
>>
>
> This is removing a previous change that was done for some optimization for
> avoiding
> unnecessary flow creation.
> I am not sure about the process here, but maybe this should be tracked as
> a separate
> patch ?
>
The above change is needed for this patch to work. The optimization above
assumes that a switch always has atleast one real physical endpoint. With
this change, since a switch can only be connected to routers, we need to
remove the optimization. The optimization itself will need more careful
consideration and with more information provided to ovn-controller, it can
ideally be improved, but I would ideally not want l3 gateway work delayed
because of it.



>
>
>
>
>>                  if (!get_patched_datapath(patched_datapaths,
>>                                            ldp->tunnel_key)) {
>>                      continue;
>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>> index 9e03606..3a29770 100644
>> --- a/ovn/northd/ovn-northd.c
>> +++ b/ovn/northd/ovn-northd.c
>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths, struct
>> hmap *ports,
>>                  free(actions);
>>                  free(match);
>>              }
>> -        } else if (op->od->n_router_ports) {
>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
>> "router")) {
>> +            /* This is a logical switch port that backs a VM or a
>> container.
>> +             * Extract its addresses. For each of the address, go
>> through all
>> +             * the router ports attached to the switch (to which this
>> port
>> +             * connects) and if the address in question is reachable
>> from the
>> +             * router port, add an ARP entry in that router's pipeline.
>> */
>> +
>>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>>                  struct lport_addresses laddrs;
>>                  if (!extract_lport_addresses(op->nbs->addresses[i],
>> &laddrs,
>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths, struct
>> hmap *ports,
>>
>>                  free(laddrs.ipv4_addrs);
>>              }
>> +        } else if (!strcmp(op->nbs->type, "router")) {
>> +            /* This is a logical switch port that connects to a router.
>> */
>> +
>> +            /* The peer of this switch port is the router port for which
>> +             * we need to add logical flows such that it can resolve
>> +             * ARP entries for all the other router ports connected to
>> +             * the switch in question. */
>> +
>> +            const char *peer_name = smap_get(&op->nbs->options,
>> +                                             "router-port");
>> +            if (!peer_name) {
>> +                continue;
>> +            }
>> +
>> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
>> +            if (!peer || !peer->nbr || !peer->ip) {
>> +                continue;
>> +            }
>> +
>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
>> +                const char *router_port_name = smap_get(
>> +
>> &op->od->router_ports[j]->nbs->options,
>> +                                    "router-port");
>> +                struct ovn_port *router_port = ovn_port_find(ports,
>> +
>>  router_port_name);
>> +                if (!router_port || !router_port->nbr ||
>> !router_port->ip) {
>> +                    continue;
>> +                }
>> +
>> +                /* Skip the router port under consideration. */
>> +                if (router_port == peer) {
>> +                   continue;
>> +                }
>> +
>> +                if (!router_port->ip) {
>> +                    continue;
>> +                }
>> +                char *match = xasprintf("outport == %s && reg0 ==
>> "IP_FMT,
>> +                                        peer->json_key,
>> +                                        IP_ARGS(router_port->ip));
>> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
>> next;",
>> +
>> ETH_ADDR_ARGS(router_port->mac));
>> +                ovn_lflow_add(lflows, peer->od, S_ROUTER_IN_ARP_RESOLVE,
>> +                              101, match, actions);
>> +                free(actions);
>> +                free(match);
>> +            }
>>          }
>>      }
>> +
>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>>          if (!od->nbr) {
>>              continue;
>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>> index c01455d..d7fd595 100644
>> --- a/ovn/ovn-nb.xml
>> +++ b/ovn/ovn-nb.xml
>> @@ -154,13 +154,6 @@
>>            These options apply when <ref column="type"/> is
>> <code>router</code>.
>>          </p>
>>
>> -        <p>
>> -          If a given logical switch has multiple <code>router</code>
>> ports, the
>> -          <ref table="Logical_Router_Port"/> rows that they reference
>> must be
>> -          all on the same <ref table="Logical_Router"/> (for different
>> -          subnets).
>> -        </p>
>> -
>>          <column name="options" key="router-port">
>>            Required.  The <ref column="name"/> of the <ref
>>            table="Logical_Router_Port"/> to which this logical switch
>> port is
>> diff --git a/tests/ovn.at b/tests/ovn.at
>> index b9d7ada..6961be0 100644
>> --- a/tests/ovn.at
>> +++ b/tests/ovn.at
>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>
>>  AT_CLEANUP
>>
>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
>> +AT_KEYWORDS([ovnstaticroutes])
>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
>> +ovn_start
>> +
>> +# Logical network:
>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
>> "join"
>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
>> 10.32.1.0/24)
>> +# connected to it.
>> +
>> +ovn-nbctl create Logical_Router name=R1
>> +ovn-nbctl create Logical_Router name=R2
>> +ovn-nbctl create Logical_Router name=R3
>> +
>> +ovn-nbctl lswitch-add foo
>> +ovn-nbctl lswitch-add alice
>> +ovn-nbctl lswitch-add bob
>> +ovn-nbctl lswitch-add join
>> +
>> +# Connect foo to R1
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add Logical_Router
>> R1 \
>> +ports @lrp -- lport-add foo rp-foo
>> +
>> +ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo \
>> +addresses=\"00:00:01:01:02:03\"
>> +
>> +# Connect alice to R2
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router
>> R2 \
>> +ports @lrp -- lport-add alice rp-alice
>> +
>> +ovn-nbctl set Logical_port rp-alice type=router
>> options:router-port=alice \
>> +addresses=\"00:00:02:01:02:03\"
>> +
>> +# Connect bob to R3
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router R3
>> \
>> +ports @lrp -- lport-add bob rp-bob
>> +
>> +ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob \
>> +addresses=\"00:00:03:01:02:03\"
>> +
>> +# Connect R1 to join
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router R1 \
>> +ports @lrp -- lport-add join r1-join
>> +
>> +ovn-nbctl set Logical_port r1-join type=router
>> options:router-port=R1_join \
>> +addresses='"00:00:04:01:02:03"'
>> +
>> +# Connect R2 to join
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router R2 \
>> +ports @lrp -- lport-add join r2-join
>> +
>> +ovn-nbctl set Logical_port r2-join type=router
>> options:router-port=R2_join \
>> +addresses='"00:00:04:01:02:04"'
>> +
>> +
>> +# Connect R3 to join
>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router R3 \
>> +ports @lrp -- lport-add join r3-join
>> +
>> +ovn-nbctl set Logical_port r3-join type=router
>> options:router-port=R3_join \
>> +addresses='"00:00:04:01:02:05"'
>> +
>> +#install static routes
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> +R1 static_routes @lrt
>> +
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> +R1 static_routes @lrt
>> +
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> +R2 static_routes @lrt
>> +
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> +R2 static_routes @lrt
>> +
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> +R3 static_routes @lrt
>> +
>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> +R3 static_routes @lrt
>> +
>> +# Create logical port foo1 in foo
>> +ovn-nbctl lport-add foo foo1 \
>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
>> +
>> +# Create logical port alice1 in alice
>> +ovn-nbctl lport-add alice alice1 \
>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
>> +
>> +# Create logical port bob1 in bob
>> +ovn-nbctl lport-add bob bob1 \
>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
>> +
>> +# Create two hypervisor and create OVS ports corresponding to logical
>> ports.
>> +net_add n1
>> +
>> +sim_add hv1
>> +as hv1
>> +ovs-vsctl add-br br-phys
>> +ovn_attach n1 br-phys 192.168.0.1
>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
>> +    options:tx_pcap=hv1/vif1-tx.pcap \
>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
>> +    ofport-request=1
>> +
>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
>> +    options:tx_pcap=hv1/vif2-tx.pcap \
>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
>> +    ofport-request=2
>> +
>> +sim_add hv2
>> +as hv2
>> +ovs-vsctl add-br br-phys
>> +ovn_attach n1 br-phys 192.168.0.2
>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
>> +    options:tx_pcap=hv2/vif1-tx.pcap \
>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
>> +    ofport-request=1
>> +
>> +
>> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
>> +# packets for ARP resolution (native tunneling doesn't queue packets
>> +# for ARP resolution).
>> +ovn_populate_arp
>> +
>> +# Allow some time for ovn-northd and ovn-controller to catch up.
>> +# XXX This should be more systematic.
>> +sleep 1
>> +
>> +ip_to_hex() {
>> +    printf "%02x%02x%02x%02x" "$@"
>> +}
>> +trim_zeros() {
>> +    sed 's/\(00\)\{1,\}$//'
>> +}
>> +
>> +# Send ip packets between foo1 and alice1
>> +src_mac="f00000010203"
>> +dst_mac="000001010203"
>> +src_ip=`ip_to_hex 192 168 1 2`
>> +dst_ip=`ip_to_hex 172 16 1 2`
>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
>> +
>> +# Send ip packets between foo1 and bob1
>> +src_mac="f00000010203"
>> +dst_mac="000001010203"
>> +src_ip=`ip_to_hex 192 168 1 2`
>> +dst_ip=`ip_to_hex 10 32 1 2`
>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> +
>> +echo "---------NB dump-----"
>> +ovn-nbctl show
>> +echo "---------------------"
>> +ovn-nbctl list logical_router
>> +echo "---------------------"
>> +ovn-nbctl list logical_router_port
>> +echo "---------------------"
>> +
>> +echo "---------SB dump-----"
>> +ovn-sbctl list datapath_binding
>> +echo "---------------------"
>> +ovn-sbctl list port_binding
>> +echo "---------------------"
>> +ovn-sbctl dump-flows
>> +echo "---------------------"
>> +
>> +echo "------ hv1 dump ----------"
>> +as hv1 ovs-ofctl show br-int
>> +as hv1 ovs-ofctl dump-flows br-int
>> +echo "------ hv2 dump ----------"
>> +as hv2 ovs-ofctl show br-int
>> +as hv2 ovs-ofctl dump-flows br-int
>> +echo "----------------------------"
>> +
>> +# Packet to Expect at bob1
>> +src_mac="000003010203"
>> +dst_mac="f00000010205"
>> +src_ip=`ip_to_hex 192 168 1 2`
>> +dst_ip=`ip_to_hex 10 32 1 2`
>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> +
>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
>> trim_zeros > received.packets
>> +echo $expected | trim_zeros > expout
>> +AT_CHECK([cat received.packets], [0], [expout])
>> +
>> +# Packet to Expect at alice1
>> +src_mac="000002010203"
>> +dst_mac="f00000010204"
>> +src_ip=`ip_to_hex 192 168 1 2`
>> +dst_ip=`ip_to_hex 172 16 1 2`
>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> +
>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
>> trim_zeros > received1.packets
>> +echo $expected | trim_zeros > expout
>> +AT_CHECK([cat received1.packets], [0], [expout])
>> +
>> +for sim in hv1 hv2; do
>> +    as $sim
>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> +done
>> +
>> +as ovn-sb
>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> +
>> +as ovn-nb
>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> +
>> +as northd
>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>> +
>> +as main
>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> +
>> +AT_CLEANUP
>> --
>> 1.7.9.5
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
>>
>
>
Darrell Ball May 11, 2016, 11:01 p.m. UTC | #3
On Tue, May 10, 2016 at 7:38 AM, Guru Shetty <guru@ovn.org> wrote:

>
>
> On 9 May 2016 at 14:51, Darrell Ball <dlu998@gmail.com> wrote:
>
>> I have some initial clarifications first before review
>>
>> On Fri, May 6, 2016 at 9:21 AM, Gurucharan Shetty <guru@ovn.org> wrote:
>>
>>> Currently we can connect routers via "peer"ing. This limits
>>> the number of routers that can be connected with each other
>>> directly to 2.
>>>
>>
>>
>> This sounds like you are saying that cannot have a topology like
>>
>> R1-------R2
>>  \           /
>>   \        /
>>    \      /
>>     \   /
>>      R3
>>
>> which is common.
>> Did I read the above comment correctly ?
>>
>
> The above drawn topology will start getting complex with more gateways.
> (In case of k8s each machine is a l3 gateway.). I agree that my usage of
> word "directly" may be confusing. I meant routers connected to each other
> in the same subnet.
>

My question was:

Is it required with your implementation, that multiple logical routers be
connected to each other via a glue or transit LS ?

It seems you have this requirement, meaning the answer is Yes; if not,
please let me know otherwise.




>
>
>>
>>
>>
>>>
>>> One of the design goals for L3 Gateway is to be able to
>>> have multiple gateways (each with their own router)
>>> connected to a distributed router via a switch.
>>>
>>
>> Its not ideal to have a switch required to connect routers - it somewhat
>> defeats the advantages
>> of having routers in the first place.
>>
>
> I do not have much experience with real world networking setups. There are
> probably reasons why your statement is true. From my perspective , all I
> see is that a switch is used to connect multiple endpoints in the same
> subnet and If I want multiple routers connected to each other in the same
> subnet, I need to use a switch. VMware NSX uses something similar in
> customer deployments, so I believe it is not out of ordinary to do
> something like this.
>

Some reasons why having a “transit LS” is “undesirable” is:

1)    1)  It creates additional requirements at the CMS layer for setting
up networks; i.e. additional programming is required at the OVN northbound
interface for the special transit LSs, interactions with the logical router
peers.

In cases where some NSX products do this, it is hidden from the user, as
one would minimally expect.

2)     2) From OVN POV, it adds an additional OVN datapath to all
processing to the packet path and programming/processing for that datapath.

because you have

R1<->Transit LS<->R2

vs

R1<->R2

3)     3) You have to coordinate the transit LS subnet to handle all
addresses in this same subnet for all the logical routers and all their
transit LS peers.

4)    4) Seems like L3 load balancing, ECMP, would be more complicated at
best.

5)    5)  1000s of additional arp resolve flows rules are needed in normal
cases in addition to added burden of the special transit LS others flows.




>
>
>> This is usually only done if there is existing switching equipment than
>> must be traversed
>> But in the case, we are just dealing with software where we have total
>> flexibility.
>>
>
> From OpenStack perspective, each tenant gets a router and multiple
> switches with different subnets. The idea is that the OpenStack network
> plugin, can at best split this single tenant router into 2 with a switch in
> between on a  subnet that neutron does not use. Taking the same logic
> forward for k8s, I can support multiple gateways.
>

As far as I have heard from Openstack folks, each tenant can create
multiple logical routers; they can even overlap subnets if they wanted. So,
I am not sure the above assumption holds true.



>
> Connecting multiple routers to each other without a switch makes it a pain
> to remember the interconnections and create multiple subnets (which may
> simply not be available for a networking backend for internal use).
>

Well, just looking at the 10.x.x.x private IP range, there are more than 16
million IP addresses

Point to point direct router links, Rx<->Ry subnets, can be /31. In the
most extreme case, maybe 20,000 addresses would be used. I don’t think
there should be a subnet depletion problem here, unless it’s a contrived
misconfiguration.




>
>
>>
>>
>>>
>>> With the above goal in mind, this commit gives the general
>>> ability to connect multiple routers via a switch.
>>>
>>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
>>> ---
>>> This needs the following 2 commits under review to
>>> first go in.
>>> 1. ofproto-dpif: Rename "recurse" to "indentation".
>>> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
>>> ---
>>>  ovn/controller/lflow.c  |   19 ++--
>>>  ovn/northd/ovn-northd.c |   56 ++++++++++-
>>>  ovn/ovn-nb.xml          |    7 --
>>>  tests/ovn.at            |  236
>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>>  4 files changed, 299 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
>>> index 96b7c66..efc427d 100644
>>> --- a/ovn/controller/lflow.c
>>> +++ b/ovn/controller/lflow.c
>>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
>>> const struct lport_index *lports,
>>>          if (is_switch(ldp)) {
>>>              /* For a logical switch datapath, local_datapaths tells us
>>> if there
>>>               * are any local ports for this datapath.  If not, we can
>>> skip
>>> -             * processing logical flows if the flow belongs to egress
>>> pipeline
>>> -             * or if that logical switch datapath is not patched to any
>>> logical
>>> -             * router.
>>> +             * processing logical flows if that logical switch datapath
>>> is not
>>> +             * patched to any logical router.
>>>               *
>>> -             * Otherwise, we still need the ingress pipeline because
>>> even if
>>> -             * there are no local ports, we still may need to execute
>>> the ingress
>>> -             * pipeline after a packet leaves a logical router.  Further
>>> -             * optimization is possible, but not based on what we know
>>> with
>>> -             * local_datapaths right now.
>>> +             * Otherwise, we still need both ingress and egress pipeline
>>> +             * because even if there are no local ports, we still may
>>> need to
>>> +             * execute the ingress pipeline after a packet leaves a
>>> logical
>>> +             * router and we need to do egress pipeline for a switch
>>> that
>>> +             * is connected to only routers.  Further optimization is
>>> possible,
>>> +             * but not based on what we know with local_datapaths right
>>> now.
>>>               *
>>>               * A better approach would be a kind of "flood fill"
>>> algorithm:
>>>               *
>>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx, const
>>> struct lport_index *lports,
>>>               */
>>>
>>>              if (!get_local_datapath(local_datapaths, ldp->tunnel_key)) {
>>> -                if (!ingress) {
>>> -                    continue;
>>> -                }
>>>
>>
>> This is removing a previous change that was done for some optimization
>> for avoiding
>> unnecessary flow creation.
>> I am not sure about the process here, but maybe this should be tracked as
>> a separate
>> patch ?
>>
> The above change is needed for this patch to work. The optimization above
> assumes that a switch always has atleast one real physical endpoint. With
> this change, since a switch can only be connected to routers, we need to
> remove the optimization. The optimization itself will need more careful
> consideration and with more information provided to ovn-controller, it can
> ideally be improved, but I would ideally not want l3 gateway work delayed
> because of it.
>
>
My point was that this code change removes a previous optimization. This
leads to 2 questions – is it worth removing the optimization for the new
code as is and is it worth adding back optimizations with the added
complexity of the new code.



>
>
>>
>>
>>
>>
>>>                  if (!get_patched_datapath(patched_datapaths,
>>>                                            ldp->tunnel_key)) {
>>>                      continue;
>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>>> index 9e03606..3a29770 100644
>>> --- a/ovn/northd/ovn-northd.c
>>> +++ b/ovn/northd/ovn-northd.c
>>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
>>> struct hmap *ports,
>>>                  free(actions);
>>>                  free(match);
>>>              }
>>> -        } else if (op->od->n_router_ports) {
>>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
>>> "router")) {
>>> +            /* This is a logical switch port that backs a VM or a
>>> container.
>>> +             * Extract its addresses. For each of the address, go
>>> through all
>>> +             * the router ports attached to the switch (to which this
>>> port
>>> +             * connects) and if the address in question is reachable
>>> from the
>>> +             * router port, add an ARP entry in that router's pipeline.
>>> */
>>> +
>>>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>>>                  struct lport_addresses laddrs;
>>>                  if (!extract_lport_addresses(op->nbs->addresses[i],
>>> &laddrs,
>>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
>>> struct hmap *ports,
>>>
>>>                  free(laddrs.ipv4_addrs);
>>>              }
>>> +        } else if (!strcmp(op->nbs->type, "router")) {
>>> +            /* This is a logical switch port that connects to a router.
>>> */
>>> +
>>> +            /* The peer of this switch port is the router port for which
>>> +             * we need to add logical flows such that it can resolve
>>> +             * ARP entries for all the other router ports connected to
>>> +             * the switch in question. */
>>> +
>>> +            const char *peer_name = smap_get(&op->nbs->options,
>>> +                                             "router-port");
>>> +            if (!peer_name) {
>>> +                continue;
>>> +            }
>>> +
>>> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
>>> +            if (!peer || !peer->nbr || !peer->ip) {
>>> +                continue;
>>> +            }
>>> +
>>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
>>> +                const char *router_port_name = smap_get(
>>> +
>>> &op->od->router_ports[j]->nbs->options,
>>> +                                    "router-port");
>>> +                struct ovn_port *router_port = ovn_port_find(ports,
>>> +
>>>  router_port_name);
>>> +                if (!router_port || !router_port->nbr ||
>>> !router_port->ip) {
>>> +                    continue;
>>> +                }
>>> +
>>> +                /* Skip the router port under consideration. */
>>> +                if (router_port == peer) {
>>> +                   continue;
>>> +                }
>>> +
>>> +                if (!router_port->ip) {
>>> +                    continue;
>>> +                }
>>> +                char *match = xasprintf("outport == %s && reg0 ==
>>> "IP_FMT,
>>> +                                        peer->json_key,
>>> +                                        IP_ARGS(router_port->ip));
>>> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
>>> next;",
>>> +
>>> ETH_ADDR_ARGS(router_port->mac));
>>> +                ovn_lflow_add(lflows, peer->od, S_ROUTER_IN_ARP_RESOLVE,
>>> +                              101, match, actions);
>>> +                free(actions);
>>> +                free(match);
>>> +            }
>>>          }
>>>      }
>>> +
>>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>>>          if (!od->nbr) {
>>>              continue;
>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>>> index c01455d..d7fd595 100644
>>> --- a/ovn/ovn-nb.xml
>>> +++ b/ovn/ovn-nb.xml
>>> @@ -154,13 +154,6 @@
>>>            These options apply when <ref column="type"/> is
>>> <code>router</code>.
>>>          </p>
>>>
>>> -        <p>
>>> -          If a given logical switch has multiple <code>router</code>
>>> ports, the
>>> -          <ref table="Logical_Router_Port"/> rows that they reference
>>> must be
>>> -          all on the same <ref table="Logical_Router"/> (for different
>>> -          subnets).
>>> -        </p>
>>> -
>>>          <column name="options" key="router-port">
>>>            Required.  The <ref column="name"/> of the <ref
>>>            table="Logical_Router_Port"/> to which this logical switch
>>> port is
>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>> index b9d7ada..6961be0 100644
>>> --- a/tests/ovn.at
>>> +++ b/tests/ovn.at
>>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>>
>>>  AT_CLEANUP
>>>
>>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
>>> +AT_KEYWORDS([ovnstaticroutes])
>>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
>>> +ovn_start
>>> +
>>> +# Logical network:
>>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
>>> "join"
>>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
>>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
>>> 10.32.1.0/24)
>>> +# connected to it.
>>> +
>>> +ovn-nbctl create Logical_Router name=R1
>>> +ovn-nbctl create Logical_Router name=R2
>>> +ovn-nbctl create Logical_Router name=R3
>>> +
>>> +ovn-nbctl lswitch-add foo
>>> +ovn-nbctl lswitch-add alice
>>> +ovn-nbctl lswitch-add bob
>>> +ovn-nbctl lswitch-add join
>>> +
>>> +# Connect foo to R1
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
>>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add Logical_Router
>>> R1 \
>>> +ports @lrp -- lport-add foo rp-foo
>>> +
>>> +ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo \
>>> +addresses=\"00:00:01:01:02:03\"
>>> +
>>> +# Connect alice to R2
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
>>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router
>>> R2 \
>>> +ports @lrp -- lport-add alice rp-alice
>>> +
>>> +ovn-nbctl set Logical_port rp-alice type=router
>>> options:router-port=alice \
>>> +addresses=\"00:00:02:01:02:03\"
>>> +
>>> +# Connect bob to R3
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
>>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router
>>> R3 \
>>> +ports @lrp -- lport-add bob rp-bob
>>> +
>>> +ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob \
>>> +addresses=\"00:00:03:01:02:03\"
>>> +
>>> +# Connect R1 to join
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
>>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router R1
>>> \
>>> +ports @lrp -- lport-add join r1-join
>>> +
>>> +ovn-nbctl set Logical_port r1-join type=router
>>> options:router-port=R1_join \
>>> +addresses='"00:00:04:01:02:03"'
>>> +
>>> +# Connect R2 to join
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
>>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router R2
>>> \
>>> +ports @lrp -- lport-add join r2-join
>>> +
>>> +ovn-nbctl set Logical_port r2-join type=router
>>> options:router-port=R2_join \
>>> +addresses='"00:00:04:01:02:04"'
>>> +
>>> +
>>> +# Connect R3 to join
>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
>>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router R3
>>> \
>>> +ports @lrp -- lport-add join r3-join
>>> +
>>> +ovn-nbctl set Logical_port r3-join type=router
>>> options:router-port=R3_join \
>>> +addresses='"00:00:04:01:02:05"'
>>> +
>>> +#install static routes
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>>> +R1 static_routes @lrt
>>> +
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>>> +R1 static_routes @lrt
>>> +
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>>> +R2 static_routes @lrt
>>> +
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>>> +R2 static_routes @lrt
>>> +
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>>> +R3 static_routes @lrt
>>> +
>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>>> +R3 static_routes @lrt
>>> +
>>> +# Create logical port foo1 in foo
>>> +ovn-nbctl lport-add foo foo1 \
>>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
>>> +
>>> +# Create logical port alice1 in alice
>>> +ovn-nbctl lport-add alice alice1 \
>>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
>>> +
>>> +# Create logical port bob1 in bob
>>> +ovn-nbctl lport-add bob bob1 \
>>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
>>> +
>>> +# Create two hypervisor and create OVS ports corresponding to logical
>>> ports.
>>> +net_add n1
>>> +
>>> +sim_add hv1
>>> +as hv1
>>> +ovs-vsctl add-br br-phys
>>> +ovn_attach n1 br-phys 192.168.0.1
>>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
>>> +    options:tx_pcap=hv1/vif1-tx.pcap \
>>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
>>> +    ofport-request=1
>>> +
>>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
>>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
>>> +    options:tx_pcap=hv1/vif2-tx.pcap \
>>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
>>> +    ofport-request=2
>>> +
>>> +sim_add hv2
>>> +as hv2
>>> +ovs-vsctl add-br br-phys
>>> +ovn_attach n1 br-phys 192.168.0.2
>>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
>>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
>>> +    options:tx_pcap=hv2/vif1-tx.pcap \
>>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
>>> +    ofport-request=1
>>> +
>>> +
>>> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
>>> +# packets for ARP resolution (native tunneling doesn't queue packets
>>> +# for ARP resolution).
>>> +ovn_populate_arp
>>> +
>>> +# Allow some time for ovn-northd and ovn-controller to catch up.
>>> +# XXX This should be more systematic.
>>> +sleep 1
>>> +
>>> +ip_to_hex() {
>>> +    printf "%02x%02x%02x%02x" "$@"
>>> +}
>>> +trim_zeros() {
>>> +    sed 's/\(00\)\{1,\}$//'
>>> +}
>>> +
>>> +# Send ip packets between foo1 and alice1
>>> +src_mac="f00000010203"
>>> +dst_mac="000001010203"
>>> +src_ip=`ip_to_hex 192 168 1 2`
>>> +dst_ip=`ip_to_hex 172 16 1 2`
>>>
>>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
>>> +
>>> +# Send ip packets between foo1 and bob1
>>> +src_mac="f00000010203"
>>> +dst_mac="000001010203"
>>> +src_ip=`ip_to_hex 192 168 1 2`
>>> +dst_ip=`ip_to_hex 10 32 1 2`
>>>
>>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>>> +
>>> +echo "---------NB dump-----"
>>> +ovn-nbctl show
>>> +echo "---------------------"
>>> +ovn-nbctl list logical_router
>>> +echo "---------------------"
>>> +ovn-nbctl list logical_router_port
>>> +echo "---------------------"
>>> +
>>> +echo "---------SB dump-----"
>>> +ovn-sbctl list datapath_binding
>>> +echo "---------------------"
>>> +ovn-sbctl list port_binding
>>> +echo "---------------------"
>>> +ovn-sbctl dump-flows
>>> +echo "---------------------"
>>> +
>>> +echo "------ hv1 dump ----------"
>>> +as hv1 ovs-ofctl show br-int
>>> +as hv1 ovs-ofctl dump-flows br-int
>>> +echo "------ hv2 dump ----------"
>>> +as hv2 ovs-ofctl show br-int
>>> +as hv2 ovs-ofctl dump-flows br-int
>>> +echo "----------------------------"
>>> +
>>> +# Packet to Expect at bob1
>>> +src_mac="000003010203"
>>> +dst_mac="f00000010205"
>>> +src_ip=`ip_to_hex 192 168 1 2`
>>> +dst_ip=`ip_to_hex 10 32 1 2`
>>>
>>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>>> +
>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
>>> trim_zeros > received.packets
>>> +echo $expected | trim_zeros > expout
>>> +AT_CHECK([cat received.packets], [0], [expout])
>>> +
>>> +# Packet to Expect at alice1
>>> +src_mac="000002010203"
>>> +dst_mac="f00000010204"
>>> +src_ip=`ip_to_hex 192 168 1 2`
>>> +dst_ip=`ip_to_hex 172 16 1 2`
>>>
>>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>>> +
>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
>>> trim_zeros > received1.packets
>>> +echo $expected | trim_zeros > expout
>>> +AT_CHECK([cat received1.packets], [0], [expout])
>>> +
>>> +for sim in hv1 hv2; do
>>> +    as $sim
>>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
>>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>> +done
>>> +
>>> +as ovn-sb
>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>> +
>>> +as ovn-nb
>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>> +
>>> +as northd
>>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>>> +
>>> +as main
>>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>>> +
>>> +AT_CLEANUP
>>> --
>>> 1.7.9.5
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> http://openvswitch.org/mailman/listinfo/dev
>>>
>>
>>
>
Gurucharan Shetty May 11, 2016, 11:42 p.m. UTC | #4
>
>
>
> Some reasons why having a “transit LS” is “undesirable” is:
>
> 1)    1)  It creates additional requirements at the CMS layer for setting
> up networks; i.e. additional programming is required at the OVN northbound
> interface for the special transit LSs, interactions with the logical router
> peers.
>

Agreed that there is additional work needed for the CMS plugin. That work
is needed even if it is just peering as they need to convert one router in
to two in OVN (unless OVN automatically makes this split)


>
> In cases where some NSX products do this, it is hidden from the user, as
> one would minimally expect.
>
> 2)     2) From OVN POV, it adds an additional OVN datapath to all
> processing to the packet path and programming/processing for that datapath.
>
> because you have
>
> R1<->Transit LS<->R2
>
> vs
>
> R1<->R2
>

Agreed that there is an additional datapath.


>
> 3)     3) You have to coordinate the transit LS subnet to handle all
> addresses in this same subnet for all the logical routers and all their
> transit LS peers.
>

I don't understand what you mean. If a user uses one gateway, a transit LS
only gets connected by 2 routers.
Other routers get their own transit LS.


>
> 4)    4) Seems like L3 load balancing, ECMP, would be more complicated at
> best.
>
> 5)    5)  1000s of additional arp resolve flows rules are needed in normal
> cases in addition to added burden of the special transit LS others flows.
>

I don't understand why that would be the case.


>
>
>
>
> >
> >
> >> This is usually only done if there is existing switching equipment than
> >> must be traversed
> >> But in the case, we are just dealing with software where we have total
> >> flexibility.
> >>
> >
> > From OpenStack perspective, each tenant gets a router and multiple
> > switches with different subnets. The idea is that the OpenStack network
> > plugin, can at best split this single tenant router into 2 with a switch
> in
> > between on a  subnet that neutron does not use. Taking the same logic
> > forward for k8s, I can support multiple gateways.
> >
>
> As far as I have heard from Openstack folks, each tenant can create
> multiple logical routers; they can even overlap subnets if they wanted. So,
> I am not sure the above assumption holds true.
>

A tenant can create multiple logical routers. But a connected logical
topology can have one router connected to the gateway, atleast that is the
common use case.

>
>
>
> >
> > Connecting multiple routers to each other without a switch makes it a
> pain
> > to remember the interconnections and create multiple subnets (which may
> > simply not be available for a networking backend for internal use).
> >
>
> Well, just looking at the 10.x.x.x private IP range, there are more than 16
> million IP addresses
>
> Point to point direct router links, Rx<->Ry subnets, can be /31. In the
> most extreme case, maybe 20,000 addresses would be used. I don’t think
> there should be a subnet depletion problem here, unless it’s a contrived
> misconfiguration.
>

Who is going to manage all the interconnections? As far as I see a transit
LS is much more simpler.


>
>
>
>
> >
> >
> >>
> >>
> >>>
> >>> With the above goal in mind, this commit gives the general
> >>> ability to connect multiple routers via a switch.
> >>>
> >>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
> >>> ---
> >>> This needs the following 2 commits under review to
> >>> first go in.
> >>> 1. ofproto-dpif: Rename "recurse" to "indentation".
> >>> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
> >>> ---
> >>>  ovn/controller/lflow.c  |   19 ++--
> >>>  ovn/northd/ovn-northd.c |   56 ++++++++++-
> >>>  ovn/ovn-nb.xml          |    7 --
> >>>  tests/ovn.at            |  236
> >>> +++++++++++++++++++++++++++++++++++++++++++++++
> >>>  4 files changed, 299 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
> >>> index 96b7c66..efc427d 100644
> >>> --- a/ovn/controller/lflow.c
> >>> +++ b/ovn/controller/lflow.c
> >>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
> >>> const struct lport_index *lports,
> >>>          if (is_switch(ldp)) {
> >>>              /* For a logical switch datapath, local_datapaths tells us
> >>> if there
> >>>               * are any local ports for this datapath.  If not, we can
> >>> skip
> >>> -             * processing logical flows if the flow belongs to egress
> >>> pipeline
> >>> -             * or if that logical switch datapath is not patched to
> any
> >>> logical
> >>> -             * router.
> >>> +             * processing logical flows if that logical switch
> datapath
> >>> is not
> >>> +             * patched to any logical router.
> >>>               *
> >>> -             * Otherwise, we still need the ingress pipeline because
> >>> even if
> >>> -             * there are no local ports, we still may need to execute
> >>> the ingress
> >>> -             * pipeline after a packet leaves a logical router.
> Further
> >>> -             * optimization is possible, but not based on what we know
> >>> with
> >>> -             * local_datapaths right now.
> >>> +             * Otherwise, we still need both ingress and egress
> pipeline
> >>> +             * because even if there are no local ports, we still may
> >>> need to
> >>> +             * execute the ingress pipeline after a packet leaves a
> >>> logical
> >>> +             * router and we need to do egress pipeline for a switch
> >>> that
> >>> +             * is connected to only routers.  Further optimization is
> >>> possible,
> >>> +             * but not based on what we know with local_datapaths
> right
> >>> now.
> >>>               *
> >>>               * A better approach would be a kind of "flood fill"
> >>> algorithm:
> >>>               *
> >>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx, const
> >>> struct lport_index *lports,
> >>>               */
> >>>
> >>>              if (!get_local_datapath(local_datapaths,
> ldp->tunnel_key)) {
> >>> -                if (!ingress) {
> >>> -                    continue;
> >>> -                }
> >>>
> >>
> >> This is removing a previous change that was done for some optimization
> >> for avoiding
> >> unnecessary flow creation.
> >> I am not sure about the process here, but maybe this should be tracked
> as
> >> a separate
> >> patch ?
> >>
> > The above change is needed for this patch to work. The optimization above
> > assumes that a switch always has atleast one real physical endpoint. With
> > this change, since a switch can only be connected to routers, we need to
> > remove the optimization. The optimization itself will need more careful
> > consideration and with more information provided to ovn-controller, it
> can
> > ideally be improved, but I would ideally not want l3 gateway work delayed
> > because of it.
> >
> >
> My point was that this code change removes a previous optimization. This
> leads to 2 questions – is it worth removing the optimization for the new
> code as is and is it worth adding back optimizations with the added
> complexity of the new code.
>
>
>
> >
> >
> >>
> >>
> >>
> >>
> >>>                  if (!get_patched_datapath(patched_datapaths,
> >>>                                            ldp->tunnel_key)) {
> >>>                      continue;
> >>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> >>> index 9e03606..3a29770 100644
> >>> --- a/ovn/northd/ovn-northd.c
> >>> +++ b/ovn/northd/ovn-northd.c
> >>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
> >>> struct hmap *ports,
> >>>                  free(actions);
> >>>                  free(match);
> >>>              }
> >>> -        } else if (op->od->n_router_ports) {
> >>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
> >>> "router")) {
> >>> +            /* This is a logical switch port that backs a VM or a
> >>> container.
> >>> +             * Extract its addresses. For each of the address, go
> >>> through all
> >>> +             * the router ports attached to the switch (to which this
> >>> port
> >>> +             * connects) and if the address in question is reachable
> >>> from the
> >>> +             * router port, add an ARP entry in that router's
> pipeline.
> >>> */
> >>> +
> >>>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
> >>>                  struct lport_addresses laddrs;
> >>>                  if (!extract_lport_addresses(op->nbs->addresses[i],
> >>> &laddrs,
> >>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
> >>> struct hmap *ports,
> >>>
> >>>                  free(laddrs.ipv4_addrs);
> >>>              }
> >>> +        } else if (!strcmp(op->nbs->type, "router")) {
> >>> +            /* This is a logical switch port that connects to a
> router.
> >>> */
> >>> +
> >>> +            /* The peer of this switch port is the router port for
> which
> >>> +             * we need to add logical flows such that it can resolve
> >>> +             * ARP entries for all the other router ports connected to
> >>> +             * the switch in question. */
> >>> +
> >>> +            const char *peer_name = smap_get(&op->nbs->options,
> >>> +                                             "router-port");
> >>> +            if (!peer_name) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
> >>> +            if (!peer || !peer->nbr || !peer->ip) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
> >>> +                const char *router_port_name = smap_get(
> >>> +
> >>> &op->od->router_ports[j]->nbs->options,
> >>> +                                    "router-port");
> >>> +                struct ovn_port *router_port = ovn_port_find(ports,
> >>> +
> >>>  router_port_name);
> >>> +                if (!router_port || !router_port->nbr ||
> >>> !router_port->ip) {
> >>> +                    continue;
> >>> +                }
> >>> +
> >>> +                /* Skip the router port under consideration. */
> >>> +                if (router_port == peer) {
> >>> +                   continue;
> >>> +                }
> >>> +
> >>> +                if (!router_port->ip) {
> >>> +                    continue;
> >>> +                }
> >>> +                char *match = xasprintf("outport == %s && reg0 ==
> >>> "IP_FMT,
> >>> +                                        peer->json_key,
> >>> +                                        IP_ARGS(router_port->ip));
> >>> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
> >>> next;",
> >>> +
> >>> ETH_ADDR_ARGS(router_port->mac));
> >>> +                ovn_lflow_add(lflows, peer->od,
> S_ROUTER_IN_ARP_RESOLVE,
> >>> +                              101, match, actions);
> >>> +                free(actions);
> >>> +                free(match);
> >>> +            }
> >>>          }
> >>>      }
> >>> +
> >>>      HMAP_FOR_EACH (od, key_node, datapaths) {
> >>>          if (!od->nbr) {
> >>>              continue;
> >>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
> >>> index c01455d..d7fd595 100644
> >>> --- a/ovn/ovn-nb.xml
> >>> +++ b/ovn/ovn-nb.xml
> >>> @@ -154,13 +154,6 @@
> >>>            These options apply when <ref column="type"/> is
> >>> <code>router</code>.
> >>>          </p>
> >>>
> >>> -        <p>
> >>> -          If a given logical switch has multiple <code>router</code>
> >>> ports, the
> >>> -          <ref table="Logical_Router_Port"/> rows that they reference
> >>> must be
> >>> -          all on the same <ref table="Logical_Router"/> (for different
> >>> -          subnets).
> >>> -        </p>
> >>> -
> >>>          <column name="options" key="router-port">
> >>>            Required.  The <ref column="name"/> of the <ref
> >>>            table="Logical_Router_Port"/> to which this logical switch
> >>> port is
> >>> diff --git a/tests/ovn.at b/tests/ovn.at
> >>> index b9d7ada..6961be0 100644
> >>> --- a/tests/ovn.at
> >>> +++ b/tests/ovn.at
> >>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>
> >>>  AT_CLEANUP
> >>>
> >>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
> >>> +AT_KEYWORDS([ovnstaticroutes])
> >>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> >>> +ovn_start
> >>> +
> >>> +# Logical network:
> >>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
> >>> "join"
> >>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
> >>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
> >>> 10.32.1.0/24)
> >>> +# connected to it.
> >>> +
> >>> +ovn-nbctl create Logical_Router name=R1
> >>> +ovn-nbctl create Logical_Router name=R2
> >>> +ovn-nbctl create Logical_Router name=R3
> >>> +
> >>> +ovn-nbctl lswitch-add foo
> >>> +ovn-nbctl lswitch-add alice
> >>> +ovn-nbctl lswitch-add bob
> >>> +ovn-nbctl lswitch-add join
> >>> +
> >>> +# Connect foo to R1
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
> >>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add
> Logical_Router
> >>> R1 \
> >>> +ports @lrp -- lport-add foo rp-foo
> >>> +
> >>> +ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo
> \
> >>> +addresses=\"00:00:01:01:02:03\"
> >>> +
> >>> +# Connect alice to R2
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
> >>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router
> >>> R2 \
> >>> +ports @lrp -- lport-add alice rp-alice
> >>> +
> >>> +ovn-nbctl set Logical_port rp-alice type=router
> >>> options:router-port=alice \
> >>> +addresses=\"00:00:02:01:02:03\"
> >>> +
> >>> +# Connect bob to R3
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
> >>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router
> >>> R3 \
> >>> +ports @lrp -- lport-add bob rp-bob
> >>> +
> >>> +ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob
> \
> >>> +addresses=\"00:00:03:01:02:03\"
> >>> +
> >>> +# Connect R1 to join
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
> >>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router
> R1
> >>> \
> >>> +ports @lrp -- lport-add join r1-join
> >>> +
> >>> +ovn-nbctl set Logical_port r1-join type=router
> >>> options:router-port=R1_join \
> >>> +addresses='"00:00:04:01:02:03"'
> >>> +
> >>> +# Connect R2 to join
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
> >>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router
> R2
> >>> \
> >>> +ports @lrp -- lport-add join r2-join
> >>> +
> >>> +ovn-nbctl set Logical_port r2-join type=router
> >>> options:router-port=R2_join \
> >>> +addresses='"00:00:04:01:02:04"'
> >>> +
> >>> +
> >>> +# Connect R3 to join
> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
> >>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router
> R3
> >>> \
> >>> +ports @lrp -- lport-add join r3-join
> >>> +
> >>> +ovn-nbctl set Logical_port r3-join type=router
> >>> options:router-port=R3_join \
> >>> +addresses='"00:00:04:01:02:05"'
> >>> +
> >>> +#install static routes
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> >>> +R1 static_routes @lrt
> >>> +
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> >>> +R1 static_routes @lrt
> >>> +
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> >>> +R2 static_routes @lrt
> >>> +
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> >>> +R2 static_routes @lrt
> >>> +
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> >>> +R3 static_routes @lrt
> >>> +
> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> >>> +R3 static_routes @lrt
> >>> +
> >>> +# Create logical port foo1 in foo
> >>> +ovn-nbctl lport-add foo foo1 \
> >>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
> >>> +
> >>> +# Create logical port alice1 in alice
> >>> +ovn-nbctl lport-add alice alice1 \
> >>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
> >>> +
> >>> +# Create logical port bob1 in bob
> >>> +ovn-nbctl lport-add bob bob1 \
> >>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
> >>> +
> >>> +# Create two hypervisor and create OVS ports corresponding to logical
> >>> ports.
> >>> +net_add n1
> >>> +
> >>> +sim_add hv1
> >>> +as hv1
> >>> +ovs-vsctl add-br br-phys
> >>> +ovn_attach n1 br-phys 192.168.0.1
> >>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> >>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
> >>> +    options:tx_pcap=hv1/vif1-tx.pcap \
> >>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> >>> +    ofport-request=1
> >>> +
> >>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
> >>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
> >>> +    options:tx_pcap=hv1/vif2-tx.pcap \
> >>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
> >>> +    ofport-request=2
> >>> +
> >>> +sim_add hv2
> >>> +as hv2
> >>> +ovs-vsctl add-br br-phys
> >>> +ovn_attach n1 br-phys 192.168.0.2
> >>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
> >>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
> >>> +    options:tx_pcap=hv2/vif1-tx.pcap \
> >>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
> >>> +    ofport-request=1
> >>> +
> >>> +
> >>> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
> >>> +# packets for ARP resolution (native tunneling doesn't queue packets
> >>> +# for ARP resolution).
> >>> +ovn_populate_arp
> >>> +
> >>> +# Allow some time for ovn-northd and ovn-controller to catch up.
> >>> +# XXX This should be more systematic.
> >>> +sleep 1
> >>> +
> >>> +ip_to_hex() {
> >>> +    printf "%02x%02x%02x%02x" "$@"
> >>> +}
> >>> +trim_zeros() {
> >>> +    sed 's/\(00\)\{1,\}$//'
> >>> +}
> >>> +
> >>> +# Send ip packets between foo1 and alice1
> >>> +src_mac="f00000010203"
> >>> +dst_mac="000001010203"
> >>> +src_ip=`ip_to_hex 192 168 1 2`
> >>> +dst_ip=`ip_to_hex 172 16 1 2`
> >>>
> >>>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> >>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> >>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
> >>> +
> >>> +# Send ip packets between foo1 and bob1
> >>> +src_mac="f00000010203"
> >>> +dst_mac="000001010203"
> >>> +src_ip=`ip_to_hex 192 168 1 2`
> >>> +dst_ip=`ip_to_hex 10 32 1 2`
> >>>
> >>>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> >>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> >>> +
> >>> +echo "---------NB dump-----"
> >>> +ovn-nbctl show
> >>> +echo "---------------------"
> >>> +ovn-nbctl list logical_router
> >>> +echo "---------------------"
> >>> +ovn-nbctl list logical_router_port
> >>> +echo "---------------------"
> >>> +
> >>> +echo "---------SB dump-----"
> >>> +ovn-sbctl list datapath_binding
> >>> +echo "---------------------"
> >>> +ovn-sbctl list port_binding
> >>> +echo "---------------------"
> >>> +ovn-sbctl dump-flows
> >>> +echo "---------------------"
> >>> +
> >>> +echo "------ hv1 dump ----------"
> >>> +as hv1 ovs-ofctl show br-int
> >>> +as hv1 ovs-ofctl dump-flows br-int
> >>> +echo "------ hv2 dump ----------"
> >>> +as hv2 ovs-ofctl show br-int
> >>> +as hv2 ovs-ofctl dump-flows br-int
> >>> +echo "----------------------------"
> >>> +
> >>> +# Packet to Expect at bob1
> >>> +src_mac="000003010203"
> >>> +dst_mac="f00000010205"
> >>> +src_ip=`ip_to_hex 192 168 1 2`
> >>> +dst_ip=`ip_to_hex 10 32 1 2`
> >>>
> >>>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> >>> +
> >>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
> >>> trim_zeros > received.packets
> >>> +echo $expected | trim_zeros > expout
> >>> +AT_CHECK([cat received.packets], [0], [expout])
> >>> +
> >>> +# Packet to Expect at alice1
> >>> +src_mac="000002010203"
> >>> +dst_mac="f00000010204"
> >>> +src_ip=`ip_to_hex 192 168 1 2`
> >>> +dst_ip=`ip_to_hex 172 16 1 2`
> >>>
> >>>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> >>> +
> >>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
> >>> trim_zeros > received1.packets
> >>> +echo $expected | trim_zeros > expout
> >>> +AT_CHECK([cat received1.packets], [0], [expout])
> >>> +
> >>> +for sim in hv1 hv2; do
> >>> +    as $sim
> >>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
> >>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> >>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>> +done
> >>> +
> >>> +as ovn-sb
> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>> +
> >>> +as ovn-nb
> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>> +
> >>> +as northd
> >>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> >>> +
> >>> +as main
> >>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>> +
> >>> +AT_CLEANUP
> >>> --
> >>> 1.7.9.5
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev@openvswitch.org
> >>> http://openvswitch.org/mailman/listinfo/dev
> >>>
> >>
> >>
> >
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Darrell Ball May 12, 2016, 3:45 a.m. UTC | #5
On Wed, May 11, 2016 at 4:42 PM, Guru Shetty <guru@ovn.org> wrote:

>
>>
>> Some reasons why having a “transit LS” is “undesirable” is:
>>
>> 1)    1)  It creates additional requirements at the CMS layer for setting
>> up networks; i.e. additional programming is required at the OVN northbound
>> interface for the special transit LSs, interactions with the logical
>> router
>> peers.
>>
>
> Agreed that there is additional work needed for the CMS plugin. That work
> is needed even if it is just peering as they need to convert one router in
> to two in OVN (unless OVN automatically makes this split)
>

The work to coordinate 2 logical routers and one special LS is more and
also more complicated than
to coordinate 2 logical routers.


>
>
>>
>> In cases where some NSX products do this, it is hidden from the user, as
>> one would minimally expect.
>>
>> 2)     2) From OVN POV, it adds an additional OVN datapath to all
>> processing to the packet path and programming/processing for that
>> datapath.
>>
>> because you have
>>
>> R1<->Transit LS<->R2
>>
>> vs
>>
>> R1<->R2
>>
>
> Agreed that there is an additional datapath.
>
>
>>
>> 3)     3) You have to coordinate the transit LS subnet to handle all
>> addresses in this same subnet for all the logical routers and all their
>> transit LS peers.
>>
>
> I don't understand what you mean. If a user uses one gateway, a transit LS
> only gets connected by 2 routers.
> Other routers get their own transit LS.
>


Each group of logical routers communicating has it own Transit LS.

Take an example with one gateway router and 1000 distributed logical
routers for 1000 tenants/hv,
connecting 1000 HVs for now.
Lets assume each distributed logical router only talks to the gateway
router.
So thats 1000 Transit LSs.
-> 1001 addresses per subnet for each of 1000 subnets (1 for each Transit
LS) ?




>
>
>>
>> 4)    4) Seems like L3 load balancing, ECMP, would be more complicated at
>> best.
>>
>> 5)    5)  1000s of additional arp resolve flows rules are needed in normal
>> cases in addition to added burden of the special transit LS others flows.
>>
>
> I don't understand why that would be the case.
>


Each Transit LS creates an arp resolve flow for each peer router port.
Lets say we have 1000 HVs, each HV has 1000 tenants.
Lets says we have 1000 distributed logical routers, minimally 1 per tenant.
For simplicity, lets assume the 1000 distributed logical routers are
connected to a single lonely gateway (GR).

How many Transit LS (i.e. extra) distributed datapaths ?
1000 ?

How many Transit LS router type logical ports in total across all 1000
Transit LSs would you
expect with this 1000 HV network ?
1001 per Transit LS * 1000 Transit LSs = 1,001,000
Each of these requires a special arp resolve flow, as well - 1,001,000.

For distributed logical router to gateway connections via Transit LSs, we
have
1001 address per subnet for each of 1000 subnets = 1,001,000 addresses
but again we need 1001 addresses per subnet !



>
>
>>
>>
>>
>>
>> >
>> >
>> >> This is usually only done if there is existing switching equipment than
>> >> must be traversed
>> >> But in the case, we are just dealing with software where we have total
>> >> flexibility.
>> >>
>> >
>> > From OpenStack perspective, each tenant gets a router and multiple
>> > switches with different subnets. The idea is that the OpenStack network
>> > plugin, can at best split this single tenant router into 2 with a
>> switch in
>> > between on a  subnet that neutron does not use. Taking the same logic
>> > forward for k8s, I can support multiple gateways.
>> >
>>
>> As far as I have heard from Openstack folks, each tenant can create
>> multiple logical routers; they can even overlap subnets if they wanted.
>> So,
>> I am not sure the above assumption holds true.
>>
>
> A tenant can create multiple logical routers. But a connected logical
> topology can have one router connected to the gateway, atleast that is the
> common use case.
>

hmm, I thought there was some kind of limitation that required creating
Transit LSs,
but there seems no such limitation ? I guess we need to dig more on this
one.



>
>>
>>
>> >
>> > Connecting multiple routers to each other without a switch makes it a
>> pain
>> > to remember the interconnections and create multiple subnets (which may
>> > simply not be available for a networking backend for internal use).
>> >
>>
>> Well, just looking at the 10.x.x.x private IP range, there are more than
>> 16
>> million IP addresses
>>
>> Point to point direct router links, Rx<->Ry subnets, can be /31. In the
>> most extreme case, maybe 20,000 addresses would be used. I don’t think
>> there should be a subnet depletion problem here, unless it’s a contrived
>> misconfiguration.
>>
>
> Who is going to manage all the interconnections? As far as I see a transit
> LS is much more simpler.
>


Lets compare.

To be consistent with the previous example, I described earlier, lets again
take the 1000 HV
example, using 1000 distributed logical routers for 1000 tenants,
connected to a single and lonely gateway.

For distributed logical router to gateway connections only, we have
for direct DR<->GR connections:

2000 subnet address per distributed logical router * 1000 distributed
logical routers
= 2,000,000 vs 1,001,000 for Transit LSs usage; same magnitude,
but we only need 2 addresses per subnet compared to 1001 addresses per
subnet for Transit LS
usage.

There is no subnet depletion problem with direct DR<->GR connections, but
there may be
with DR<->Transit LS<->GR.

The question is who is going to manage 1000 subnets needing 1001 addresses
each, using Transit LSs.



>
>
>>
>>
>>
>>
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> With the above goal in mind, this commit gives the general
>> >>> ability to connect multiple routers via a switch.
>> >>>
>> >>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
>> >>> ---
>> >>> This needs the following 2 commits under review to
>> >>> first go in.
>> >>> 1. ofproto-dpif: Rename "recurse" to "indentation".
>> >>> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
>> >>> ---
>> >>>  ovn/controller/lflow.c  |   19 ++--
>> >>>  ovn/northd/ovn-northd.c |   56 ++++++++++-
>> >>>  ovn/ovn-nb.xml          |    7 --
>> >>>  tests/ovn.at            |  236
>> >>> +++++++++++++++++++++++++++++++++++++++++++++++
>> >>>  4 files changed, 299 insertions(+), 19 deletions(-)
>> >>>
>> >>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
>> >>> index 96b7c66..efc427d 100644
>> >>> --- a/ovn/controller/lflow.c
>> >>> +++ b/ovn/controller/lflow.c
>> >>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
>> >>> const struct lport_index *lports,
>> >>>          if (is_switch(ldp)) {
>> >>>              /* For a logical switch datapath, local_datapaths tells
>> us
>> >>> if there
>> >>>               * are any local ports for this datapath.  If not, we can
>> >>> skip
>> >>> -             * processing logical flows if the flow belongs to egress
>> >>> pipeline
>> >>> -             * or if that logical switch datapath is not patched to
>> any
>> >>> logical
>> >>> -             * router.
>> >>> +             * processing logical flows if that logical switch
>> datapath
>> >>> is not
>> >>> +             * patched to any logical router.
>> >>>               *
>> >>> -             * Otherwise, we still need the ingress pipeline because
>> >>> even if
>> >>> -             * there are no local ports, we still may need to execute
>> >>> the ingress
>> >>> -             * pipeline after a packet leaves a logical router.
>> Further
>> >>> -             * optimization is possible, but not based on what we
>> know
>> >>> with
>> >>> -             * local_datapaths right now.
>> >>> +             * Otherwise, we still need both ingress and egress
>> pipeline
>> >>> +             * because even if there are no local ports, we still may
>> >>> need to
>> >>> +             * execute the ingress pipeline after a packet leaves a
>> >>> logical
>> >>> +             * router and we need to do egress pipeline for a switch
>> >>> that
>> >>> +             * is connected to only routers.  Further optimization is
>> >>> possible,
>> >>> +             * but not based on what we know with local_datapaths
>> right
>> >>> now.
>> >>>               *
>> >>>               * A better approach would be a kind of "flood fill"
>> >>> algorithm:
>> >>>               *
>> >>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx,
>> const
>> >>> struct lport_index *lports,
>> >>>               */
>> >>>
>> >>>              if (!get_local_datapath(local_datapaths,
>> ldp->tunnel_key)) {
>> >>> -                if (!ingress) {
>> >>> -                    continue;
>> >>> -                }
>> >>>
>> >>
>> >> This is removing a previous change that was done for some optimization
>> >> for avoiding
>> >> unnecessary flow creation.
>> >> I am not sure about the process here, but maybe this should be tracked
>> as
>> >> a separate
>> >> patch ?
>> >>
>> > The above change is needed for this patch to work. The optimization
>> above
>> > assumes that a switch always has atleast one real physical endpoint.
>> With
>> > this change, since a switch can only be connected to routers, we need to
>> > remove the optimization. The optimization itself will need more careful
>> > consideration and with more information provided to ovn-controller, it
>> can
>> > ideally be improved, but I would ideally not want l3 gateway work
>> delayed
>> > because of it.
>> >
>> >
>> My point was that this code change removes a previous optimization. This
>> leads to 2 questions – is it worth removing the optimization for the new
>> code as is and is it worth adding back optimizations with the added
>> complexity of the new code.
>>
>>
>>
>> >
>> >
>> >>
>> >>
>> >>
>> >>
>> >>>                  if (!get_patched_datapath(patched_datapaths,
>> >>>                                            ldp->tunnel_key)) {
>> >>>                      continue;
>> >>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>> >>> index 9e03606..3a29770 100644
>> >>> --- a/ovn/northd/ovn-northd.c
>> >>> +++ b/ovn/northd/ovn-northd.c
>> >>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>> struct hmap *ports,
>> >>>                  free(actions);
>> >>>                  free(match);
>> >>>              }
>> >>> -        } else if (op->od->n_router_ports) {
>> >>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
>> >>> "router")) {
>> >>> +            /* This is a logical switch port that backs a VM or a
>> >>> container.
>> >>> +             * Extract its addresses. For each of the address, go
>> >>> through all
>> >>> +             * the router ports attached to the switch (to which this
>> >>> port
>> >>> +             * connects) and if the address in question is reachable
>> >>> from the
>> >>> +             * router port, add an ARP entry in that router's
>> pipeline.
>> >>> */
>> >>> +
>> >>>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>> >>>                  struct lport_addresses laddrs;
>> >>>                  if (!extract_lport_addresses(op->nbs->addresses[i],
>> >>> &laddrs,
>> >>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>> struct hmap *ports,
>> >>>
>> >>>                  free(laddrs.ipv4_addrs);
>> >>>              }
>> >>> +        } else if (!strcmp(op->nbs->type, "router")) {
>> >>> +            /* This is a logical switch port that connects to a
>> router.
>> >>> */
>> >>> +
>> >>> +            /* The peer of this switch port is the router port for
>> which
>> >>> +             * we need to add logical flows such that it can resolve
>> >>> +             * ARP entries for all the other router ports connected
>> to
>> >>> +             * the switch in question. */
>> >>> +
>> >>> +            const char *peer_name = smap_get(&op->nbs->options,
>> >>> +                                             "router-port");
>> >>> +            if (!peer_name) {
>> >>> +                continue;
>> >>> +            }
>> >>> +
>> >>> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
>> >>> +            if (!peer || !peer->nbr || !peer->ip) {
>> >>> +                continue;
>> >>> +            }
>> >>> +
>> >>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
>> >>> +                const char *router_port_name = smap_get(
>> >>> +
>> >>> &op->od->router_ports[j]->nbs->options,
>> >>> +                                    "router-port");
>> >>> +                struct ovn_port *router_port = ovn_port_find(ports,
>> >>> +
>> >>>  router_port_name);
>> >>> +                if (!router_port || !router_port->nbr ||
>> >>> !router_port->ip) {
>> >>> +                    continue;
>> >>> +                }
>> >>> +
>> >>> +                /* Skip the router port under consideration. */
>> >>> +                if (router_port == peer) {
>> >>> +                   continue;
>> >>> +                }
>> >>> +
>> >>> +                if (!router_port->ip) {
>> >>> +                    continue;
>> >>> +                }
>> >>> +                char *match = xasprintf("outport == %s && reg0 ==
>> >>> "IP_FMT,
>> >>> +                                        peer->json_key,
>> >>> +                                        IP_ARGS(router_port->ip));
>> >>> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
>> >>> next;",
>> >>> +
>> >>> ETH_ADDR_ARGS(router_port->mac));
>> >>> +                ovn_lflow_add(lflows, peer->od,
>> S_ROUTER_IN_ARP_RESOLVE,
>> >>> +                              101, match, actions);
>> >>> +                free(actions);
>> >>> +                free(match);
>> >>> +            }
>> >>>          }
>> >>>      }
>> >>> +
>> >>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>> >>>          if (!od->nbr) {
>> >>>              continue;
>> >>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>> >>> index c01455d..d7fd595 100644
>> >>> --- a/ovn/ovn-nb.xml
>> >>> +++ b/ovn/ovn-nb.xml
>> >>> @@ -154,13 +154,6 @@
>> >>>            These options apply when <ref column="type"/> is
>> >>> <code>router</code>.
>> >>>          </p>
>> >>>
>> >>> -        <p>
>> >>> -          If a given logical switch has multiple <code>router</code>
>> >>> ports, the
>> >>> -          <ref table="Logical_Router_Port"/> rows that they reference
>> >>> must be
>> >>> -          all on the same <ref table="Logical_Router"/> (for
>> different
>> >>> -          subnets).
>> >>> -        </p>
>> >>> -
>> >>>          <column name="options" key="router-port">
>> >>>            Required.  The <ref column="name"/> of the <ref
>> >>>            table="Logical_Router_Port"/> to which this logical switch
>> >>> port is
>> >>> diff --git a/tests/ovn.at b/tests/ovn.at
>> >>> index b9d7ada..6961be0 100644
>> >>> --- a/tests/ovn.at
>> >>> +++ b/tests/ovn.at
>> >>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>
>> >>>  AT_CLEANUP
>> >>>
>> >>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
>> >>> +AT_KEYWORDS([ovnstaticroutes])
>> >>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
>> >>> +ovn_start
>> >>> +
>> >>> +# Logical network:
>> >>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
>> >>> "join"
>> >>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
>> >>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
>> >>> 10.32.1.0/24)
>> >>> +# connected to it.
>> >>> +
>> >>> +ovn-nbctl create Logical_Router name=R1
>> >>> +ovn-nbctl create Logical_Router name=R2
>> >>> +ovn-nbctl create Logical_Router name=R3
>> >>> +
>> >>> +ovn-nbctl lswitch-add foo
>> >>> +ovn-nbctl lswitch-add alice
>> >>> +ovn-nbctl lswitch-add bob
>> >>> +ovn-nbctl lswitch-add join
>> >>> +
>> >>> +# Connect foo to R1
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
>> >>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add
>> Logical_Router
>> >>> R1 \
>> >>> +ports @lrp -- lport-add foo rp-foo
>> >>> +
>> >>> +ovn-nbctl set Logical_port rp-foo type=router
>> options:router-port=foo \
>> >>> +addresses=\"00:00:01:01:02:03\"
>> >>> +
>> >>> +# Connect alice to R2
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
>> >>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add
>> Logical_Router
>> >>> R2 \
>> >>> +ports @lrp -- lport-add alice rp-alice
>> >>> +
>> >>> +ovn-nbctl set Logical_port rp-alice type=router
>> >>> options:router-port=alice \
>> >>> +addresses=\"00:00:02:01:02:03\"
>> >>> +
>> >>> +# Connect bob to R3
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
>> >>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router
>> >>> R3 \
>> >>> +ports @lrp -- lport-add bob rp-bob
>> >>> +
>> >>> +ovn-nbctl set Logical_port rp-bob type=router
>> options:router-port=bob \
>> >>> +addresses=\"00:00:03:01:02:03\"
>> >>> +
>> >>> +# Connect R1 to join
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
>> >>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router
>> R1
>> >>> \
>> >>> +ports @lrp -- lport-add join r1-join
>> >>> +
>> >>> +ovn-nbctl set Logical_port r1-join type=router
>> >>> options:router-port=R1_join \
>> >>> +addresses='"00:00:04:01:02:03"'
>> >>> +
>> >>> +# Connect R2 to join
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
>> >>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router
>> R2
>> >>> \
>> >>> +ports @lrp -- lport-add join r2-join
>> >>> +
>> >>> +ovn-nbctl set Logical_port r2-join type=router
>> >>> options:router-port=R2_join \
>> >>> +addresses='"00:00:04:01:02:04"'
>> >>> +
>> >>> +
>> >>> +# Connect R3 to join
>> >>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
>> >>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router
>> R3
>> >>> \
>> >>> +ports @lrp -- lport-add join r3-join
>> >>> +
>> >>> +ovn-nbctl set Logical_port r3-join type=router
>> >>> options:router-port=R3_join \
>> >>> +addresses='"00:00:04:01:02:05"'
>> >>> +
>> >>> +#install static routes
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>> +R1 static_routes @lrt
>> >>> +
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>> +R1 static_routes @lrt
>> >>> +
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>> +R2 static_routes @lrt
>> >>> +
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>> +R2 static_routes @lrt
>> >>> +
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>> +R3 static_routes @lrt
>> >>> +
>> >>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>> +R3 static_routes @lrt
>> >>> +
>> >>> +# Create logical port foo1 in foo
>> >>> +ovn-nbctl lport-add foo foo1 \
>> >>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
>> >>> +
>> >>> +# Create logical port alice1 in alice
>> >>> +ovn-nbctl lport-add alice alice1 \
>> >>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
>> >>> +
>> >>> +# Create logical port bob1 in bob
>> >>> +ovn-nbctl lport-add bob bob1 \
>> >>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
>> >>> +
>> >>> +# Create two hypervisor and create OVS ports corresponding to logical
>> >>> ports.
>> >>> +net_add n1
>> >>> +
>> >>> +sim_add hv1
>> >>> +as hv1
>> >>> +ovs-vsctl add-br br-phys
>> >>> +ovn_attach n1 br-phys 192.168.0.1
>> >>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>> >>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
>> >>> +    options:tx_pcap=hv1/vif1-tx.pcap \
>> >>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
>> >>> +    ofport-request=1
>> >>> +
>> >>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
>> >>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
>> >>> +    options:tx_pcap=hv1/vif2-tx.pcap \
>> >>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
>> >>> +    ofport-request=2
>> >>> +
>> >>> +sim_add hv2
>> >>> +as hv2
>> >>> +ovs-vsctl add-br br-phys
>> >>> +ovn_attach n1 br-phys 192.168.0.2
>> >>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
>> >>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
>> >>> +    options:tx_pcap=hv2/vif1-tx.pcap \
>> >>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
>> >>> +    ofport-request=1
>> >>> +
>> >>> +
>> >>> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
>> >>> +# packets for ARP resolution (native tunneling doesn't queue packets
>> >>> +# for ARP resolution).
>> >>> +ovn_populate_arp
>> >>> +
>> >>> +# Allow some time for ovn-northd and ovn-controller to catch up.
>> >>> +# XXX This should be more systematic.
>> >>> +sleep 1
>> >>> +
>> >>> +ip_to_hex() {
>> >>> +    printf "%02x%02x%02x%02x" "$@"
>> >>> +}
>> >>> +trim_zeros() {
>> >>> +    sed 's/\(00\)\{1,\}$//'
>> >>> +}
>> >>> +
>> >>> +# Send ip packets between foo1 and alice1
>> >>> +src_mac="f00000010203"
>> >>> +dst_mac="000001010203"
>> >>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>>
>> >>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
>> >>> +
>> >>> +# Send ip packets between foo1 and bob1
>> >>> +src_mac="f00000010203"
>> >>> +dst_mac="000001010203"
>> >>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>>
>> >>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>> +
>> >>> +echo "---------NB dump-----"
>> >>> +ovn-nbctl show
>> >>> +echo "---------------------"
>> >>> +ovn-nbctl list logical_router
>> >>> +echo "---------------------"
>> >>> +ovn-nbctl list logical_router_port
>> >>> +echo "---------------------"
>> >>> +
>> >>> +echo "---------SB dump-----"
>> >>> +ovn-sbctl list datapath_binding
>> >>> +echo "---------------------"
>> >>> +ovn-sbctl list port_binding
>> >>> +echo "---------------------"
>> >>> +ovn-sbctl dump-flows
>> >>> +echo "---------------------"
>> >>> +
>> >>> +echo "------ hv1 dump ----------"
>> >>> +as hv1 ovs-ofctl show br-int
>> >>> +as hv1 ovs-ofctl dump-flows br-int
>> >>> +echo "------ hv2 dump ----------"
>> >>> +as hv2 ovs-ofctl show br-int
>> >>> +as hv2 ovs-ofctl dump-flows br-int
>> >>> +echo "----------------------------"
>> >>> +
>> >>> +# Packet to Expect at bob1
>> >>> +src_mac="000003010203"
>> >>> +dst_mac="f00000010205"
>> >>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>>
>> >>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>> +
>> >>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
>> >>> trim_zeros > received.packets
>> >>> +echo $expected | trim_zeros > expout
>> >>> +AT_CHECK([cat received.packets], [0], [expout])
>> >>> +
>> >>> +# Packet to Expect at alice1
>> >>> +src_mac="000002010203"
>> >>> +dst_mac="f00000010204"
>> >>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>>
>> >>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>> +
>> >>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
>> >>> trim_zeros > received1.packets
>> >>> +echo $expected | trim_zeros > expout
>> >>> +AT_CHECK([cat received1.packets], [0], [expout])
>> >>> +
>> >>> +for sim in hv1 hv2; do
>> >>> +    as $sim
>> >>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
>> >>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>> +done
>> >>> +
>> >>> +as ovn-sb
>> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>> +
>> >>> +as ovn-nb
>> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>> +
>> >>> +as northd
>> >>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>> >>> +
>> >>> +as main
>> >>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>> +
>> >>> +AT_CLEANUP
>> >>> --
>> >>> 1.7.9.5
>> >>>
>> >>> _______________________________________________
>> >>> dev mailing list
>> >>> dev@openvswitch.org
>> >>> http://openvswitch.org/mailman/listinfo/dev
>> >>>
>> >>
>> >>
>> >
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
>>
>
>
Darrell Ball May 12, 2016, 5:45 a.m. UTC | #6
On Wed, May 11, 2016 at 8:51 PM, Guru Shetty <guru.ovn@gmail.com> wrote:

>
>
>
>
> > On May 11, 2016, at 8:45 PM, Darrell Ball <dlu998@gmail.com> wrote:
> >
> >> On Wed, May 11, 2016 at 4:42 PM, Guru Shetty <guru@ovn.org> wrote:
> >>
> >>
> >>>
> >>> Some reasons why having a “transit LS” is “undesirable” is:
> >>>
> >>> 1)    1)  It creates additional requirements at the CMS layer for
> setting
> >>> up networks; i.e. additional programming is required at the OVN
> northbound
> >>> interface for the special transit LSs, interactions with the logical
> >>> router
> >>> peers.
> >>
> >> Agreed that there is additional work needed for the CMS plugin. That
> work
> >> is needed even if it is just peering as they need to convert one router
> in
> >> to two in OVN (unless OVN automatically makes this split)
> >
> > The work to coordinate 2 logical routers and one special LS is more and
> > also more complicated than
> > to coordinate 2 logical routers.
> >
> >
> >>
> >>
> >>>
> >>> In cases where some NSX products do this, it is hidden from the user,
> as
> >>> one would minimally expect.
> >>>
> >>> 2)     2) From OVN POV, it adds an additional OVN datapath to all
> >>> processing to the packet path and programming/processing for that
> >>> datapath.
> >>>
> >>> because you have
> >>>
> >>> R1<->Transit LS<->R2
> >>>
> >>> vs
> >>>
> >>> R1<->R2
> >>
> >> Agreed that there is an additional datapath.
> >>
> >>
> >>>
> >>> 3)     3) You have to coordinate the transit LS subnet to handle all
> >>> addresses in this same subnet for all the logical routers and all their
> >>> transit LS peers.
> >>
> >> I don't understand what you mean. If a user uses one gateway, a transit
> LS
> >> only gets connected by 2 routers.
> >> Other routers get their own transit LS.
> >
> >
> > Each group of logical routers communicating has it own Transit LS.
> >
> > Take an example with one gateway router and 1000 distributed logical
> > routers for 1000 tenants/hv,
> > connecting 1000 HVs for now.
> > Lets assume each distributed logical router only talks to the gateway
> > router.
>
> That is a wrong assumption. Each tenant has his own gateway router (or
> more)
>

Its less of an assumption but more of an example for illustrative purposes;
but its good that you
mention it.

The DR<->GR direct connection approach as well as the transit LS approach
can re-use private
IP pools across internal distributed logical routers, which amount to VRF
contexts for tenants networks.

The Transit LS approach does not scale due to the large number of
distributed datapaths required and
other high special flow requirements. It has more complex and higher
subnetting requirements. In addition, there is greater complexity for
northbound management.







>
>
> > So thats 1000 Transit LSs.
> > -> 1001 addresses per subnet for each of 1000 subnets (1 for each Transit
> > LS) ?
> >
> >
> >
> >
> >>
> >>
> >>>
> >>> 4)    4) Seems like L3 load balancing, ECMP, would be more complicated
> at
> >>> best.
> >>>
> >>> 5)    5)  1000s of additional arp resolve flows rules are needed in
> normal
> >>> cases in addition to added burden of the special transit LS others
> flows.
> >>
> >> I don't understand why that would be the case.
> >
> >
> > Each Transit LS creates an arp resolve flow for each peer router port.
> > Lets say we have 1000 HVs, each HV has 1000 tenants.
> > Lets says we have 1000 distributed logical routers, minimally 1 per
> tenant.
> > For simplicity, lets assume the 1000 distributed logical routers are
> > connected to a single lonely gateway (GR).
> >
> > How many Transit LS (i.e. extra) distributed datapaths ?
> > 1000 ?
> >
> > How many Transit LS router type logical ports in total across all 1000
> > Transit LSs would you
> > expect with this 1000 HV network ?
> > 1001 per Transit LS * 1000 Transit LSs = 1,001,000
> > Each of these requires a special arp resolve flow, as well - 1,001,000.
> >
> > For distributed logical router to gateway connections via Transit LSs, we
> > have
> > 1001 address per subnet for each of 1000 subnets = 1,001,000 addresses
> > but again we need 1001 addresses per subnet !
> >
> >
> >
> >>
> >>
> >>>
> >>>
> >>>
> >>>
> >>>>
> >>>>
> >>>>> This is usually only done if there is existing switching equipment
> than
> >>>>> must be traversed
> >>>>> But in the case, we are just dealing with software where we have
> total
> >>>>> flexibility.
> >>>>
> >>>> From OpenStack perspective, each tenant gets a router and multiple
> >>>> switches with different subnets. The idea is that the OpenStack
> network
> >>>> plugin, can at best split this single tenant router into 2 with a
> >>> switch in
> >>>> between on a  subnet that neutron does not use. Taking the same logic
> >>>> forward for k8s, I can support multiple gateways.
> >>>
> >>> As far as I have heard from Openstack folks, each tenant can create
> >>> multiple logical routers; they can even overlap subnets if they wanted.
> >>> So,
> >>> I am not sure the above assumption holds true.
> >>
> >> A tenant can create multiple logical routers. But a connected logical
> >> topology can have one router connected to the gateway, atleast that is
> the
> >> common use case.
> >
> > hmm, I thought there was some kind of limitation that required creating
> > Transit LSs,
> > but there seems no such limitation ? I guess we need to dig more on this
> > one.
> >
> >
> >
> >>
> >>>
> >>>
> >>>>
> >>>> Connecting multiple routers to each other without a switch makes it a
> >>> pain
> >>>> to remember the interconnections and create multiple subnets (which
> may
> >>>> simply not be available for a networking backend for internal use).
> >>>
> >>> Well, just looking at the 10.x.x.x private IP range, there are more
> than
> >>> 16
> >>> million IP addresses
> >>>
> >>> Point to point direct router links, Rx<->Ry subnets, can be /31. In the
> >>> most extreme case, maybe 20,000 addresses would be used. I don’t think
> >>> there should be a subnet depletion problem here, unless it’s a
> contrived
> >>> misconfiguration.
> >>
> >> Who is going to manage all the interconnections? As far as I see a
> transit
> >> LS is much more simpler.
> >
> >
> > Lets compare.
> >
> > To be consistent with the previous example, I described earlier, lets
> again
> > take the 1000 HV
> > example, using 1000 distributed logical routers for 1000 tenants,
> > connected to a single and lonely gateway.
> >
> > For distributed logical router to gateway connections only, we have
> > for direct DR<->GR connections:
> >
> > 2000 subnet address per distributed logical router * 1000 distributed
> > logical routers
> > = 2,000,000 vs 1,001,000 for Transit LSs usage; same magnitude,
> > but we only need 2 addresses per subnet compared to 1001 addresses per
> > subnet for Transit LS
> > usage.
> >
> > There is no subnet depletion problem with direct DR<->GR connections, but
> > there may be
> > with DR<->Transit LS<->GR.
> >
> > The question is who is going to manage 1000 subnets needing 1001
> addresses
> > each, using Transit LSs.
> >
> >
> >
> >>
> >>
> >>>
> >>>
> >>>
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> With the above goal in mind, this commit gives the general
> >>>>>> ability to connect multiple routers via a switch.
> >>>>>>
> >>>>>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
> >>>>>> ---
> >>>>>> This needs the following 2 commits under review to
> >>>>>> first go in.
> >>>>>> 1. ofproto-dpif: Rename "recurse" to "indentation".
> >>>>>> 2. ofproto-dpif: Do not count resubmit to later tables against
> limit.
> >>>>>> ---
> >>>>>> ovn/controller/lflow.c  |   19 ++--
> >>>>>> ovn/northd/ovn-northd.c |   56 ++++++++++-
> >>>>>> ovn/ovn-nb.xml          |    7 --
> >>>>>> tests/ovn.at            |  236
> >>>>>> +++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> 4 files changed, 299 insertions(+), 19 deletions(-)
> >>>>>>
> >>>>>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
> >>>>>> index 96b7c66..efc427d 100644
> >>>>>> --- a/ovn/controller/lflow.c
> >>>>>> +++ b/ovn/controller/lflow.c
> >>>>>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
> >>>>>> const struct lport_index *lports,
> >>>>>>         if (is_switch(ldp)) {
> >>>>>>             /* For a logical switch datapath, local_datapaths tells
> >>> us
> >>>>>> if there
> >>>>>>              * are any local ports for this datapath.  If not, we
> can
> >>>>>> skip
> >>>>>> -             * processing logical flows if the flow belongs to
> egress
> >>>>>> pipeline
> >>>>>> -             * or if that logical switch datapath is not patched to
> >>> any
> >>>>>> logical
> >>>>>> -             * router.
> >>>>>> +             * processing logical flows if that logical switch
> >>> datapath
> >>>>>> is not
> >>>>>> +             * patched to any logical router.
> >>>>>>              *
> >>>>>> -             * Otherwise, we still need the ingress pipeline
> because
> >>>>>> even if
> >>>>>> -             * there are no local ports, we still may need to
> execute
> >>>>>> the ingress
> >>>>>> -             * pipeline after a packet leaves a logical router.
> >>> Further
> >>>>>> -             * optimization is possible, but not based on what we
> >>> know
> >>>>>> with
> >>>>>> -             * local_datapaths right now.
> >>>>>> +             * Otherwise, we still need both ingress and egress
> >>> pipeline
> >>>>>> +             * because even if there are no local ports, we still
> may
> >>>>>> need to
> >>>>>> +             * execute the ingress pipeline after a packet leaves a
> >>>>>> logical
> >>>>>> +             * router and we need to do egress pipeline for a
> switch
> >>>>>> that
> >>>>>> +             * is connected to only routers.  Further optimization
> is
> >>>>>> possible,
> >>>>>> +             * but not based on what we know with local_datapaths
> >>> right
> >>>>>> now.
> >>>>>>              *
> >>>>>>              * A better approach would be a kind of "flood fill"
> >>>>>> algorithm:
> >>>>>>              *
> >>>>>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx,
> >>> const
> >>>>>> struct lport_index *lports,
> >>>>>>              */
> >>>>>>
> >>>>>>             if (!get_local_datapath(local_datapaths,
> >>> ldp->tunnel_key)) {
> >>>>>> -                if (!ingress) {
> >>>>>> -                    continue;
> >>>>>> -                }
> >>>>>
> >>>>> This is removing a previous change that was done for some
> optimization
> >>>>> for avoiding
> >>>>> unnecessary flow creation.
> >>>>> I am not sure about the process here, but maybe this should be
> tracked
> >>> as
> >>>>> a separate
> >>>>> patch ?
> >>>> The above change is needed for this patch to work. The optimization
> >>> above
> >>>> assumes that a switch always has atleast one real physical endpoint.
> >>> With
> >>>> this change, since a switch can only be connected to routers, we need
> to
> >>>> remove the optimization. The optimization itself will need more
> careful
> >>>> consideration and with more information provided to ovn-controller, it
> >>> can
> >>>> ideally be improved, but I would ideally not want l3 gateway work
> >>> delayed
> >>>> because of it.
> >>> My point was that this code change removes a previous optimization.
> This
> >>> leads to 2 questions – is it worth removing the optimization for the
> new
> >>> code as is and is it worth adding back optimizations with the added
> >>> complexity of the new code.
> >>>
> >>>
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>                 if (!get_patched_datapath(patched_datapaths,
> >>>>>>                                           ldp->tunnel_key)) {
> >>>>>>                     continue;
> >>>>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> >>>>>> index 9e03606..3a29770 100644
> >>>>>> --- a/ovn/northd/ovn-northd.c
> >>>>>> +++ b/ovn/northd/ovn-northd.c
> >>>>>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
> >>>>>> struct hmap *ports,
> >>>>>>                 free(actions);
> >>>>>>                 free(match);
> >>>>>>             }
> >>>>>> -        } else if (op->od->n_router_ports) {
> >>>>>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
> >>>>>> "router")) {
> >>>>>> +            /* This is a logical switch port that backs a VM or a
> >>>>>> container.
> >>>>>> +             * Extract its addresses. For each of the address, go
> >>>>>> through all
> >>>>>> +             * the router ports attached to the switch (to which
> this
> >>>>>> port
> >>>>>> +             * connects) and if the address in question is
> reachable
> >>>>>> from the
> >>>>>> +             * router port, add an ARP entry in that router's
> >>> pipeline.
> >>>>>> */
> >>>>>> +
> >>>>>>             for (size_t i = 0; i < op->nbs->n_addresses; i++) {
> >>>>>>                 struct lport_addresses laddrs;
> >>>>>>                 if (!extract_lport_addresses(op->nbs->addresses[i],
> >>>>>> &laddrs,
> >>>>>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
> >>>>>> struct hmap *ports,
> >>>>>>
> >>>>>>                 free(laddrs.ipv4_addrs);
> >>>>>>             }
> >>>>>> +        } else if (!strcmp(op->nbs->type, "router")) {
> >>>>>> +            /* This is a logical switch port that connects to a
> >>> router.
> >>>>>> */
> >>>>>> +
> >>>>>> +            /* The peer of this switch port is the router port for
> >>> which
> >>>>>> +             * we need to add logical flows such that it can
> resolve
> >>>>>> +             * ARP entries for all the other router ports connected
> >>> to
> >>>>>> +             * the switch in question. */
> >>>>>> +
> >>>>>> +            const char *peer_name = smap_get(&op->nbs->options,
> >>>>>> +                                             "router-port");
> >>>>>> +            if (!peer_name) {
> >>>>>> +                continue;
> >>>>>> +            }
> >>>>>> +
> >>>>>> +            struct ovn_port *peer = ovn_port_find(ports,
> peer_name);
> >>>>>> +            if (!peer || !peer->nbr || !peer->ip) {
> >>>>>> +                continue;
> >>>>>> +            }
> >>>>>> +
> >>>>>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
> >>>>>> +                const char *router_port_name = smap_get(
> >>>>>> +
> >>>>>> &op->od->router_ports[j]->nbs->options,
> >>>>>> +                                    "router-port");
> >>>>>> +                struct ovn_port *router_port = ovn_port_find(ports,
> >>>>>> +
> >>>>>> router_port_name);
> >>>>>> +                if (!router_port || !router_port->nbr ||
> >>>>>> !router_port->ip) {
> >>>>>> +                    continue;
> >>>>>> +                }
> >>>>>> +
> >>>>>> +                /* Skip the router port under consideration. */
> >>>>>> +                if (router_port == peer) {
> >>>>>> +                   continue;
> >>>>>> +                }
> >>>>>> +
> >>>>>> +                if (!router_port->ip) {
> >>>>>> +                    continue;
> >>>>>> +                }
> >>>>>> +                char *match = xasprintf("outport == %s && reg0 ==
> >>>>>> "IP_FMT,
> >>>>>> +                                        peer->json_key,
> >>>>>> +                                        IP_ARGS(router_port->ip));
> >>>>>> +                char *actions = xasprintf("eth.dst =
> "ETH_ADDR_FMT";
> >>>>>> next;",
> >>>>>> +
> >>>>>> ETH_ADDR_ARGS(router_port->mac));
> >>>>>> +                ovn_lflow_add(lflows, peer->od,
> >>> S_ROUTER_IN_ARP_RESOLVE,
> >>>>>> +                              101, match, actions);
> >>>>>> +                free(actions);
> >>>>>> +                free(match);
> >>>>>> +            }
> >>>>>>         }
> >>>>>>     }
> >>>>>> +
> >>>>>>     HMAP_FOR_EACH (od, key_node, datapaths) {
> >>>>>>         if (!od->nbr) {
> >>>>>>             continue;
> >>>>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
> >>>>>> index c01455d..d7fd595 100644
> >>>>>> --- a/ovn/ovn-nb.xml
> >>>>>> +++ b/ovn/ovn-nb.xml
> >>>>>> @@ -154,13 +154,6 @@
> >>>>>>           These options apply when <ref column="type"/> is
> >>>>>> <code>router</code>.
> >>>>>>         </p>
> >>>>>>
> >>>>>> -        <p>
> >>>>>> -          If a given logical switch has multiple
> <code>router</code>
> >>>>>> ports, the
> >>>>>> -          <ref table="Logical_Router_Port"/> rows that they
> reference
> >>>>>> must be
> >>>>>> -          all on the same <ref table="Logical_Router"/> (for
> >>> different
> >>>>>> -          subnets).
> >>>>>> -        </p>
> >>>>>> -
> >>>>>>         <column name="options" key="router-port">
> >>>>>>           Required.  The <ref column="name"/> of the <ref
> >>>>>>           table="Logical_Router_Port"/> to which this logical switch
> >>>>>> port is
> >>>>>> diff --git a/tests/ovn.at b/tests/ovn.at
> >>>>>> index b9d7ada..6961be0 100644
> >>>>>> --- a/tests/ovn.at
> >>>>>> +++ b/tests/ovn.at
> >>>>>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>>>>
> >>>>>> AT_CLEANUP
> >>>>>>
> >>>>>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
> >>>>>> +AT_KEYWORDS([ovnstaticroutes])
> >>>>>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> >>>>>> +ovn_start
> >>>>>> +
> >>>>>> +# Logical network:
> >>>>>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
> >>>>>> "join"
> >>>>>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
> >>>>>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
> >>>>>> 10.32.1.0/24)
> >>>>>> +# connected to it.
> >>>>>> +
> >>>>>> +ovn-nbctl create Logical_Router name=R1
> >>>>>> +ovn-nbctl create Logical_Router name=R2
> >>>>>> +ovn-nbctl create Logical_Router name=R3
> >>>>>> +
> >>>>>> +ovn-nbctl lswitch-add foo
> >>>>>> +ovn-nbctl lswitch-add alice
> >>>>>> +ovn-nbctl lswitch-add bob
> >>>>>> +ovn-nbctl lswitch-add join
> >>>>>> +
> >>>>>> +# Connect foo to R1
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
> >>>>>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add
> >>> Logical_Router
> >>>>>> R1 \
> >>>>>> +ports @lrp -- lport-add foo rp-foo
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port rp-foo type=router
> >>> options:router-port=foo \
> >>>>>> +addresses=\"00:00:01:01:02:03\"
> >>>>>> +
> >>>>>> +# Connect alice to R2
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
> >>>>>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add
> >>> Logical_Router
> >>>>>> R2 \
> >>>>>> +ports @lrp -- lport-add alice rp-alice
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port rp-alice type=router
> >>>>>> options:router-port=alice \
> >>>>>> +addresses=\"00:00:02:01:02:03\"
> >>>>>> +
> >>>>>> +# Connect bob to R3
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
> >>>>>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add
> Logical_Router
> >>>>>> R3 \
> >>>>>> +ports @lrp -- lport-add bob rp-bob
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port rp-bob type=router
> >>> options:router-port=bob \
> >>>>>> +addresses=\"00:00:03:01:02:03\"
> >>>>>> +
> >>>>>> +# Connect R1 to join
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
> >>>>>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add
> Logical_Router
> >>> R1
> >>>>>> \
> >>>>>> +ports @lrp -- lport-add join r1-join
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port r1-join type=router
> >>>>>> options:router-port=R1_join \
> >>>>>> +addresses='"00:00:04:01:02:03"'
> >>>>>> +
> >>>>>> +# Connect R2 to join
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
> >>>>>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add
> Logical_Router
> >>> R2
> >>>>>> \
> >>>>>> +ports @lrp -- lport-add join r2-join
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port r2-join type=router
> >>>>>> options:router-port=R2_join \
> >>>>>> +addresses='"00:00:04:01:02:04"'
> >>>>>> +
> >>>>>> +
> >>>>>> +# Connect R3 to join
> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
> >>>>>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add
> Logical_Router
> >>> R3
> >>>>>> \
> >>>>>> +ports @lrp -- lport-add join r3-join
> >>>>>> +
> >>>>>> +ovn-nbctl set Logical_port r3-join type=router
> >>>>>> options:router-port=R3_join \
> >>>>>> +addresses='"00:00:04:01:02:05"'
> >>>>>> +
> >>>>>> +#install static routes
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> >>>>>> +R1 static_routes @lrt
> >>>>>> +
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> >>>>>> +R1 static_routes @lrt
> >>>>>> +
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> >>>>>> +R2 static_routes @lrt
> >>>>>> +
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> >>>>>> +R2 static_routes @lrt
> >>>>>> +
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> >>>>>> +R3 static_routes @lrt
> >>>>>> +
> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> >>>>>> +R3 static_routes @lrt
> >>>>>> +
> >>>>>> +# Create logical port foo1 in foo
> >>>>>> +ovn-nbctl lport-add foo foo1 \
> >>>>>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
> >>>>>> +
> >>>>>> +# Create logical port alice1 in alice
> >>>>>> +ovn-nbctl lport-add alice alice1 \
> >>>>>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
> >>>>>> +
> >>>>>> +# Create logical port bob1 in bob
> >>>>>> +ovn-nbctl lport-add bob bob1 \
> >>>>>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
> >>>>>> +
> >>>>>> +# Create two hypervisor and create OVS ports corresponding to
> logical
> >>>>>> ports.
> >>>>>> +net_add n1
> >>>>>> +
> >>>>>> +sim_add hv1
> >>>>>> +as hv1
> >>>>>> +ovs-vsctl add-br br-phys
> >>>>>> +ovn_attach n1 br-phys 192.168.0.1
> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> >>>>>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
> >>>>>> +    options:tx_pcap=hv1/vif1-tx.pcap \
> >>>>>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> >>>>>> +    ofport-request=1
> >>>>>> +
> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
> >>>>>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
> >>>>>> +    options:tx_pcap=hv1/vif2-tx.pcap \
> >>>>>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
> >>>>>> +    ofport-request=2
> >>>>>> +
> >>>>>> +sim_add hv2
> >>>>>> +as hv2
> >>>>>> +ovs-vsctl add-br br-phys
> >>>>>> +ovn_attach n1 br-phys 192.168.0.2
> >>>>>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
> >>>>>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
> >>>>>> +    options:tx_pcap=hv2/vif1-tx.pcap \
> >>>>>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
> >>>>>> +    ofport-request=1
> >>>>>> +
> >>>>>> +
> >>>>>> +# Pre-populate the hypervisors' ARP tables so that we don't lose
> any
> >>>>>> +# packets for ARP resolution (native tunneling doesn't queue
> packets
> >>>>>> +# for ARP resolution).
> >>>>>> +ovn_populate_arp
> >>>>>> +
> >>>>>> +# Allow some time for ovn-northd and ovn-controller to catch up.
> >>>>>> +# XXX This should be more systematic.
> >>>>>> +sleep 1
> >>>>>> +
> >>>>>> +ip_to_hex() {
> >>>>>> +    printf "%02x%02x%02x%02x" "$@"
> >>>>>> +}
> >>>>>> +trim_zeros() {
> >>>>>> +    sed 's/\(00\)\{1,\}$//'
> >>>>>> +}
> >>>>>> +
> >>>>>> +# Send ip packets between foo1 and alice1
> >>>>>> +src_mac="f00000010203"
> >>>>>> +dst_mac="000001010203"
> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
> >>>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> >>>>>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
> >>>>>> +
> >>>>>> +# Send ip packets between foo1 and bob1
> >>>>>> +src_mac="f00000010203"
> >>>>>> +dst_mac="000001010203"
> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
> >>>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> >>>>>> +
> >>>>>> +echo "---------NB dump-----"
> >>>>>> +ovn-nbctl show
> >>>>>> +echo "---------------------"
> >>>>>> +ovn-nbctl list logical_router
> >>>>>> +echo "---------------------"
> >>>>>> +ovn-nbctl list logical_router_port
> >>>>>> +echo "---------------------"
> >>>>>> +
> >>>>>> +echo "---------SB dump-----"
> >>>>>> +ovn-sbctl list datapath_binding
> >>>>>> +echo "---------------------"
> >>>>>> +ovn-sbctl list port_binding
> >>>>>> +echo "---------------------"
> >>>>>> +ovn-sbctl dump-flows
> >>>>>> +echo "---------------------"
> >>>>>> +
> >>>>>> +echo "------ hv1 dump ----------"
> >>>>>> +as hv1 ovs-ofctl show br-int
> >>>>>> +as hv1 ovs-ofctl dump-flows br-int
> >>>>>> +echo "------ hv2 dump ----------"
> >>>>>> +as hv2 ovs-ofctl show br-int
> >>>>>> +as hv2 ovs-ofctl dump-flows br-int
> >>>>>> +echo "----------------------------"
> >>>>>> +
> >>>>>> +# Packet to Expect at bob1
> >>>>>> +src_mac="000003010203"
> >>>>>> +dst_mac="f00000010205"
> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
> >>>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> >>>>>> +
> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
> >>>>>> trim_zeros > received.packets
> >>>>>> +echo $expected | trim_zeros > expout
> >>>>>> +AT_CHECK([cat received.packets], [0], [expout])
> >>>>>> +
> >>>>>> +# Packet to Expect at alice1
> >>>>>> +src_mac="000002010203"
> >>>>>> +dst_mac="f00000010204"
> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
> >>>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> >>>>>> +
> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
> >>>>>> trim_zeros > received1.packets
> >>>>>> +echo $expected | trim_zeros > expout
> >>>>>> +AT_CHECK([cat received1.packets], [0], [expout])
> >>>>>> +
> >>>>>> +for sim in hv1 hv2; do
> >>>>>> +    as $sim
> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>>>> +done
> >>>>>> +
> >>>>>> +as ovn-sb
> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>>>> +
> >>>>>> +as ovn-nb
> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>>>> +
> >>>>>> +as northd
> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> >>>>>> +
> >>>>>> +as main
> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >>>>>> +
> >>>>>> +AT_CLEANUP
> >>>>>> --
> >>>>>> 1.7.9.5
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> dev mailing list
> >>>>>> dev@openvswitch.org
> >>>>>> http://openvswitch.org/mailman/listinfo/dev
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev@openvswitch.org
> >>> http://openvswitch.org/mailman/listinfo/dev
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > http://openvswitch.org/mailman/listinfo/dev
>
Guru Shetty May 12, 2016, 1:03 p.m. UTC | #7
> On May 11, 2016, at 10:45 PM, Darrell Ball <dlu998@gmail.com> wrote:
> 
> 
> 
>> On Wed, May 11, 2016 at 8:51 PM, Guru Shetty <guru.ovn@gmail.com> wrote:
>> 
>> 
>> 
>> 
>> > On May 11, 2016, at 8:45 PM, Darrell Ball <dlu998@gmail.com> wrote:
>> >
>> >> On Wed, May 11, 2016 at 4:42 PM, Guru Shetty <guru@ovn.org> wrote:
>> >>
>> >>
>> >>>
>> >>> Some reasons why having a “transit LS” is “undesirable” is:
>> >>>
>> >>> 1)    1)  It creates additional requirements at the CMS layer for setting
>> >>> up networks; i.e. additional programming is required at the OVN northbound
>> >>> interface for the special transit LSs, interactions with the logical
>> >>> router
>> >>> peers.
>> >>
>> >> Agreed that there is additional work needed for the CMS plugin. That work
>> >> is needed even if it is just peering as they need to convert one router in
>> >> to two in OVN (unless OVN automatically makes this split)
>> >
>> > The work to coordinate 2 logical routers and one special LS is more and
>> > also more complicated than
>> > to coordinate 2 logical routers.
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> In cases where some NSX products do this, it is hidden from the user, as
>> >>> one would minimally expect.
>> >>>
>> >>> 2)     2) From OVN POV, it adds an additional OVN datapath to all
>> >>> processing to the packet path and programming/processing for that
>> >>> datapath.
>> >>>
>> >>> because you have
>> >>>
>> >>> R1<->Transit LS<->R2
>> >>>
>> >>> vs
>> >>>
>> >>> R1<->R2
>> >>
>> >> Agreed that there is an additional datapath.
>> >>
>> >>
>> >>>
>> >>> 3)     3) You have to coordinate the transit LS subnet to handle all
>> >>> addresses in this same subnet for all the logical routers and all their
>> >>> transit LS peers.
>> >>
>> >> I don't understand what you mean. If a user uses one gateway, a transit LS
>> >> only gets connected by 2 routers.
>> >> Other routers get their own transit LS.
>> >
>> >
>> > Each group of logical routers communicating has it own Transit LS.
>> >
>> > Take an example with one gateway router and 1000 distributed logical
>> > routers for 1000 tenants/hv,
>> > connecting 1000 HVs for now.
>> > Lets assume each distributed logical router only talks to the gateway
>> > router.
>> 
>> That is a wrong assumption. Each tenant has his own gateway router (or more)
> 
> Its less of an assumption but more of an example for illustrative purposes; but its good that you
> mention it.

I think one of the main discussion points was needing thousands of arp flows and thousands of subnets, and it was on an incorrect logical topology, I am glad that it is not an issue any more. 
> 
> The DR<->GR direct connection approach as well as the transit LS approach can re-use private
> IP pools across internal distributed logical routers, which amount to VRF contexts for tenants networks.

> 
> The Transit LS approach does not scale due to the large number of distributed datapaths required and 
> other high special flow requirements. It has more complex and higher subnetting requirements. In addition, there is greater complexity for northbound management.

Okay, now to summarize from my understanding:
* A transit LS uses one internal subnet to connect multiple GR with one DR whereas direct multiple GR to one DR  via peering uses multiple internal subnets. 
* A transit LS uses an additional logical datapath (split between 2 machines via tunnel) per logical topology which is a disadvantage as it involves going through an additional egress or ingress openflow pipeline in one host.
* A transit LS lets you split a DR and GR in such a way that the packet entering physical gateway gets into egress pipeline of a switch and can be made to renter the ingress pipeline of a router making it easier to apply stateful policies as packets always enter ingress pipeline of a router in all directions (NS, SN and east west)
* The general ability to connect multiple router to a switch (which this patch is about) also lets you connect your physical interface of your physical gateway connected to a physical topology to a LS in ovn which inturn is connected to multiple GRs. Each GR will have floating ips and will respond to ARPs for those floating IPs.

Overall, Though I see the possibility of implementing direct GR to DR connections via peering, it feels right now that it will be additional work for not a lot of added benefits.





> 
> 
> 
> 
> 
>  
>> 
>> 
>> > So thats 1000 Transit LSs.
>> > -> 1001 addresses per subnet for each of 1000 subnets (1 for each Transit
>> > LS) ?
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> 4)    4) Seems like L3 load balancing, ECMP, would be more complicated at
>> >>> best.
>> >>>
>> >>> 5)    5)  1000s of additional arp resolve flows rules are needed in normal
>> >>> cases in addition to added burden of the special transit LS others flows.
>> >>
>> >> I don't understand why that would be the case.
>> >
>> >
>> > Each Transit LS creates an arp resolve flow for each peer router port.
>> > Lets say we have 1000 HVs, each HV has 1000 tenants.
>> > Lets says we have 1000 distributed logical routers, minimally 1 per tenant.
>> > For simplicity, lets assume the 1000 distributed logical routers are
>> > connected to a single lonely gateway (GR).
>> >
>> > How many Transit LS (i.e. extra) distributed datapaths ?
>> > 1000 ?
>> >
>> > How many Transit LS router type logical ports in total across all 1000
>> > Transit LSs would you
>> > expect with this 1000 HV network ?
>> > 1001 per Transit LS * 1000 Transit LSs = 1,001,000
>> > Each of these requires a special arp resolve flow, as well - 1,001,000.
>> >
>> > For distributed logical router to gateway connections via Transit LSs, we
>> > have
>> > 1001 address per subnet for each of 1000 subnets = 1,001,000 addresses
>> > but again we need 1001 addresses per subnet !
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>> This is usually only done if there is existing switching equipment than
>> >>>>> must be traversed
>> >>>>> But in the case, we are just dealing with software where we have total
>> >>>>> flexibility.
>> >>>>
>> >>>> From OpenStack perspective, each tenant gets a router and multiple
>> >>>> switches with different subnets. The idea is that the OpenStack network
>> >>>> plugin, can at best split this single tenant router into 2 with a
>> >>> switch in
>> >>>> between on a  subnet that neutron does not use. Taking the same logic
>> >>>> forward for k8s, I can support multiple gateways.
>> >>>
>> >>> As far as I have heard from Openstack folks, each tenant can create
>> >>> multiple logical routers; they can even overlap subnets if they wanted.
>> >>> So,
>> >>> I am not sure the above assumption holds true.
>> >>
>> >> A tenant can create multiple logical routers. But a connected logical
>> >> topology can have one router connected to the gateway, atleast that is the
>> >> common use case.
>> >
>> > hmm, I thought there was some kind of limitation that required creating
>> > Transit LSs,
>> > but there seems no such limitation ? I guess we need to dig more on this
>> > one.
>> >
>> >
>> >
>> >>
>> >>>
>> >>>
>> >>>>
>> >>>> Connecting multiple routers to each other without a switch makes it a
>> >>> pain
>> >>>> to remember the interconnections and create multiple subnets (which may
>> >>>> simply not be available for a networking backend for internal use).
>> >>>
>> >>> Well, just looking at the 10.x.x.x private IP range, there are more than
>> >>> 16
>> >>> million IP addresses
>> >>>
>> >>> Point to point direct router links, Rx<->Ry subnets, can be /31. In the
>> >>> most extreme case, maybe 20,000 addresses would be used. I don’t think
>> >>> there should be a subnet depletion problem here, unless it’s a contrived
>> >>> misconfiguration.
>> >>
>> >> Who is going to manage all the interconnections? As far as I see a transit
>> >> LS is much more simpler.
>> >
>> >
>> > Lets compare.
>> >
>> > To be consistent with the previous example, I described earlier, lets again
>> > take the 1000 HV
>> > example, using 1000 distributed logical routers for 1000 tenants,
>> > connected to a single and lonely gateway.
>> >
>> > For distributed logical router to gateway connections only, we have
>> > for direct DR<->GR connections:
>> >
>> > 2000 subnet address per distributed logical router * 1000 distributed
>> > logical routers
>> > = 2,000,000 vs 1,001,000 for Transit LSs usage; same magnitude,
>> > but we only need 2 addresses per subnet compared to 1001 addresses per
>> > subnet for Transit LS
>> > usage.
>> >
>> > There is no subnet depletion problem with direct DR<->GR connections, but
>> > there may be
>> > with DR<->Transit LS<->GR.
>> >
>> > The question is who is going to manage 1000 subnets needing 1001 addresses
>> > each, using Transit LSs.
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> With the above goal in mind, this commit gives the general
>> >>>>>> ability to connect multiple routers via a switch.
>> >>>>>>
>> >>>>>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
>> >>>>>> ---
>> >>>>>> This needs the following 2 commits under review to
>> >>>>>> first go in.
>> >>>>>> 1. ofproto-dpif: Rename "recurse" to "indentation".
>> >>>>>> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
>> >>>>>> ---
>> >>>>>> ovn/controller/lflow.c  |   19 ++--
>> >>>>>> ovn/northd/ovn-northd.c |   56 ++++++++++-
>> >>>>>> ovn/ovn-nb.xml          |    7 --
>> >>>>>> tests/ovn.at            |  236
>> >>>>>> +++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>> 4 files changed, 299 insertions(+), 19 deletions(-)
>> >>>>>>
>> >>>>>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
>> >>>>>> index 96b7c66..efc427d 100644
>> >>>>>> --- a/ovn/controller/lflow.c
>> >>>>>> +++ b/ovn/controller/lflow.c
>> >>>>>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
>> >>>>>> const struct lport_index *lports,
>> >>>>>>         if (is_switch(ldp)) {
>> >>>>>>             /* For a logical switch datapath, local_datapaths tells
>> >>> us
>> >>>>>> if there
>> >>>>>>              * are any local ports for this datapath.  If not, we can
>> >>>>>> skip
>> >>>>>> -             * processing logical flows if the flow belongs to egress
>> >>>>>> pipeline
>> >>>>>> -             * or if that logical switch datapath is not patched to
>> >>> any
>> >>>>>> logical
>> >>>>>> -             * router.
>> >>>>>> +             * processing logical flows if that logical switch
>> >>> datapath
>> >>>>>> is not
>> >>>>>> +             * patched to any logical router.
>> >>>>>>              *
>> >>>>>> -             * Otherwise, we still need the ingress pipeline because
>> >>>>>> even if
>> >>>>>> -             * there are no local ports, we still may need to execute
>> >>>>>> the ingress
>> >>>>>> -             * pipeline after a packet leaves a logical router.
>> >>> Further
>> >>>>>> -             * optimization is possible, but not based on what we
>> >>> know
>> >>>>>> with
>> >>>>>> -             * local_datapaths right now.
>> >>>>>> +             * Otherwise, we still need both ingress and egress
>> >>> pipeline
>> >>>>>> +             * because even if there are no local ports, we still may
>> >>>>>> need to
>> >>>>>> +             * execute the ingress pipeline after a packet leaves a
>> >>>>>> logical
>> >>>>>> +             * router and we need to do egress pipeline for a switch
>> >>>>>> that
>> >>>>>> +             * is connected to only routers.  Further optimization is
>> >>>>>> possible,
>> >>>>>> +             * but not based on what we know with local_datapaths
>> >>> right
>> >>>>>> now.
>> >>>>>>              *
>> >>>>>>              * A better approach would be a kind of "flood fill"
>> >>>>>> algorithm:
>> >>>>>>              *
>> >>>>>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx,
>> >>> const
>> >>>>>> struct lport_index *lports,
>> >>>>>>              */
>> >>>>>>
>> >>>>>>             if (!get_local_datapath(local_datapaths,
>> >>> ldp->tunnel_key)) {
>> >>>>>> -                if (!ingress) {
>> >>>>>> -                    continue;
>> >>>>>> -                }
>> >>>>>
>> >>>>> This is removing a previous change that was done for some optimization
>> >>>>> for avoiding
>> >>>>> unnecessary flow creation.
>> >>>>> I am not sure about the process here, but maybe this should be tracked
>> >>> as
>> >>>>> a separate
>> >>>>> patch ?
>> >>>> The above change is needed for this patch to work. The optimization
>> >>> above
>> >>>> assumes that a switch always has atleast one real physical endpoint.
>> >>> With
>> >>>> this change, since a switch can only be connected to routers, we need to
>> >>>> remove the optimization. The optimization itself will need more careful
>> >>>> consideration and with more information provided to ovn-controller, it
>> >>> can
>> >>>> ideally be improved, but I would ideally not want l3 gateway work
>> >>> delayed
>> >>>> because of it.
>> >>> My point was that this code change removes a previous optimization. This
>> >>> leads to 2 questions – is it worth removing the optimization for the new
>> >>> code as is and is it worth adding back optimizations with the added
>> >>> complexity of the new code.
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>>                 if (!get_patched_datapath(patched_datapaths,
>> >>>>>>                                           ldp->tunnel_key)) {
>> >>>>>>                     continue;
>> >>>>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>> >>>>>> index 9e03606..3a29770 100644
>> >>>>>> --- a/ovn/northd/ovn-northd.c
>> >>>>>> +++ b/ovn/northd/ovn-northd.c
>> >>>>>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>>>>> struct hmap *ports,
>> >>>>>>                 free(actions);
>> >>>>>>                 free(match);
>> >>>>>>             }
>> >>>>>> -        } else if (op->od->n_router_ports) {
>> >>>>>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
>> >>>>>> "router")) {
>> >>>>>> +            /* This is a logical switch port that backs a VM or a
>> >>>>>> container.
>> >>>>>> +             * Extract its addresses. For each of the address, go
>> >>>>>> through all
>> >>>>>> +             * the router ports attached to the switch (to which this
>> >>>>>> port
>> >>>>>> +             * connects) and if the address in question is reachable
>> >>>>>> from the
>> >>>>>> +             * router port, add an ARP entry in that router's
>> >>> pipeline.
>> >>>>>> */
>> >>>>>> +
>> >>>>>>             for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>> >>>>>>                 struct lport_addresses laddrs;
>> >>>>>>                 if (!extract_lport_addresses(op->nbs->addresses[i],
>> >>>>>> &laddrs,
>> >>>>>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>>>>> struct hmap *ports,
>> >>>>>>
>> >>>>>>                 free(laddrs.ipv4_addrs);
>> >>>>>>             }
>> >>>>>> +        } else if (!strcmp(op->nbs->type, "router")) {
>> >>>>>> +            /* This is a logical switch port that connects to a
>> >>> router.
>> >>>>>> */
>> >>>>>> +
>> >>>>>> +            /* The peer of this switch port is the router port for
>> >>> which
>> >>>>>> +             * we need to add logical flows such that it can resolve
>> >>>>>> +             * ARP entries for all the other router ports connected
>> >>> to
>> >>>>>> +             * the switch in question. */
>> >>>>>> +
>> >>>>>> +            const char *peer_name = smap_get(&op->nbs->options,
>> >>>>>> +                                             "router-port");
>> >>>>>> +            if (!peer_name) {
>> >>>>>> +                continue;
>> >>>>>> +            }
>> >>>>>> +
>> >>>>>> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
>> >>>>>> +            if (!peer || !peer->nbr || !peer->ip) {
>> >>>>>> +                continue;
>> >>>>>> +            }
>> >>>>>> +
>> >>>>>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
>> >>>>>> +                const char *router_port_name = smap_get(
>> >>>>>> +
>> >>>>>> &op->od->router_ports[j]->nbs->options,
>> >>>>>> +                                    "router-port");
>> >>>>>> +                struct ovn_port *router_port = ovn_port_find(ports,
>> >>>>>> +
>> >>>>>> router_port_name);
>> >>>>>> +                if (!router_port || !router_port->nbr ||
>> >>>>>> !router_port->ip) {
>> >>>>>> +                    continue;
>> >>>>>> +                }
>> >>>>>> +
>> >>>>>> +                /* Skip the router port under consideration. */
>> >>>>>> +                if (router_port == peer) {
>> >>>>>> +                   continue;
>> >>>>>> +                }
>> >>>>>> +
>> >>>>>> +                if (!router_port->ip) {
>> >>>>>> +                    continue;
>> >>>>>> +                }
>> >>>>>> +                char *match = xasprintf("outport == %s && reg0 ==
>> >>>>>> "IP_FMT,
>> >>>>>> +                                        peer->json_key,
>> >>>>>> +                                        IP_ARGS(router_port->ip));
>> >>>>>> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
>> >>>>>> next;",
>> >>>>>> +
>> >>>>>> ETH_ADDR_ARGS(router_port->mac));
>> >>>>>> +                ovn_lflow_add(lflows, peer->od,
>> >>> S_ROUTER_IN_ARP_RESOLVE,
>> >>>>>> +                              101, match, actions);
>> >>>>>> +                free(actions);
>> >>>>>> +                free(match);
>> >>>>>> +            }
>> >>>>>>         }
>> >>>>>>     }
>> >>>>>> +
>> >>>>>>     HMAP_FOR_EACH (od, key_node, datapaths) {
>> >>>>>>         if (!od->nbr) {
>> >>>>>>             continue;
>> >>>>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>> >>>>>> index c01455d..d7fd595 100644
>> >>>>>> --- a/ovn/ovn-nb.xml
>> >>>>>> +++ b/ovn/ovn-nb.xml
>> >>>>>> @@ -154,13 +154,6 @@
>> >>>>>>           These options apply when <ref column="type"/> is
>> >>>>>> <code>router</code>.
>> >>>>>>         </p>
>> >>>>>>
>> >>>>>> -        <p>
>> >>>>>> -          If a given logical switch has multiple <code>router</code>
>> >>>>>> ports, the
>> >>>>>> -          <ref table="Logical_Router_Port"/> rows that they reference
>> >>>>>> must be
>> >>>>>> -          all on the same <ref table="Logical_Router"/> (for
>> >>> different
>> >>>>>> -          subnets).
>> >>>>>> -        </p>
>> >>>>>> -
>> >>>>>>         <column name="options" key="router-port">
>> >>>>>>           Required.  The <ref column="name"/> of the <ref
>> >>>>>>           table="Logical_Router_Port"/> to which this logical switch
>> >>>>>> port is
>> >>>>>> diff --git a/tests/ovn.at b/tests/ovn.at
>> >>>>>> index b9d7ada..6961be0 100644
>> >>>>>> --- a/tests/ovn.at
>> >>>>>> +++ b/tests/ovn.at
>> >>>>>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>>
>> >>>>>> AT_CLEANUP
>> >>>>>>
>> >>>>>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
>> >>>>>> +AT_KEYWORDS([ovnstaticroutes])
>> >>>>>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
>> >>>>>> +ovn_start
>> >>>>>> +
>> >>>>>> +# Logical network:
>> >>>>>> +# Three LRs - R1, R2 and R3 that are connected to each other via LS
>> >>>>>> "join"
>> >>>>>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
>> >>>>>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
>> >>>>>> 10.32.1.0/24)
>> >>>>>> +# connected to it.
>> >>>>>> +
>> >>>>>> +ovn-nbctl create Logical_Router name=R1
>> >>>>>> +ovn-nbctl create Logical_Router name=R2
>> >>>>>> +ovn-nbctl create Logical_Router name=R3
>> >>>>>> +
>> >>>>>> +ovn-nbctl lswitch-add foo
>> >>>>>> +ovn-nbctl lswitch-add alice
>> >>>>>> +ovn-nbctl lswitch-add bob
>> >>>>>> +ovn-nbctl lswitch-add join
>> >>>>>> +
>> >>>>>> +# Connect foo to R1
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
>> >>>>>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add
>> >>> Logical_Router
>> >>>>>> R1 \
>> >>>>>> +ports @lrp -- lport-add foo rp-foo
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-foo type=router
>> >>> options:router-port=foo \
>> >>>>>> +addresses=\"00:00:01:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect alice to R2
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
>> >>>>>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add
>> >>> Logical_Router
>> >>>>>> R2 \
>> >>>>>> +ports @lrp -- lport-add alice rp-alice
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-alice type=router
>> >>>>>> options:router-port=alice \
>> >>>>>> +addresses=\"00:00:02:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect bob to R3
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
>> >>>>>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router
>> >>>>>> R3 \
>> >>>>>> +ports @lrp -- lport-add bob rp-bob
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-bob type=router
>> >>> options:router-port=bob \
>> >>>>>> +addresses=\"00:00:03:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect R1 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
>> >>>>>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router
>> >>> R1
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r1-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r1-join type=router
>> >>>>>> options:router-port=R1_join \
>> >>>>>> +addresses='"00:00:04:01:02:03"'
>> >>>>>> +
>> >>>>>> +# Connect R2 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
>> >>>>>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router
>> >>> R2
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r2-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r2-join type=router
>> >>>>>> options:router-port=R2_join \
>> >>>>>> +addresses='"00:00:04:01:02:04"'
>> >>>>>> +
>> >>>>>> +
>> >>>>>> +# Connect R3 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
>> >>>>>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router
>> >>> R3
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r3-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r3-join type=router
>> >>>>>> options:router-port=R3_join \
>> >>>>>> +addresses='"00:00:04:01:02:05"'
>> >>>>>> +
>> >>>>>> +#install static routes
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>>>>> +R1 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>>>>> +R1 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>>>>> +R2 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>>>>> +R2 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>>>>> +R3 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>>>>> +R3 static_routes @lrt
>> >>>>>> +
>> >>>>>> +# Create logical port foo1 in foo
>> >>>>>> +ovn-nbctl lport-add foo foo1 \
>> >>>>>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
>> >>>>>> +
>> >>>>>> +# Create logical port alice1 in alice
>> >>>>>> +ovn-nbctl lport-add alice alice1 \
>> >>>>>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
>> >>>>>> +
>> >>>>>> +# Create logical port bob1 in bob
>> >>>>>> +ovn-nbctl lport-add bob bob1 \
>> >>>>>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
>> >>>>>> +
>> >>>>>> +# Create two hypervisor and create OVS ports corresponding to logical
>> >>>>>> ports.
>> >>>>>> +net_add n1
>> >>>>>> +
>> >>>>>> +sim_add hv1
>> >>>>>> +as hv1
>> >>>>>> +ovs-vsctl add-br br-phys
>> >>>>>> +ovn_attach n1 br-phys 192.168.0.1
>> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>> >>>>>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
>> >>>>>> +    options:tx_pcap=hv1/vif1-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
>> >>>>>> +    ofport-request=1
>> >>>>>> +
>> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
>> >>>>>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
>> >>>>>> +    options:tx_pcap=hv1/vif2-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
>> >>>>>> +    ofport-request=2
>> >>>>>> +
>> >>>>>> +sim_add hv2
>> >>>>>> +as hv2
>> >>>>>> +ovs-vsctl add-br br-phys
>> >>>>>> +ovn_attach n1 br-phys 192.168.0.2
>> >>>>>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
>> >>>>>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
>> >>>>>> +    options:tx_pcap=hv2/vif1-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
>> >>>>>> +    ofport-request=1
>> >>>>>> +
>> >>>>>> +
>> >>>>>> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
>> >>>>>> +# packets for ARP resolution (native tunneling doesn't queue packets
>> >>>>>> +# for ARP resolution).
>> >>>>>> +ovn_populate_arp
>> >>>>>> +
>> >>>>>> +# Allow some time for ovn-northd and ovn-controller to catch up.
>> >>>>>> +# XXX This should be more systematic.
>> >>>>>> +sleep 1
>> >>>>>> +
>> >>>>>> +ip_to_hex() {
>> >>>>>> +    printf "%02x%02x%02x%02x" "$@"
>> >>>>>> +}
>> >>>>>> +trim_zeros() {
>> >>>>>> +    sed 's/\(00\)\{1,\}$//'
>> >>>>>> +}
>> >>>>>> +
>> >>>>>> +# Send ip packets between foo1 and alice1
>> >>>>>> +src_mac="f00000010203"
>> >>>>>> +dst_mac="000001010203"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>>>>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
>> >>>>>> +
>> >>>>>> +# Send ip packets between foo1 and bob1
>> >>>>>> +src_mac="f00000010203"
>> >>>>>> +dst_mac="000001010203"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>>>>> +
>> >>>>>> +echo "---------NB dump-----"
>> >>>>>> +ovn-nbctl show
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-nbctl list logical_router
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-nbctl list logical_router_port
>> >>>>>> +echo "---------------------"
>> >>>>>> +
>> >>>>>> +echo "---------SB dump-----"
>> >>>>>> +ovn-sbctl list datapath_binding
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-sbctl list port_binding
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-sbctl dump-flows
>> >>>>>> +echo "---------------------"
>> >>>>>> +
>> >>>>>> +echo "------ hv1 dump ----------"
>> >>>>>> +as hv1 ovs-ofctl show br-int
>> >>>>>> +as hv1 ovs-ofctl dump-flows br-int
>> >>>>>> +echo "------ hv2 dump ----------"
>> >>>>>> +as hv2 ovs-ofctl show br-int
>> >>>>>> +as hv2 ovs-ofctl dump-flows br-int
>> >>>>>> +echo "----------------------------"
>> >>>>>> +
>> >>>>>> +# Packet to Expect at bob1
>> >>>>>> +src_mac="000003010203"
>> >>>>>> +dst_mac="f00000010205"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>>>>> +
>> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
>> >>>>>> trim_zeros > received.packets
>> >>>>>> +echo $expected | trim_zeros > expout
>> >>>>>> +AT_CHECK([cat received.packets], [0], [expout])
>> >>>>>> +
>> >>>>>> +# Packet to Expect at alice1
>> >>>>>> +src_mac="000002010203"
>> >>>>>> +dst_mac="f00000010204"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>>>>> +
>> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
>> >>>>>> trim_zeros > received1.packets
>> >>>>>> +echo $expected | trim_zeros > expout
>> >>>>>> +AT_CHECK([cat received1.packets], [0], [expout])
>> >>>>>> +
>> >>>>>> +for sim in hv1 hv2; do
>> >>>>>> +    as $sim
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +done
>> >>>>>> +
>> >>>>>> +as ovn-sb
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +as ovn-nb
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +as northd
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>> >>>>>> +
>> >>>>>> +as main
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +AT_CLEANUP
>> >>>>>> --
>> >>>>>> 1.7.9.5
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> dev mailing list
>> >>>>>> dev@openvswitch.org
>> >>>>>> http://openvswitch.org/mailman/listinfo/dev
>> >>> _______________________________________________
>> >>> dev mailing list
>> >>> dev@openvswitch.org
>> >>> http://openvswitch.org/mailman/listinfo/dev
>> > _______________________________________________
>> > dev mailing list
>> > dev@openvswitch.org
>> > http://openvswitch.org/mailman/listinfo/dev
>
Darrell Ball May 12, 2016, 4:41 p.m. UTC | #8
On Thu, May 12, 2016 at 6:03 AM, Guru Shetty <guru.ovn@gmail.com> wrote:

>
>
>
> On May 11, 2016, at 10:45 PM, Darrell Ball <dlu998@gmail.com> wrote:
>
>
>
> On Wed, May 11, 2016 at 8:51 PM, Guru Shetty <guru.ovn@gmail.com> wrote:
>
>>
>>
>>
>>
>> > On May 11, 2016, at 8:45 PM, Darrell Ball <dlu998@gmail.com> wrote:
>> >
>> >> On Wed, May 11, 2016 at 4:42 PM, Guru Shetty <guru@ovn.org> wrote:
>> >>
>> >>
>> >>>
>> >>> Some reasons why having a “transit LS” is “undesirable” is:
>> >>>
>> >>> 1)    1)  It creates additional requirements at the CMS layer for
>> setting
>> >>> up networks; i.e. additional programming is required at the OVN
>> northbound
>> >>> interface for the special transit LSs, interactions with the logical
>> >>> router
>> >>> peers.
>> >>
>> >> Agreed that there is additional work needed for the CMS plugin. That
>> work
>> >> is needed even if it is just peering as they need to convert one
>> router in
>> >> to two in OVN (unless OVN automatically makes this split)
>> >
>> > The work to coordinate 2 logical routers and one special LS is more and
>> > also more complicated than
>> > to coordinate 2 logical routers.
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> In cases where some NSX products do this, it is hidden from the user,
>> as
>> >>> one would minimally expect.
>> >>>
>> >>> 2)     2) From OVN POV, it adds an additional OVN datapath to all
>> >>> processing to the packet path and programming/processing for that
>> >>> datapath.
>> >>>
>> >>> because you have
>> >>>
>> >>> R1<->Transit LS<->R2
>> >>>
>> >>> vs
>> >>>
>> >>> R1<->R2
>> >>
>> >> Agreed that there is an additional datapath.
>> >>
>> >>
>> >>>
>> >>> 3)     3) You have to coordinate the transit LS subnet to handle all
>> >>> addresses in this same subnet for all the logical routers and all
>> their
>> >>> transit LS peers.
>> >>
>> >> I don't understand what you mean. If a user uses one gateway, a
>> transit LS
>> >> only gets connected by 2 routers.
>> >> Other routers get their own transit LS.
>> >
>> >
>> > Each group of logical routers communicating has it own Transit LS.
>> >
>> > Take an example with one gateway router and 1000 distributed logical
>> > routers for 1000 tenants/hv,
>> > connecting 1000 HVs for now.
>> > Lets assume each distributed logical router only talks to the gateway
>> > router.
>>
>> That is a wrong assumption. Each tenant has his own gateway router (or
>> more)
>>
>
> Its less of an assumption but more of an example for illustrative
> purposes; but its good that you
> mention it.
>
>
> I think one of the main discussion points was needing thousands of arp
> flows and thousands of subnets, and it was on an incorrect logical
> topology, I am glad that it is not an issue any more.
>

I think you misunderstood - having one or more gateway per tenant does not
make Transit LS better in flow scale.
The size of a Transit LS subnet and management across Transit LSs is one
the 5 issues I mentioned and it remains the same
as do the other issues.

Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
distributed logical router per tenant
spanning 1000 HVs, one gateway per tenant, we have a Transit LS with 1001
router type logical ports (1000 HVs + one gateway).

Now, based on your previous assertion earlier:
 "If a user uses one gateway, a transit LS only gets connected by 2 routers.
Other routers get their own transit LS."

This translates:
one Transit LS per tenant => 1000 Transit LS datapaths in total
1001 Transit LS logical ports per Transit LS => 1,001,000 Transit LS
logical ports in total
1001 arp resolve flows per Transit LS => 1,001,000 flows just for arp
resolve.
Each Transit LS comes with many other flows: so we multiply that number of
flows * 1000 Transit LSs = ? flows
1001 addresses per subnet per Transit LS; I suggested addresses should be
reused across subnets, but when each subnet is large
          as it with Transit LS, and there are 1000 subnets instances to
keep context for - one for each Transit LS, its get harder to manage.
          These Transit LSs and their subnets will not be identical across
Transit LSs in real scenarios.


We can go to more complex examples and larger HV scale (10000 is the later
goal ?) if you wish,
but I think the minimal case is enough to point out the issues.




>
>
> The DR<->GR direct connection approach as well as the transit LS approach
> can re-use private
> IP pools across internal distributed logical routers, which amount to VRF
> contexts for tenants networks.
>
>
>
> The Transit LS approach does not scale due to the large number of
> distributed datapaths required and
> other high special flow requirements. It has more complex and higher
> subnetting requirements. In addition, there is greater complexity for
> northbound management.
>
>
> Okay, now to summarize from my understanding:
> * A transit LS uses one internal subnet to connect multiple GR with one DR
> whereas direct multiple GR to one DR  via peering uses multiple internal
> subnets.
> * A transit LS uses an additional logical datapath (split between 2
> machines via tunnel) per logical topology which is a disadvantage as it
> involves going through an additional egress or ingress openflow pipeline in
> one host.
>
* A transit LS lets you split a DR and GR in such a way that the packet
> entering physical gateway gets into egress pipeline of a switch and can be
> made to renter the ingress pipeline of a router making it easier to apply
> stateful policies as packets always enter ingress pipeline of a router in
> all directions (NS, SN and east west)
>

I think its better to have diagram here explaining what the challenge is
that you facing and so that "easier"
is better defined.





> * The general ability to connect multiple router to a switch (which this
> patch is about) also lets you connect your physical interface of your
> physical gateway connected to a physical topology to a LS in ovn which
> inturn is connected to multiple GRs. Each GR will have floating ips and
> will respond to ARPs for those floating IPs.
>

> Overall, Though I see the possibility of implementing direct GR to DR
> connections via peering, it feels right now that it will be additional work
> for not a lot of added benefits.
>
>
>
>
>
>
>
>
>
>
>
>
>>
>>
>> > So thats 1000 Transit LSs.
>> > -> 1001 addresses per subnet for each of 1000 subnets (1 for each
>> Transit
>> > LS) ?
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>> 4)    4) Seems like L3 load balancing, ECMP, would be more
>> complicated at
>> >>> best.
>> >>>
>> >>> 5)    5)  1000s of additional arp resolve flows rules are needed in
>> normal
>> >>> cases in addition to added burden of the special transit LS others
>> flows.
>> >>
>> >> I don't understand why that would be the case.
>> >
>> >
>> > Each Transit LS creates an arp resolve flow for each peer router port.
>> > Lets say we have 1000 HVs, each HV has 1000 tenants.
>> > Lets says we have 1000 distributed logical routers, minimally 1 per
>> tenant.
>> > For simplicity, lets assume the 1000 distributed logical routers are
>> > connected to a single lonely gateway (GR).
>> >
>> > How many Transit LS (i.e. extra) distributed datapaths ?
>> > 1000 ?
>> >
>> > How many Transit LS router type logical ports in total across all 1000
>> > Transit LSs would you
>> > expect with this 1000 HV network ?
>> > 1001 per Transit LS * 1000 Transit LSs = 1,001,000
>> > Each of these requires a special arp resolve flow, as well - 1,001,000.
>> >
>> > For distributed logical router to gateway connections via Transit LSs,
>> we
>> > have
>> > 1001 address per subnet for each of 1000 subnets = 1,001,000 addresses
>> > but again we need 1001 addresses per subnet !
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>> This is usually only done if there is existing switching equipment
>> than
>> >>>>> must be traversed
>> >>>>> But in the case, we are just dealing with software where we have
>> total
>> >>>>> flexibility.
>> >>>>
>> >>>> From OpenStack perspective, each tenant gets a router and multiple
>> >>>> switches with different subnets. The idea is that the OpenStack
>> network
>> >>>> plugin, can at best split this single tenant router into 2 with a
>> >>> switch in
>> >>>> between on a  subnet that neutron does not use. Taking the same logic
>> >>>> forward for k8s, I can support multiple gateways.
>> >>>
>> >>> As far as I have heard from Openstack folks, each tenant can create
>> >>> multiple logical routers; they can even overlap subnets if they
>> wanted.
>> >>> So,
>> >>> I am not sure the above assumption holds true.
>> >>
>> >> A tenant can create multiple logical routers. But a connected logical
>> >> topology can have one router connected to the gateway, atleast that is
>> the
>> >> common use case.
>> >
>> > hmm, I thought there was some kind of limitation that required creating
>> > Transit LSs,
>> > but there seems no such limitation ? I guess we need to dig more on this
>> > one.
>> >
>> >
>> >
>> >>
>> >>>
>> >>>
>> >>>>
>> >>>> Connecting multiple routers to each other without a switch makes it a
>> >>> pain
>> >>>> to remember the interconnections and create multiple subnets (which
>> may
>> >>>> simply not be available for a networking backend for internal use).
>> >>>
>> >>> Well, just looking at the 10.x.x.x private IP range, there are more
>> than
>> >>> 16
>> >>> million IP addresses
>> >>>
>> >>> Point to point direct router links, Rx<->Ry subnets, can be /31. In
>> the
>> >>> most extreme case, maybe 20,000 addresses would be used. I don’t think
>> >>> there should be a subnet depletion problem here, unless it’s a
>> contrived
>> >>> misconfiguration.
>> >>
>> >> Who is going to manage all the interconnections? As far as I see a
>> transit
>> >> LS is much more simpler.
>> >
>> >
>> > Lets compare.
>> >
>> > To be consistent with the previous example, I described earlier, lets
>> again
>> > take the 1000 HV
>> > example, using 1000 distributed logical routers for 1000 tenants,
>> > connected to a single and lonely gateway.
>> >
>> > For distributed logical router to gateway connections only, we have
>> > for direct DR<->GR connections:
>> >
>> > 2000 subnet address per distributed logical router * 1000 distributed
>> > logical routers
>> > = 2,000,000 vs 1,001,000 for Transit LSs usage; same magnitude,
>> > but we only need 2 addresses per subnet compared to 1001 addresses per
>> > subnet for Transit LS
>> > usage.
>> >
>> > There is no subnet depletion problem with direct DR<->GR connections,
>> but
>> > there may be
>> > with DR<->Transit LS<->GR.
>> >
>> > The question is who is going to manage 1000 subnets needing 1001
>> addresses
>> > each, using Transit LSs.
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> With the above goal in mind, this commit gives the general
>> >>>>>> ability to connect multiple routers via a switch.
>> >>>>>>
>> >>>>>> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
>> >>>>>> ---
>> >>>>>> This needs the following 2 commits under review to
>> >>>>>> first go in.
>> >>>>>> 1. ofproto-dpif: Rename "recurse" to "indentation".
>> >>>>>> 2. ofproto-dpif: Do not count resubmit to later tables against
>> limit.
>> >>>>>> ---
>> >>>>>> ovn/controller/lflow.c  |   19 ++--
>> >>>>>> ovn/northd/ovn-northd.c |   56 ++++++++++-
>> >>>>>> ovn/ovn-nb.xml          |    7 --
>> >>>>>> tests/ovn.at            |  236
>> >>>>>> +++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>>> 4 files changed, 299 insertions(+), 19 deletions(-)
>> >>>>>>
>> >>>>>> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
>> >>>>>> index 96b7c66..efc427d 100644
>> >>>>>> --- a/ovn/controller/lflow.c
>> >>>>>> +++ b/ovn/controller/lflow.c
>> >>>>>> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx,
>> >>>>>> const struct lport_index *lports,
>> >>>>>>         if (is_switch(ldp)) {
>> >>>>>>             /* For a logical switch datapath, local_datapaths tells
>> >>> us
>> >>>>>> if there
>> >>>>>>              * are any local ports for this datapath.  If not, we
>> can
>> >>>>>> skip
>> >>>>>> -             * processing logical flows if the flow belongs to
>> egress
>> >>>>>> pipeline
>> >>>>>> -             * or if that logical switch datapath is not patched
>> to
>> >>> any
>> >>>>>> logical
>> >>>>>> -             * router.
>> >>>>>> +             * processing logical flows if that logical switch
>> >>> datapath
>> >>>>>> is not
>> >>>>>> +             * patched to any logical router.
>> >>>>>>              *
>> >>>>>> -             * Otherwise, we still need the ingress pipeline
>> because
>> >>>>>> even if
>> >>>>>> -             * there are no local ports, we still may need to
>> execute
>> >>>>>> the ingress
>> >>>>>> -             * pipeline after a packet leaves a logical router.
>> >>> Further
>> >>>>>> -             * optimization is possible, but not based on what we
>> >>> know
>> >>>>>> with
>> >>>>>> -             * local_datapaths right now.
>> >>>>>> +             * Otherwise, we still need both ingress and egress
>> >>> pipeline
>> >>>>>> +             * because even if there are no local ports, we still
>> may
>> >>>>>> need to
>> >>>>>> +             * execute the ingress pipeline after a packet leaves
>> a
>> >>>>>> logical
>> >>>>>> +             * router and we need to do egress pipeline for a
>> switch
>> >>>>>> that
>> >>>>>> +             * is connected to only routers.  Further
>> optimization is
>> >>>>>> possible,
>> >>>>>> +             * but not based on what we know with local_datapaths
>> >>> right
>> >>>>>> now.
>> >>>>>>              *
>> >>>>>>              * A better approach would be a kind of "flood fill"
>> >>>>>> algorithm:
>> >>>>>>              *
>> >>>>>> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx,
>> >>> const
>> >>>>>> struct lport_index *lports,
>> >>>>>>              */
>> >>>>>>
>> >>>>>>             if (!get_local_datapath(local_datapaths,
>> >>> ldp->tunnel_key)) {
>> >>>>>> -                if (!ingress) {
>> >>>>>> -                    continue;
>> >>>>>> -                }
>> >>>>>
>> >>>>> This is removing a previous change that was done for some
>> optimization
>> >>>>> for avoiding
>> >>>>> unnecessary flow creation.
>> >>>>> I am not sure about the process here, but maybe this should be
>> tracked
>> >>> as
>> >>>>> a separate
>> >>>>> patch ?
>> >>>> The above change is needed for this patch to work. The optimization
>> >>> above
>> >>>> assumes that a switch always has atleast one real physical endpoint.
>> >>> With
>> >>>> this change, since a switch can only be connected to routers, we
>> need to
>> >>>> remove the optimization. The optimization itself will need more
>> careful
>> >>>> consideration and with more information provided to ovn-controller,
>> it
>> >>> can
>> >>>> ideally be improved, but I would ideally not want l3 gateway work
>> >>> delayed
>> >>>> because of it.
>> >>> My point was that this code change removes a previous optimization.
>> This
>> >>> leads to 2 questions – is it worth removing the optimization for the
>> new
>> >>> code as is and is it worth adding back optimizations with the added
>> >>> complexity of the new code.
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>>                 if (!get_patched_datapath(patched_datapaths,
>> >>>>>>                                           ldp->tunnel_key)) {
>> >>>>>>                     continue;
>> >>>>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>> >>>>>> index 9e03606..3a29770 100644
>> >>>>>> --- a/ovn/northd/ovn-northd.c
>> >>>>>> +++ b/ovn/northd/ovn-northd.c
>> >>>>>> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>>>>> struct hmap *ports,
>> >>>>>>                 free(actions);
>> >>>>>>                 free(match);
>> >>>>>>             }
>> >>>>>> -        } else if (op->od->n_router_ports) {
>> >>>>>> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
>> >>>>>> "router")) {
>> >>>>>> +            /* This is a logical switch port that backs a VM or a
>> >>>>>> container.
>> >>>>>> +             * Extract its addresses. For each of the address, go
>> >>>>>> through all
>> >>>>>> +             * the router ports attached to the switch (to which
>> this
>> >>>>>> port
>> >>>>>> +             * connects) and if the address in question is
>> reachable
>> >>>>>> from the
>> >>>>>> +             * router port, add an ARP entry in that router's
>> >>> pipeline.
>> >>>>>> */
>> >>>>>> +
>> >>>>>>             for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>> >>>>>>                 struct lport_addresses laddrs;
>> >>>>>>                 if (!extract_lport_addresses(op->nbs->addresses[i],
>> >>>>>> &laddrs,
>> >>>>>> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths,
>> >>>>>> struct hmap *ports,
>> >>>>>>
>> >>>>>>                 free(laddrs.ipv4_addrs);
>> >>>>>>             }
>> >>>>>> +        } else if (!strcmp(op->nbs->type, "router")) {
>> >>>>>> +            /* This is a logical switch port that connects to a
>> >>> router.
>> >>>>>> */
>> >>>>>> +
>> >>>>>> +            /* The peer of this switch port is the router port for
>> >>> which
>> >>>>>> +             * we need to add logical flows such that it can
>> resolve
>> >>>>>> +             * ARP entries for all the other router ports
>> connected
>> >>> to
>> >>>>>> +             * the switch in question. */
>> >>>>>> +
>> >>>>>> +            const char *peer_name = smap_get(&op->nbs->options,
>> >>>>>> +                                             "router-port");
>> >>>>>> +            if (!peer_name) {
>> >>>>>> +                continue;
>> >>>>>> +            }
>> >>>>>> +
>> >>>>>> +            struct ovn_port *peer = ovn_port_find(ports,
>> peer_name);
>> >>>>>> +            if (!peer || !peer->nbr || !peer->ip) {
>> >>>>>> +                continue;
>> >>>>>> +            }
>> >>>>>> +
>> >>>>>> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
>> >>>>>> +                const char *router_port_name = smap_get(
>> >>>>>> +
>> >>>>>> &op->od->router_ports[j]->nbs->options,
>> >>>>>> +                                    "router-port");
>> >>>>>> +                struct ovn_port *router_port =
>> ovn_port_find(ports,
>> >>>>>> +
>> >>>>>> router_port_name);
>> >>>>>> +                if (!router_port || !router_port->nbr ||
>> >>>>>> !router_port->ip) {
>> >>>>>> +                    continue;
>> >>>>>> +                }
>> >>>>>> +
>> >>>>>> +                /* Skip the router port under consideration. */
>> >>>>>> +                if (router_port == peer) {
>> >>>>>> +                   continue;
>> >>>>>> +                }
>> >>>>>> +
>> >>>>>> +                if (!router_port->ip) {
>> >>>>>> +                    continue;
>> >>>>>> +                }
>> >>>>>> +                char *match = xasprintf("outport == %s && reg0 ==
>> >>>>>> "IP_FMT,
>> >>>>>> +                                        peer->json_key,
>> >>>>>> +                                        IP_ARGS(router_port->ip));
>> >>>>>> +                char *actions = xasprintf("eth.dst =
>> "ETH_ADDR_FMT";
>> >>>>>> next;",
>> >>>>>> +
>> >>>>>> ETH_ADDR_ARGS(router_port->mac));
>> >>>>>> +                ovn_lflow_add(lflows, peer->od,
>> >>> S_ROUTER_IN_ARP_RESOLVE,
>> >>>>>> +                              101, match, actions);
>> >>>>>> +                free(actions);
>> >>>>>> +                free(match);
>> >>>>>> +            }
>> >>>>>>         }
>> >>>>>>     }
>> >>>>>> +
>> >>>>>>     HMAP_FOR_EACH (od, key_node, datapaths) {
>> >>>>>>         if (!od->nbr) {
>> >>>>>>             continue;
>> >>>>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>> >>>>>> index c01455d..d7fd595 100644
>> >>>>>> --- a/ovn/ovn-nb.xml
>> >>>>>> +++ b/ovn/ovn-nb.xml
>> >>>>>> @@ -154,13 +154,6 @@
>> >>>>>>           These options apply when <ref column="type"/> is
>> >>>>>> <code>router</code>.
>> >>>>>>         </p>
>> >>>>>>
>> >>>>>> -        <p>
>> >>>>>> -          If a given logical switch has multiple
>> <code>router</code>
>> >>>>>> ports, the
>> >>>>>> -          <ref table="Logical_Router_Port"/> rows that they
>> reference
>> >>>>>> must be
>> >>>>>> -          all on the same <ref table="Logical_Router"/> (for
>> >>> different
>> >>>>>> -          subnets).
>> >>>>>> -        </p>
>> >>>>>> -
>> >>>>>>         <column name="options" key="router-port">
>> >>>>>>           Required.  The <ref column="name"/> of the <ref
>> >>>>>>           table="Logical_Router_Port"/> to which this logical
>> switch
>> >>>>>> port is
>> >>>>>> diff --git a/tests/ovn.at b/tests/ovn.at
>> >>>>>> index b9d7ada..6961be0 100644
>> >>>>>> --- a/tests/ovn.at
>> >>>>>> +++ b/tests/ovn.at
>> >>>>>> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>>
>> >>>>>> AT_CLEANUP
>> >>>>>>
>> >>>>>> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
>> >>>>>> +AT_KEYWORDS([ovnstaticroutes])
>> >>>>>> +AT_SKIP_IF([test $HAVE_PYTHON = no])
>> >>>>>> +ovn_start
>> >>>>>> +
>> >>>>>> +# Logical network:
>> >>>>>> +# Three LRs - R1, R2 and R3 that are connected to each other via
>> LS
>> >>>>>> "join"
>> >>>>>> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
>> >>>>>> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
>> >>>>>> 10.32.1.0/24)
>> >>>>>> +# connected to it.
>> >>>>>> +
>> >>>>>> +ovn-nbctl create Logical_Router name=R1
>> >>>>>> +ovn-nbctl create Logical_Router name=R2
>> >>>>>> +ovn-nbctl create Logical_Router name=R3
>> >>>>>> +
>> >>>>>> +ovn-nbctl lswitch-add foo
>> >>>>>> +ovn-nbctl lswitch-add alice
>> >>>>>> +ovn-nbctl lswitch-add bob
>> >>>>>> +ovn-nbctl lswitch-add join
>> >>>>>> +
>> >>>>>> +# Connect foo to R1
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
>> >>>>>> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add
>> >>> Logical_Router
>> >>>>>> R1 \
>> >>>>>> +ports @lrp -- lport-add foo rp-foo
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-foo type=router
>> >>> options:router-port=foo \
>> >>>>>> +addresses=\"00:00:01:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect alice to R2
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
>> >>>>>> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add
>> >>> Logical_Router
>> >>>>>> R2 \
>> >>>>>> +ports @lrp -- lport-add alice rp-alice
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-alice type=router
>> >>>>>> options:router-port=alice \
>> >>>>>> +addresses=\"00:00:02:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect bob to R3
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
>> >>>>>> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add
>> Logical_Router
>> >>>>>> R3 \
>> >>>>>> +ports @lrp -- lport-add bob rp-bob
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port rp-bob type=router
>> >>> options:router-port=bob \
>> >>>>>> +addresses=\"00:00:03:01:02:03\"
>> >>>>>> +
>> >>>>>> +# Connect R1 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
>> >>>>>> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add
>> Logical_Router
>> >>> R1
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r1-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r1-join type=router
>> >>>>>> options:router-port=R1_join \
>> >>>>>> +addresses='"00:00:04:01:02:03"'
>> >>>>>> +
>> >>>>>> +# Connect R2 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
>> >>>>>> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add
>> Logical_Router
>> >>> R2
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r2-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r2-join type=router
>> >>>>>> options:router-port=R2_join \
>> >>>>>> +addresses='"00:00:04:01:02:04"'
>> >>>>>> +
>> >>>>>> +
>> >>>>>> +# Connect R3 to join
>> >>>>>> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
>> >>>>>> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add
>> Logical_Router
>> >>> R3
>> >>>>>> \
>> >>>>>> +ports @lrp -- lport-add join r3-join
>> >>>>>> +
>> >>>>>> +ovn-nbctl set Logical_port r3-join type=router
>> >>>>>> options:router-port=R3_join \
>> >>>>>> +addresses='"00:00:04:01:02:05"'
>> >>>>>> +
>> >>>>>> +#install static routes
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>>>>> +R1 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>>>>> +R1 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>>>>> +R2 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
>> >>>>>> +R2 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
>> >>>>>> +R3 static_routes @lrt
>> >>>>>> +
>> >>>>>> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
>> >>>>>> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
>> >>>>>> +R3 static_routes @lrt
>> >>>>>> +
>> >>>>>> +# Create logical port foo1 in foo
>> >>>>>> +ovn-nbctl lport-add foo foo1 \
>> >>>>>> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
>> >>>>>> +
>> >>>>>> +# Create logical port alice1 in alice
>> >>>>>> +ovn-nbctl lport-add alice alice1 \
>> >>>>>> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
>> >>>>>> +
>> >>>>>> +# Create logical port bob1 in bob
>> >>>>>> +ovn-nbctl lport-add bob bob1 \
>> >>>>>> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
>> >>>>>> +
>> >>>>>> +# Create two hypervisor and create OVS ports corresponding to
>> logical
>> >>>>>> ports.
>> >>>>>> +net_add n1
>> >>>>>> +
>> >>>>>> +sim_add hv1
>> >>>>>> +as hv1
>> >>>>>> +ovs-vsctl add-br br-phys
>> >>>>>> +ovn_attach n1 br-phys 192.168.0.1
>> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
>> >>>>>> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
>> >>>>>> +    options:tx_pcap=hv1/vif1-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv1/vif1-rx.pcap \
>> >>>>>> +    ofport-request=1
>> >>>>>> +
>> >>>>>> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
>> >>>>>> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
>> >>>>>> +    options:tx_pcap=hv1/vif2-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv1/vif2-rx.pcap \
>> >>>>>> +    ofport-request=2
>> >>>>>> +
>> >>>>>> +sim_add hv2
>> >>>>>> +as hv2
>> >>>>>> +ovs-vsctl add-br br-phys
>> >>>>>> +ovn_attach n1 br-phys 192.168.0.2
>> >>>>>> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
>> >>>>>> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
>> >>>>>> +    options:tx_pcap=hv2/vif1-tx.pcap \
>> >>>>>> +    options:rxq_pcap=hv2/vif1-rx.pcap \
>> >>>>>> +    ofport-request=1
>> >>>>>> +
>> >>>>>> +
>> >>>>>> +# Pre-populate the hypervisors' ARP tables so that we don't lose
>> any
>> >>>>>> +# packets for ARP resolution (native tunneling doesn't queue
>> packets
>> >>>>>> +# for ARP resolution).
>> >>>>>> +ovn_populate_arp
>> >>>>>> +
>> >>>>>> +# Allow some time for ovn-northd and ovn-controller to catch up.
>> >>>>>> +# XXX This should be more systematic.
>> >>>>>> +sleep 1
>> >>>>>> +
>> >>>>>> +ip_to_hex() {
>> >>>>>> +    printf "%02x%02x%02x%02x" "$@"
>> >>>>>> +}
>> >>>>>> +trim_zeros() {
>> >>>>>> +    sed 's/\(00\)\{1,\}$//'
>> >>>>>> +}
>> >>>>>> +
>> >>>>>> +# Send ip packets between foo1 and alice1
>> >>>>>> +src_mac="f00000010203"
>> >>>>>> +dst_mac="000001010203"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>>>>> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
>> >>>>>> +
>> >>>>>> +# Send ip packets between foo1 and bob1
>> >>>>>> +src_mac="f00000010203"
>> >>>>>> +dst_mac="000001010203"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>>
>> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
>> >>>>>> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>> >>>>>> +
>> >>>>>> +echo "---------NB dump-----"
>> >>>>>> +ovn-nbctl show
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-nbctl list logical_router
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-nbctl list logical_router_port
>> >>>>>> +echo "---------------------"
>> >>>>>> +
>> >>>>>> +echo "---------SB dump-----"
>> >>>>>> +ovn-sbctl list datapath_binding
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-sbctl list port_binding
>> >>>>>> +echo "---------------------"
>> >>>>>> +ovn-sbctl dump-flows
>> >>>>>> +echo "---------------------"
>> >>>>>> +
>> >>>>>> +echo "------ hv1 dump ----------"
>> >>>>>> +as hv1 ovs-ofctl show br-int
>> >>>>>> +as hv1 ovs-ofctl dump-flows br-int
>> >>>>>> +echo "------ hv2 dump ----------"
>> >>>>>> +as hv2 ovs-ofctl show br-int
>> >>>>>> +as hv2 ovs-ofctl dump-flows br-int
>> >>>>>> +echo "----------------------------"
>> >>>>>> +
>> >>>>>> +# Packet to Expect at bob1
>> >>>>>> +src_mac="000003010203"
>> >>>>>> +dst_mac="f00000010205"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 10 32 1 2`
>> >>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>>>>> +
>> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
>> >>>>>> trim_zeros > received.packets
>> >>>>>> +echo $expected | trim_zeros > expout
>> >>>>>> +AT_CHECK([cat received.packets], [0], [expout])
>> >>>>>> +
>> >>>>>> +# Packet to Expect at alice1
>> >>>>>> +src_mac="000002010203"
>> >>>>>> +dst_mac="f00000010204"
>> >>>>>> +src_ip=`ip_to_hex 192 168 1 2`
>> >>>>>> +dst_ip=`ip_to_hex 172 16 1 2`
>> >>>
>> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
>> >>>>>> +
>> >>>>>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
>> >>>>>> trim_zeros > received1.packets
>> >>>>>> +echo $expected | trim_zeros > expout
>> >>>>>> +AT_CHECK([cat received1.packets], [0], [expout])
>> >>>>>> +
>> >>>>>> +for sim in hv1 hv2; do
>> >>>>>> +    as $sim
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>>>>> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +done
>> >>>>>> +
>> >>>>>> +as ovn-sb
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +as ovn-nb
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +as northd
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>> >>>>>> +
>> >>>>>> +as main
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>> >>>>>> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>> >>>>>> +
>> >>>>>> +AT_CLEANUP
>> >>>>>> --
>> >>>>>> 1.7.9.5
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> dev mailing list
>> >>>>>> dev@openvswitch.org
>> >>>>>> http://openvswitch.org/mailman/listinfo/dev
>> >>> _______________________________________________
>> >>> dev mailing list
>> >>> dev@openvswitch.org
>> >>> http://openvswitch.org/mailman/listinfo/dev
>> > _______________________________________________
>> > dev mailing list
>> > dev@openvswitch.org
>> > http://openvswitch.org/mailman/listinfo/dev
>>
>
>
Gurucharan Shetty May 12, 2016, 5:15 p.m. UTC | #9
>
>
>>
>> I think one of the main discussion points was needing thousands of arp
>> flows and thousands of subnets, and it was on an incorrect logical
>> topology, I am glad that it is not an issue any more.
>>
>
> I think you misunderstood - having one or more gateway per tenant does not
> make Transit LS better in flow scale.
> The size of a Transit LS subnet and management across Transit LSs is one
> the 5 issues I mentioned and it remains the same
> as do the other issues.
>
> Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
> distributed logical router per tenant
> spanning 1000 HVs, one gateway per tenant, we have a Transit LS with 1001
> router type logical ports (1000 HVs + one gateway).
>

A transit LS does not have 1001 router type ports. It has just two. One of
them only resides in the gateway. The other one resides in every
hypervisor. This is the same as a router peer port. Transit LS adds one
extra per hypervisor, which I have agreed as a disadvantage. If that is
what you mean, then it is right.

>
>
Gurucharan Shetty May 12, 2016, 5:54 p.m. UTC | #10
>
>
> I think you misunderstood - having one or more gateway per tenant does not
> make Transit LS better in flow scale.
> The size of a Transit LS subnet and management across Transit LSs is one
> the 5 issues I mentioned and it remains the same
> as do the other issues.
>
> Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
> distributed logical router per tenant
> spanning 1000 HVs, one gateway per tenant, we have a Transit LS with 1001
> router type logical ports (1000 HVs + one gateway).
>
> Now, based on your previous assertion earlier:
>  "If a user uses one gateway, a transit LS only gets connected by 2
> routers.
> Other routers get their own transit LS."
>
> This translates:
> one Transit LS per tenant => 1000 Transit LS datapaths in total
> 1001 Transit LS logical ports per Transit LS => 1,001,000 Transit LS
> logical ports in total
> 1001 arp resolve flows per Transit LS => 1,001,000 flows just for arp
> resolve.
> Each Transit LS comes with many other flows: so we multiply that number of
> flows * 1000 Transit LSs = ? flows
> 1001 addresses per subnet per Transit LS; I suggested addresses should be
> reused across subnets, but when each subnet is large
>

Re-reading. The above is a wrong conclusion making me believe that there is
a big disconnect. A subnet in transit LS has only 2 IP addresses (if it is
only one physical gateway). Every additional physical gateway can add one
additional IP address to the subnet (depending on whether the new physical
gateway has a gateway router added for that logical topology.).
Darrell Ball May 12, 2016, 11:34 p.m. UTC | #11
On Thu, May 12, 2016 at 10:54 AM, Guru Shetty <guru@ovn.org> wrote:

>
>> I think you misunderstood - having one or more gateway per tenant does
>> not make Transit LS better in flow scale.
>> The size of a Transit LS subnet and management across Transit LSs is one
>> the 5 issues I mentioned and it remains the same
>> as do the other issues.
>>
>> Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
>> distributed logical router per tenant
>> spanning 1000 HVs, one gateway per tenant, we have a Transit LS with 1001
>> router type logical ports (1000 HVs + one gateway).
>>
>> Now, based on your previous assertion earlier:
>>  "If a user uses one gateway, a transit LS only gets connected by 2
>> routers.
>> Other routers get their own transit LS."
>>
>> This translates:
>> one Transit LS per tenant => 1000 Transit LS datapaths in total
>> 1001 Transit LS logical ports per Transit LS => 1,001,000 Transit LS
>> logical ports in total
>> 1001 arp resolve flows per Transit LS => 1,001,000 flows just for arp
>> resolve.
>> Each Transit LS comes with many other flows: so we multiply that number
>> of flows * 1000 Transit LSs = ? flows
>> 1001 addresses per subnet per Transit LS; I suggested addresses should be
>> reused across subnets, but when each subnet is large
>>
>
> Re-reading. The above is a wrong conclusion making me believe that there
> is a big disconnect. A subnet in transit LS has only 2 IP addresses (if it
> is only one physical gateway). Every additional physical gateway can add
> one additional IP address to the subnet (depending on whether the new
> physical gateway has a gateway router added for that logical topology.).
>


With respect to the IP address usage. I think a diagram would help
especially the K8 case,
which I had heard in other conversations may have a separate gateway on
every HV ?. Hence, I would like to know what that means - i.e. were you
thinking to run separate gateways routers on every HV for K8 ?

With respect to the other questions, I think its best approach would be to
ask direct questions so those
direct questions get answered.

1) With 1000 HVs, 1000 HVs/tenant, 1 distributed router per tenant, you
choose the number of gateways/tenant:

a) How many Transit LS distributed datapaths are expected in total ?

b) How many Transit LS logical ports are needed at the HV level  ?

what I mean by that is lets say we have one additional logical port at
northd level and 1000 HVs then if we need to download that port to 1000
HVs, I consider that to be 1000 logical ports at the HV level because
downloading and maintaining state across HVs at scale is more expensive
than for a single hypervisor.

c) How many Transit LS arp resolve entries at the HV level ?

what I mean by that is lets say we have one additional arp resolve flow at
northd level and 1000 HVs then if we need to download that arp resolve flow
to 1000 HVs, I consider that to be 1000 flows at the HV level because
downloading and maintaining state across multiple HVs is more expensive
that a single hypervisor.
Gurucharan Shetty May 12, 2016, 11:55 p.m. UTC | #12
On 12 May 2016 at 16:34, Darrell Ball <dlu998@gmail.com> wrote:

> On Thu, May 12, 2016 at 10:54 AM, Guru Shetty <guru@ovn.org> wrote:
>
> >
> >> I think you misunderstood - having one or more gateway per tenant does
> >> not make Transit LS better in flow scale.
> >> The size of a Transit LS subnet and management across Transit LSs is one
> >> the 5 issues I mentioned and it remains the same
> >> as do the other issues.
> >>
> >> Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
> >> distributed logical router per tenant
> >> spanning 1000 HVs, one gateway per tenant, we have a Transit LS with
> 1001
> >> router type logical ports (1000 HVs + one gateway).
> >>
> >> Now, based on your previous assertion earlier:
> >>  "If a user uses one gateway, a transit LS only gets connected by 2
> >> routers.
> >> Other routers get their own transit LS."
> >>
> >> This translates:
> >> one Transit LS per tenant => 1000 Transit LS datapaths in total
> >> 1001 Transit LS logical ports per Transit LS => 1,001,000 Transit LS
> >> logical ports in total
> >> 1001 arp resolve flows per Transit LS => 1,001,000 flows just for arp
> >> resolve.
> >> Each Transit LS comes with many other flows: so we multiply that number
> >> of flows * 1000 Transit LSs = ? flows
> >> 1001 addresses per subnet per Transit LS; I suggested addresses should
> be
> >> reused across subnets, but when each subnet is large
> >>
> >
> > Re-reading. The above is a wrong conclusion making me believe that there
> > is a big disconnect. A subnet in transit LS has only 2 IP addresses (if
> it
> > is only one physical gateway). Every additional physical gateway can add
> > one additional IP address to the subnet (depending on whether the new
> > physical gateway has a gateway router added for that logical topology.).
> >
>
>
> With respect to the IP address usage. I think a diagram would help
> especially the K8 case,
>
Drawing a diagram here is not feasible. Happy to do it on a whiteboard
though.


> which I had heard in other conversations may have a separate gateway on
> every HV ?. Hence, I would like to know what that means - i.e. were you
> thinking to run separate gateways routers on every HV for K8 ?
>
Yes, thats the plan (as many as possible). 100 routers is a target. Not HV,
but a VM.



>
> With respect to the other questions, I think its best approach would be to
> ask direct questions so those
> direct questions get answered.
>
> 1) With 1000 HVs, 1000 HVs/tenant, 1 distributed router per tenant, you
> choose the number of gateways/tenant:
>
> a) How many Transit LS distributed datapaths are expected in total ?
>

One (i.e the same as the distributed router).


>
> b) How many Transit LS logical ports are needed at the HV level  ?
>
> what I mean by that is lets say we have one additional logical port at
> northd level and 1000 HVs then if we need to download that port to 1000
> HVs, I consider that to be 1000 logical ports at the HV level because
> downloading and maintaining state across HVs at scale is more expensive
> than for a single hypervisor.
>

1000 additional ones. It is the same as your distributed logical switch or
logical router (this is the case even with the peer routers)



>
> c) How many Transit LS arp resolve entries at the HV level ?
>
> what I mean by that is lets say we have one additional arp resolve flow at
> northd level and 1000 HVs then if we need to download that arp resolve flow
> to 1000 HVs, I consider that to be 1000 flows at the HV level because
> downloading and maintaining state across multiple HVs is more expensive
> that a single hypervisor.
>

2 ARP flows per transit LS * 1000 HVs. Do realize that a single bridge on a
single hypervisor typically has flows in the 100,000 range. Even a million
is feasbile. Microsegmentation use cases has 10000 ACLs per logical switch.
So that is 10000 * 1000 for your case form single LS. So do you have some
comparative perspectives.



dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Darrell Ball May 13, 2016, 12:44 a.m. UTC | #13
On Thu, May 12, 2016 at 4:55 PM, Guru Shetty <guru@ovn.org> wrote:

>
>
> On 12 May 2016 at 16:34, Darrell Ball <dlu998@gmail.com> wrote:
>
>> On Thu, May 12, 2016 at 10:54 AM, Guru Shetty <guru@ovn.org> wrote:
>>
>> >
>> >> I think you misunderstood - having one or more gateway per tenant does
>> >> not make Transit LS better in flow scale.
>> >> The size of a Transit LS subnet and management across Transit LSs is
>> one
>> >> the 5 issues I mentioned and it remains the same
>> >> as do the other issues.
>> >>
>> >> Based on the example with 1000 HVs, 1000 tenants, 1000 HV/tenant, one
>> >> distributed logical router per tenant
>> >> spanning 1000 HVs, one gateway per tenant, we have a Transit LS with
>> 1001
>> >> router type logical ports (1000 HVs + one gateway).
>> >>
>> >> Now, based on your previous assertion earlier:
>> >>  "If a user uses one gateway, a transit LS only gets connected by 2
>> >> routers.
>> >> Other routers get their own transit LS."
>> >>
>> >> This translates:
>> >> one Transit LS per tenant => 1000 Transit LS datapaths in total
>> >> 1001 Transit LS logical ports per Transit LS => 1,001,000 Transit LS
>> >> logical ports in total
>> >> 1001 arp resolve flows per Transit LS => 1,001,000 flows just for arp
>> >> resolve.
>> >> Each Transit LS comes with many other flows: so we multiply that number
>> >> of flows * 1000 Transit LSs = ? flows
>> >> 1001 addresses per subnet per Transit LS; I suggested addresses should
>> be
>> >> reused across subnets, but when each subnet is large
>> >>
>> >
>> > Re-reading. The above is a wrong conclusion making me believe that there
>> > is a big disconnect. A subnet in transit LS has only 2 IP addresses (if
>> it
>> > is only one physical gateway). Every additional physical gateway can add
>> > one additional IP address to the subnet (depending on whether the new
>> > physical gateway has a gateway router added for that logical topology.).
>> >
>>
>>
>> With respect to the IP address usage. I think a diagram would help
>> especially the K8 case,
>>
> Drawing a diagram here is not feasible. Happy to do it on a whiteboard
> though.
>

Thanks - lets do that; I would like to clarify the addressing requirements
and full scope of
distributed/gateway router interconnects for K8s.


>
>
>> which I had heard in other conversations may have a separate gateway on
>> every HV ?. Hence, I would like to know what that means - i.e. were you
>> thinking to run separate gateways routers on every HV for K8 ?
>>
> Yes, thats the plan (as many as possible). 100 routers is a target. Not
> HV, but a VM.
>
>
>
>>
>> With respect to the other questions, I think its best approach would be to
>> ask direct questions so those
>> direct questions get answered.
>>
>> 1) With 1000 HVs, 1000 HVs/tenant, 1 distributed router per tenant, you
>> choose the number of gateways/tenant:
>>
>> a) How many Transit LS distributed datapaths are expected in total ?
>>
>
> One (i.e the same as the distributed router).
>


i.e.
1000 distributed routers => 1000 Transit LSs



>
>
>>
>> b) How many Transit LS logical ports are needed at the HV level  ?
>>
>> what I mean by that is lets say we have one additional logical port at
>> northd level and 1000 HVs then if we need to download that port to 1000
>> HVs, I consider that to be 1000 logical ports at the HV level because
>> downloading and maintaining state across HVs at scale is more expensive
>> than for a single hypervisor.
>>
>
> 1000 additional ones. It is the same as your distributed logical switch or
> logical router (this is the case even with the peer routers)
>

Did you mean 2000 including both ends of the Transit LS ?



>
>
>
>>
>> c) How many Transit LS arp resolve entries at the HV level ?
>>
>> what I mean by that is lets say we have one additional arp resolve flow at
>> northd level and 1000 HVs then if we need to download that arp resolve
>> flow
>> to 1000 HVs, I consider that to be 1000 flows at the HV level because
>> downloading and maintaining state across multiple HVs is more expensive
>> that a single hypervisor.
>>
>
> 2 ARP flows per transit LS * 1000 HVs.
>

oops; I underestimated by half



> Do realize that a single bridge on a single hypervisor typically has flows
> in the 100,000 range. Even a million is feasbile.
>

I know.
I am thinking about the coordination across many HVs.



> Microsegmentation use cases has 10000 ACLs per logical switch. So that is
> 10000 * 1000 for your case form single LS. So do you have some comparative
> perspectives.
>
>
>
> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
>>
>
>
Gurucharan Shetty May 13, 2016, 12:59 a.m. UTC | #14
>
>
> >>
> >> With respect to the other questions, I think its best approach would be
> to
> >> ask direct questions so those
> >> direct questions get answered.
> >>
> >> 1) With 1000 HVs, 1000 HVs/tenant, 1 distributed router per tenant, you
> >> choose the number of gateways/tenant:
> >>
> >> a) How many Transit LS distributed datapaths are expected in total ?
> >>
> >
> > One (i.e the same as the distributed router).
> >
>
>
> i.e.
> 1000 distributed routers => 1000 Transit LSs
>
Yes.

>
>
>
> >
> >
> >>
> >> b) How many Transit LS logical ports are needed at the HV level  ?
> >>
> >> what I mean by that is lets say we have one additional logical port at
> >> northd level and 1000 HVs then if we need to download that port to 1000
> >> HVs, I consider that to be 1000 logical ports at the HV level because
> >> downloading and maintaining state across HVs at scale is more expensive
> >> than for a single hypervisor.
> >>
> >
> > 1000 additional ones. It is the same as your distributed logical switch
> or
> > logical router (this is the case even with the peer routers)
> >
>
> Did you mean 2000 including both ends of the Transit LS ?
>

No. One end is only on the physical gateway to act as a physical endpoint.

>
>
>
> >
> >
> >
> >>
> >> c) How many Transit LS arp resolve entries at the HV level ?
> >>
> >> what I mean by that is lets say we have one additional arp resolve flow
> at
> >> northd level and 1000 HVs then if we need to download that arp resolve
> >> flow
> >> to 1000 HVs, I consider that to be 1000 flows at the HV level because
> >> downloading and maintaining state across multiple HVs is more expensive
> >> that a single hypervisor.
> >>
> >
> > 2 ARP flows per transit LS * 1000 HVs.
> >
>
> oops; I underestimated by half
>
>
>
> > Do realize that a single bridge on a single hypervisor typically has
> flows
> > in the 100,000 range. Even a million is feasbile.
> >
>
> I know.
> I am thinking about the coordination across many HVs.
>

There is no co-ordination. HV just downloads from ovn-sb. This is
absolutely not different than any of the other distributed datapaths that
we have. If introduction of one additional datapath is a problem, then OVN
has a problem in general because it then simply means that it can only do
one DR per logical topology. A transit LS is much less resource intensive
(as it consumes just one additional port) than a DR connected to another DR
(not GR) as peers (in this case have 2 additional ports per DR and then
whatever additional switch ports that are connected to it).

If the larger concern is about having 1000 tenants, then we need to pass
more hints to ovn-controller about interconnections so that it only
programs things relevant to local VMs and containers which are limited by
the number of CPUs and Memory and is usually in the order of 10s.


>
>
>
> > Microsegmentation use cases has 10000 ACLs per logical switch. So that is
> > 10000 * 1000 for your case form single LS. So do you have some
> comparative
> > perspectives.
> >
> >
> >
> > dev mailing list
> >> dev@openvswitch.org
> >> http://openvswitch.org/mailman/listinfo/dev
> >>
> >
> >
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Flaviof May 17, 2016, 2:06 p.m. UTC | #15
On Fri, May 6, 2016 at 12:21 PM, Gurucharan Shetty <guru@ovn.org> wrote:

> Currently we can connect routers via "peer"ing. This limits
> the number of routers that can be connected with each other
> directly to 2.
>
> One of the design goals for L3 Gateway is to be able to
> have multiple gateways (each with their own router)
> connected to a distributed router via a switch.
>
> With the above goal in mind, this commit gives the general
> ability to connect multiple routers via a switch.
>
>
Very interesting work!
A minor suggestion/question on tests/ovn.at

Given that the test is a little long -- and I can see why it needs to be --
would it
be better if it was added in a separate [new] file, i.e. ovn-l3.at ? I ask
because I
see that this could be similar to what we have in regards to controller
vtep:
ovn-controller.at and ovn-controller-vtep.at

-- flaviof


Signed-off-by: Gurucharan Shetty <guru@ovn.org>
> ---
> This needs the following 2 commits under review to
> first go in.
> 1. ofproto-dpif: Rename "recurse" to "indentation".
> 2. ofproto-dpif: Do not count resubmit to later tables against limit.
> ---
>  ovn/controller/lflow.c  |   19 ++--
>  ovn/northd/ovn-northd.c |   56 ++++++++++-
>  ovn/ovn-nb.xml          |    7 --
>  tests/ovn.at            |  236
> +++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 299 insertions(+), 19 deletions(-)
>
> diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
> index 96b7c66..efc427d 100644
> --- a/ovn/controller/lflow.c
> +++ b/ovn/controller/lflow.c
> @@ -215,15 +215,15 @@ add_logical_flows(struct controller_ctx *ctx, const
> struct lport_index *lports,
>          if (is_switch(ldp)) {
>              /* For a logical switch datapath, local_datapaths tells us if
> there
>               * are any local ports for this datapath.  If not, we can skip
> -             * processing logical flows if the flow belongs to egress
> pipeline
> -             * or if that logical switch datapath is not patched to any
> logical
> -             * router.
> +             * processing logical flows if that logical switch datapath
> is not
> +             * patched to any logical router.
>               *
> -             * Otherwise, we still need the ingress pipeline because even
> if
> -             * there are no local ports, we still may need to execute the
> ingress
> -             * pipeline after a packet leaves a logical router.  Further
> -             * optimization is possible, but not based on what we know
> with
> -             * local_datapaths right now.
> +             * Otherwise, we still need both ingress and egress pipeline
> +             * because even if there are no local ports, we still may
> need to
> +             * execute the ingress pipeline after a packet leaves a
> logical
> +             * router and we need to do egress pipeline for a switch that
> +             * is connected to only routers.  Further optimization is
> possible,
> +             * but not based on what we know with local_datapaths right
> now.
>               *
>               * A better approach would be a kind of "flood fill"
> algorithm:
>               *
> @@ -242,9 +242,6 @@ add_logical_flows(struct controller_ctx *ctx, const
> struct lport_index *lports,
>               */
>
>              if (!get_local_datapath(local_datapaths, ldp->tunnel_key)) {
> -                if (!ingress) {
> -                    continue;
> -                }
>                  if (!get_patched_datapath(patched_datapaths,
>                                            ldp->tunnel_key)) {
>                      continue;
> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> index 9e03606..3a29770 100644
> --- a/ovn/northd/ovn-northd.c
> +++ b/ovn/northd/ovn-northd.c
> @@ -2110,7 +2110,13 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>                  free(actions);
>                  free(match);
>              }
> -        } else if (op->od->n_router_ports) {
> +        } else if (op->od->n_router_ports && strcmp(op->nbs->type,
> "router")) {
> +            /* This is a logical switch port that backs a VM or a
> container.
> +             * Extract its addresses. For each of the address, go through
> all
> +             * the router ports attached to the switch (to which this port
> +             * connects) and if the address in question is reachable from
> the
> +             * router port, add an ARP entry in that router's pipeline. */
> +
>              for (size_t i = 0; i < op->nbs->n_addresses; i++) {
>                  struct lport_addresses laddrs;
>                  if (!extract_lport_addresses(op->nbs->addresses[i],
> &laddrs,
> @@ -2158,8 +2164,56 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>
>                  free(laddrs.ipv4_addrs);
>              }
> +        } else if (!strcmp(op->nbs->type, "router")) {
> +            /* This is a logical switch port that connects to a router. */
> +
> +            /* The peer of this switch port is the router port for which
> +             * we need to add logical flows such that it can resolve
> +             * ARP entries for all the other router ports connected to
> +             * the switch in question. */
> +
> +            const char *peer_name = smap_get(&op->nbs->options,
> +                                             "router-port");
> +            if (!peer_name) {
> +                continue;
> +            }
> +
> +            struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +            if (!peer || !peer->nbr || !peer->ip) {
> +                continue;
> +            }
> +
> +            for (size_t j = 0; j < op->od->n_router_ports; j++) {
> +                const char *router_port_name = smap_get(
> +
> &op->od->router_ports[j]->nbs->options,
> +                                    "router-port");
> +                struct ovn_port *router_port = ovn_port_find(ports,
> +
>  router_port_name);
> +                if (!router_port || !router_port->nbr ||
> !router_port->ip) {
> +                    continue;
> +                }
> +
> +                /* Skip the router port under consideration. */
> +                if (router_port == peer) {
> +                   continue;
> +                }
> +
> +                if (!router_port->ip) {
> +                    continue;
> +                }
> +                char *match = xasprintf("outport == %s && reg0 == "IP_FMT,
> +                                        peer->json_key,
> +                                        IP_ARGS(router_port->ip));
> +                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT";
> next;",
> +
> ETH_ADDR_ARGS(router_port->mac));
> +                ovn_lflow_add(lflows, peer->od, S_ROUTER_IN_ARP_RESOLVE,
> +                              101, match, actions);
> +                free(actions);
> +                free(match);
> +            }
>          }
>      }
> +
>      HMAP_FOR_EACH (od, key_node, datapaths) {
>          if (!od->nbr) {
>              continue;
> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
> index c01455d..d7fd595 100644
> --- a/ovn/ovn-nb.xml
> +++ b/ovn/ovn-nb.xml
> @@ -154,13 +154,6 @@
>            These options apply when <ref column="type"/> is
> <code>router</code>.
>          </p>
>
> -        <p>
> -          If a given logical switch has multiple <code>router</code>
> ports, the
> -          <ref table="Logical_Router_Port"/> rows that they reference
> must be
> -          all on the same <ref table="Logical_Router"/> (for different
> -          subnets).
> -        </p>
> -
>          <column name="options" key="router-port">
>            Required.  The <ref column="name"/> of the <ref
>            table="Logical_Router_Port"/> to which this logical switch port
> is
> diff --git a/tests/ovn.at b/tests/ovn.at
> index b9d7ada..6961be0 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -2545,3 +2545,239 @@ OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>
>  AT_CLEANUP
>
> +AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
> +AT_KEYWORDS([ovnstaticroutes])
> +AT_SKIP_IF([test $HAVE_PYTHON = no])
> +ovn_start
> +
> +# Logical network:
> +# Three LRs - R1, R2 and R3 that are connected to each other via LS "join"
> +# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
> +# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (
> 10.32.1.0/24)
> +# connected to it.
> +
> +ovn-nbctl create Logical_Router name=R1
> +ovn-nbctl create Logical_Router name=R2
> +ovn-nbctl create Logical_Router name=R3
> +
> +ovn-nbctl lswitch-add foo
> +ovn-nbctl lswitch-add alice
> +ovn-nbctl lswitch-add bob
> +ovn-nbctl lswitch-add join
> +
> +# Connect foo to R1
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
> +network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add Logical_Router
> R1 \
> +ports @lrp -- lport-add foo rp-foo
> +
> +ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo \
> +addresses=\"00:00:01:01:02:03\"
> +
> +# Connect alice to R2
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
> +network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router R2
> \
> +ports @lrp -- lport-add alice rp-alice
> +
> +ovn-nbctl set Logical_port rp-alice type=router options:router-port=alice
> \
> +addresses=\"00:00:02:01:02:03\"
> +
> +# Connect bob to R3
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
> +network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router R3 \
> +ports @lrp -- lport-add bob rp-bob
> +
> +ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob \
> +addresses=\"00:00:03:01:02:03\"
> +
> +# Connect R1 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
> +network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router R1 \
> +ports @lrp -- lport-add join r1-join
> +
> +ovn-nbctl set Logical_port r1-join type=router
> options:router-port=R1_join \
> +addresses='"00:00:04:01:02:03"'
> +
> +# Connect R2 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
> +network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router R2 \
> +ports @lrp -- lport-add join r2-join
> +
> +ovn-nbctl set Logical_port r2-join type=router
> options:router-port=R2_join \
> +addresses='"00:00:04:01:02:04"'
> +
> +
> +# Connect R3 to join
> +ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
> +network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router R3 \
> +ports @lrp -- lport-add join r3-join
> +
> +ovn-nbctl set Logical_port r3-join type=router
> options:router-port=R3_join \
> +addresses='"00:00:04:01:02:05"'
> +
> +#install static routes
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> +R1 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> +R1 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> +R2 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
> +R2 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
> +R3 static_routes @lrt
> +
> +ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
> +ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
> +R3 static_routes @lrt
> +
> +# Create logical port foo1 in foo
> +ovn-nbctl lport-add foo foo1 \
> +-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
> +
> +# Create logical port alice1 in alice
> +ovn-nbctl lport-add alice alice1 \
> +-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
> +
> +# Create logical port bob1 in bob
> +ovn-nbctl lport-add bob bob1 \
> +-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
> +
> +# Create two hypervisor and create OVS ports corresponding to logical
> ports.
> +net_add n1
> +
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +ovs-vsctl -- add-port br-int hv1-vif1 -- \
> +    set interface hv1-vif1 external-ids:iface-id=foo1 \
> +    options:tx_pcap=hv1/vif1-tx.pcap \
> +    options:rxq_pcap=hv1/vif1-rx.pcap \
> +    ofport-request=1
> +
> +ovs-vsctl -- add-port br-int hv1-vif2 -- \
> +    set interface hv1-vif2 external-ids:iface-id=alice1 \
> +    options:tx_pcap=hv1/vif2-tx.pcap \
> +    options:rxq_pcap=hv1/vif2-rx.pcap \
> +    ofport-request=2
> +
> +sim_add hv2
> +as hv2
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.2
> +ovs-vsctl -- add-port br-int hv2-vif1 -- \
> +    set interface hv2-vif1 external-ids:iface-id=bob1 \
> +    options:tx_pcap=hv2/vif1-tx.pcap \
> +    options:rxq_pcap=hv2/vif1-rx.pcap \
> +    ofport-request=1
> +
> +
> +# Pre-populate the hypervisors' ARP tables so that we don't lose any
> +# packets for ARP resolution (native tunneling doesn't queue packets
> +# for ARP resolution).
> +ovn_populate_arp
> +
> +# Allow some time for ovn-northd and ovn-controller to catch up.
> +# XXX This should be more systematic.
> +sleep 1
> +
> +ip_to_hex() {
> +    printf "%02x%02x%02x%02x" "$@"
> +}
> +trim_zeros() {
> +    sed 's/\(00\)\{1,\}$//'
> +}
> +
> +# Send ip packets between foo1 and alice1
> +src_mac="f00000010203"
> +dst_mac="000001010203"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 172 16 1 2`
>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> +as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
> +
> +# Send ip packets between foo1 and bob1
> +src_mac="f00000010203"
> +dst_mac="000001010203"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 10 32 1 2`
>
> +packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
> +as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> +
> +echo "---------NB dump-----"
> +ovn-nbctl show
> +echo "---------------------"
> +ovn-nbctl list logical_router
> +echo "---------------------"
> +ovn-nbctl list logical_router_port
> +echo "---------------------"
> +
> +echo "---------SB dump-----"
> +ovn-sbctl list datapath_binding
> +echo "---------------------"
> +ovn-sbctl list port_binding
> +echo "---------------------"
> +ovn-sbctl dump-flows
> +echo "---------------------"
> +
> +echo "------ hv1 dump ----------"
> +as hv1 ovs-ofctl show br-int
> +as hv1 ovs-ofctl dump-flows br-int
> +echo "------ hv2 dump ----------"
> +as hv2 ovs-ofctl show br-int
> +as hv2 ovs-ofctl dump-flows br-int
> +echo "----------------------------"
> +
> +# Packet to Expect at bob1
> +src_mac="000003010203"
> +dst_mac="f00000010205"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 10 32 1 2`
>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> +
> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap |
> trim_zeros > received.packets
> +echo $expected | trim_zeros > expout
> +AT_CHECK([cat received.packets], [0], [expout])
> +
> +# Packet to Expect at alice1
> +src_mac="000002010203"
> +dst_mac="f00000010204"
> +src_ip=`ip_to_hex 192 168 1 2`
> +dst_ip=`ip_to_hex 172 16 1 2`
>
> +expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
> +
> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap |
> trim_zeros > received1.packets
> +echo $expected | trim_zeros > expout
> +AT_CHECK([cat received1.packets], [0], [expout])
> +
> +for sim in hv1 hv2; do
> +    as $sim
> +    OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> +    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +done
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as main
> +OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +AT_CLEANUP
> --
> 1.7.9.5
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Darrell Ball May 21, 2016, 6:21 p.m. UTC | #16
On Thu, May 12, 2016 at 5:59 PM, Guru Shetty <guru@ovn.org> wrote:

>
>> >>
>> >> With respect to the other questions, I think its best approach would
>> be to
>> >> ask direct questions so those
>> >> direct questions get answered.
>> >>
>> >> 1) With 1000 HVs, 1000 HVs/tenant, 1 distributed router per tenant, you
>> >> choose the number of gateways/tenant:
>> >>
>> >> a) How many Transit LS distributed datapaths are expected in total ?
>> >>
>> >
>> > One (i.e the same as the distributed router).
>> >
>>
>>
>> i.e.
>> 1000 distributed routers => 1000 Transit LSs
>>
> Yes.
>
>>
>>
>>
>> >
>> >
>> >>
>> >> b) How many Transit LS logical ports are needed at the HV level  ?
>> >>
>> >> what I mean by that is lets say we have one additional logical port at
>> >> northd level and 1000 HVs then if we need to download that port to 1000
>> >> HVs, I consider that to be 1000 logical ports at the HV level because
>> >> downloading and maintaining state across HVs at scale is more expensive
>> >> than for a single hypervisor.
>> >>
>> >
>> > 1000 additional ones. It is the same as your distributed logical switch
>> or
>> > logical router (this is the case even with the peer routers)
>> >
>>
>> Did you mean 2000 including both ends of the Transit LS ?
>>
>
> No. One end is only on the physical gateway to act as a physical endpoint.
>
>>
>>
>>
>> >
>> >
>> >
>> >>
>> >> c) How many Transit LS arp resolve entries at the HV level ?
>> >>
>> >> what I mean by that is lets say we have one additional arp resolve
>> flow at
>> >> northd level and 1000 HVs then if we need to download that arp resolve
>> >> flow
>> >> to 1000 HVs, I consider that to be 1000 flows at the HV level because
>> >> downloading and maintaining state across multiple HVs is more expensive
>> >> that a single hypervisor.
>> >>
>> >
>> > 2 ARP flows per transit LS * 1000 HVs.
>> >
>>
>> oops; I underestimated by half
>>
>>
>>
>> > Do realize that a single bridge on a single hypervisor typically has
>> flows
>> > in the 100,000 range. Even a million is feasbile.
>> >
>>
>> I know.
>> I am thinking about the coordination across many HVs.
>>
>
> There is no co-ordination. HV just downloads from ovn-sb. This is
> absolutely not different than any of the other distributed datapaths that
> we have. If introduction of one additional datapath is a problem, then OVN
> has a problem in general because it then simply means that it can only do
> one DR per logical topology.
>

Coordination here does not mean strict synchronization.
Rather, something like this:
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=coordination




> A transit LS is much less resource intensive (as it consumes just one
> additional port) than a DR connected to another DR (not GR) as peers (in
> this case have 2 additional ports per DR and then whatever additional
> switch ports that are connected to it).
>

The apples to apples comparison that I am looking at is:

DR<->Transit LS<->GR
vs
DR<->GR

I looked at this from efficiency POV and manageability POV.

I'll send an update on a combined thread.




>
> If the larger concern is about having 1000 tenants, then we need to pass
> more hints to ovn-controller about interconnections so that it only
> programs things relevant to local VMs and containers which are limited by
> the number of CPUs and Memory and is usually in the order of 10s.
>

This is a somewhat orthogonal topic.
However, I am working on some limiting scope of DR flows. I started that
with "local_router",
but it breaks VM scheduling for Openstack :-). I thought about tracking a
DR to those LSs it services instead and that seems that is also done in
other products, so that would be more viable albeit
slightly more complicated from internal OVN POV.
Some other folks may be looking at this as well.


>
>
>>
>>
>>
>> > Microsegmentation use cases has 10000 ACLs per logical switch. So that
>> is
>> > 10000 * 1000 for your case form single LS. So do you have some
>> comparative
>> > perspectives.
>
> >
>> >
>> >
>> > dev mailing list
>> >> dev@openvswitch.org
>> >> http://openvswitch.org/mailman/listinfo/dev
>> >>
>> >
>> >
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
>>
>
>
diff mbox

Patch

diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
index 96b7c66..efc427d 100644
--- a/ovn/controller/lflow.c
+++ b/ovn/controller/lflow.c
@@ -215,15 +215,15 @@  add_logical_flows(struct controller_ctx *ctx, const struct lport_index *lports,
         if (is_switch(ldp)) {
             /* For a logical switch datapath, local_datapaths tells us if there
              * are any local ports for this datapath.  If not, we can skip
-             * processing logical flows if the flow belongs to egress pipeline
-             * or if that logical switch datapath is not patched to any logical
-             * router.
+             * processing logical flows if that logical switch datapath is not
+             * patched to any logical router.
              *
-             * Otherwise, we still need the ingress pipeline because even if
-             * there are no local ports, we still may need to execute the ingress
-             * pipeline after a packet leaves a logical router.  Further
-             * optimization is possible, but not based on what we know with
-             * local_datapaths right now.
+             * Otherwise, we still need both ingress and egress pipeline
+             * because even if there are no local ports, we still may need to
+             * execute the ingress pipeline after a packet leaves a logical
+             * router and we need to do egress pipeline for a switch that
+             * is connected to only routers.  Further optimization is possible,
+             * but not based on what we know with local_datapaths right now.
              *
              * A better approach would be a kind of "flood fill" algorithm:
              *
@@ -242,9 +242,6 @@  add_logical_flows(struct controller_ctx *ctx, const struct lport_index *lports,
              */
 
             if (!get_local_datapath(local_datapaths, ldp->tunnel_key)) {
-                if (!ingress) {
-                    continue;
-                }
                 if (!get_patched_datapath(patched_datapaths,
                                           ldp->tunnel_key)) {
                     continue;
diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index 9e03606..3a29770 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -2110,7 +2110,13 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
                 free(actions);
                 free(match);
             }
-        } else if (op->od->n_router_ports) {
+        } else if (op->od->n_router_ports && strcmp(op->nbs->type, "router")) {
+            /* This is a logical switch port that backs a VM or a container.
+             * Extract its addresses. For each of the address, go through all
+             * the router ports attached to the switch (to which this port
+             * connects) and if the address in question is reachable from the
+             * router port, add an ARP entry in that router's pipeline. */
+
             for (size_t i = 0; i < op->nbs->n_addresses; i++) {
                 struct lport_addresses laddrs;
                 if (!extract_lport_addresses(op->nbs->addresses[i], &laddrs,
@@ -2158,8 +2164,56 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
 
                 free(laddrs.ipv4_addrs);
             }
+        } else if (!strcmp(op->nbs->type, "router")) {
+            /* This is a logical switch port that connects to a router. */
+
+            /* The peer of this switch port is the router port for which
+             * we need to add logical flows such that it can resolve
+             * ARP entries for all the other router ports connected to
+             * the switch in question. */
+
+            const char *peer_name = smap_get(&op->nbs->options,
+                                             "router-port");
+            if (!peer_name) {
+                continue;
+            }
+
+            struct ovn_port *peer = ovn_port_find(ports, peer_name);
+            if (!peer || !peer->nbr || !peer->ip) {
+                continue;
+            }
+
+            for (size_t j = 0; j < op->od->n_router_ports; j++) {
+                const char *router_port_name = smap_get(
+                                    &op->od->router_ports[j]->nbs->options,
+                                    "router-port");
+                struct ovn_port *router_port = ovn_port_find(ports,
+                                                             router_port_name);
+                if (!router_port || !router_port->nbr || !router_port->ip) {
+                    continue;
+                }
+
+                /* Skip the router port under consideration. */
+                if (router_port == peer) {
+                   continue;
+                }
+
+                if (!router_port->ip) {
+                    continue;
+                }
+                char *match = xasprintf("outport == %s && reg0 == "IP_FMT,
+                                        peer->json_key,
+                                        IP_ARGS(router_port->ip));
+                char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT"; next;",
+                                          ETH_ADDR_ARGS(router_port->mac));
+                ovn_lflow_add(lflows, peer->od, S_ROUTER_IN_ARP_RESOLVE,
+                              101, match, actions);
+                free(actions);
+                free(match);
+            }
         }
     }
+
     HMAP_FOR_EACH (od, key_node, datapaths) {
         if (!od->nbr) {
             continue;
diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
index c01455d..d7fd595 100644
--- a/ovn/ovn-nb.xml
+++ b/ovn/ovn-nb.xml
@@ -154,13 +154,6 @@ 
           These options apply when <ref column="type"/> is <code>router</code>.
         </p>
 
-        <p>
-          If a given logical switch has multiple <code>router</code> ports, the
-          <ref table="Logical_Router_Port"/> rows that they reference must be
-          all on the same <ref table="Logical_Router"/> (for different
-          subnets).
-        </p>
-
         <column name="options" key="router-port">
           Required.  The <ref column="name"/> of the <ref
           table="Logical_Router_Port"/> to which this logical switch port is
diff --git a/tests/ovn.at b/tests/ovn.at
index b9d7ada..6961be0 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -2545,3 +2545,239 @@  OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 AT_CLEANUP
 
+AT_SETUP([ovn -- 2 HVs, 3 LRs connected via LS, static routes])
+AT_KEYWORDS([ovnstaticroutes])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+# Logical network:
+# Three LRs - R1, R2 and R3 that are connected to each other via LS "join"
+# in 20.0.0.0/24 network. R1 has switchess foo (192.168.1.0/24)
+# connected to it. R2 has alice (172.16.1.0/24) and R3 has bob (10.32.1.0/24)
+# connected to it.
+
+ovn-nbctl create Logical_Router name=R1
+ovn-nbctl create Logical_Router name=R2
+ovn-nbctl create Logical_Router name=R3
+
+ovn-nbctl lswitch-add foo
+ovn-nbctl lswitch-add alice
+ovn-nbctl lswitch-add bob
+ovn-nbctl lswitch-add join
+
+# Connect foo to R1
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=foo \
+network=192.168.1.1/24 mac=\"00:00:01:01:02:03\" -- add Logical_Router R1 \
+ports @lrp -- lport-add foo rp-foo
+
+ovn-nbctl set Logical_port rp-foo type=router options:router-port=foo \
+addresses=\"00:00:01:01:02:03\"
+
+# Connect alice to R2
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=alice \
+network=172.16.1.1/24 mac=\"00:00:02:01:02:03\" -- add Logical_Router R2 \
+ports @lrp -- lport-add alice rp-alice
+
+ovn-nbctl set Logical_port rp-alice type=router options:router-port=alice \
+addresses=\"00:00:02:01:02:03\"
+
+# Connect bob to R3
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=bob \
+network=10.32.1.1/24 mac=\"00:00:03:01:02:03\" -- add Logical_Router R3 \
+ports @lrp -- lport-add bob rp-bob
+
+ovn-nbctl set Logical_port rp-bob type=router options:router-port=bob \
+addresses=\"00:00:03:01:02:03\"
+
+# Connect R1 to join
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=R1_join \
+network=20.0.0.1/24 mac=\"00:00:04:01:02:03\" -- add Logical_Router R1 \
+ports @lrp -- lport-add join r1-join
+
+ovn-nbctl set Logical_port r1-join type=router options:router-port=R1_join \
+addresses='"00:00:04:01:02:03"'
+
+# Connect R2 to join
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=R2_join \
+network=20.0.0.2/24 mac=\"00:00:04:01:02:04\" -- add Logical_Router R2 \
+ports @lrp -- lport-add join r2-join
+
+ovn-nbctl set Logical_port r2-join type=router options:router-port=R2_join \
+addresses='"00:00:04:01:02:04"'
+
+
+# Connect R3 to join
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=R3_join \
+network=20.0.0.3/24 mac=\"00:00:04:01:02:05\" -- add Logical_Router R3 \
+ports @lrp -- lport-add join r3-join
+
+ovn-nbctl set Logical_port r3-join type=router options:router-port=R3_join \
+addresses='"00:00:04:01:02:05"'
+
+#install static routes
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
+R1 static_routes @lrt
+
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
+R1 static_routes @lrt
+
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
+R2 static_routes @lrt
+
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=10.32.1.0/24 nexthop=20.0.0.3 -- add Logical_Router \
+R2 static_routes @lrt
+
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=192.168.1.0/24 nexthop=20.0.0.1 -- add Logical_Router \
+R3 static_routes @lrt
+
+ovn-nbctl -- --id=@lrt create Logical_Router_Static_Route \
+ip_prefix=172.16.1.0/24 nexthop=20.0.0.2 -- add Logical_Router \
+R3 static_routes @lrt
+
+# Create logical port foo1 in foo
+ovn-nbctl lport-add foo foo1 \
+-- lport-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
+
+# Create logical port alice1 in alice
+ovn-nbctl lport-add alice alice1 \
+-- lport-set-addresses alice1 "f0:00:00:01:02:04 172.16.1.2"
+
+# Create logical port bob1 in bob
+ovn-nbctl lport-add bob bob1 \
+-- lport-set-addresses bob1 "f0:00:00:01:02:05 10.32.1.2"
+
+# Create two hypervisor and create OVS ports corresponding to logical ports.
+net_add n1
+
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+ovs-vsctl -- add-port br-int hv1-vif1 -- \
+    set interface hv1-vif1 external-ids:iface-id=foo1 \
+    options:tx_pcap=hv1/vif1-tx.pcap \
+    options:rxq_pcap=hv1/vif1-rx.pcap \
+    ofport-request=1
+
+ovs-vsctl -- add-port br-int hv1-vif2 -- \
+    set interface hv1-vif2 external-ids:iface-id=alice1 \
+    options:tx_pcap=hv1/vif2-tx.pcap \
+    options:rxq_pcap=hv1/vif2-rx.pcap \
+    ofport-request=2
+
+sim_add hv2
+as hv2
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.2
+ovs-vsctl -- add-port br-int hv2-vif1 -- \
+    set interface hv2-vif1 external-ids:iface-id=bob1 \
+    options:tx_pcap=hv2/vif1-tx.pcap \
+    options:rxq_pcap=hv2/vif1-rx.pcap \
+    ofport-request=1
+
+
+# Pre-populate the hypervisors' ARP tables so that we don't lose any
+# packets for ARP resolution (native tunneling doesn't queue packets
+# for ARP resolution).
+ovn_populate_arp
+
+# Allow some time for ovn-northd and ovn-controller to catch up.
+# XXX This should be more systematic.
+sleep 1
+
+ip_to_hex() {
+    printf "%02x%02x%02x%02x" "$@"
+}
+trim_zeros() {
+    sed 's/\(00\)\{1,\}$//'
+}
+
+# Send ip packets between foo1 and alice1
+src_mac="f00000010203"
+dst_mac="000001010203"
+src_ip=`ip_to_hex 192 168 1 2`
+dst_ip=`ip_to_hex 172 16 1 2`
+packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
+as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
+as hv1 ovs-appctl ofproto/trace br-int in_port=1 $packet
+
+# Send ip packets between foo1 and bob1
+src_mac="f00000010203"
+dst_mac="000001010203"
+src_ip=`ip_to_hex 192 168 1 2`
+dst_ip=`ip_to_hex 10 32 1 2`
+packet=${dst_mac}${src_mac}08004500001c0000000040110000${src_ip}${dst_ip}0035111100080000
+as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
+
+echo "---------NB dump-----"
+ovn-nbctl show
+echo "---------------------"
+ovn-nbctl list logical_router
+echo "---------------------"
+ovn-nbctl list logical_router_port
+echo "---------------------"
+
+echo "---------SB dump-----"
+ovn-sbctl list datapath_binding
+echo "---------------------"
+ovn-sbctl list port_binding
+echo "---------------------"
+ovn-sbctl dump-flows
+echo "---------------------"
+
+echo "------ hv1 dump ----------"
+as hv1 ovs-ofctl show br-int
+as hv1 ovs-ofctl dump-flows br-int
+echo "------ hv2 dump ----------"
+as hv2 ovs-ofctl show br-int
+as hv2 ovs-ofctl dump-flows br-int
+echo "----------------------------"
+
+# Packet to Expect at bob1
+src_mac="000003010203"
+dst_mac="f00000010205"
+src_ip=`ip_to_hex 192 168 1 2`
+dst_ip=`ip_to_hex 10 32 1 2`
+expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
+
+$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/vif1-tx.pcap | trim_zeros > received.packets
+echo $expected | trim_zeros > expout
+AT_CHECK([cat received.packets], [0], [expout])
+
+# Packet to Expect at alice1
+src_mac="000002010203"
+dst_mac="f00000010204"
+src_ip=`ip_to_hex 192 168 1 2`
+dst_ip=`ip_to_hex 172 16 1 2`
+expected=${dst_mac}${src_mac}08004500001c000000003e110200${src_ip}${dst_ip}0035111100080000
+
+$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap | trim_zeros > received1.packets
+echo $expected | trim_zeros > expout
+AT_CHECK([cat received1.packets], [0], [expout])
+
+for sim in hv1 hv2; do
+    as $sim
+    OVS_APP_EXIT_AND_WAIT([ovn-controller])
+    OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
+    OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+done
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as main
+OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+AT_CLEANUP