diff mbox series

[ovs-dev,v2,1/3] Add multipath static router in OVN northd and north-db

Message ID 20170926095238.28826-1-sysugaozhenyu@gmail.com
State Deferred
Headers show
Series [ovs-dev,v2,1/3] Add multipath static router in OVN northd and north-db | expand

Commit Message

Gao Zhenyu Sept. 26, 2017, 9:52 a.m. UTC
1. ovn-nb.ovsschema was updated output_port field. Change the max entry
number from 1 to unlimited.
2. Add multipath feature in ovn-northd part. northd generates multipath
flows to dispatch traffic by using packet's IP dst address if route's
output_port contains two or more ports.
3. Add new table(lr_in_multipath) in ovn-northd's router ingress stages
to dispatch traffic to ports.
4. Add multipath flow in Table 5(lr_in_ip_routing) and store hash result
into reg0. reg9[2] was used to indicate packet which need dispatching.
5. Add multipath feature description in ovn/northd/ovn-northd.8.xml
and ovn/ovn-nb.xml
6. ovn-nbctl.c was updated to handle configuring mulitiple output_port.

Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
---
 ovn/northd/ovn-northd.8.xml |  67 +++++++++++-
 ovn/northd/ovn-northd.c     | 257 +++++++++++++++++++++++++++++++++++++-------
 ovn/ovn-nb.ovsschema        |   7 +-
 ovn/ovn-nb.xml              |   4 +
 ovn/utilities/ovn-nbctl.c   |  28 +++--
 5 files changed, 311 insertions(+), 52 deletions(-)

Comments

Gao Zhenyu Oct. 8, 2017, 9:42 a.m. UTC | #1
Comments and suggestions are welcome :)

Thanks
Zhenyu Gao

2017-09-26 17:52 GMT+08:00 Zhenyu Gao <sysugaozhenyu@gmail.com>:

> 1. ovn-nb.ovsschema was updated output_port field. Change the max entry
> number from 1 to unlimited.
> 2. Add multipath feature in ovn-northd part. northd generates multipath
> flows to dispatch traffic by using packet's IP dst address if route's
> output_port contains two or more ports.
> 3. Add new table(lr_in_multipath) in ovn-northd's router ingress stages
> to dispatch traffic to ports.
> 4. Add multipath flow in Table 5(lr_in_ip_routing) and store hash result
> into reg0. reg9[2] was used to indicate packet which need dispatching.
> 5. Add multipath feature description in ovn/northd/ovn-northd.8.xml
> and ovn/ovn-nb.xml
> 6. ovn-nbctl.c was updated to handle configuring mulitiple output_port.
>
> Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
> ---
>  ovn/northd/ovn-northd.8.xml |  67 +++++++++++-
>  ovn/northd/ovn-northd.c     | 257 ++++++++++++++++++++++++++++++
> +++++++-------
>  ovn/ovn-nb.ovsschema        |   7 +-
>  ovn/ovn-nb.xml              |   4 +
>  ovn/utilities/ovn-nbctl.c   |  28 +++--
>  5 files changed, 311 insertions(+), 52 deletions(-)
>
> diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
> index 0d85ec0..b1ce9a9 100644
> --- a/ovn/northd/ovn-northd.8.xml
> +++ b/ovn/northd/ovn-northd.8.xml
> @@ -1598,6 +1598,9 @@ icmp4 {
>        port (ingress table <code>ARP Request</code> will generate an ARP
>        request, if needed, with <code>reg0</code> as the target protocol
>        address and <code>reg1</code> as the source protocol address).
> +      A IP route can be configured that it has multipath to next-hop.
> +      If a packet has multipath to destination, OVN assign the port
> +      index into reg[0] to indicate the packet's output port in table 6.
>      </p>
>
>      <p>
> @@ -1617,6 +1620,28 @@ icmp4 {
>
>        <li>
>          <p>
> +          IPv4/IPV6 multipath routing table. For each route to IPv4/IPv6
> +          network <var>N</var> with netmask <var>M</var>, on multipath
> port
> +          <var>P</var> with IP address <var>A</var> and Ethernet
> +          address <var>E</var>, a logical flow with match
> +          <code>ip4.dst ==<var>N</var>/<var>M</var></code>,whose priority
> +          is the number of 1-bits plus 10 in <var>M</var>,
> +          has the following actions:
> +        </p>
> +
> +        <pre>
> +ip.ttl--;
> +multipath (nw_dst, 0, modulo_n, <var>n_links</var>, 0, reg0);
> +reg9[2] = 1
> +next;
> +        </pre>
> +        <p>
> +          <var>n_links</var> is the number of multipath port.
> +        </p>
> +      </li>
> +
> +      <li>
> +        <p>
>            IPv4 routing table.  For each route to IPv4 network
> <var>N</var> with
>            netmask <var>M</var>, on router port <var>P</var> with IP
> address
>            <var>A</var> and Ethernet
> @@ -1686,7 +1711,43 @@ next;
>        </li>
>      </ul>
>
> -    <h3>Ingress Table 6: ARP/ND Resolution</h3>
> +    <h3>Ingress Table 6: Multipath</h3>
> +    <p>
> +      Any packet taht reaches this table is an IP packet and reg9[2]=1
> +      using the following flows to route to corresponding port. This table
> +      implement dispatching by consuming reg0.
> +    </p>
> +
> +    <ul>
> +      <li>
> +        <p>
> +          A packet with netmask <var>M</var>, IP address <var>A</var> and
> +          <code>reg9[2] = 1</code>, whose priority above 1 has following
> +          actions:
> +        </p>
> +
> +        <pre>
> +reg0 = <var>G</var>;
> +reg1 = <var>A</var>;
> +eth.src = <var>E</var>;
> +outport = <var>P</var>;
> +flags.loopback = 1;
> +next;
> +        </pre>
> +
> +        <p>
> +          <var>G</var> is the gateway IP address. <var>A</var>,
> <var>E</var>
> +          and <var>P</var> are the values that were described in multipath
> +          routeing in table 5
> +        </p>
> +
> +        <p>
> +          A priority-0 logical flow with match has actions
> <code>next;</code>.
> +        </p>
> +      </li>
> +    </ul>
> +
> +    <h3>Ingress Table 7: ARP/ND Resolution</h3>
>
>      <p>
>        Any packet that reaches this table is an IP packet whose next-hop
> @@ -1779,7 +1840,7 @@ next;
>        </li>
>      </ul>
>
> -    <h3>Ingress Table 7: Gateway Redirect</h3>
> +    <h3>Ingress Table 8: Gateway Redirect</h3>
>
>      <p>
>        For distributed logical routers where one of the logical router
> @@ -1836,7 +1897,7 @@ next;
>        </li>
>      </ul>
>
> -    <h3>Ingress Table 8: ARP Request</h3>
> +    <h3>Ingress Table 9: ARP Request</h3>
>
>      <p>
>        In the common case where the Ethernet destination has been
> resolved, this
> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> index 49e4ac3..f8bfee2 100644
> --- a/ovn/northd/ovn-northd.c
> +++ b/ovn/northd/ovn-northd.c
> @@ -135,9 +135,10 @@ enum ovn_stage {
>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,      3, "lr_in_unsnat")       \
>      PIPELINE_STAGE(ROUTER, IN,  DNAT,        4, "lr_in_dnat")         \
>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  5, "lr_in_ip_routing")   \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 6, "lr_in_arp_resolve")  \
> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 7, "lr_in_gw_redirect")  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 8, "lr_in_arp_request")  \
> +    PIPELINE_STAGE(ROUTER, IN,  MULTIPATH,   6, "lr_in_multipath")    \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 7, "lr_in_arp_resolve")  \
> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 8, "lr_in_gw_redirect")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 9, "lr_in_arp_request")  \
>                                                                        \
>      /* Logical router egress stages. */                               \
>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> @@ -173,6 +174,11 @@ enum ovn_stage {
>   * one of the logical router's own IP addresses. */
>  #define REGBIT_EGRESS_LOOPBACK  "reg9[1]"
>
> +/* Indicate multipath action has process this packet and store hash result
> + * into other regX. Should consume the hash result to determin the right
> + * output port. */
> +#define REGBIT_MULTIPATH "reg9[2]"
> +
>  /* Returns an "enum ovn_stage" built from the arguments. */
>  static enum ovn_stage
>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
> pipeline,
> @@ -4142,82 +4148,178 @@ add_route(struct hmap *lflows, const struct
> ovn_port *op,
>  }
>
>  static void
> -build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
> -                        struct hmap *ports,
> -                        const struct nbrec_logical_router_static_route
> *route)
> +add_multipath_route(struct hmap *lflows, uint32_t port_num,
> +                    struct ovn_port **out_ports,
> +                    const char **lrp_addr_s,
> +                    struct ovn_datapath *od,
> +                    const char *network_s, int plen,
> +                    const char *gateway, const char *policy)
> +{
> +    bool is_ipv4 = strchr(network_s, '.') ? true : false;
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    const char *dir;
> +    uint16_t priority;
> +
> +    if (policy && !strcmp(policy, "src-ip")) {
> +        dir = "src";
> +        priority = plen * 2;
> +    } else {
> +        dir = "dst";
> +        priority = (plen * 2) + 1;
> +    }
> +
> +    ds_put_format(&match, "ip%s.%s == %s/%d", is_ipv4 ? "4" : "6", dir,
> +                  network_s, plen);
> +
> +    struct ds actions = DS_EMPTY_INITIALIZER;
> +
> +    ds_put_format(&actions, "ip.ttl--; ");
> +    ds_put_format(&actions,
> +                  "multipath (nw_dst, 0, modulo_n, %u, 0, reg0); "
> +                  "%s = 1; "
> +                  "next;",
> +                  port_num, REGBIT_MULTIPATH);
> +
> +    /* The priority here is calculated to implement longest-prefix-match
> +     * routing. */
> +    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
> +                  ds_cstr(&match), ds_cstr(&actions));
> +
> +    for (int i = 0; i < port_num; i++) {
> +        struct ds mp_match = DS_EMPTY_INITIALIZER;
> +        struct ds mp_actions = DS_EMPTY_INITIALIZER;
> +
> +        ds_put_format(&mp_match, "%s == 1 && reg0 == %d && ",
> +                      REGBIT_MULTIPATH, i);
> +        ds_put_format(&mp_match, "ip%s.%s == %s/%d",
> +                      is_ipv4 ? "4" : "6", dir,
> +                      network_s, plen);
> +
> +        ds_put_format(&mp_actions, "%sreg0 = ", is_ipv4 ? "" : "xx");
> +        if (gateway) {
> +            ds_put_cstr(&mp_actions, gateway);
> +        } else {
> +            ds_put_format(&mp_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
> +        }
> +
> +        ds_put_format(&mp_actions, "; "
> +                      "%sreg1 = %s; "
> +                      "eth.src = %s; "
> +                      "outport = %s; "
> +                      "flags.loopback = 1; "
> +                      "next;",
> +                      is_ipv4 ? "" : "xx",
> +                      lrp_addr_s[i],
> +                      out_ports[i]->lrp_networks.ea_s,
> +                      out_ports[i]->json_key);
> +
> +        /* Add flow in table 6 to determin the right output port
> +         * for this traffic. */
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, priority,
> +                      ds_cstr(&mp_match), ds_cstr(&mp_actions));
> +        ds_destroy(&mp_match);
> +        ds_destroy(&mp_actions);
> +    }
> +    ds_destroy(&match);
> +    ds_destroy(&actions);
> +}
> +
> +static bool
> +verify_nexthop_prefix(const struct nbrec_logical_router_static_route
> *route,
> +                      bool *is_ipv4, char **prefix_s, unsigned int *plen)
>  {
>      ovs_be32 nexthop;
> -    const char *lrp_addr_s = NULL;
> -    unsigned int plen;
> -    bool is_ipv4;
>
>      /* Verify that the next hop is an IP address with an all-ones mask. */
> -    char *error = ip_parse_cidr(route->nexthop, &nexthop, &plen);
> +    char *error = ip_parse_cidr(route->nexthop, &nexthop, plen);
>      if (!error) {
> -        if (plen != 32) {
> +        if (*plen != 32) {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
>              VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
> -            return;
> +            return false;
>          }
> -        is_ipv4 = true;
> +        *is_ipv4 = true;
>      } else {
>          free(error);
>
>          struct in6_addr ip6;
> -        error = ipv6_parse_cidr(route->nexthop, &ip6, &plen);
> +        error = ipv6_parse_cidr(route->nexthop, &ip6, plen);
>          if (!error) {
> -            if (plen != 128) {
> +            if (*plen != 128) {
>                  static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 1);
>                  VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
> -                return;
> +                return false;
>              }
> -            is_ipv4 = false;
> +            *is_ipv4 = false;
>          } else {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
>              VLOG_WARN_RL(&rl, "bad next hop ip address %s",
> route->nexthop);
>              free(error);
> -            return;
> +            return false;
>          }
>      }
>
> -    char *prefix_s;
> -    if (is_ipv4) {
> +    if (*is_ipv4) {
>          ovs_be32 prefix;
>          /* Verify that ip prefix is a valid IPv4 address. */
> -        error = ip_parse_cidr(route->ip_prefix, &prefix, &plen);
> +        error = ip_parse_cidr(route->ip_prefix, &prefix, plen);
>          if (error) {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>                           route->ip_prefix);
>              free(error);
> -            return;
> +            return false;
>          }
> -        prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix &
> be32_prefix_mask(plen)));
> +        *prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix
> +                                              & be32_prefix_mask(*plen)));
>      } else {
>          /* Verify that ip prefix is a valid IPv6 address. */
>          struct in6_addr prefix;
> -        error = ipv6_parse_cidr(route->ip_prefix, &prefix, &plen);
> +        error = ipv6_parse_cidr(route->ip_prefix, &prefix, plen);
>          if (error) {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>                           route->ip_prefix);
>              free(error);
> -            return;
> +            return false;
>          }
> -        struct in6_addr mask = ipv6_create_mask(plen);
> +        struct in6_addr mask = ipv6_create_mask(*plen);
>          struct in6_addr network = ipv6_addr_bitand(&prefix, &mask);
> -        prefix_s = xmalloc(INET6_ADDRSTRLEN);
> -        inet_ntop(AF_INET6, &network, prefix_s, INET6_ADDRSTRLEN);
> +        *prefix_s = xmalloc(INET6_ADDRSTRLEN);
> +        inet_ntop(AF_INET6, &network, *prefix_s, INET6_ADDRSTRLEN);
> +    }
> +
> +    return true;
> +}
> +
> +static void
> +build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
> +                        struct hmap *ports,
> +                        const struct nbrec_logical_router_static_route
> *route)
> +{
> +    const char *lrp_addr_s = NULL;
> +    unsigned int plen;
> +    bool is_ipv4;
> +    char *prefix_s = NULL;
> +
> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
> +        return;
> +    }
> +
> +    /* Only need one output_port, if route contains multiple output_port,
> then
> +     * we should use build_multipath_flow to handle it. */
> +    if (route->n_output_port > 1) {
> +        return;
>      }
>
>      /* Find the outgoing port. */
>      struct ovn_port *out_port = NULL;
> -    if (route->output_port) {
> -        out_port = ovn_port_find(ports, route->output_port);
> +    if (route->n_output_port) {
> +        out_port = ovn_port_find(ports, route->output_port[0]);
>          if (!out_port) {
>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
>              VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
> -                         route->output_port, route->ip_prefix);
> +                         route->output_port[0], route->ip_prefix);
>              goto free_prefix_s;
>          }
>          lrp_addr_s = find_lrp_member_ip(out_port, route->nexthop);
> @@ -4270,7 +4372,77 @@ build_static_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                policy);
>
>  free_prefix_s:
> -    free(prefix_s);
> +    if (prefix_s) {
> +        free(prefix_s);
> +    }
> +}
> +
> +static void
> +build_multipath_flow(struct hmap *lflows, struct ovn_datapath *od,
> +                     struct hmap *ports,
> +                     const struct nbrec_logical_router_static_route
> *route)
> +{
> +    unsigned int plen;
> +    bool is_ipv4;
> +    char *prefix_s = NULL;
> +
> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
> +        return;
> +    }
> +
> +    /* Find the outgoing port. */
> +    struct ovn_port **out_ports = xmalloc(route->n_output_port *
> +                                             sizeof(struct ovn_port *));
> +    const char **lrp_addr_s = xmalloc(route->n_output_port *
> +                                         sizeof(const char *));
> +    uint32_t idx = 0;
> +    for (int i = 0; i < route->n_output_port; i++) {
> +        out_ports[idx] = ovn_port_find(ports, route->output_port[i]);
> +        if (!out_ports[idx]) {
> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
> +            VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
> +                         route->output_port[i], route->ip_prefix);
> +            continue;
> +        }
> +
> +        lrp_addr_s[idx] = find_lrp_member_ip(out_ports[idx],
> route->nexthop);
> +        if (!lrp_addr_s[idx]) {
> +            if (is_ipv4) {
> +                if (out_ports[idx]->lrp_networks.n_ipv4_addrs) {
> +                    lrp_addr_s[idx] = out_ports[idx]->
> +                                        lrp_networks.ipv4_addrs[0].
> addr_s;
> +                }
> +            } else {
> +                if (out_ports[idx]->lrp_networks.n_ipv6_addrs) {
> +                    lrp_addr_s[idx] = out_ports[idx]->
> +                                        lrp_networks.ipv6_addrs[0].
> addr_s;
> +                }
> +            }
> +        }
> +        if (!lrp_addr_s[idx]) {
> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
> +            VLOG_WARN_RL(&rl,
> +                         "%s has no path for static route %s; next hop
> %s",
> +                         route->output_port[i], route->ip_prefix,
> +                         route->nexthop);
> +            continue;
> +        }
> +
> +        idx++;
> +    }
> +
> +    char *policy = route->policy ? route->policy : "dst-ip";
> +    if (idx > 0) {
> +        add_multipath_route(lflows, idx,
> +                            out_ports, lrp_addr_s, od,
> +                            prefix_s, plen, route->nexthop, policy);
> +    }
> +
> +    free(out_ports);
> +    free(lrp_addr_s);
> +    if (prefix_s) {
> +        free(prefix_s);
> +    }
>  }
>
>  static void
> @@ -5344,7 +5516,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>          }
>      }
>
> -    /* Convert the static routes to flows. */
> +    /* Convert the static routes and multipath route to flows. */
>      HMAP_FOR_EACH (od, key_node, datapaths) {
>          if (!od->nbr) {
>              continue;
> @@ -5354,13 +5526,24 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>              const struct nbrec_logical_router_static_route *route;
>
>              route = od->nbr->static_routes[i];
> -            build_static_route_flow(lflows, od, ports, route);
> +            if (route->n_output_port > 1) {
> +                /* Logical router ingress table 5-6: Multipath Routing.
> +                 *
> +                 * If router had been configured a traffic has multiple
> paths
> +                 * to destination. The specific output port should be
> firgured
> +                 * out by computing packet's IP dst address header */
> +                build_multipath_flow(lflows, od, ports, route);
> +            } else {
> +                build_static_route_flow(lflows, od, ports, route);
> +            }
>          }
> +        /* Packets are allowed by default in table 6. */
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, 0, "1", "next;");
>      }
>
>      /* XXX destination unreachable */
>
> -    /* Local router ingress table 6: ARP Resolution.
> +    /* Local router ingress table 7: ARP Resolution.
>       *
>       * Any packet that reaches this table is an IP packet whose next-hop
> IP
>       * address is in reg0. (ip4.dst is the final destination.) This table
> @@ -5555,7 +5738,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>                        "get_nd(outport, xxreg0); next;");
>      }
>
> -    /* Logical router ingress table 7: Gateway redirect.
> +    /* Logical router ingress table 8: Gateway redirect.
>       *
>       * For traffic with outport equal to the l3dgw_port
>       * on a distributed router, this table redirects a subset
> @@ -5595,7 +5778,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>          ovn_lflow_add(lflows, od, S_ROUTER_IN_GW_REDIRECT, 0, "1",
> "next;");
>      }
>
> -    /* Local router ingress table 8: ARP request.
> +    /* Local router ingress table 9: ARP request.
>       *
>       * In the common case where the Ethernet destination has been
> resolved,
>       * this table outputs the packet (priority 0).  Otherwise, it composes
> diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema
> index a077bfb..7a43473 100644
> --- a/ovn/ovn-nb.ovsschema
> +++ b/ovn/ovn-nb.ovsschema
> @@ -1,7 +1,7 @@
>  {
>      "name": "OVN_Northbound",
> -    "version": "5.8.0",
> -    "cksum": "2812300190 16766",
> +    "version": "5.9.0",
> +    "cksum": "1515729450 16817",
>      "tables": {
>          "NB_Global": {
>              "columns": {
> @@ -235,7 +235,8 @@
>                                                               "dst-ip"]]},
>                                      "min": 0, "max": 1}},
>                  "nexthop": {"type": "string"},
> -                "output_port": {"type": {"key": "string", "min": 0,
> "max": 1}}},
> +                "output_port": {"type": {"key": "string", "min": 0,
> +                                         "max": "unlimited"}}},
>              "isRoot": false},
>          "NAT": {
>              "columns": {
> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
> index 9869d7e..eaba0c8 100644
> --- a/ovn/ovn-nb.xml
> +++ b/ovn/ovn-nb.xml
> @@ -1485,6 +1485,10 @@
>          multiple IP addresses on the router port and none of them are in
> the
>          same subnet of <ref column="nexthop"/>, OVN chooses the first IP
>          address as the one via which the <ref column="nexthop"/> is
> reachable.
> +        When it contains more than two ports, it means packet has multiple
> +        candidate output ports. OVN uses the packet header to determin
> which
> +        port the packet would be delivered to.
> +        Currently, OVN consumes destination IP field to figure out port.
>        </p>
>      </column>
>    </table>
> diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
> index 8e5c1a4..417194f 100644
> --- a/ovn/utilities/ovn-nbctl.c
> +++ b/ovn/utilities/ovn-nbctl.c
> @@ -397,7 +397,7 @@ Logical router port commands:\n\
>                              ('enabled' or 'disabled')\n\
>  \n\
>  Route commands:\n\
> -  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
> +  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]...\n\
>                              add a route to ROUTER\n\
>    lr-route-del ROUTER [PREFIX]\n\
>                              remove routes from ROUTER\n\
> @@ -2184,13 +2184,15 @@ normalize_prefix_str(const char *orig_prefix)
>          return normalize_ipv6_prefix(ipv6, plen);
>      }
>  }
> -
> +
>  static void
>  nbctl_lr_route_add(struct ctl_context *ctx)
>  {
>      const struct nbrec_logical_router *lr;
>      lr = lr_by_name_or_uuid(ctx, ctx->argv[1], true);
>      char *prefix, *next_hop;
> +    int n_output_port = 0;
> +    const char **output_port;
>
>      const char *policy = shash_find_data(&ctx->options, "--policy");
>      if (policy && strcmp(policy, "src-ip") && strcmp(policy, "dst-ip")) {
> @@ -2224,6 +2226,11 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>          }
>      }
>
> +    if (ctx->argc > 4) {
> +        n_output_port = ctx->argc - 4;
> +        output_port = (const char **)&ctx->argv[4];
> +    }
> +
>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
>      for (int i = 0; i < lr->n_static_routes; i++) {
>          const struct nbrec_logical_router_static_route *route
> @@ -2253,9 +2260,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>          nbrec_logical_router_static_route_verify_nexthop(route);
>          nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>          nbrec_logical_router_static_route_set_nexthop(route, next_hop);
> -        if (ctx->argc == 5) {
> +        if (n_output_port > 0) {
>              nbrec_logical_router_static_route_set_output_port(route,
> -
> ctx->argv[4]);
> +                                                              output_port,
> +
> n_output_port);
>          }
>          if (policy) {
>               nbrec_logical_router_static_route_set_policy(route, policy);
> @@ -2270,8 +2278,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>      route = nbrec_logical_router_static_route_insert(ctx->txn);
>      nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>      nbrec_logical_router_static_route_set_nexthop(route, next_hop);
> -    if (ctx->argc == 5) {
> -        nbrec_logical_router_static_route_set_output_port(route,
> ctx->argv[4]);
> +    if (n_output_port > 0) {
> +        nbrec_logical_router_static_route_set_output_port(route,
> +                                                          output_port,
> +                                                          n_output_port);
>      }
>      if (policy) {
>          nbrec_logical_router_static_route_set_policy(route, policy);
> @@ -3066,8 +3076,8 @@ print_route(const struct nbrec_logical_router_static_route
> *route, struct ds *s)
>          ds_put_format(s, " %s", "dst-ip");
>      }
>
> -    if (route->output_port) {
> -        ds_put_format(s, " %s", route->output_port);
> +    for (int i = 0; i < route->n_output_port; i++) {
> +        ds_put_format(s, " %s", route->output_port[i]);
>      }
>      ds_put_char(s, '\n');
>  }
> @@ -3682,7 +3692,7 @@ static const struct ctl_command_syntax
> nbctl_commands[] = {
>        NULL, "", RO },
>
>      /* logical router route commands. */
> -    { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
> +    { "lr-route-add", 3, INT_MAX, "ROUTER PREFIX NEXTHOP [PORT]...", NULL,
>        nbctl_lr_route_add, NULL, "--may-exist,--policy=", RW },
>      { "lr-route-del", 1, 2, "ROUTER [PREFIX]", NULL, nbctl_lr_route_del,
>        NULL, "--if-exists", RW },
> --
> 1.8.3.1
>
>
Gao Zhenyu Oct. 11, 2017, 1:50 a.m. UTC | #2
I discussed this multipath stuff with Miguel in other mailling thread and I
want to bring this discusstion on ovs mailing list and hope to collect more
suggestions from all of you. :)

Here is the Miguel's suggestion on it.

=================================
Hi Gao,

   Sorry, I didn't have more time to look at it currently (although it's a
topic of my interest.)

   I'm worried of the replication of concerns inside networking-ovn related
routing, and I don't see the advantage of l3gateway mode, beyond legacy
usage.

   I understand the limitation you expressed about the
"chassisredirect"/"gatewaychassis" mode only being able to expose a single
external router leg.

   If that's a limitation that doesn't work for you, my opinion is that we
should work on fixing that limitation, and keeping all our development
efforts in a single place, with distributed E/W routing.

   In such way we could construct L3HA A/A , by having every
gateway_chassis have the same priority, and possible some extra options.

   But again, please, this is a discussion we may have on the development
mailing list, because may be my point of view is too narrow.

    Can you bring it up on the mailing list, or do you want me to do it?

   Best regards,
=================================

2017-10-08 17:42 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:

> Comments and suggestions are welcome :)
>
> Thanks
> Zhenyu Gao
>
> 2017-09-26 17:52 GMT+08:00 Zhenyu Gao <sysugaozhenyu@gmail.com>:
>
>> 1. ovn-nb.ovsschema was updated output_port field. Change the max entry
>> number from 1 to unlimited.
>> 2. Add multipath feature in ovn-northd part. northd generates multipath
>> flows to dispatch traffic by using packet's IP dst address if route's
>> output_port contains two or more ports.
>> 3. Add new table(lr_in_multipath) in ovn-northd's router ingress stages
>> to dispatch traffic to ports.
>> 4. Add multipath flow in Table 5(lr_in_ip_routing) and store hash result
>> into reg0. reg9[2] was used to indicate packet which need dispatching.
>> 5. Add multipath feature description in ovn/northd/ovn-northd.8.xml
>> and ovn/ovn-nb.xml
>> 6. ovn-nbctl.c was updated to handle configuring mulitiple output_port.
>>
>> Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
>> ---
>>  ovn/northd/ovn-northd.8.xml |  67 +++++++++++-
>>  ovn/northd/ovn-northd.c     | 257 ++++++++++++++++++++++++++++++
>> +++++++-------
>>  ovn/ovn-nb.ovsschema        |   7 +-
>>  ovn/ovn-nb.xml              |   4 +
>>  ovn/utilities/ovn-nbctl.c   |  28 +++--
>>  5 files changed, 311 insertions(+), 52 deletions(-)
>>
>> diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
>> index 0d85ec0..b1ce9a9 100644
>> --- a/ovn/northd/ovn-northd.8.xml
>> +++ b/ovn/northd/ovn-northd.8.xml
>> @@ -1598,6 +1598,9 @@ icmp4 {
>>        port (ingress table <code>ARP Request</code> will generate an ARP
>>        request, if needed, with <code>reg0</code> as the target protocol
>>        address and <code>reg1</code> as the source protocol address).
>> +      A IP route can be configured that it has multipath to next-hop.
>> +      If a packet has multipath to destination, OVN assign the port
>> +      index into reg[0] to indicate the packet's output port in table 6.
>>      </p>
>>
>>      <p>
>> @@ -1617,6 +1620,28 @@ icmp4 {
>>
>>        <li>
>>          <p>
>> +          IPv4/IPV6 multipath routing table. For each route to IPv4/IPv6
>> +          network <var>N</var> with netmask <var>M</var>, on multipath
>> port
>> +          <var>P</var> with IP address <var>A</var> and Ethernet
>> +          address <var>E</var>, a logical flow with match
>> +          <code>ip4.dst ==<var>N</var>/<var>M</var></code>,whose
>> priority
>> +          is the number of 1-bits plus 10 in <var>M</var>,
>> +          has the following actions:
>> +        </p>
>> +
>> +        <pre>
>> +ip.ttl--;
>> +multipath (nw_dst, 0, modulo_n, <var>n_links</var>, 0, reg0);
>> +reg9[2] = 1
>> +next;
>> +        </pre>
>> +        <p>
>> +          <var>n_links</var> is the number of multipath port.
>> +        </p>
>> +      </li>
>> +
>> +      <li>
>> +        <p>
>>            IPv4 routing table.  For each route to IPv4 network
>> <var>N</var> with
>>            netmask <var>M</var>, on router port <var>P</var> with IP
>> address
>>            <var>A</var> and Ethernet
>> @@ -1686,7 +1711,43 @@ next;
>>        </li>
>>      </ul>
>>
>> -    <h3>Ingress Table 6: ARP/ND Resolution</h3>
>> +    <h3>Ingress Table 6: Multipath</h3>
>> +    <p>
>> +      Any packet taht reaches this table is an IP packet and reg9[2]=1
>> +      using the following flows to route to corresponding port. This
>> table
>> +      implement dispatching by consuming reg0.
>> +    </p>
>> +
>> +    <ul>
>> +      <li>
>> +        <p>
>> +          A packet with netmask <var>M</var>, IP address <var>A</var> and
>> +          <code>reg9[2] = 1</code>, whose priority above 1 has following
>> +          actions:
>> +        </p>
>> +
>> +        <pre>
>> +reg0 = <var>G</var>;
>> +reg1 = <var>A</var>;
>> +eth.src = <var>E</var>;
>> +outport = <var>P</var>;
>> +flags.loopback = 1;
>> +next;
>> +        </pre>
>> +
>> +        <p>
>> +          <var>G</var> is the gateway IP address. <var>A</var>,
>> <var>E</var>
>> +          and <var>P</var> are the values that were described in
>> multipath
>> +          routeing in table 5
>> +        </p>
>> +
>> +        <p>
>> +          A priority-0 logical flow with match has actions
>> <code>next;</code>.
>> +        </p>
>> +      </li>
>> +    </ul>
>> +
>> +    <h3>Ingress Table 7: ARP/ND Resolution</h3>
>>
>>      <p>
>>        Any packet that reaches this table is an IP packet whose next-hop
>> @@ -1779,7 +1840,7 @@ next;
>>        </li>
>>      </ul>
>>
>> -    <h3>Ingress Table 7: Gateway Redirect</h3>
>> +    <h3>Ingress Table 8: Gateway Redirect</h3>
>>
>>      <p>
>>        For distributed logical routers where one of the logical router
>> @@ -1836,7 +1897,7 @@ next;
>>        </li>
>>      </ul>
>>
>> -    <h3>Ingress Table 8: ARP Request</h3>
>> +    <h3>Ingress Table 9: ARP Request</h3>
>>
>>      <p>
>>        In the common case where the Ethernet destination has been
>> resolved, this
>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>> index 49e4ac3..f8bfee2 100644
>> --- a/ovn/northd/ovn-northd.c
>> +++ b/ovn/northd/ovn-northd.c
>> @@ -135,9 +135,10 @@ enum ovn_stage {
>>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,      3, "lr_in_unsnat")       \
>>      PIPELINE_STAGE(ROUTER, IN,  DNAT,        4, "lr_in_dnat")         \
>>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  5, "lr_in_ip_routing")   \
>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 6, "lr_in_arp_resolve")  \
>> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 7, "lr_in_gw_redirect")  \
>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 8, "lr_in_arp_request")  \
>> +    PIPELINE_STAGE(ROUTER, IN,  MULTIPATH,   6, "lr_in_multipath")    \
>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 7, "lr_in_arp_resolve")  \
>> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 8, "lr_in_gw_redirect")  \
>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 9, "lr_in_arp_request")  \
>>                                                                        \
>>      /* Logical router egress stages. */                               \
>>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
>> @@ -173,6 +174,11 @@ enum ovn_stage {
>>   * one of the logical router's own IP addresses. */
>>  #define REGBIT_EGRESS_LOOPBACK  "reg9[1]"
>>
>> +/* Indicate multipath action has process this packet and store hash
>> result
>> + * into other regX. Should consume the hash result to determin the right
>> + * output port. */
>> +#define REGBIT_MULTIPATH "reg9[2]"
>> +
>>  /* Returns an "enum ovn_stage" built from the arguments. */
>>  static enum ovn_stage
>>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
>> pipeline,
>> @@ -4142,82 +4148,178 @@ add_route(struct hmap *lflows, const struct
>> ovn_port *op,
>>  }
>>
>>  static void
>> -build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>> -                        struct hmap *ports,
>> -                        const struct nbrec_logical_router_static_route
>> *route)
>> +add_multipath_route(struct hmap *lflows, uint32_t port_num,
>> +                    struct ovn_port **out_ports,
>> +                    const char **lrp_addr_s,
>> +                    struct ovn_datapath *od,
>> +                    const char *network_s, int plen,
>> +                    const char *gateway, const char *policy)
>> +{
>> +    bool is_ipv4 = strchr(network_s, '.') ? true : false;
>> +    struct ds match = DS_EMPTY_INITIALIZER;
>> +    const char *dir;
>> +    uint16_t priority;
>> +
>> +    if (policy && !strcmp(policy, "src-ip")) {
>> +        dir = "src";
>> +        priority = plen * 2;
>> +    } else {
>> +        dir = "dst";
>> +        priority = (plen * 2) + 1;
>> +    }
>> +
>> +    ds_put_format(&match, "ip%s.%s == %s/%d", is_ipv4 ? "4" : "6", dir,
>> +                  network_s, plen);
>> +
>> +    struct ds actions = DS_EMPTY_INITIALIZER;
>> +
>> +    ds_put_format(&actions, "ip.ttl--; ");
>> +    ds_put_format(&actions,
>> +                  "multipath (nw_dst, 0, modulo_n, %u, 0, reg0); "
>> +                  "%s = 1; "
>> +                  "next;",
>> +                  port_num, REGBIT_MULTIPATH);
>> +
>> +    /* The priority here is calculated to implement longest-prefix-match
>> +     * routing. */
>> +    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
>> +                  ds_cstr(&match), ds_cstr(&actions));
>> +
>> +    for (int i = 0; i < port_num; i++) {
>> +        struct ds mp_match = DS_EMPTY_INITIALIZER;
>> +        struct ds mp_actions = DS_EMPTY_INITIALIZER;
>> +
>> +        ds_put_format(&mp_match, "%s == 1 && reg0 == %d && ",
>> +                      REGBIT_MULTIPATH, i);
>> +        ds_put_format(&mp_match, "ip%s.%s == %s/%d",
>> +                      is_ipv4 ? "4" : "6", dir,
>> +                      network_s, plen);
>> +
>> +        ds_put_format(&mp_actions, "%sreg0 = ", is_ipv4 ? "" : "xx");
>> +        if (gateway) {
>> +            ds_put_cstr(&mp_actions, gateway);
>> +        } else {
>> +            ds_put_format(&mp_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
>> +        }
>> +
>> +        ds_put_format(&mp_actions, "; "
>> +                      "%sreg1 = %s; "
>> +                      "eth.src = %s; "
>> +                      "outport = %s; "
>> +                      "flags.loopback = 1; "
>> +                      "next;",
>> +                      is_ipv4 ? "" : "xx",
>> +                      lrp_addr_s[i],
>> +                      out_ports[i]->lrp_networks.ea_s,
>> +                      out_ports[i]->json_key);
>> +
>> +        /* Add flow in table 6 to determin the right output port
>> +         * for this traffic. */
>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, priority,
>> +                      ds_cstr(&mp_match), ds_cstr(&mp_actions));
>> +        ds_destroy(&mp_match);
>> +        ds_destroy(&mp_actions);
>> +    }
>> +    ds_destroy(&match);
>> +    ds_destroy(&actions);
>> +}
>> +
>> +static bool
>> +verify_nexthop_prefix(const struct nbrec_logical_router_static_route
>> *route,
>> +                      bool *is_ipv4, char **prefix_s, unsigned int *plen)
>>  {
>>      ovs_be32 nexthop;
>> -    const char *lrp_addr_s = NULL;
>> -    unsigned int plen;
>> -    bool is_ipv4;
>>
>>      /* Verify that the next hop is an IP address with an all-ones mask.
>> */
>> -    char *error = ip_parse_cidr(route->nexthop, &nexthop, &plen);
>> +    char *error = ip_parse_cidr(route->nexthop, &nexthop, plen);
>>      if (!error) {
>> -        if (plen != 32) {
>> +        if (*plen != 32) {
>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>>              VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
>> -            return;
>> +            return false;
>>          }
>> -        is_ipv4 = true;
>> +        *is_ipv4 = true;
>>      } else {
>>          free(error);
>>
>>          struct in6_addr ip6;
>> -        error = ipv6_parse_cidr(route->nexthop, &ip6, &plen);
>> +        error = ipv6_parse_cidr(route->nexthop, &ip6, plen);
>>          if (!error) {
>> -            if (plen != 128) {
>> +            if (*plen != 128) {
>>                  static struct vlog_rate_limit rl =
>> VLOG_RATE_LIMIT_INIT(5, 1);
>>                  VLOG_WARN_RL(&rl, "bad next hop mask %s",
>> route->nexthop);
>> -                return;
>> +                return false;
>>              }
>> -            is_ipv4 = false;
>> +            *is_ipv4 = false;
>>          } else {
>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>>              VLOG_WARN_RL(&rl, "bad next hop ip address %s",
>> route->nexthop);
>>              free(error);
>> -            return;
>> +            return false;
>>          }
>>      }
>>
>> -    char *prefix_s;
>> -    if (is_ipv4) {
>> +    if (*is_ipv4) {
>>          ovs_be32 prefix;
>>          /* Verify that ip prefix is a valid IPv4 address. */
>> -        error = ip_parse_cidr(route->ip_prefix, &prefix, &plen);
>> +        error = ip_parse_cidr(route->ip_prefix, &prefix, plen);
>>          if (error) {
>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>                           route->ip_prefix);
>>              free(error);
>> -            return;
>> +            return false;
>>          }
>> -        prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix &
>> be32_prefix_mask(plen)));
>> +        *prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix
>> +                                              &
>> be32_prefix_mask(*plen)));
>>      } else {
>>          /* Verify that ip prefix is a valid IPv6 address. */
>>          struct in6_addr prefix;
>> -        error = ipv6_parse_cidr(route->ip_prefix, &prefix, &plen);
>> +        error = ipv6_parse_cidr(route->ip_prefix, &prefix, plen);
>>          if (error) {
>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>                           route->ip_prefix);
>>              free(error);
>> -            return;
>> +            return false;
>>          }
>> -        struct in6_addr mask = ipv6_create_mask(plen);
>> +        struct in6_addr mask = ipv6_create_mask(*plen);
>>          struct in6_addr network = ipv6_addr_bitand(&prefix, &mask);
>> -        prefix_s = xmalloc(INET6_ADDRSTRLEN);
>> -        inet_ntop(AF_INET6, &network, prefix_s, INET6_ADDRSTRLEN);
>> +        *prefix_s = xmalloc(INET6_ADDRSTRLEN);
>> +        inet_ntop(AF_INET6, &network, *prefix_s, INET6_ADDRSTRLEN);
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static void
>> +build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>> +                        struct hmap *ports,
>> +                        const struct nbrec_logical_router_static_route
>> *route)
>> +{
>> +    const char *lrp_addr_s = NULL;
>> +    unsigned int plen;
>> +    bool is_ipv4;
>> +    char *prefix_s = NULL;
>> +
>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>> +        return;
>> +    }
>> +
>> +    /* Only need one output_port, if route contains multiple
>> output_port, then
>> +     * we should use build_multipath_flow to handle it. */
>> +    if (route->n_output_port > 1) {
>> +        return;
>>      }
>>
>>      /* Find the outgoing port. */
>>      struct ovn_port *out_port = NULL;
>> -    if (route->output_port) {
>> -        out_port = ovn_port_find(ports, route->output_port);
>> +    if (route->n_output_port) {
>> +        out_port = ovn_port_find(ports, route->output_port[0]);
>>          if (!out_port) {
>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>>              VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>> -                         route->output_port, route->ip_prefix);
>> +                         route->output_port[0], route->ip_prefix);
>>              goto free_prefix_s;
>>          }
>>          lrp_addr_s = find_lrp_member_ip(out_port, route->nexthop);
>> @@ -4270,7 +4372,77 @@ build_static_route_flow(struct hmap *lflows,
>> struct ovn_datapath *od,
>>                policy);
>>
>>  free_prefix_s:
>> -    free(prefix_s);
>> +    if (prefix_s) {
>> +        free(prefix_s);
>> +    }
>> +}
>> +
>> +static void
>> +build_multipath_flow(struct hmap *lflows, struct ovn_datapath *od,
>> +                     struct hmap *ports,
>> +                     const struct nbrec_logical_router_static_route
>> *route)
>> +{
>> +    unsigned int plen;
>> +    bool is_ipv4;
>> +    char *prefix_s = NULL;
>> +
>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>> +        return;
>> +    }
>> +
>> +    /* Find the outgoing port. */
>> +    struct ovn_port **out_ports = xmalloc(route->n_output_port *
>> +                                             sizeof(struct ovn_port *));
>> +    const char **lrp_addr_s = xmalloc(route->n_output_port *
>> +                                         sizeof(const char *));
>> +    uint32_t idx = 0;
>> +    for (int i = 0; i < route->n_output_port; i++) {
>> +        out_ports[idx] = ovn_port_find(ports, route->output_port[i]);
>> +        if (!out_ports[idx]) {
>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>> +            VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>> +                         route->output_port[i], route->ip_prefix);
>> +            continue;
>> +        }
>> +
>> +        lrp_addr_s[idx] = find_lrp_member_ip(out_ports[idx],
>> route->nexthop);
>> +        if (!lrp_addr_s[idx]) {
>> +            if (is_ipv4) {
>> +                if (out_ports[idx]->lrp_networks.n_ipv4_addrs) {
>> +                    lrp_addr_s[idx] = out_ports[idx]->
>> +                                        lrp_networks.ipv4_addrs[0].add
>> r_s;
>> +                }
>> +            } else {
>> +                if (out_ports[idx]->lrp_networks.n_ipv6_addrs) {
>> +                    lrp_addr_s[idx] = out_ports[idx]->
>> +                                        lrp_networks.ipv6_addrs[0].add
>> r_s;
>> +                }
>> +            }
>> +        }
>> +        if (!lrp_addr_s[idx]) {
>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>> 1);
>> +            VLOG_WARN_RL(&rl,
>> +                         "%s has no path for static route %s; next hop
>> %s",
>> +                         route->output_port[i], route->ip_prefix,
>> +                         route->nexthop);
>> +            continue;
>> +        }
>> +
>> +        idx++;
>> +    }
>> +
>> +    char *policy = route->policy ? route->policy : "dst-ip";
>> +    if (idx > 0) {
>> +        add_multipath_route(lflows, idx,
>> +                            out_ports, lrp_addr_s, od,
>> +                            prefix_s, plen, route->nexthop, policy);
>> +    }
>> +
>> +    free(out_ports);
>> +    free(lrp_addr_s);
>> +    if (prefix_s) {
>> +        free(prefix_s);
>> +    }
>>  }
>>
>>  static void
>> @@ -5344,7 +5516,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>> hmap *ports,
>>          }
>>      }
>>
>> -    /* Convert the static routes to flows. */
>> +    /* Convert the static routes and multipath route to flows. */
>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>>          if (!od->nbr) {
>>              continue;
>> @@ -5354,13 +5526,24 @@ build_lrouter_flows(struct hmap *datapaths,
>> struct hmap *ports,
>>              const struct nbrec_logical_router_static_route *route;
>>
>>              route = od->nbr->static_routes[i];
>> -            build_static_route_flow(lflows, od, ports, route);
>> +            if (route->n_output_port > 1) {
>> +                /* Logical router ingress table 5-6: Multipath Routing.
>> +                 *
>> +                 * If router had been configured a traffic has multiple
>> paths
>> +                 * to destination. The specific output port should be
>> firgured
>> +                 * out by computing packet's IP dst address header */
>> +                build_multipath_flow(lflows, od, ports, route);
>> +            } else {
>> +                build_static_route_flow(lflows, od, ports, route);
>> +            }
>>          }
>> +        /* Packets are allowed by default in table 6. */
>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, 0, "1",
>> "next;");
>>      }
>>
>>      /* XXX destination unreachable */
>>
>> -    /* Local router ingress table 6: ARP Resolution.
>> +    /* Local router ingress table 7: ARP Resolution.
>>       *
>>       * Any packet that reaches this table is an IP packet whose next-hop
>> IP
>>       * address is in reg0. (ip4.dst is the final destination.) This table
>> @@ -5555,7 +5738,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>> hmap *ports,
>>                        "get_nd(outport, xxreg0); next;");
>>      }
>>
>> -    /* Logical router ingress table 7: Gateway redirect.
>> +    /* Logical router ingress table 8: Gateway redirect.
>>       *
>>       * For traffic with outport equal to the l3dgw_port
>>       * on a distributed router, this table redirects a subset
>> @@ -5595,7 +5778,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>> hmap *ports,
>>          ovn_lflow_add(lflows, od, S_ROUTER_IN_GW_REDIRECT, 0, "1",
>> "next;");
>>      }
>>
>> -    /* Local router ingress table 8: ARP request.
>> +    /* Local router ingress table 9: ARP request.
>>       *
>>       * In the common case where the Ethernet destination has been
>> resolved,
>>       * this table outputs the packet (priority 0).  Otherwise, it
>> composes
>> diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema
>> index a077bfb..7a43473 100644
>> --- a/ovn/ovn-nb.ovsschema
>> +++ b/ovn/ovn-nb.ovsschema
>> @@ -1,7 +1,7 @@
>>  {
>>      "name": "OVN_Northbound",
>> -    "version": "5.8.0",
>> -    "cksum": "2812300190 <(281)%20230-0190> 16766",
>> +    "version": "5.9.0",
>> +    "cksum": "1515729450 16817",
>>      "tables": {
>>          "NB_Global": {
>>              "columns": {
>> @@ -235,7 +235,8 @@
>>                                                               "dst-ip"]]},
>>                                      "min": 0, "max": 1}},
>>                  "nexthop": {"type": "string"},
>> -                "output_port": {"type": {"key": "string", "min": 0,
>> "max": 1}}},
>> +                "output_port": {"type": {"key": "string", "min": 0,
>> +                                         "max": "unlimited"}}},
>>              "isRoot": false},
>>          "NAT": {
>>              "columns": {
>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>> index 9869d7e..eaba0c8 100644
>> --- a/ovn/ovn-nb.xml
>> +++ b/ovn/ovn-nb.xml
>> @@ -1485,6 +1485,10 @@
>>          multiple IP addresses on the router port and none of them are in
>> the
>>          same subnet of <ref column="nexthop"/>, OVN chooses the first IP
>>          address as the one via which the <ref column="nexthop"/> is
>> reachable.
>> +        When it contains more than two ports, it means packet has
>> multiple
>> +        candidate output ports. OVN uses the packet header to determin
>> which
>> +        port the packet would be delivered to.
>> +        Currently, OVN consumes destination IP field to figure out port.
>>        </p>
>>      </column>
>>    </table>
>> diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
>> index 8e5c1a4..417194f 100644
>> --- a/ovn/utilities/ovn-nbctl.c
>> +++ b/ovn/utilities/ovn-nbctl.c
>> @@ -397,7 +397,7 @@ Logical router port commands:\n\
>>                              ('enabled' or 'disabled')\n\
>>  \n\
>>  Route commands:\n\
>> -  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
>> +  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]...\n\
>>                              add a route to ROUTER\n\
>>    lr-route-del ROUTER [PREFIX]\n\
>>                              remove routes from ROUTER\n\
>> @@ -2184,13 +2184,15 @@ normalize_prefix_str(const char *orig_prefix)
>>          return normalize_ipv6_prefix(ipv6, plen);
>>      }
>>  }
>> -
>> +
>>  static void
>>  nbctl_lr_route_add(struct ctl_context *ctx)
>>  {
>>      const struct nbrec_logical_router *lr;
>>      lr = lr_by_name_or_uuid(ctx, ctx->argv[1], true);
>>      char *prefix, *next_hop;
>> +    int n_output_port = 0;
>> +    const char **output_port;
>>
>>      const char *policy = shash_find_data(&ctx->options, "--policy");
>>      if (policy && strcmp(policy, "src-ip") && strcmp(policy, "dst-ip")) {
>> @@ -2224,6 +2226,11 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>          }
>>      }
>>
>> +    if (ctx->argc > 4) {
>> +        n_output_port = ctx->argc - 4;
>> +        output_port = (const char **)&ctx->argv[4];
>> +    }
>> +
>>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
>>      for (int i = 0; i < lr->n_static_routes; i++) {
>>          const struct nbrec_logical_router_static_route *route
>> @@ -2253,9 +2260,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>          nbrec_logical_router_static_route_verify_nexthop(route);
>>          nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>>          nbrec_logical_router_static_route_set_nexthop(route, next_hop);
>> -        if (ctx->argc == 5) {
>> +        if (n_output_port > 0) {
>>              nbrec_logical_router_static_route_set_output_port(route,
>> -
>> ctx->argv[4]);
>> +
>> output_port,
>> +
>> n_output_port);
>>          }
>>          if (policy) {
>>               nbrec_logical_router_static_route_set_policy(route,
>> policy);
>> @@ -2270,8 +2278,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>      route = nbrec_logical_router_static_route_insert(ctx->txn);
>>      nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>>      nbrec_logical_router_static_route_set_nexthop(route, next_hop);
>> -    if (ctx->argc == 5) {
>> -        nbrec_logical_router_static_route_set_output_port(route,
>> ctx->argv[4]);
>> +    if (n_output_port > 0) {
>> +        nbrec_logical_router_static_route_set_output_port(route,
>> +                                                          output_port,
>> +                                                          n_output_port);
>>      }
>>      if (policy) {
>>          nbrec_logical_router_static_route_set_policy(route, policy);
>> @@ -3066,8 +3076,8 @@ print_route(const struct
>> nbrec_logical_router_static_route *route, struct ds *s)
>>          ds_put_format(s, " %s", "dst-ip");
>>      }
>>
>> -    if (route->output_port) {
>> -        ds_put_format(s, " %s", route->output_port);
>> +    for (int i = 0; i < route->n_output_port; i++) {
>> +        ds_put_format(s, " %s", route->output_port[i]);
>>      }
>>      ds_put_char(s, '\n');
>>  }
>> @@ -3682,7 +3692,7 @@ static const struct ctl_command_syntax
>> nbctl_commands[] = {
>>        NULL, "", RO },
>>
>>      /* logical router route commands. */
>> -    { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
>> +    { "lr-route-add", 3, INT_MAX, "ROUTER PREFIX NEXTHOP [PORT]...",
>> NULL,
>>        nbctl_lr_route_add, NULL, "--may-exist,--policy=", RW },
>>      { "lr-route-del", 1, 2, "ROUTER [PREFIX]", NULL, nbctl_lr_route_del,
>>        NULL, "--if-exists", RW },
>> --
>> 1.8.3.1
>>
>>
>
Gao Zhenyu Oct. 11, 2017, 1:55 a.m. UTC | #3
Hi Miguel,

   Thanks for your suggestion on it. It's very usefull.
   In my point of view, I think no matter we have single router leg or
multiple router legs on edge router, we still need a way to dispatch
traffic randomly, right?
   So even we implement multiple legs on a router we can't random seperate
traffics to those legs easily.(static route only seperates specific
traffic) Then multipath action is a good candidate to make it.

   Currently, gateway chassises are links to single ovn edge logical router
port, which means those gateways chassises output traffic contain same src
mac.
   I don't know if we have a good way to implement L3HA A/A in current
architecture. (Maybe adding  gateway_chassis options field, populate
"rewrite-mac", "rewrite-ip" to rewrite mac address is a way, but I don't
think it is a good way and may confuse people)
   So if you already get a idea to make it, it would be great to bring it
up then we can discuss it and move whole process faster. :)


Thanks
Zhenyu Gao

2017-10-11 9:50 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:

> I discussed this multipath stuff with Miguel in other mailling thread and
> I want to bring this discusstion on ovs mailing list and hope to collect
> more suggestions from all of you. :)
>
> Here is the Miguel's suggestion on it.
>
> =================================
> Hi Gao,
>
>    Sorry, I didn't have more time to look at it currently (although it's a
> topic of my interest.)
>
>    I'm worried of the replication of concerns inside networking-ovn
> related routing, and I don't see the advantage of l3gateway mode, beyond
> legacy usage.
>
>    I understand the limitation you expressed about the
> "chassisredirect"/"gatewaychassis" mode only being able to expose a
> single external router leg.
>
>    If that's a limitation that doesn't work for you, my opinion is that we
> should work on fixing that limitation, and keeping all our development
> efforts in a single place, with distributed E/W routing.
>
>    In such way we could construct L3HA A/A , by having every
> gateway_chassis have the same priority, and possible some extra options.
>
>    But again, please, this is a discussion we may have on the development
> mailing list, because may be my point of view is too narrow.
>
>     Can you bring it up on the mailing list, or do you want me to do it?
>
>    Best regards,
> =================================
>
> 2017-10-08 17:42 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:
>
>> Comments and suggestions are welcome :)
>>
>> Thanks
>> Zhenyu Gao
>>
>> 2017-09-26 17:52 GMT+08:00 Zhenyu Gao <sysugaozhenyu@gmail.com>:
>>
>>> 1. ovn-nb.ovsschema was updated output_port field. Change the max entry
>>> number from 1 to unlimited.
>>> 2. Add multipath feature in ovn-northd part. northd generates multipath
>>> flows to dispatch traffic by using packet's IP dst address if route's
>>> output_port contains two or more ports.
>>> 3. Add new table(lr_in_multipath) in ovn-northd's router ingress stages
>>> to dispatch traffic to ports.
>>> 4. Add multipath flow in Table 5(lr_in_ip_routing) and store hash result
>>> into reg0. reg9[2] was used to indicate packet which need dispatching.
>>> 5. Add multipath feature description in ovn/northd/ovn-northd.8.xml
>>> and ovn/ovn-nb.xml
>>> 6. ovn-nbctl.c was updated to handle configuring mulitiple output_port.
>>>
>>> Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
>>> ---
>>>  ovn/northd/ovn-northd.8.xml |  67 +++++++++++-
>>>  ovn/northd/ovn-northd.c     | 257 ++++++++++++++++++++++++++++++
>>> +++++++-------
>>>  ovn/ovn-nb.ovsschema        |   7 +-
>>>  ovn/ovn-nb.xml              |   4 +
>>>  ovn/utilities/ovn-nbctl.c   |  28 +++--
>>>  5 files changed, 311 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
>>> index 0d85ec0..b1ce9a9 100644
>>> --- a/ovn/northd/ovn-northd.8.xml
>>> +++ b/ovn/northd/ovn-northd.8.xml
>>> @@ -1598,6 +1598,9 @@ icmp4 {
>>>        port (ingress table <code>ARP Request</code> will generate an ARP
>>>        request, if needed, with <code>reg0</code> as the target protocol
>>>        address and <code>reg1</code> as the source protocol address).
>>> +      A IP route can be configured that it has multipath to next-hop.
>>> +      If a packet has multipath to destination, OVN assign the port
>>> +      index into reg[0] to indicate the packet's output port in table 6.
>>>      </p>
>>>
>>>      <p>
>>> @@ -1617,6 +1620,28 @@ icmp4 {
>>>
>>>        <li>
>>>          <p>
>>> +          IPv4/IPV6 multipath routing table. For each route to IPv4/IPv6
>>> +          network <var>N</var> with netmask <var>M</var>, on multipath
>>> port
>>> +          <var>P</var> with IP address <var>A</var> and Ethernet
>>> +          address <var>E</var>, a logical flow with match
>>> +          <code>ip4.dst ==<var>N</var>/<var>M</var></code>,whose
>>> priority
>>> +          is the number of 1-bits plus 10 in <var>M</var>,
>>> +          has the following actions:
>>> +        </p>
>>> +
>>> +        <pre>
>>> +ip.ttl--;
>>> +multipath (nw_dst, 0, modulo_n, <var>n_links</var>, 0, reg0);
>>> +reg9[2] = 1
>>> +next;
>>> +        </pre>
>>> +        <p>
>>> +          <var>n_links</var> is the number of multipath port.
>>> +        </p>
>>> +      </li>
>>> +
>>> +      <li>
>>> +        <p>
>>>            IPv4 routing table.  For each route to IPv4 network
>>> <var>N</var> with
>>>            netmask <var>M</var>, on router port <var>P</var> with IP
>>> address
>>>            <var>A</var> and Ethernet
>>> @@ -1686,7 +1711,43 @@ next;
>>>        </li>
>>>      </ul>
>>>
>>> -    <h3>Ingress Table 6: ARP/ND Resolution</h3>
>>> +    <h3>Ingress Table 6: Multipath</h3>
>>> +    <p>
>>> +      Any packet taht reaches this table is an IP packet and reg9[2]=1
>>> +      using the following flows to route to corresponding port. This
>>> table
>>> +      implement dispatching by consuming reg0.
>>> +    </p>
>>> +
>>> +    <ul>
>>> +      <li>
>>> +        <p>
>>> +          A packet with netmask <var>M</var>, IP address <var>A</var>
>>> and
>>> +          <code>reg9[2] = 1</code>, whose priority above 1 has following
>>> +          actions:
>>> +        </p>
>>> +
>>> +        <pre>
>>> +reg0 = <var>G</var>;
>>> +reg1 = <var>A</var>;
>>> +eth.src = <var>E</var>;
>>> +outport = <var>P</var>;
>>> +flags.loopback = 1;
>>> +next;
>>> +        </pre>
>>> +
>>> +        <p>
>>> +          <var>G</var> is the gateway IP address. <var>A</var>,
>>> <var>E</var>
>>> +          and <var>P</var> are the values that were described in
>>> multipath
>>> +          routeing in table 5
>>> +        </p>
>>> +
>>> +        <p>
>>> +          A priority-0 logical flow with match has actions
>>> <code>next;</code>.
>>> +        </p>
>>> +      </li>
>>> +    </ul>
>>> +
>>> +    <h3>Ingress Table 7: ARP/ND Resolution</h3>
>>>
>>>      <p>
>>>        Any packet that reaches this table is an IP packet whose next-hop
>>> @@ -1779,7 +1840,7 @@ next;
>>>        </li>
>>>      </ul>
>>>
>>> -    <h3>Ingress Table 7: Gateway Redirect</h3>
>>> +    <h3>Ingress Table 8: Gateway Redirect</h3>
>>>
>>>      <p>
>>>        For distributed logical routers where one of the logical router
>>> @@ -1836,7 +1897,7 @@ next;
>>>        </li>
>>>      </ul>
>>>
>>> -    <h3>Ingress Table 8: ARP Request</h3>
>>> +    <h3>Ingress Table 9: ARP Request</h3>
>>>
>>>      <p>
>>>        In the common case where the Ethernet destination has been
>>> resolved, this
>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>>> index 49e4ac3..f8bfee2 100644
>>> --- a/ovn/northd/ovn-northd.c
>>> +++ b/ovn/northd/ovn-northd.c
>>> @@ -135,9 +135,10 @@ enum ovn_stage {
>>>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,      3, "lr_in_unsnat")       \
>>>      PIPELINE_STAGE(ROUTER, IN,  DNAT,        4, "lr_in_dnat")         \
>>>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  5, "lr_in_ip_routing")   \
>>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 6, "lr_in_arp_resolve")  \
>>> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 7, "lr_in_gw_redirect")  \
>>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 8, "lr_in_arp_request")  \
>>> +    PIPELINE_STAGE(ROUTER, IN,  MULTIPATH,   6, "lr_in_multipath")    \
>>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 7, "lr_in_arp_resolve")  \
>>> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 8, "lr_in_gw_redirect")  \
>>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 9, "lr_in_arp_request")  \
>>>                                                                        \
>>>      /* Logical router egress stages. */                               \
>>>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
>>> @@ -173,6 +174,11 @@ enum ovn_stage {
>>>   * one of the logical router's own IP addresses. */
>>>  #define REGBIT_EGRESS_LOOPBACK  "reg9[1]"
>>>
>>> +/* Indicate multipath action has process this packet and store hash
>>> result
>>> + * into other regX. Should consume the hash result to determin the right
>>> + * output port. */
>>> +#define REGBIT_MULTIPATH "reg9[2]"
>>> +
>>>  /* Returns an "enum ovn_stage" built from the arguments. */
>>>  static enum ovn_stage
>>>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
>>> pipeline,
>>> @@ -4142,82 +4148,178 @@ add_route(struct hmap *lflows, const struct
>>> ovn_port *op,
>>>  }
>>>
>>>  static void
>>> -build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>>> -                        struct hmap *ports,
>>> -                        const struct nbrec_logical_router_static_route
>>> *route)
>>> +add_multipath_route(struct hmap *lflows, uint32_t port_num,
>>> +                    struct ovn_port **out_ports,
>>> +                    const char **lrp_addr_s,
>>> +                    struct ovn_datapath *od,
>>> +                    const char *network_s, int plen,
>>> +                    const char *gateway, const char *policy)
>>> +{
>>> +    bool is_ipv4 = strchr(network_s, '.') ? true : false;
>>> +    struct ds match = DS_EMPTY_INITIALIZER;
>>> +    const char *dir;
>>> +    uint16_t priority;
>>> +
>>> +    if (policy && !strcmp(policy, "src-ip")) {
>>> +        dir = "src";
>>> +        priority = plen * 2;
>>> +    } else {
>>> +        dir = "dst";
>>> +        priority = (plen * 2) + 1;
>>> +    }
>>> +
>>> +    ds_put_format(&match, "ip%s.%s == %s/%d", is_ipv4 ? "4" : "6", dir,
>>> +                  network_s, plen);
>>> +
>>> +    struct ds actions = DS_EMPTY_INITIALIZER;
>>> +
>>> +    ds_put_format(&actions, "ip.ttl--; ");
>>> +    ds_put_format(&actions,
>>> +                  "multipath (nw_dst, 0, modulo_n, %u, 0, reg0); "
>>> +                  "%s = 1; "
>>> +                  "next;",
>>> +                  port_num, REGBIT_MULTIPATH);
>>> +
>>> +    /* The priority here is calculated to implement longest-prefix-match
>>> +     * routing. */
>>> +    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
>>> +                  ds_cstr(&match), ds_cstr(&actions));
>>> +
>>> +    for (int i = 0; i < port_num; i++) {
>>> +        struct ds mp_match = DS_EMPTY_INITIALIZER;
>>> +        struct ds mp_actions = DS_EMPTY_INITIALIZER;
>>> +
>>> +        ds_put_format(&mp_match, "%s == 1 && reg0 == %d && ",
>>> +                      REGBIT_MULTIPATH, i);
>>> +        ds_put_format(&mp_match, "ip%s.%s == %s/%d",
>>> +                      is_ipv4 ? "4" : "6", dir,
>>> +                      network_s, plen);
>>> +
>>> +        ds_put_format(&mp_actions, "%sreg0 = ", is_ipv4 ? "" : "xx");
>>> +        if (gateway) {
>>> +            ds_put_cstr(&mp_actions, gateway);
>>> +        } else {
>>> +            ds_put_format(&mp_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
>>> +        }
>>> +
>>> +        ds_put_format(&mp_actions, "; "
>>> +                      "%sreg1 = %s; "
>>> +                      "eth.src = %s; "
>>> +                      "outport = %s; "
>>> +                      "flags.loopback = 1; "
>>> +                      "next;",
>>> +                      is_ipv4 ? "" : "xx",
>>> +                      lrp_addr_s[i],
>>> +                      out_ports[i]->lrp_networks.ea_s,
>>> +                      out_ports[i]->json_key);
>>> +
>>> +        /* Add flow in table 6 to determin the right output port
>>> +         * for this traffic. */
>>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, priority,
>>> +                      ds_cstr(&mp_match), ds_cstr(&mp_actions));
>>> +        ds_destroy(&mp_match);
>>> +        ds_destroy(&mp_actions);
>>> +    }
>>> +    ds_destroy(&match);
>>> +    ds_destroy(&actions);
>>> +}
>>> +
>>> +static bool
>>> +verify_nexthop_prefix(const struct nbrec_logical_router_static_route
>>> *route,
>>> +                      bool *is_ipv4, char **prefix_s, unsigned int
>>> *plen)
>>>  {
>>>      ovs_be32 nexthop;
>>> -    const char *lrp_addr_s = NULL;
>>> -    unsigned int plen;
>>> -    bool is_ipv4;
>>>
>>>      /* Verify that the next hop is an IP address with an all-ones mask.
>>> */
>>> -    char *error = ip_parse_cidr(route->nexthop, &nexthop, &plen);
>>> +    char *error = ip_parse_cidr(route->nexthop, &nexthop, plen);
>>>      if (!error) {
>>> -        if (plen != 32) {
>>> +        if (*plen != 32) {
>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>>              VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
>>> -            return;
>>> +            return false;
>>>          }
>>> -        is_ipv4 = true;
>>> +        *is_ipv4 = true;
>>>      } else {
>>>          free(error);
>>>
>>>          struct in6_addr ip6;
>>> -        error = ipv6_parse_cidr(route->nexthop, &ip6, &plen);
>>> +        error = ipv6_parse_cidr(route->nexthop, &ip6, plen);
>>>          if (!error) {
>>> -            if (plen != 128) {
>>> +            if (*plen != 128) {
>>>                  static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 1);
>>>                  VLOG_WARN_RL(&rl, "bad next hop mask %s",
>>> route->nexthop);
>>> -                return;
>>> +                return false;
>>>              }
>>> -            is_ipv4 = false;
>>> +            *is_ipv4 = false;
>>>          } else {
>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>>              VLOG_WARN_RL(&rl, "bad next hop ip address %s",
>>> route->nexthop);
>>>              free(error);
>>> -            return;
>>> +            return false;
>>>          }
>>>      }
>>>
>>> -    char *prefix_s;
>>> -    if (is_ipv4) {
>>> +    if (*is_ipv4) {
>>>          ovs_be32 prefix;
>>>          /* Verify that ip prefix is a valid IPv4 address. */
>>> -        error = ip_parse_cidr(route->ip_prefix, &prefix, &plen);
>>> +        error = ip_parse_cidr(route->ip_prefix, &prefix, plen);
>>>          if (error) {
>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>>                           route->ip_prefix);
>>>              free(error);
>>> -            return;
>>> +            return false;
>>>          }
>>> -        prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix &
>>> be32_prefix_mask(plen)));
>>> +        *prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix
>>> +                                              &
>>> be32_prefix_mask(*plen)));
>>>      } else {
>>>          /* Verify that ip prefix is a valid IPv6 address. */
>>>          struct in6_addr prefix;
>>> -        error = ipv6_parse_cidr(route->ip_prefix, &prefix, &plen);
>>> +        error = ipv6_parse_cidr(route->ip_prefix, &prefix, plen);
>>>          if (error) {
>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>>                           route->ip_prefix);
>>>              free(error);
>>> -            return;
>>> +            return false;
>>>          }
>>> -        struct in6_addr mask = ipv6_create_mask(plen);
>>> +        struct in6_addr mask = ipv6_create_mask(*plen);
>>>          struct in6_addr network = ipv6_addr_bitand(&prefix, &mask);
>>> -        prefix_s = xmalloc(INET6_ADDRSTRLEN);
>>> -        inet_ntop(AF_INET6, &network, prefix_s, INET6_ADDRSTRLEN);
>>> +        *prefix_s = xmalloc(INET6_ADDRSTRLEN);
>>> +        inet_ntop(AF_INET6, &network, *prefix_s, INET6_ADDRSTRLEN);
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static void
>>> +build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>>> +                        struct hmap *ports,
>>> +                        const struct nbrec_logical_router_static_route
>>> *route)
>>> +{
>>> +    const char *lrp_addr_s = NULL;
>>> +    unsigned int plen;
>>> +    bool is_ipv4;
>>> +    char *prefix_s = NULL;
>>> +
>>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* Only need one output_port, if route contains multiple
>>> output_port, then
>>> +     * we should use build_multipath_flow to handle it. */
>>> +    if (route->n_output_port > 1) {
>>> +        return;
>>>      }
>>>
>>>      /* Find the outgoing port. */
>>>      struct ovn_port *out_port = NULL;
>>> -    if (route->output_port) {
>>> -        out_port = ovn_port_find(ports, route->output_port);
>>> +    if (route->n_output_port) {
>>> +        out_port = ovn_port_find(ports, route->output_port[0]);
>>>          if (!out_port) {
>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>>              VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>>> -                         route->output_port, route->ip_prefix);
>>> +                         route->output_port[0], route->ip_prefix);
>>>              goto free_prefix_s;
>>>          }
>>>          lrp_addr_s = find_lrp_member_ip(out_port, route->nexthop);
>>> @@ -4270,7 +4372,77 @@ build_static_route_flow(struct hmap *lflows,
>>> struct ovn_datapath *od,
>>>                policy);
>>>
>>>  free_prefix_s:
>>> -    free(prefix_s);
>>> +    if (prefix_s) {
>>> +        free(prefix_s);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +build_multipath_flow(struct hmap *lflows, struct ovn_datapath *od,
>>> +                     struct hmap *ports,
>>> +                     const struct nbrec_logical_router_static_route
>>> *route)
>>> +{
>>> +    unsigned int plen;
>>> +    bool is_ipv4;
>>> +    char *prefix_s = NULL;
>>> +
>>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* Find the outgoing port. */
>>> +    struct ovn_port **out_ports = xmalloc(route->n_output_port *
>>> +                                             sizeof(struct ovn_port *));
>>> +    const char **lrp_addr_s = xmalloc(route->n_output_port *
>>> +                                         sizeof(const char *));
>>> +    uint32_t idx = 0;
>>> +    for (int i = 0; i < route->n_output_port; i++) {
>>> +        out_ports[idx] = ovn_port_find(ports, route->output_port[i]);
>>> +        if (!out_ports[idx]) {
>>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>> +            VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>>> +                         route->output_port[i], route->ip_prefix);
>>> +            continue;
>>> +        }
>>> +
>>> +        lrp_addr_s[idx] = find_lrp_member_ip(out_ports[idx],
>>> route->nexthop);
>>> +        if (!lrp_addr_s[idx]) {
>>> +            if (is_ipv4) {
>>> +                if (out_ports[idx]->lrp_networks.n_ipv4_addrs) {
>>> +                    lrp_addr_s[idx] = out_ports[idx]->
>>> +                                        lrp_networks.ipv4_addrs[0].add
>>> r_s;
>>> +                }
>>> +            } else {
>>> +                if (out_ports[idx]->lrp_networks.n_ipv6_addrs) {
>>> +                    lrp_addr_s[idx] = out_ports[idx]->
>>> +                                        lrp_networks.ipv6_addrs[0].add
>>> r_s;
>>> +                }
>>> +            }
>>> +        }
>>> +        if (!lrp_addr_s[idx]) {
>>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>> 1);
>>> +            VLOG_WARN_RL(&rl,
>>> +                         "%s has no path for static route %s; next hop
>>> %s",
>>> +                         route->output_port[i], route->ip_prefix,
>>> +                         route->nexthop);
>>> +            continue;
>>> +        }
>>> +
>>> +        idx++;
>>> +    }
>>> +
>>> +    char *policy = route->policy ? route->policy : "dst-ip";
>>> +    if (idx > 0) {
>>> +        add_multipath_route(lflows, idx,
>>> +                            out_ports, lrp_addr_s, od,
>>> +                            prefix_s, plen, route->nexthop, policy);
>>> +    }
>>> +
>>> +    free(out_ports);
>>> +    free(lrp_addr_s);
>>> +    if (prefix_s) {
>>> +        free(prefix_s);
>>> +    }
>>>  }
>>>
>>>  static void
>>> @@ -5344,7 +5516,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>>> hmap *ports,
>>>          }
>>>      }
>>>
>>> -    /* Convert the static routes to flows. */
>>> +    /* Convert the static routes and multipath route to flows. */
>>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>>>          if (!od->nbr) {
>>>              continue;
>>> @@ -5354,13 +5526,24 @@ build_lrouter_flows(struct hmap *datapaths,
>>> struct hmap *ports,
>>>              const struct nbrec_logical_router_static_route *route;
>>>
>>>              route = od->nbr->static_routes[i];
>>> -            build_static_route_flow(lflows, od, ports, route);
>>> +            if (route->n_output_port > 1) {
>>> +                /* Logical router ingress table 5-6: Multipath Routing.
>>> +                 *
>>> +                 * If router had been configured a traffic has multiple
>>> paths
>>> +                 * to destination. The specific output port should be
>>> firgured
>>> +                 * out by computing packet's IP dst address header */
>>> +                build_multipath_flow(lflows, od, ports, route);
>>> +            } else {
>>> +                build_static_route_flow(lflows, od, ports, route);
>>> +            }
>>>          }
>>> +        /* Packets are allowed by default in table 6. */
>>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, 0, "1",
>>> "next;");
>>>      }
>>>
>>>      /* XXX destination unreachable */
>>>
>>> -    /* Local router ingress table 6: ARP Resolution.
>>> +    /* Local router ingress table 7: ARP Resolution.
>>>       *
>>>       * Any packet that reaches this table is an IP packet whose
>>> next-hop IP
>>>       * address is in reg0. (ip4.dst is the final destination.) This
>>> table
>>> @@ -5555,7 +5738,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>>> hmap *ports,
>>>                        "get_nd(outport, xxreg0); next;");
>>>      }
>>>
>>> -    /* Logical router ingress table 7: Gateway redirect.
>>> +    /* Logical router ingress table 8: Gateway redirect.
>>>       *
>>>       * For traffic with outport equal to the l3dgw_port
>>>       * on a distributed router, this table redirects a subset
>>> @@ -5595,7 +5778,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
>>> hmap *ports,
>>>          ovn_lflow_add(lflows, od, S_ROUTER_IN_GW_REDIRECT, 0, "1",
>>> "next;");
>>>      }
>>>
>>> -    /* Local router ingress table 8: ARP request.
>>> +    /* Local router ingress table 9: ARP request.
>>>       *
>>>       * In the common case where the Ethernet destination has been
>>> resolved,
>>>       * this table outputs the packet (priority 0).  Otherwise, it
>>> composes
>>> diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema
>>> index a077bfb..7a43473 100644
>>> --- a/ovn/ovn-nb.ovsschema
>>> +++ b/ovn/ovn-nb.ovsschema
>>> @@ -1,7 +1,7 @@
>>>  {
>>>      "name": "OVN_Northbound",
>>> -    "version": "5.8.0",
>>> -    "cksum": "2812300190 <(281)%20230-0190> 16766",
>>> +    "version": "5.9.0",
>>> +    "cksum": "1515729450 16817",
>>>      "tables": {
>>>          "NB_Global": {
>>>              "columns": {
>>> @@ -235,7 +235,8 @@
>>>
>>> "dst-ip"]]},
>>>                                      "min": 0, "max": 1}},
>>>                  "nexthop": {"type": "string"},
>>> -                "output_port": {"type": {"key": "string", "min": 0,
>>> "max": 1}}},
>>> +                "output_port": {"type": {"key": "string", "min": 0,
>>> +                                         "max": "unlimited"}}},
>>>              "isRoot": false},
>>>          "NAT": {
>>>              "columns": {
>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>>> index 9869d7e..eaba0c8 100644
>>> --- a/ovn/ovn-nb.xml
>>> +++ b/ovn/ovn-nb.xml
>>> @@ -1485,6 +1485,10 @@
>>>          multiple IP addresses on the router port and none of them are
>>> in the
>>>          same subnet of <ref column="nexthop"/>, OVN chooses the first IP
>>>          address as the one via which the <ref column="nexthop"/> is
>>> reachable.
>>> +        When it contains more than two ports, it means packet has
>>> multiple
>>> +        candidate output ports. OVN uses the packet header to determin
>>> which
>>> +        port the packet would be delivered to.
>>> +        Currently, OVN consumes destination IP field to figure out port.
>>>        </p>
>>>      </column>
>>>    </table>
>>> diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
>>> index 8e5c1a4..417194f 100644
>>> --- a/ovn/utilities/ovn-nbctl.c
>>> +++ b/ovn/utilities/ovn-nbctl.c
>>> @@ -397,7 +397,7 @@ Logical router port commands:\n\
>>>                              ('enabled' or 'disabled')\n\
>>>  \n\
>>>  Route commands:\n\
>>> -  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
>>> +  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]...\n\
>>>                              add a route to ROUTER\n\
>>>    lr-route-del ROUTER [PREFIX]\n\
>>>                              remove routes from ROUTER\n\
>>> @@ -2184,13 +2184,15 @@ normalize_prefix_str(const char *orig_prefix)
>>>          return normalize_ipv6_prefix(ipv6, plen);
>>>      }
>>>  }
>>> -
>>> +
>>>  static void
>>>  nbctl_lr_route_add(struct ctl_context *ctx)
>>>  {
>>>      const struct nbrec_logical_router *lr;
>>>      lr = lr_by_name_or_uuid(ctx, ctx->argv[1], true);
>>>      char *prefix, *next_hop;
>>> +    int n_output_port = 0;
>>> +    const char **output_port;
>>>
>>>      const char *policy = shash_find_data(&ctx->options, "--policy");
>>>      if (policy && strcmp(policy, "src-ip") && strcmp(policy, "dst-ip"))
>>> {
>>> @@ -2224,6 +2226,11 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>          }
>>>      }
>>>
>>> +    if (ctx->argc > 4) {
>>> +        n_output_port = ctx->argc - 4;
>>> +        output_port = (const char **)&ctx->argv[4];
>>> +    }
>>> +
>>>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
>>>      for (int i = 0; i < lr->n_static_routes; i++) {
>>>          const struct nbrec_logical_router_static_route *route
>>> @@ -2253,9 +2260,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>          nbrec_logical_router_static_route_verify_nexthop(route);
>>>          nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>>>          nbrec_logical_router_static_route_set_nexthop(route, next_hop);
>>> -        if (ctx->argc == 5) {
>>> +        if (n_output_port > 0) {
>>>              nbrec_logical_router_static_route_set_output_port(route,
>>> -
>>> ctx->argv[4]);
>>> +
>>> output_port,
>>> +
>>> n_output_port);
>>>          }
>>>          if (policy) {
>>>               nbrec_logical_router_static_route_set_policy(route,
>>> policy);
>>> @@ -2270,8 +2278,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>      route = nbrec_logical_router_static_route_insert(ctx->txn);
>>>      nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>>>      nbrec_logical_router_static_route_set_nexthop(route, next_hop);
>>> -    if (ctx->argc == 5) {
>>> -        nbrec_logical_router_static_route_set_output_port(route,
>>> ctx->argv[4]);
>>> +    if (n_output_port > 0) {
>>> +        nbrec_logical_router_static_route_set_output_port(route,
>>> +                                                          output_port,
>>> +
>>> n_output_port);
>>>      }
>>>      if (policy) {
>>>          nbrec_logical_router_static_route_set_policy(route, policy);
>>> @@ -3066,8 +3076,8 @@ print_route(const struct
>>> nbrec_logical_router_static_route *route, struct ds *s)
>>>          ds_put_format(s, " %s", "dst-ip");
>>>      }
>>>
>>> -    if (route->output_port) {
>>> -        ds_put_format(s, " %s", route->output_port);
>>> +    for (int i = 0; i < route->n_output_port; i++) {
>>> +        ds_put_format(s, " %s", route->output_port[i]);
>>>      }
>>>      ds_put_char(s, '\n');
>>>  }
>>> @@ -3682,7 +3692,7 @@ static const struct ctl_command_syntax
>>> nbctl_commands[] = {
>>>        NULL, "", RO },
>>>
>>>      /* logical router route commands. */
>>> -    { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
>>> +    { "lr-route-add", 3, INT_MAX, "ROUTER PREFIX NEXTHOP [PORT]...",
>>> NULL,
>>>        nbctl_lr_route_add, NULL, "--may-exist,--policy=", RW },
>>>      { "lr-route-del", 1, 2, "ROUTER [PREFIX]", NULL, nbctl_lr_route_del,
>>>        NULL, "--if-exists", RW },
>>> --
>>> 1.8.3.1
>>>
>>>
>>
>
Gao Zhenyu Oct. 26, 2017, 1:38 a.m. UTC | #4
ping.....


Thanks
Zhenyu Gao

2017-10-11 9:55 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:

> Hi Miguel,
>
>    Thanks for your suggestion on it. It's very usefull.
>    In my point of view, I think no matter we have single router leg or
> multiple router legs on edge router, we still need a way to dispatch
> traffic randomly, right?
>    So even we implement multiple legs on a router we can't random seperate
> traffics to those legs easily.(static route only seperates specific
> traffic) Then multipath action is a good candidate to make it.
>
>    Currently, gateway chassises are links to single ovn edge logical
> router port, which means those gateways chassises output traffic contain
> same src mac.
>    I don't know if we have a good way to implement L3HA A/A in current
> architecture. (Maybe adding  gateway_chassis options field, populate
> "rewrite-mac", "rewrite-ip" to rewrite mac address is a way, but I don't
> think it is a good way and may confuse people)
>    So if you already get a idea to make it, it would be great to bring it
> up then we can discuss it and move whole process faster. :)
>
>
> Thanks
> Zhenyu Gao
>
> 2017-10-11 9:50 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:
>
>> I discussed this multipath stuff with Miguel in other mailling thread and
>> I want to bring this discusstion on ovs mailing list and hope to collect
>> more suggestions from all of you. :)
>>
>> Here is the Miguel's suggestion on it.
>>
>> =================================
>> Hi Gao,
>>
>>    Sorry, I didn't have more time to look at it currently (although it's
>> a topic of my interest.)
>>
>>    I'm worried of the replication of concerns inside networking-ovn
>> related routing, and I don't see the advantage of l3gateway mode, beyond
>> legacy usage.
>>
>>    I understand the limitation you expressed about the
>> "chassisredirect"/"gatewaychassis" mode only being able to expose a
>> single external router leg.
>>
>>    If that's a limitation that doesn't work for you, my opinion is that
>> we should work on fixing that limitation, and keeping all our development
>> efforts in a single place, with distributed E/W routing.
>>
>>    In such way we could construct L3HA A/A , by having every
>> gateway_chassis have the same priority, and possible some extra options.
>>
>>    But again, please, this is a discussion we may have on the development
>> mailing list, because may be my point of view is too narrow.
>>
>>     Can you bring it up on the mailing list, or do you want me to do it?
>>
>>    Best regards,
>> =================================
>>
>> 2017-10-08 17:42 GMT+08:00 Gao Zhenyu <sysugaozhenyu@gmail.com>:
>>
>>> Comments and suggestions are welcome :)
>>>
>>> Thanks
>>> Zhenyu Gao
>>>
>>> 2017-09-26 17:52 GMT+08:00 Zhenyu Gao <sysugaozhenyu@gmail.com>:
>>>
>>>> 1. ovn-nb.ovsschema was updated output_port field. Change the max entry
>>>> number from 1 to unlimited.
>>>> 2. Add multipath feature in ovn-northd part. northd generates multipath
>>>> flows to dispatch traffic by using packet's IP dst address if route's
>>>> output_port contains two or more ports.
>>>> 3. Add new table(lr_in_multipath) in ovn-northd's router ingress stages
>>>> to dispatch traffic to ports.
>>>> 4. Add multipath flow in Table 5(lr_in_ip_routing) and store hash result
>>>> into reg0. reg9[2] was used to indicate packet which need dispatching.
>>>> 5. Add multipath feature description in ovn/northd/ovn-northd.8.xml
>>>> and ovn/ovn-nb.xml
>>>> 6. ovn-nbctl.c was updated to handle configuring mulitiple output_port.
>>>>
>>>> Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
>>>> ---
>>>>  ovn/northd/ovn-northd.8.xml |  67 +++++++++++-
>>>>  ovn/northd/ovn-northd.c     | 257 ++++++++++++++++++++++++++++++
>>>> +++++++-------
>>>>  ovn/ovn-nb.ovsschema        |   7 +-
>>>>  ovn/ovn-nb.xml              |   4 +
>>>>  ovn/utilities/ovn-nbctl.c   |  28 +++--
>>>>  5 files changed, 311 insertions(+), 52 deletions(-)
>>>>
>>>> diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
>>>> index 0d85ec0..b1ce9a9 100644
>>>> --- a/ovn/northd/ovn-northd.8.xml
>>>> +++ b/ovn/northd/ovn-northd.8.xml
>>>> @@ -1598,6 +1598,9 @@ icmp4 {
>>>>        port (ingress table <code>ARP Request</code> will generate an ARP
>>>>        request, if needed, with <code>reg0</code> as the target protocol
>>>>        address and <code>reg1</code> as the source protocol address).
>>>> +      A IP route can be configured that it has multipath to next-hop.
>>>> +      If a packet has multipath to destination, OVN assign the port
>>>> +      index into reg[0] to indicate the packet's output port in table
>>>> 6.
>>>>      </p>
>>>>
>>>>      <p>
>>>> @@ -1617,6 +1620,28 @@ icmp4 {
>>>>
>>>>        <li>
>>>>          <p>
>>>> +          IPv4/IPV6 multipath routing table. For each route to
>>>> IPv4/IPv6
>>>> +          network <var>N</var> with netmask <var>M</var>, on multipath
>>>> port
>>>> +          <var>P</var> with IP address <var>A</var> and Ethernet
>>>> +          address <var>E</var>, a logical flow with match
>>>> +          <code>ip4.dst ==<var>N</var>/<var>M</var></code>,whose
>>>> priority
>>>> +          is the number of 1-bits plus 10 in <var>M</var>,
>>>> +          has the following actions:
>>>> +        </p>
>>>> +
>>>> +        <pre>
>>>> +ip.ttl--;
>>>> +multipath (nw_dst, 0, modulo_n, <var>n_links</var>, 0, reg0);
>>>> +reg9[2] = 1
>>>> +next;
>>>> +        </pre>
>>>> +        <p>
>>>> +          <var>n_links</var> is the number of multipath port.
>>>> +        </p>
>>>> +      </li>
>>>> +
>>>> +      <li>
>>>> +        <p>
>>>>            IPv4 routing table.  For each route to IPv4 network
>>>> <var>N</var> with
>>>>            netmask <var>M</var>, on router port <var>P</var> with IP
>>>> address
>>>>            <var>A</var> and Ethernet
>>>> @@ -1686,7 +1711,43 @@ next;
>>>>        </li>
>>>>      </ul>
>>>>
>>>> -    <h3>Ingress Table 6: ARP/ND Resolution</h3>
>>>> +    <h3>Ingress Table 6: Multipath</h3>
>>>> +    <p>
>>>> +      Any packet taht reaches this table is an IP packet and reg9[2]=1
>>>> +      using the following flows to route to corresponding port. This
>>>> table
>>>> +      implement dispatching by consuming reg0.
>>>> +    </p>
>>>> +
>>>> +    <ul>
>>>> +      <li>
>>>> +        <p>
>>>> +          A packet with netmask <var>M</var>, IP address <var>A</var>
>>>> and
>>>> +          <code>reg9[2] = 1</code>, whose priority above 1 has
>>>> following
>>>> +          actions:
>>>> +        </p>
>>>> +
>>>> +        <pre>
>>>> +reg0 = <var>G</var>;
>>>> +reg1 = <var>A</var>;
>>>> +eth.src = <var>E</var>;
>>>> +outport = <var>P</var>;
>>>> +flags.loopback = 1;
>>>> +next;
>>>> +        </pre>
>>>> +
>>>> +        <p>
>>>> +          <var>G</var> is the gateway IP address. <var>A</var>,
>>>> <var>E</var>
>>>> +          and <var>P</var> are the values that were described in
>>>> multipath
>>>> +          routeing in table 5
>>>> +        </p>
>>>> +
>>>> +        <p>
>>>> +          A priority-0 logical flow with match has actions
>>>> <code>next;</code>.
>>>> +        </p>
>>>> +      </li>
>>>> +    </ul>
>>>> +
>>>> +    <h3>Ingress Table 7: ARP/ND Resolution</h3>
>>>>
>>>>      <p>
>>>>        Any packet that reaches this table is an IP packet whose next-hop
>>>> @@ -1779,7 +1840,7 @@ next;
>>>>        </li>
>>>>      </ul>
>>>>
>>>> -    <h3>Ingress Table 7: Gateway Redirect</h3>
>>>> +    <h3>Ingress Table 8: Gateway Redirect</h3>
>>>>
>>>>      <p>
>>>>        For distributed logical routers where one of the logical router
>>>> @@ -1836,7 +1897,7 @@ next;
>>>>        </li>
>>>>      </ul>
>>>>
>>>> -    <h3>Ingress Table 8: ARP Request</h3>
>>>> +    <h3>Ingress Table 9: ARP Request</h3>
>>>>
>>>>      <p>
>>>>        In the common case where the Ethernet destination has been
>>>> resolved, this
>>>> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
>>>> index 49e4ac3..f8bfee2 100644
>>>> --- a/ovn/northd/ovn-northd.c
>>>> +++ b/ovn/northd/ovn-northd.c
>>>> @@ -135,9 +135,10 @@ enum ovn_stage {
>>>>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,      3, "lr_in_unsnat")       \
>>>>      PIPELINE_STAGE(ROUTER, IN,  DNAT,        4, "lr_in_dnat")         \
>>>>      PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  5, "lr_in_ip_routing")   \
>>>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 6, "lr_in_arp_resolve")  \
>>>> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 7, "lr_in_gw_redirect")  \
>>>> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 8, "lr_in_arp_request")  \
>>>> +    PIPELINE_STAGE(ROUTER, IN,  MULTIPATH,   6, "lr_in_multipath")    \
>>>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 7, "lr_in_arp_resolve")  \
>>>> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 8, "lr_in_gw_redirect")  \
>>>> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 9, "lr_in_arp_request")  \
>>>>                                                                        \
>>>>      /* Logical router egress stages. */                               \
>>>>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
>>>> @@ -173,6 +174,11 @@ enum ovn_stage {
>>>>   * one of the logical router's own IP addresses. */
>>>>  #define REGBIT_EGRESS_LOOPBACK  "reg9[1]"
>>>>
>>>> +/* Indicate multipath action has process this packet and store hash
>>>> result
>>>> + * into other regX. Should consume the hash result to determin the
>>>> right
>>>> + * output port. */
>>>> +#define REGBIT_MULTIPATH "reg9[2]"
>>>> +
>>>>  /* Returns an "enum ovn_stage" built from the arguments. */
>>>>  static enum ovn_stage
>>>>  ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline
>>>> pipeline,
>>>> @@ -4142,82 +4148,178 @@ add_route(struct hmap *lflows, const struct
>>>> ovn_port *op,
>>>>  }
>>>>
>>>>  static void
>>>> -build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>>>> -                        struct hmap *ports,
>>>> -                        const struct nbrec_logical_router_static_route
>>>> *route)
>>>> +add_multipath_route(struct hmap *lflows, uint32_t port_num,
>>>> +                    struct ovn_port **out_ports,
>>>> +                    const char **lrp_addr_s,
>>>> +                    struct ovn_datapath *od,
>>>> +                    const char *network_s, int plen,
>>>> +                    const char *gateway, const char *policy)
>>>> +{
>>>> +    bool is_ipv4 = strchr(network_s, '.') ? true : false;
>>>> +    struct ds match = DS_EMPTY_INITIALIZER;
>>>> +    const char *dir;
>>>> +    uint16_t priority;
>>>> +
>>>> +    if (policy && !strcmp(policy, "src-ip")) {
>>>> +        dir = "src";
>>>> +        priority = plen * 2;
>>>> +    } else {
>>>> +        dir = "dst";
>>>> +        priority = (plen * 2) + 1;
>>>> +    }
>>>> +
>>>> +    ds_put_format(&match, "ip%s.%s == %s/%d", is_ipv4 ? "4" : "6", dir,
>>>> +                  network_s, plen);
>>>> +
>>>> +    struct ds actions = DS_EMPTY_INITIALIZER;
>>>> +
>>>> +    ds_put_format(&actions, "ip.ttl--; ");
>>>> +    ds_put_format(&actions,
>>>> +                  "multipath (nw_dst, 0, modulo_n, %u, 0, reg0); "
>>>> +                  "%s = 1; "
>>>> +                  "next;",
>>>> +                  port_num, REGBIT_MULTIPATH);
>>>> +
>>>> +    /* The priority here is calculated to implement
>>>> longest-prefix-match
>>>> +     * routing. */
>>>> +    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
>>>> +                  ds_cstr(&match), ds_cstr(&actions));
>>>> +
>>>> +    for (int i = 0; i < port_num; i++) {
>>>> +        struct ds mp_match = DS_EMPTY_INITIALIZER;
>>>> +        struct ds mp_actions = DS_EMPTY_INITIALIZER;
>>>> +
>>>> +        ds_put_format(&mp_match, "%s == 1 && reg0 == %d && ",
>>>> +                      REGBIT_MULTIPATH, i);
>>>> +        ds_put_format(&mp_match, "ip%s.%s == %s/%d",
>>>> +                      is_ipv4 ? "4" : "6", dir,
>>>> +                      network_s, plen);
>>>> +
>>>> +        ds_put_format(&mp_actions, "%sreg0 = ", is_ipv4 ? "" : "xx");
>>>> +        if (gateway) {
>>>> +            ds_put_cstr(&mp_actions, gateway);
>>>> +        } else {
>>>> +            ds_put_format(&mp_actions, "ip%s.dst", is_ipv4 ? "4" :
>>>> "6");
>>>> +        }
>>>> +
>>>> +        ds_put_format(&mp_actions, "; "
>>>> +                      "%sreg1 = %s; "
>>>> +                      "eth.src = %s; "
>>>> +                      "outport = %s; "
>>>> +                      "flags.loopback = 1; "
>>>> +                      "next;",
>>>> +                      is_ipv4 ? "" : "xx",
>>>> +                      lrp_addr_s[i],
>>>> +                      out_ports[i]->lrp_networks.ea_s,
>>>> +                      out_ports[i]->json_key);
>>>> +
>>>> +        /* Add flow in table 6 to determin the right output port
>>>> +         * for this traffic. */
>>>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, priority,
>>>> +                      ds_cstr(&mp_match), ds_cstr(&mp_actions));
>>>> +        ds_destroy(&mp_match);
>>>> +        ds_destroy(&mp_actions);
>>>> +    }
>>>> +    ds_destroy(&match);
>>>> +    ds_destroy(&actions);
>>>> +}
>>>> +
>>>> +static bool
>>>> +verify_nexthop_prefix(const struct nbrec_logical_router_static_route
>>>> *route,
>>>> +                      bool *is_ipv4, char **prefix_s, unsigned int
>>>> *plen)
>>>>  {
>>>>      ovs_be32 nexthop;
>>>> -    const char *lrp_addr_s = NULL;
>>>> -    unsigned int plen;
>>>> -    bool is_ipv4;
>>>>
>>>>      /* Verify that the next hop is an IP address with an all-ones
>>>> mask. */
>>>> -    char *error = ip_parse_cidr(route->nexthop, &nexthop, &plen);
>>>> +    char *error = ip_parse_cidr(route->nexthop, &nexthop, plen);
>>>>      if (!error) {
>>>> -        if (plen != 32) {
>>>> +        if (*plen != 32) {
>>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>>              VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
>>>> -            return;
>>>> +            return false;
>>>>          }
>>>> -        is_ipv4 = true;
>>>> +        *is_ipv4 = true;
>>>>      } else {
>>>>          free(error);
>>>>
>>>>          struct in6_addr ip6;
>>>> -        error = ipv6_parse_cidr(route->nexthop, &ip6, &plen);
>>>> +        error = ipv6_parse_cidr(route->nexthop, &ip6, plen);
>>>>          if (!error) {
>>>> -            if (plen != 128) {
>>>> +            if (*plen != 128) {
>>>>                  static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 1);
>>>>                  VLOG_WARN_RL(&rl, "bad next hop mask %s",
>>>> route->nexthop);
>>>> -                return;
>>>> +                return false;
>>>>              }
>>>> -            is_ipv4 = false;
>>>> +            *is_ipv4 = false;
>>>>          } else {
>>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>>              VLOG_WARN_RL(&rl, "bad next hop ip address %s",
>>>> route->nexthop);
>>>>              free(error);
>>>> -            return;
>>>> +            return false;
>>>>          }
>>>>      }
>>>>
>>>> -    char *prefix_s;
>>>> -    if (is_ipv4) {
>>>> +    if (*is_ipv4) {
>>>>          ovs_be32 prefix;
>>>>          /* Verify that ip prefix is a valid IPv4 address. */
>>>> -        error = ip_parse_cidr(route->ip_prefix, &prefix, &plen);
>>>> +        error = ip_parse_cidr(route->ip_prefix, &prefix, plen);
>>>>          if (error) {
>>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>>>                           route->ip_prefix);
>>>>              free(error);
>>>> -            return;
>>>> +            return false;
>>>>          }
>>>> -        prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix &
>>>> be32_prefix_mask(plen)));
>>>> +        *prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix
>>>> +                                              &
>>>> be32_prefix_mask(*plen)));
>>>>      } else {
>>>>          /* Verify that ip prefix is a valid IPv6 address. */
>>>>          struct in6_addr prefix;
>>>> -        error = ipv6_parse_cidr(route->ip_prefix, &prefix, &plen);
>>>> +        error = ipv6_parse_cidr(route->ip_prefix, &prefix, plen);
>>>>          if (error) {
>>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>>              VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
>>>>                           route->ip_prefix);
>>>>              free(error);
>>>> -            return;
>>>> +            return false;
>>>>          }
>>>> -        struct in6_addr mask = ipv6_create_mask(plen);
>>>> +        struct in6_addr mask = ipv6_create_mask(*plen);
>>>>          struct in6_addr network = ipv6_addr_bitand(&prefix, &mask);
>>>> -        prefix_s = xmalloc(INET6_ADDRSTRLEN);
>>>> -        inet_ntop(AF_INET6, &network, prefix_s, INET6_ADDRSTRLEN);
>>>> +        *prefix_s = xmalloc(INET6_ADDRSTRLEN);
>>>> +        inet_ntop(AF_INET6, &network, *prefix_s, INET6_ADDRSTRLEN);
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void
>>>> +build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>>>> +                        struct hmap *ports,
>>>> +                        const struct nbrec_logical_router_static_route
>>>> *route)
>>>> +{
>>>> +    const char *lrp_addr_s = NULL;
>>>> +    unsigned int plen;
>>>> +    bool is_ipv4;
>>>> +    char *prefix_s = NULL;
>>>> +
>>>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Only need one output_port, if route contains multiple
>>>> output_port, then
>>>> +     * we should use build_multipath_flow to handle it. */
>>>> +    if (route->n_output_port > 1) {
>>>> +        return;
>>>>      }
>>>>
>>>>      /* Find the outgoing port. */
>>>>      struct ovn_port *out_port = NULL;
>>>> -    if (route->output_port) {
>>>> -        out_port = ovn_port_find(ports, route->output_port);
>>>> +    if (route->n_output_port) {
>>>> +        out_port = ovn_port_find(ports, route->output_port[0]);
>>>>          if (!out_port) {
>>>>              static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>>              VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>>>> -                         route->output_port, route->ip_prefix);
>>>> +                         route->output_port[0], route->ip_prefix);
>>>>              goto free_prefix_s;
>>>>          }
>>>>          lrp_addr_s = find_lrp_member_ip(out_port, route->nexthop);
>>>> @@ -4270,7 +4372,77 @@ build_static_route_flow(struct hmap *lflows,
>>>> struct ovn_datapath *od,
>>>>                policy);
>>>>
>>>>  free_prefix_s:
>>>> -    free(prefix_s);
>>>> +    if (prefix_s) {
>>>> +        free(prefix_s);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +build_multipath_flow(struct hmap *lflows, struct ovn_datapath *od,
>>>> +                     struct hmap *ports,
>>>> +                     const struct nbrec_logical_router_static_route
>>>> *route)
>>>> +{
>>>> +    unsigned int plen;
>>>> +    bool is_ipv4;
>>>> +    char *prefix_s = NULL;
>>>> +
>>>> +    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Find the outgoing port. */
>>>> +    struct ovn_port **out_ports = xmalloc(route->n_output_port *
>>>> +                                             sizeof(struct ovn_port
>>>> *));
>>>> +    const char **lrp_addr_s = xmalloc(route->n_output_port *
>>>> +                                         sizeof(const char *));
>>>> +    uint32_t idx = 0;
>>>> +    for (int i = 0; i < route->n_output_port; i++) {
>>>> +        out_ports[idx] = ovn_port_find(ports, route->output_port[i]);
>>>> +        if (!out_ports[idx]) {
>>>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>> +            VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
>>>> +                         route->output_port[i], route->ip_prefix);
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        lrp_addr_s[idx] = find_lrp_member_ip(out_ports[idx],
>>>> route->nexthop);
>>>> +        if (!lrp_addr_s[idx]) {
>>>> +            if (is_ipv4) {
>>>> +                if (out_ports[idx]->lrp_networks.n_ipv4_addrs) {
>>>> +                    lrp_addr_s[idx] = out_ports[idx]->
>>>> +                                        lrp_networks.ipv4_addrs[0].add
>>>> r_s;
>>>> +                }
>>>> +            } else {
>>>> +                if (out_ports[idx]->lrp_networks.n_ipv6_addrs) {
>>>> +                    lrp_addr_s[idx] = out_ports[idx]->
>>>> +                                        lrp_networks.ipv6_addrs[0].add
>>>> r_s;
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +        if (!lrp_addr_s[idx]) {
>>>> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
>>>> 1);
>>>> +            VLOG_WARN_RL(&rl,
>>>> +                         "%s has no path for static route %s; next hop
>>>> %s",
>>>> +                         route->output_port[i], route->ip_prefix,
>>>> +                         route->nexthop);
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        idx++;
>>>> +    }
>>>> +
>>>> +    char *policy = route->policy ? route->policy : "dst-ip";
>>>> +    if (idx > 0) {
>>>> +        add_multipath_route(lflows, idx,
>>>> +                            out_ports, lrp_addr_s, od,
>>>> +                            prefix_s, plen, route->nexthop, policy);
>>>> +    }
>>>> +
>>>> +    free(out_ports);
>>>> +    free(lrp_addr_s);
>>>> +    if (prefix_s) {
>>>> +        free(prefix_s);
>>>> +    }
>>>>  }
>>>>
>>>>  static void
>>>> @@ -5344,7 +5516,7 @@ build_lrouter_flows(struct hmap *datapaths,
>>>> struct hmap *ports,
>>>>          }
>>>>      }
>>>>
>>>> -    /* Convert the static routes to flows. */
>>>> +    /* Convert the static routes and multipath route to flows. */
>>>>      HMAP_FOR_EACH (od, key_node, datapaths) {
>>>>          if (!od->nbr) {
>>>>              continue;
>>>> @@ -5354,13 +5526,24 @@ build_lrouter_flows(struct hmap *datapaths,
>>>> struct hmap *ports,
>>>>              const struct nbrec_logical_router_static_route *route;
>>>>
>>>>              route = od->nbr->static_routes[i];
>>>> -            build_static_route_flow(lflows, od, ports, route);
>>>> +            if (route->n_output_port > 1) {
>>>> +                /* Logical router ingress table 5-6: Multipath Routing.
>>>> +                 *
>>>> +                 * If router had been configured a traffic has
>>>> multiple paths
>>>> +                 * to destination. The specific output port should be
>>>> firgured
>>>> +                 * out by computing packet's IP dst address header */
>>>> +                build_multipath_flow(lflows, od, ports, route);
>>>> +            } else {
>>>> +                build_static_route_flow(lflows, od, ports, route);
>>>> +            }
>>>>          }
>>>> +        /* Packets are allowed by default in table 6. */
>>>> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, 0, "1",
>>>> "next;");
>>>>      }
>>>>
>>>>      /* XXX destination unreachable */
>>>>
>>>> -    /* Local router ingress table 6: ARP Resolution.
>>>> +    /* Local router ingress table 7: ARP Resolution.
>>>>       *
>>>>       * Any packet that reaches this table is an IP packet whose
>>>> next-hop IP
>>>>       * address is in reg0. (ip4.dst is the final destination.) This
>>>> table
>>>> @@ -5555,7 +5738,7 @@ build_lrouter_flows(struct hmap *datapaths,
>>>> struct hmap *ports,
>>>>                        "get_nd(outport, xxreg0); next;");
>>>>      }
>>>>
>>>> -    /* Logical router ingress table 7: Gateway redirect.
>>>> +    /* Logical router ingress table 8: Gateway redirect.
>>>>       *
>>>>       * For traffic with outport equal to the l3dgw_port
>>>>       * on a distributed router, this table redirects a subset
>>>> @@ -5595,7 +5778,7 @@ build_lrouter_flows(struct hmap *datapaths,
>>>> struct hmap *ports,
>>>>          ovn_lflow_add(lflows, od, S_ROUTER_IN_GW_REDIRECT, 0, "1",
>>>> "next;");
>>>>      }
>>>>
>>>> -    /* Local router ingress table 8: ARP request.
>>>> +    /* Local router ingress table 9: ARP request.
>>>>       *
>>>>       * In the common case where the Ethernet destination has been
>>>> resolved,
>>>>       * this table outputs the packet (priority 0).  Otherwise, it
>>>> composes
>>>> diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema
>>>> index a077bfb..7a43473 100644
>>>> --- a/ovn/ovn-nb.ovsschema
>>>> +++ b/ovn/ovn-nb.ovsschema
>>>> @@ -1,7 +1,7 @@
>>>>  {
>>>>      "name": "OVN_Northbound",
>>>> -    "version": "5.8.0",
>>>> -    "cksum": "2812300190 <(281)%20230-0190> 16766",
>>>> +    "version": "5.9.0",
>>>> +    "cksum": "1515729450 16817",
>>>>      "tables": {
>>>>          "NB_Global": {
>>>>              "columns": {
>>>> @@ -235,7 +235,8 @@
>>>>
>>>> "dst-ip"]]},
>>>>                                      "min": 0, "max": 1}},
>>>>                  "nexthop": {"type": "string"},
>>>> -                "output_port": {"type": {"key": "string", "min": 0,
>>>> "max": 1}}},
>>>> +                "output_port": {"type": {"key": "string", "min": 0,
>>>> +                                         "max": "unlimited"}}},
>>>>              "isRoot": false},
>>>>          "NAT": {
>>>>              "columns": {
>>>> diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
>>>> index 9869d7e..eaba0c8 100644
>>>> --- a/ovn/ovn-nb.xml
>>>> +++ b/ovn/ovn-nb.xml
>>>> @@ -1485,6 +1485,10 @@
>>>>          multiple IP addresses on the router port and none of them are
>>>> in the
>>>>          same subnet of <ref column="nexthop"/>, OVN chooses the first
>>>> IP
>>>>          address as the one via which the <ref column="nexthop"/> is
>>>> reachable.
>>>> +        When it contains more than two ports, it means packet has
>>>> multiple
>>>> +        candidate output ports. OVN uses the packet header to determin
>>>> which
>>>> +        port the packet would be delivered to.
>>>> +        Currently, OVN consumes destination IP field to figure out
>>>> port.
>>>>        </p>
>>>>      </column>
>>>>    </table>
>>>> diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
>>>> index 8e5c1a4..417194f 100644
>>>> --- a/ovn/utilities/ovn-nbctl.c
>>>> +++ b/ovn/utilities/ovn-nbctl.c
>>>> @@ -397,7 +397,7 @@ Logical router port commands:\n\
>>>>                              ('enabled' or 'disabled')\n\
>>>>  \n\
>>>>  Route commands:\n\
>>>> -  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
>>>> +  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]...\n\
>>>>                              add a route to ROUTER\n\
>>>>    lr-route-del ROUTER [PREFIX]\n\
>>>>                              remove routes from ROUTER\n\
>>>> @@ -2184,13 +2184,15 @@ normalize_prefix_str(const char *orig_prefix)
>>>>          return normalize_ipv6_prefix(ipv6, plen);
>>>>      }
>>>>  }
>>>> -
>>>> +
>>>>  static void
>>>>  nbctl_lr_route_add(struct ctl_context *ctx)
>>>>  {
>>>>      const struct nbrec_logical_router *lr;
>>>>      lr = lr_by_name_or_uuid(ctx, ctx->argv[1], true);
>>>>      char *prefix, *next_hop;
>>>> +    int n_output_port = 0;
>>>> +    const char **output_port;
>>>>
>>>>      const char *policy = shash_find_data(&ctx->options, "--policy");
>>>>      if (policy && strcmp(policy, "src-ip") && strcmp(policy,
>>>> "dst-ip")) {
>>>> @@ -2224,6 +2226,11 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>>          }
>>>>      }
>>>>
>>>> +    if (ctx->argc > 4) {
>>>> +        n_output_port = ctx->argc - 4;
>>>> +        output_port = (const char **)&ctx->argv[4];
>>>> +    }
>>>> +
>>>>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
>>>>      for (int i = 0; i < lr->n_static_routes; i++) {
>>>>          const struct nbrec_logical_router_static_route *route
>>>> @@ -2253,9 +2260,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>>          nbrec_logical_router_static_route_verify_nexthop(route);
>>>>          nbrec_logical_router_static_route_set_ip_prefix(route,
>>>> prefix);
>>>>          nbrec_logical_router_static_route_set_nexthop(route,
>>>> next_hop);
>>>> -        if (ctx->argc == 5) {
>>>> +        if (n_output_port > 0) {
>>>>              nbrec_logical_router_static_route_set_output_port(route,
>>>> -
>>>> ctx->argv[4]);
>>>> +
>>>> output_port,
>>>> +
>>>> n_output_port);
>>>>          }
>>>>          if (policy) {
>>>>               nbrec_logical_router_static_route_set_policy(route,
>>>> policy);
>>>> @@ -2270,8 +2278,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>>>>      route = nbrec_logical_router_static_route_insert(ctx->txn);
>>>>      nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
>>>>      nbrec_logical_router_static_route_set_nexthop(route, next_hop);
>>>> -    if (ctx->argc == 5) {
>>>> -        nbrec_logical_router_static_route_set_output_port(route,
>>>> ctx->argv[4]);
>>>> +    if (n_output_port > 0) {
>>>> +        nbrec_logical_router_static_route_set_output_port(route,
>>>> +                                                          output_port,
>>>> +
>>>> n_output_port);
>>>>      }
>>>>      if (policy) {
>>>>          nbrec_logical_router_static_route_set_policy(route, policy);
>>>> @@ -3066,8 +3076,8 @@ print_route(const struct
>>>> nbrec_logical_router_static_route *route, struct ds *s)
>>>>          ds_put_format(s, " %s", "dst-ip");
>>>>      }
>>>>
>>>> -    if (route->output_port) {
>>>> -        ds_put_format(s, " %s", route->output_port);
>>>> +    for (int i = 0; i < route->n_output_port; i++) {
>>>> +        ds_put_format(s, " %s", route->output_port[i]);
>>>>      }
>>>>      ds_put_char(s, '\n');
>>>>  }
>>>> @@ -3682,7 +3692,7 @@ static const struct ctl_command_syntax
>>>> nbctl_commands[] = {
>>>>        NULL, "", RO },
>>>>
>>>>      /* logical router route commands. */
>>>> -    { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
>>>> +    { "lr-route-add", 3, INT_MAX, "ROUTER PREFIX NEXTHOP [PORT]...",
>>>> NULL,
>>>>        nbctl_lr_route_add, NULL, "--may-exist,--policy=", RW },
>>>>      { "lr-route-del", 1, 2, "ROUTER [PREFIX]", NULL,
>>>> nbctl_lr_route_del,
>>>>        NULL, "--if-exists", RW },
>>>> --
>>>> 1.8.3.1
>>>>
>>>>
>>>
>>
>
Ben Pfaff Oct. 30, 2017, 9:54 p.m. UTC | #5
Thank you for sending this series.  I feel bad that this series did not
receive a timely review.  Let me help a little bit.

I don't entirely understand the series.  This is mostly because of a
lack of context, which the series could be revised to provide.

One issue I have is that the term "multipath" is used for a few
different purposes in networking.  Sometimes, multiple paths are used
for reliability, and sometimes they are used to take advantage of
bandwidth available along multiple paths.  I don't know which meaning is
intended here.

I also suffer a bit from not understanding the purpose to which these
patches will be put at a higher level.  At the cloud management system
(CMS) level, such as the level of Neutron or equivalent, how will the
user use this feature?

Once I understand the purpose a bit better, perhaps I can contribute to
providing reviews, or help direct the patches to someone who can better
review them.

(The answers to these questions should probably not be just in email,
but also added in appropriate places in the documentation for the
patches themselves.)

Thanks,

Ben.
Gao Zhenyu Oct. 31, 2017, 1:39 a.m. UTC | #6
Hi Ben,

   Thanks for the comments!
   The multipath feature can separete traffics to multi-gateways. If you
get two gateway routers and both of them pin on chassis. (gateway1 pin on
chassis-A, gateway2 pin on chassis-B) The multipath can seperate traffic to
those chassis automaticly.
   Otherwise, user should config many static routes to make it.

   The problem you mentioned about CMS part has not previously considered.


Thanks
Zhenyu Gao





2017-10-31 5:54 GMT+08:00 Ben Pfaff <blp@ovn.org>:

> Thank you for sending this series.  I feel bad that this series did not
> receive a timely review.  Let me help a little bit.
>
> I don't entirely understand the series.  This is mostly because of a
> lack of context, which the series could be revised to provide.
>
> One issue I have is that the term "multipath" is used for a few
> different purposes in networking.  Sometimes, multiple paths are used
> for reliability, and sometimes they are used to take advantage of
> bandwidth available along multiple paths.  I don't know which meaning is
> intended here.
>
> I also suffer a bit from not understanding the purpose to which these
> patches will be put at a higher level.  At the cloud management system
> (CMS) level, such as the level of Neutron or equivalent, how will the
> user use this feature?
>
> Once I understand the purpose a bit better, perhaps I can contribute to
> providing reviews, or help direct the patches to someone who can better
> review them.
>
> (The answers to these questions should probably not be just in email,
> but also added in appropriate places in the documentation for the
> patches themselves.)
>
> Thanks,
>
> Ben.
>
diff mbox series

Patch

diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
index 0d85ec0..b1ce9a9 100644
--- a/ovn/northd/ovn-northd.8.xml
+++ b/ovn/northd/ovn-northd.8.xml
@@ -1598,6 +1598,9 @@  icmp4 {
       port (ingress table <code>ARP Request</code> will generate an ARP
       request, if needed, with <code>reg0</code> as the target protocol
       address and <code>reg1</code> as the source protocol address).
+      A IP route can be configured that it has multipath to next-hop.
+      If a packet has multipath to destination, OVN assign the port
+      index into reg[0] to indicate the packet's output port in table 6.
     </p>
 
     <p>
@@ -1617,6 +1620,28 @@  icmp4 {
 
       <li>
         <p>
+          IPv4/IPV6 multipath routing table. For each route to IPv4/IPv6
+          network <var>N</var> with netmask <var>M</var>, on multipath port
+          <var>P</var> with IP address <var>A</var> and Ethernet
+          address <var>E</var>, a logical flow with match
+          <code>ip4.dst ==<var>N</var>/<var>M</var></code>,whose priority
+          is the number of 1-bits plus 10 in <var>M</var>,
+          has the following actions:
+        </p>
+
+        <pre>
+ip.ttl--;
+multipath (nw_dst, 0, modulo_n, <var>n_links</var>, 0, reg0);
+reg9[2] = 1
+next;
+        </pre>
+        <p>
+          <var>n_links</var> is the number of multipath port.
+        </p>
+      </li>
+
+      <li>
+        <p>
           IPv4 routing table.  For each route to IPv4 network <var>N</var> with
           netmask <var>M</var>, on router port <var>P</var> with IP address
           <var>A</var> and Ethernet
@@ -1686,7 +1711,43 @@  next;
       </li>
     </ul>
 
-    <h3>Ingress Table 6: ARP/ND Resolution</h3>
+    <h3>Ingress Table 6: Multipath</h3>
+    <p>
+      Any packet taht reaches this table is an IP packet and reg9[2]=1
+      using the following flows to route to corresponding port. This table
+      implement dispatching by consuming reg0.
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          A packet with netmask <var>M</var>, IP address <var>A</var> and
+          <code>reg9[2] = 1</code>, whose priority above 1 has following
+          actions:
+        </p>
+
+        <pre>
+reg0 = <var>G</var>;
+reg1 = <var>A</var>;
+eth.src = <var>E</var>;
+outport = <var>P</var>;
+flags.loopback = 1;
+next;
+        </pre>
+
+        <p>
+          <var>G</var> is the gateway IP address. <var>A</var>, <var>E</var>
+          and <var>P</var> are the values that were described in multipath
+          routeing in table 5
+        </p>
+
+        <p>
+          A priority-0 logical flow with match has actions <code>next;</code>.
+        </p>
+      </li>
+    </ul>
+
+    <h3>Ingress Table 7: ARP/ND Resolution</h3>
 
     <p>
       Any packet that reaches this table is an IP packet whose next-hop
@@ -1779,7 +1840,7 @@  next;
       </li>
     </ul>
 
-    <h3>Ingress Table 7: Gateway Redirect</h3>
+    <h3>Ingress Table 8: Gateway Redirect</h3>
 
     <p>
       For distributed logical routers where one of the logical router
@@ -1836,7 +1897,7 @@  next;
       </li>
     </ul>
 
-    <h3>Ingress Table 8: ARP Request</h3>
+    <h3>Ingress Table 9: ARP Request</h3>
 
     <p>
       In the common case where the Ethernet destination has been resolved, this
diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index 49e4ac3..f8bfee2 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -135,9 +135,10 @@  enum ovn_stage {
     PIPELINE_STAGE(ROUTER, IN,  UNSNAT,      3, "lr_in_unsnat")       \
     PIPELINE_STAGE(ROUTER, IN,  DNAT,        4, "lr_in_dnat")         \
     PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  5, "lr_in_ip_routing")   \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 6, "lr_in_arp_resolve")  \
-    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 7, "lr_in_gw_redirect")  \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 8, "lr_in_arp_request")  \
+    PIPELINE_STAGE(ROUTER, IN,  MULTIPATH,   6, "lr_in_multipath")    \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 7, "lr_in_arp_resolve")  \
+    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT, 8, "lr_in_gw_redirect")  \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 9, "lr_in_arp_request")  \
                                                                       \
     /* Logical router egress stages. */                               \
     PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
@@ -173,6 +174,11 @@  enum ovn_stage {
  * one of the logical router's own IP addresses. */
 #define REGBIT_EGRESS_LOOPBACK  "reg9[1]"
 
+/* Indicate multipath action has process this packet and store hash result
+ * into other regX. Should consume the hash result to determin the right
+ * output port. */
+#define REGBIT_MULTIPATH "reg9[2]"
+
 /* Returns an "enum ovn_stage" built from the arguments. */
 static enum ovn_stage
 ovn_stage_build(enum ovn_datapath_type dp_type, enum ovn_pipeline pipeline,
@@ -4142,82 +4148,178 @@  add_route(struct hmap *lflows, const struct ovn_port *op,
 }
 
 static void
-build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
-                        struct hmap *ports,
-                        const struct nbrec_logical_router_static_route *route)
+add_multipath_route(struct hmap *lflows, uint32_t port_num,
+                    struct ovn_port **out_ports,
+                    const char **lrp_addr_s,
+                    struct ovn_datapath *od,
+                    const char *network_s, int plen,
+                    const char *gateway, const char *policy)
+{
+    bool is_ipv4 = strchr(network_s, '.') ? true : false;
+    struct ds match = DS_EMPTY_INITIALIZER;
+    const char *dir;
+    uint16_t priority;
+
+    if (policy && !strcmp(policy, "src-ip")) {
+        dir = "src";
+        priority = plen * 2;
+    } else {
+        dir = "dst";
+        priority = (plen * 2) + 1;
+    }
+
+    ds_put_format(&match, "ip%s.%s == %s/%d", is_ipv4 ? "4" : "6", dir,
+                  network_s, plen);
+
+    struct ds actions = DS_EMPTY_INITIALIZER;
+
+    ds_put_format(&actions, "ip.ttl--; ");
+    ds_put_format(&actions,
+                  "multipath (nw_dst, 0, modulo_n, %u, 0, reg0); "
+                  "%s = 1; "
+                  "next;",
+                  port_num, REGBIT_MULTIPATH);
+
+    /* The priority here is calculated to implement longest-prefix-match
+     * routing. */
+    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
+                  ds_cstr(&match), ds_cstr(&actions));
+
+    for (int i = 0; i < port_num; i++) {
+        struct ds mp_match = DS_EMPTY_INITIALIZER;
+        struct ds mp_actions = DS_EMPTY_INITIALIZER;
+
+        ds_put_format(&mp_match, "%s == 1 && reg0 == %d && ",
+                      REGBIT_MULTIPATH, i);
+        ds_put_format(&mp_match, "ip%s.%s == %s/%d",
+                      is_ipv4 ? "4" : "6", dir,
+                      network_s, plen);
+
+        ds_put_format(&mp_actions, "%sreg0 = ", is_ipv4 ? "" : "xx");
+        if (gateway) {
+            ds_put_cstr(&mp_actions, gateway);
+        } else {
+            ds_put_format(&mp_actions, "ip%s.dst", is_ipv4 ? "4" : "6");
+        }
+
+        ds_put_format(&mp_actions, "; "
+                      "%sreg1 = %s; "
+                      "eth.src = %s; "
+                      "outport = %s; "
+                      "flags.loopback = 1; "
+                      "next;",
+                      is_ipv4 ? "" : "xx",
+                      lrp_addr_s[i],
+                      out_ports[i]->lrp_networks.ea_s,
+                      out_ports[i]->json_key);
+
+        /* Add flow in table 6 to determin the right output port
+         * for this traffic. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, priority,
+                      ds_cstr(&mp_match), ds_cstr(&mp_actions));
+        ds_destroy(&mp_match);
+        ds_destroy(&mp_actions);
+    }
+    ds_destroy(&match);
+    ds_destroy(&actions);
+}
+
+static bool
+verify_nexthop_prefix(const struct nbrec_logical_router_static_route *route,
+                      bool *is_ipv4, char **prefix_s, unsigned int *plen)
 {
     ovs_be32 nexthop;
-    const char *lrp_addr_s = NULL;
-    unsigned int plen;
-    bool is_ipv4;
 
     /* Verify that the next hop is an IP address with an all-ones mask. */
-    char *error = ip_parse_cidr(route->nexthop, &nexthop, &plen);
+    char *error = ip_parse_cidr(route->nexthop, &nexthop, plen);
     if (!error) {
-        if (plen != 32) {
+        if (*plen != 32) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
             VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
-            return;
+            return false;
         }
-        is_ipv4 = true;
+        *is_ipv4 = true;
     } else {
         free(error);
 
         struct in6_addr ip6;
-        error = ipv6_parse_cidr(route->nexthop, &ip6, &plen);
+        error = ipv6_parse_cidr(route->nexthop, &ip6, plen);
         if (!error) {
-            if (plen != 128) {
+            if (*plen != 128) {
                 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
                 VLOG_WARN_RL(&rl, "bad next hop mask %s", route->nexthop);
-                return;
+                return false;
             }
-            is_ipv4 = false;
+            *is_ipv4 = false;
         } else {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
             VLOG_WARN_RL(&rl, "bad next hop ip address %s", route->nexthop);
             free(error);
-            return;
+            return false;
         }
     }
 
-    char *prefix_s;
-    if (is_ipv4) {
+    if (*is_ipv4) {
         ovs_be32 prefix;
         /* Verify that ip prefix is a valid IPv4 address. */
-        error = ip_parse_cidr(route->ip_prefix, &prefix, &plen);
+        error = ip_parse_cidr(route->ip_prefix, &prefix, plen);
         if (error) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
             VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
                          route->ip_prefix);
             free(error);
-            return;
+            return false;
         }
-        prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix & be32_prefix_mask(plen)));
+        *prefix_s = xasprintf(IP_FMT, IP_ARGS(prefix
+                                              & be32_prefix_mask(*plen)));
     } else {
         /* Verify that ip prefix is a valid IPv6 address. */
         struct in6_addr prefix;
-        error = ipv6_parse_cidr(route->ip_prefix, &prefix, &plen);
+        error = ipv6_parse_cidr(route->ip_prefix, &prefix, plen);
         if (error) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
             VLOG_WARN_RL(&rl, "bad 'ip_prefix' in static routes %s",
                          route->ip_prefix);
             free(error);
-            return;
+            return false;
         }
-        struct in6_addr mask = ipv6_create_mask(plen);
+        struct in6_addr mask = ipv6_create_mask(*plen);
         struct in6_addr network = ipv6_addr_bitand(&prefix, &mask);
-        prefix_s = xmalloc(INET6_ADDRSTRLEN);
-        inet_ntop(AF_INET6, &network, prefix_s, INET6_ADDRSTRLEN);
+        *prefix_s = xmalloc(INET6_ADDRSTRLEN);
+        inet_ntop(AF_INET6, &network, *prefix_s, INET6_ADDRSTRLEN);
+    }
+
+    return true;
+}
+
+static void
+build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
+                        struct hmap *ports,
+                        const struct nbrec_logical_router_static_route *route)
+{
+    const char *lrp_addr_s = NULL;
+    unsigned int plen;
+    bool is_ipv4;
+    char *prefix_s = NULL;
+
+    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
+        return;
+    }
+
+    /* Only need one output_port, if route contains multiple output_port, then
+     * we should use build_multipath_flow to handle it. */
+    if (route->n_output_port > 1) {
+        return;
     }
 
     /* Find the outgoing port. */
     struct ovn_port *out_port = NULL;
-    if (route->output_port) {
-        out_port = ovn_port_find(ports, route->output_port);
+    if (route->n_output_port) {
+        out_port = ovn_port_find(ports, route->output_port[0]);
         if (!out_port) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
             VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
-                         route->output_port, route->ip_prefix);
+                         route->output_port[0], route->ip_prefix);
             goto free_prefix_s;
         }
         lrp_addr_s = find_lrp_member_ip(out_port, route->nexthop);
@@ -4270,7 +4372,77 @@  build_static_route_flow(struct hmap *lflows, struct ovn_datapath *od,
               policy);
 
 free_prefix_s:
-    free(prefix_s);
+    if (prefix_s) {
+        free(prefix_s);
+    }
+}
+
+static void
+build_multipath_flow(struct hmap *lflows, struct ovn_datapath *od,
+                     struct hmap *ports,
+                     const struct nbrec_logical_router_static_route *route)
+{
+    unsigned int plen;
+    bool is_ipv4;
+    char *prefix_s = NULL;
+
+    if (!verify_nexthop_prefix(route, &is_ipv4, &prefix_s, &plen)) {
+        return;
+    }
+
+    /* Find the outgoing port. */
+    struct ovn_port **out_ports = xmalloc(route->n_output_port *
+                                             sizeof(struct ovn_port *));
+    const char **lrp_addr_s = xmalloc(route->n_output_port *
+                                         sizeof(const char *));
+    uint32_t idx = 0;
+    for (int i = 0; i < route->n_output_port; i++) {
+        out_ports[idx] = ovn_port_find(ports, route->output_port[i]);
+        if (!out_ports[idx]) {
+            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+            VLOG_WARN_RL(&rl, "Bad out port %s for static route %s",
+                         route->output_port[i], route->ip_prefix);
+            continue;
+        }
+
+        lrp_addr_s[idx] = find_lrp_member_ip(out_ports[idx], route->nexthop);
+        if (!lrp_addr_s[idx]) {
+            if (is_ipv4) {
+                if (out_ports[idx]->lrp_networks.n_ipv4_addrs) {
+                    lrp_addr_s[idx] = out_ports[idx]->
+                                        lrp_networks.ipv4_addrs[0].addr_s;
+                }
+            } else {
+                if (out_ports[idx]->lrp_networks.n_ipv6_addrs) {
+                    lrp_addr_s[idx] = out_ports[idx]->
+                                        lrp_networks.ipv6_addrs[0].addr_s;
+                }
+            }
+        }
+        if (!lrp_addr_s[idx]) {
+            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+            VLOG_WARN_RL(&rl,
+                         "%s has no path for static route %s; next hop %s",
+                         route->output_port[i], route->ip_prefix,
+                         route->nexthop);
+            continue;
+        }
+
+        idx++;
+    }
+
+    char *policy = route->policy ? route->policy : "dst-ip";
+    if (idx > 0) {
+        add_multipath_route(lflows, idx,
+                            out_ports, lrp_addr_s, od,
+                            prefix_s, plen, route->nexthop, policy);
+    }
+
+    free(out_ports);
+    free(lrp_addr_s);
+    if (prefix_s) {
+        free(prefix_s);
+    }
 }
 
 static void
@@ -5344,7 +5516,7 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         }
     }
 
-    /* Convert the static routes to flows. */
+    /* Convert the static routes and multipath route to flows. */
     HMAP_FOR_EACH (od, key_node, datapaths) {
         if (!od->nbr) {
             continue;
@@ -5354,13 +5526,24 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
             const struct nbrec_logical_router_static_route *route;
 
             route = od->nbr->static_routes[i];
-            build_static_route_flow(lflows, od, ports, route);
+            if (route->n_output_port > 1) {
+                /* Logical router ingress table 5-6: Multipath Routing.
+                 *
+                 * If router had been configured a traffic has multiple paths
+                 * to destination. The specific output port should be firgured
+                 * out by computing packet's IP dst address header */
+                build_multipath_flow(lflows, od, ports, route);
+            } else {
+                build_static_route_flow(lflows, od, ports, route);
+            }
         }
+        /* Packets are allowed by default in table 6. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_MULTIPATH, 0, "1", "next;");
     }
 
     /* XXX destination unreachable */
 
-    /* Local router ingress table 6: ARP Resolution.
+    /* Local router ingress table 7: ARP Resolution.
      *
      * Any packet that reaches this table is an IP packet whose next-hop IP
      * address is in reg0. (ip4.dst is the final destination.) This table
@@ -5555,7 +5738,7 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
                       "get_nd(outport, xxreg0); next;");
     }
 
-    /* Logical router ingress table 7: Gateway redirect.
+    /* Logical router ingress table 8: Gateway redirect.
      *
      * For traffic with outport equal to the l3dgw_port
      * on a distributed router, this table redirects a subset
@@ -5595,7 +5778,7 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         ovn_lflow_add(lflows, od, S_ROUTER_IN_GW_REDIRECT, 0, "1", "next;");
     }
 
-    /* Local router ingress table 8: ARP request.
+    /* Local router ingress table 9: ARP request.
      *
      * In the common case where the Ethernet destination has been resolved,
      * this table outputs the packet (priority 0).  Otherwise, it composes
diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema
index a077bfb..7a43473 100644
--- a/ovn/ovn-nb.ovsschema
+++ b/ovn/ovn-nb.ovsschema
@@ -1,7 +1,7 @@ 
 {
     "name": "OVN_Northbound",
-    "version": "5.8.0",
-    "cksum": "2812300190 16766",
+    "version": "5.9.0",
+    "cksum": "1515729450 16817",
     "tables": {
         "NB_Global": {
             "columns": {
@@ -235,7 +235,8 @@ 
                                                              "dst-ip"]]},
                                     "min": 0, "max": 1}},
                 "nexthop": {"type": "string"},
-                "output_port": {"type": {"key": "string", "min": 0, "max": 1}}},
+                "output_port": {"type": {"key": "string", "min": 0,
+                                         "max": "unlimited"}}},
             "isRoot": false},
         "NAT": {
             "columns": {
diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
index 9869d7e..eaba0c8 100644
--- a/ovn/ovn-nb.xml
+++ b/ovn/ovn-nb.xml
@@ -1485,6 +1485,10 @@ 
         multiple IP addresses on the router port and none of them are in the
         same subnet of <ref column="nexthop"/>, OVN chooses the first IP
         address as the one via which the <ref column="nexthop"/> is reachable.
+        When it contains more than two ports, it means packet has multiple
+        candidate output ports. OVN uses the packet header to determin which
+        port the packet would be delivered to.
+        Currently, OVN consumes destination IP field to figure out port.
       </p>
     </column>
   </table>
diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
index 8e5c1a4..417194f 100644
--- a/ovn/utilities/ovn-nbctl.c
+++ b/ovn/utilities/ovn-nbctl.c
@@ -397,7 +397,7 @@  Logical router port commands:\n\
                             ('enabled' or 'disabled')\n\
 \n\
 Route commands:\n\
-  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
+  [--policy=POLICY] lr-route-add ROUTER PREFIX NEXTHOP [PORT]...\n\
                             add a route to ROUTER\n\
   lr-route-del ROUTER [PREFIX]\n\
                             remove routes from ROUTER\n\
@@ -2184,13 +2184,15 @@  normalize_prefix_str(const char *orig_prefix)
         return normalize_ipv6_prefix(ipv6, plen);
     }
 }
-
+
 static void
 nbctl_lr_route_add(struct ctl_context *ctx)
 {
     const struct nbrec_logical_router *lr;
     lr = lr_by_name_or_uuid(ctx, ctx->argv[1], true);
     char *prefix, *next_hop;
+    int n_output_port = 0;
+    const char **output_port;
 
     const char *policy = shash_find_data(&ctx->options, "--policy");
     if (policy && strcmp(policy, "src-ip") && strcmp(policy, "dst-ip")) {
@@ -2224,6 +2226,11 @@  nbctl_lr_route_add(struct ctl_context *ctx)
         }
     }
 
+    if (ctx->argc > 4) {
+        n_output_port = ctx->argc - 4;
+        output_port = (const char **)&ctx->argv[4];
+    }
+
     bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
     for (int i = 0; i < lr->n_static_routes; i++) {
         const struct nbrec_logical_router_static_route *route
@@ -2253,9 +2260,10 @@  nbctl_lr_route_add(struct ctl_context *ctx)
         nbrec_logical_router_static_route_verify_nexthop(route);
         nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
         nbrec_logical_router_static_route_set_nexthop(route, next_hop);
-        if (ctx->argc == 5) {
+        if (n_output_port > 0) {
             nbrec_logical_router_static_route_set_output_port(route,
-                                                              ctx->argv[4]);
+                                                              output_port,
+                                                              n_output_port);
         }
         if (policy) {
              nbrec_logical_router_static_route_set_policy(route, policy);
@@ -2270,8 +2278,10 @@  nbctl_lr_route_add(struct ctl_context *ctx)
     route = nbrec_logical_router_static_route_insert(ctx->txn);
     nbrec_logical_router_static_route_set_ip_prefix(route, prefix);
     nbrec_logical_router_static_route_set_nexthop(route, next_hop);
-    if (ctx->argc == 5) {
-        nbrec_logical_router_static_route_set_output_port(route, ctx->argv[4]);
+    if (n_output_port > 0) {
+        nbrec_logical_router_static_route_set_output_port(route,
+                                                          output_port,
+                                                          n_output_port);
     }
     if (policy) {
         nbrec_logical_router_static_route_set_policy(route, policy);
@@ -3066,8 +3076,8 @@  print_route(const struct nbrec_logical_router_static_route *route, struct ds *s)
         ds_put_format(s, " %s", "dst-ip");
     }
 
-    if (route->output_port) {
-        ds_put_format(s, " %s", route->output_port);
+    for (int i = 0; i < route->n_output_port; i++) {
+        ds_put_format(s, " %s", route->output_port[i]);
     }
     ds_put_char(s, '\n');
 }
@@ -3682,7 +3692,7 @@  static const struct ctl_command_syntax nbctl_commands[] = {
       NULL, "", RO },
 
     /* logical router route commands. */
-    { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
+    { "lr-route-add", 3, INT_MAX, "ROUTER PREFIX NEXTHOP [PORT]...", NULL,
       nbctl_lr_route_add, NULL, "--may-exist,--policy=", RW },
     { "lr-route-del", 1, 2, "ROUTER [PREFIX]", NULL, nbctl_lr_route_del,
       NULL, "--if-exists", RW },