diff mbox series

[ovs-dev,ovn,v3,5/5] Add ECMP symmetric replies.

Message ID 20200727211920.3697694-6-mmichels@redhat.com
State Superseded
Headers show
Series Add ECMP symmetric replies | expand

Commit Message

Mark Michelson July 27, 2020, 9:19 p.m. UTC
When traffic arrives over an ECMP route, there is no guarantee that the
reply traffic will egress over the same route. Sometimes, the nature of
the traffic (or the intervening equipment) means that it is important
for reply traffic to go out the same route it came in.

This commit introduces optional ECMP symmetric reply behavior. If
configured, then traffic to or from the ECMP route will be sent to
conntrack. New incoming traffic over the route will have the source MAC
address and incoming port saved in the ct_label. Reply traffic then uses
this saved information to send the packet back out the same way it came
in.

To facilitate this, a new table was added to the ingress logical router
pipeline. The ECMP_STATEFUL table is responsible for committing to
conntrack and setting the ct_label when it detects new incoming traffic
from the route.

Since ingress pipeline logic on the logical router depends on ct state
of a particular hypervisor, this feature is only usable on gateway
routers.

Signed-off-by: Mark Michelson <mmichels@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683
---
 lib/logical-fields.c      |   4 +
 northd/ovn-northd.8.xml   |  49 ++++++++++---
 northd/ovn-northd.c       | 123 +++++++++++++++++++++++++++----
 ovn-architecture.7.xml    |   7 +-
 ovn-nb.ovsschema          |   5 +-
 ovn-nb.xml                |  16 ++++
 tests/ovn.at              | 151 ++++++++++++++++++++++++++++++++++----
 tests/system-ovn.at       | 144 ++++++++++++++++++++++++++++++++++++
 utilities/ovn-nbctl.8.xml |  31 ++++++--
 utilities/ovn-nbctl.c     |  18 ++++-
 10 files changed, 496 insertions(+), 52 deletions(-)

Comments

0-day Robot July 27, 2020, 10:01 p.m. UTC | #1
Bleep bloop.  Greetings Mark Michelson, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


git-am:
error: sha1 information is lacking or useless (tests/ovn.at).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 Add ECMP symmetric replies.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Numan Siddique July 28, 2020, 1:23 p.m. UTC | #2
On Tue, Jul 28, 2020 at 2:51 AM Mark Michelson <mmichels@redhat.com> wrote:

> When traffic arrives over an ECMP route, there is no guarantee that the
> reply traffic will egress over the same route. Sometimes, the nature of
> the traffic (or the intervening equipment) means that it is important
> for reply traffic to go out the same route it came in.
>
> This commit introduces optional ECMP symmetric reply behavior. If
> configured, then traffic to or from the ECMP route will be sent to
> conntrack. New incoming traffic over the route will have the source MAC
> address and incoming port saved in the ct_label. Reply traffic then uses
> this saved information to send the packet back out the same way it came
> in.
>
> To facilitate this, a new table was added to the ingress logical router
> pipeline. The ECMP_STATEFUL table is responsible for committing to
> conntrack and setting the ct_label when it detects new incoming traffic
> from the route.
>
> Since ingress pipeline logic on the logical router depends on ct state
> of a particular hypervisor, this feature is only usable on gateway
> routers.
>
> Signed-off-by: Mark Michelson <mmichels@redhat.com>
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683


Hi Mark,

Thanks for the new version. The first 4 patches in the series LGTM.

I've few comments in this patch

1. This patch series needs a rebase as it's not applying cleanly on top of
the master.

2. I think we should not exclude this feature to logical routers with
distributed gateway ports.
   For logical router with gw port I think you can add the same ecmp
symmetric flows but with
   one extra match - "inport == cr-<gw-port> && ...."
   We do the same in many parts of the code. Openstack may use this feature
and Openstack neutron
   don't use gateway routers.

3. In my testing with the logical resources created from system-ovn.at, I
noticed that
     - The traffic initiated from bob1 to alice1 works as expected. The
newly added logical flows gets hit
        and the ct_label is set as expected.

      - The problem is in the traffic initiated by alice1. For the first
packet from alice1, the select action is executed
        to choose one ecmp route (which is expected) and this packet is not
committed to the conntrack.
        For the reply traffic from bob1, the packet gets committed because
of this flow
        table=7 (lr_in_ecmp_stateful), priority=100  , match=(inport ==
"R1_ext" && ip4.dst == 10.0.0.0/24 && (ct.new && !ct.est)),
action=(ct_commit { ct_label.ecmp_reply_eth = eth.src;
ct_label.ecmp_reply_port = 2;}; next;)

  -   Basically the reverse traffic is treated as new traffic. And from
here on, the packet from alice1 is considered as reply traffic.
      table=10(lr_in_ip_routing   ), priority=100  , match=(ct.rpl &&
ct_label.ecmp_reply_port == 2 && ip4.src == 10.0.0.0/24), action=(ip.ttl--;
flags.loopback = 1; eth.src = 00:00:04:01:02:03; reg1 = 20.0.0.1; outport =
"R1_ext"; next;)

  - I'm not really sure if it's a problem or not. Maybe it's fine. But is
it as expected ? I personally don't see any harm with this.

   - But I would like to know your comments and maybe Han has some comments.

 4. The test case - "3: ovn -- conntrack fields" is failing with this
patch. It's a small error which you forgot to change I suppose.

 5. Since you are adding a new column in Logical_Router_Static_Route, I
think the schema version needs to be updated to - "5.25.0"

Thanks
Numan


> ---
>  lib/logical-fields.c      |   4 +
>  northd/ovn-northd.8.xml   |  49 ++++++++++---
>  northd/ovn-northd.c       | 123 +++++++++++++++++++++++++++----
>  ovn-architecture.7.xml    |   7 +-
>  ovn-nb.ovsschema          |   5 +-
>  ovn-nb.xml                |  16 ++++
>  tests/ovn.at              | 151 ++++++++++++++++++++++++++++++++++----
>  tests/system-ovn.at       | 144 ++++++++++++++++++++++++++++++++++++
>  utilities/ovn-nbctl.8.xml |  31 ++++++--
>  utilities/ovn-nbctl.c     |  18 ++++-
>  10 files changed, 496 insertions(+), 52 deletions(-)
>
> diff --git a/lib/logical-fields.c b/lib/logical-fields.c
> index fde53a47e..15342dded 100644
> --- a/lib/logical-fields.c
> +++ b/lib/logical-fields.c
> @@ -130,6 +130,10 @@ ovn_init_symtab(struct shash *symtab)
>                                   WR_CT_COMMIT);
>      expr_symtab_add_subfield_scoped(symtab, "ct_label.blocked", NULL,
>                                      "ct_label[0]", WR_CT_COMMIT);
> +    expr_symtab_add_subfield_scoped(symtab, "ct_label.ecmp_reply_eth",
> NULL,
> +                                    "ct_label[32..79]", WR_CT_COMMIT);
> +    expr_symtab_add_subfield_scoped(symtab, "ct_label.ecmp_reply_port",
> NULL,
> +                                    "ct_label[80..95]", WR_CT_COMMIT);
>
>      expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL, false);
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index eb2514f15..cf251e02a 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -2120,15 +2120,31 @@ icmp6 {
>      <p>
>        This is to send packets to connection tracker for tracking and
>        defragmentation.  It contains a priority-0 flow that simply moves
> traffic
> -      to the next table.  If load balancing rules with virtual IP
> addresses
> -      (and ports) are configured in <code>OVN_Northbound</code> database
> for a
> -      Gateway router, a priority-100 flow is added for each configured
> virtual
> -      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
> -      <code>ip &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
> -      <var>VIPs</var>, the flow matches <code>ip &amp;&amp; ip6.dst ==
> -      <var>VIP</var></code>.  The flow uses the action
> <code>ct_next;</code>
> -      to send IP packets to the connection tracker for packet
> de-fragmentation
> -      and tracking before sending it to the next table.
> +      to the next table.
> +    </p>
> +
> +    <p>
> +      If load balancing rules with virtual IP addresses (and ports) are
> +      configured in <code>OVN_Northbound</code> database for a Gateway
> router,
> +      a priority-100 flow is added for each configured virtual IP address
> +      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches <code>ip
> +      &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
> <var>VIPs</var>,
> +      the flow matches <code>ip &amp;&amp; ip6.dst ==
> <var>VIP</var></code>.
> +      The flow uses the action <code>ct_next;</code> to send IP packets
> to the
> +      connection tracker for packet de-fragmentation and tracking before
> +      sending it to the next table.
> +    </p>
> +
> +    <p>
> +      If ECMP routes with symmetric reply are configured in the
> +      <code>OVN_Northbound</code> database for a gateway router, a
> priority-100
> +      flow is added for each router port on which symmetric replies are
> +      configured. The matching logic for these ports essentially reverses
> the
> +      configured logic of the ECMP route. So for instance, a route with a
> +      destination routing policy will instead match if the source IP
> address
> +      matches the static route's prefix. The flow uses the action
> +      <code>ct_next</code> to send IP packets to the connection tracker
> for
> +      packet de-fragmentation and tracking before sending it to the next
> table.
>      </p>
>
>      <h3>Ingress Table 5: UNSNAT</h3>
> @@ -2489,7 +2505,15 @@ output;
>        table.  This table, instead, is responsible for determine the ECMP
>        group id and select a member id within the group based on 5-tuple
>        hashing.  It stores group id in <code>reg8[0..15]</code> and member
> id in
> -      <code>reg8[16..31]</code>.
> +      <code>reg8[16..31]</code>. This step is skipped if the traffic going
> +      out the ECMP route is reply traffic, and the ECMP route was
> configured
> +      to use symmetric replies. Instead, the stored <code>ct_label</code>
> value
> +      is used to choose the destination. The least significant 48 bits of
> the
> +      <code>ct_label</code> tell the destination MAC address to which the
> +      packet should be sent. The next 16 bits tell the logical router
> port on
> +      which the packet should be sent. These values in the
> +      <code>ct_label</code> are set when the initial ingress traffic is
> +      received over the ECMP route.
>      </p>
>
>      <p>
> @@ -2639,6 +2663,11 @@ select(reg8[16..31], <var>MID1</var>,
> <var>MID2</var>, ...);
>        address and <code>reg1</code> as the source protocol address).
>      </p>
>
> +    <p>
> +      This processing is skipped for reply traffic being sent out of an
> ECMP
> +      route if the route was configured to use symmetric replies.
> +    </p>
> +
>      <p>
>        This table contains the following logical flows:
>      </p>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index d10e5ee5d..85f04ccde 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -172,16 +172,17 @@ enum ovn_stage {
>      PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4, "lr_in_defrag")
>  \
>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5, "lr_in_unsnat")
>  \
>      PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7,
> "lr_in_nd_ra_options") \
> -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8,
> "lr_in_nd_ra_response") \
> -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> "lr_in_ip_routing_ecmp") \
> -    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> "lr_in_arp_resolve")  \
> -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> "lr_in_chk_pkt_len")   \
> -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> "lr_in_gw_redirect")  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> "lr_in_arp_request")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7,
> "lr_in_ecmp_stateful") \
> +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8,
> "lr_in_nd_ra_options") \
> +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9,
> "lr_in_nd_ra_response") \
> +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10, "lr_in_ip_routing")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11,
> "lr_in_ip_routing_ecmp") \
> +    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12, "lr_in_policy")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> "lr_in_arp_resolve")  \
> +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> "lr_in_chk_pkt_len")   \
> +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> "lr_in_gw_redirect")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> "lr_in_arp_request")  \
>                                                                        \
>      /* Logical router egress stages. */                               \
>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> @@ -7312,6 +7313,7 @@ struct parsed_route {
>      bool is_src_route;
>      uint32_t hash;
>      const struct nbrec_logical_router_static_route *route;
> +    bool ecmp_symmetric_reply;
>  };
>
>  static uint32_t
> @@ -7373,6 +7375,8 @@ parsed_routes_add(struct ovs_list *routes,
>                                                   "src-ip"));
>      pr->hash = route_hash(pr);
>      pr->route = route;
> +    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
> +                                             "ecmp_symmetric_reply",
> false);
>      ovs_list_insert(routes, &pr->list_node);
>      return pr;
>  }
> @@ -7621,18 +7625,95 @@ find_static_route_outport(struct ovn_datapath *od,
> struct hmap *ports,
>      return true;
>  }
>
> +static void
> +add_ecmp_symmetric_reply_flows(struct hmap *lflows,
> +                               struct ovn_datapath *od,
> +                               const char *port_ip,
> +                               struct ovn_port *out_port,
> +                               const struct parsed_route *route,
> +                               struct ds *route_match)
> +{
> +    const struct nbrec_logical_router_static_route *st_route =
> route->route;
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    struct ds actions = DS_EMPTY_INITIALIZER;
> +    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
> +    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
> +
> +    /* If symmetric ECMP replies are enabled, then packets that arrive
> over
> +     * an ECMP route need to go through conntrack.
> +     */
> +    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
> +                  out_port->json_key,
> +                  route->prefix.family == AF_INET ? "4" : "6",
> +                  route->is_src_route ? "dst" : "src",
> +                  cidr);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> +                            ds_cstr(&match), "ct_next;",
> +                            &st_route->header_);
> +
> +    /* And packets that go out over an ECMP route need conntrack */
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> +                            ds_cstr(route_match), "ct_next;",
> +                            &st_route->header_);
> +
> +    /* Save src eth and inport in ct_label for packets that arrive over
> +     * an ECMP route.
> +     *
> +     * NOTE: we purposely are not clearing match before this
> +     * ds_put_cstr() call. The previous contents are needed.
> +     */
> +    ds_put_cstr(&match, " && (ct.new && !ct.est)");
> +
> +    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth =
> eth.src;"
> +                  " ct_label.ecmp_reply_port = %" PRId64 ";}; next;",
> +                  out_port->sb->tunnel_key);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 100,
> +                            ds_cstr(&match), ds_cstr(&actions),
> +                            &st_route->header_);
> +
> +    /* Bypass ECMP selection if we already have ct_label information
> +     * for where to route the packet.
> +     */
> +    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port == %"
> +                  PRId64, out_port->sb->tunnel_key);
> +    ds_clear(&match);
> +    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
> +                  ds_cstr(route_match));
> +    ds_clear(&actions);
> +    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
> +                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
> +                  out_port->lrp_networks.ea_s,
> +                  route->prefix.family == AF_INET ? "" : "xx",
> +                  port_ip, out_port->json_key);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
> +                           ds_cstr(&match), ds_cstr(&actions),
> +                           &st_route->header_);
> +
> +    /* Egress reply traffic for symmetric ECMP routes skips router
> policies. */
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
> +                            ds_cstr(&ecmp_reply), "next;",
> +                            &st_route->header_);
> +
> +    ds_clear(&actions);
> +    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth; next;");
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
> +                            200, ds_cstr(&ecmp_reply),
> +                            ds_cstr(&actions), &st_route->header_);
> +}
> +
>  static void
>  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>                        struct hmap *ports, struct ecmp_groups_node *eg)
>
>  {
>      bool is_ipv4 = (eg->prefix.family == AF_INET);
> -    struct ds match = DS_EMPTY_INITIALIZER;
>      uint16_t priority;
> +    struct ecmp_route_list_node *er;
> +    struct ds route_match = DS_EMPTY_INITIALIZER;
>
>      char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
>      build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route, is_ipv4,
> -                      &match, &priority);
> +                      &route_match, &priority);
>      free(prefix_s);
>
>      struct ds actions = DS_EMPTY_INITIALIZER;
> @@ -7640,7 +7721,6 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                    "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
>                    REG_ECMP_MEMBER_ID);
>
> -    struct ecmp_route_list_node *er;
>      bool is_first = true;
>      LIST_FOR_EACH (er, list_node, &eg->route_list) {
>          if (is_first) {
> @@ -7654,11 +7734,12 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>      ds_put_cstr(&actions, ");");
>
>      ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
> -                  ds_cstr(&match), ds_cstr(&actions));
> +                  ds_cstr(&route_match), ds_cstr(&actions));
>
>      /* Add per member flow */
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
>      LIST_FOR_EACH (er, list_node, &eg->route_list) {
> -
>          const struct parsed_route *route_ = er->route;
>          const struct nbrec_logical_router_static_route *route =
> route_->route;
>          /* Find the outgoing port. */
> @@ -7668,6 +7749,15 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                                         &out_port)) {
>              continue;
>          }
> +        /* Symmetric ECMP reply is only usable on gateway routers.
> +         * It is NOT usable on distributed routers with a gateway port.
> +         */
> +        if (smap_get(&od->nbr->options, "chassis") &&
> +            route_->ecmp_symmetric_reply && sset_add(&visited_ports,
> +                                                     out_port->key)) {
> +            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s,
> out_port,
> +                                           route_, &route_match);
> +        }
>          ds_clear(&match);
>          ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
>                        REG_ECMP_MEMBER_ID" == %"PRIu16,
> @@ -7688,7 +7778,9 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                                  ds_cstr(&match), ds_cstr(&actions),
>                                  &route->header_);
>      }
> +    sset_destroy(&visited_ports);
>      ds_destroy(&match);
> +    ds_destroy(&route_match);
>      ds_destroy(&actions);
>  }
>
> @@ -8972,6 +9064,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>          ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1", "next;");
>          ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1", "next;");
>          ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1", "next;");
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0, "1",
> "next;");
>
>          /* Send the IPv6 NS packets to next table. When ovn-controller
>           * generates IPv6 NS (for the action - nd_ns{}), the injected
> diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> index 246cebc19..b1a462933 100644
> --- a/ovn-architecture.7.xml
> +++ b/ovn-architecture.7.xml
> @@ -1210,11 +1210,12 @@
>      <dd>
>        Fields that denote the connection tracking zones for routers.  These
>        values only have local significance and are not meaningful between
> -      chassis.  OVN stores the zone information for DNATting in Open
> vSwitch
> +      chassis.  OVN stores the zone information for north to south traffic
> +      (for DNATting or ECMP symmetric replies) in Open vSwitch
>          <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
>          MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
> -      extension register number 11 and zone information for SNATing in
> -      Open vSwitch extension register number 12.
> +      extension register number 11 and zone information for south to north
> +      traffic (for SNATing) in Open vSwitch extension register number 12.
>      </dd>
>
>      <dt>logical flow flags</dt>
> diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
> index da9af7157..16f7794f2 100644
> --- a/ovn-nb.ovsschema
> +++ b/ovn-nb.ovsschema
> @@ -1,7 +1,7 @@
>  {
>      "name": "OVN_Northbound",
>      "version": "5.24.0",
> -    "cksum": "1092394564 25961",
> +    "cksum": "679745602 26116",
>      "tables": {
>          "NB_Global": {
>              "columns": {
> @@ -365,6 +365,9 @@
>                                      "min": 0, "max": 1}},
>                  "nexthop": {"type": "string"},
>                  "output_port": {"type": {"key": "string", "min": 0,
> "max": 1}},
> +                "options": {
> +                    "type": {"key": "string", "value": "string",
> +                             "min": 0, "max": "unlimited"}},
>                  "external_ids": {
>                      "type": {"key": "string", "value": "string",
>                               "min": 0, "max": "unlimited"}}},
> diff --git a/ovn-nb.xml b/ovn-nb.xml
> index db5908cd5..5e434d257 100644
> --- a/ovn-nb.xml
> +++ b/ovn-nb.xml
> @@ -2481,6 +2481,22 @@
>        </column>
>      </group>
>
> +    <group title="Common options">
> +      <column name="options">
> +        This column provides general key/value settings. The supported
> +        options are described individually below.
> +      </column>
> +
> +      <column name="options" key="ecmp_symmetric_reply">
> +        It true, then new traffic that arrives over this route will have
> +        its reply traffic bypass ECMP route selection and will be sent out
> +        this route instead. Note that this option overrides any rules set
> +        in the <ref table="Logical_Router_policy" /> table. This option
> +        only works on gateway routers (routers that have
> +        <ref column="options" key="chassis" table="Logical_Router" />
> set).
> +      </column>
> +    </group>
> +
>    </table>
>
>    <table name="Logical_Router_Policy" title="Logical router policies">
> diff --git a/tests/ovn.at b/tests/ovn.at
> index f8dde14c2..c1ab6b85f 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -195,6 +195,8 @@ ct.snat = ct_state[6]
>  ct.trk = ct_state[5]
>  ct_label = NXM_NX_CT_LABEL
>  ct_label.blocked = ct_label[0]
> +ct_label.ecmp_reply_eth = ct_label[0..47]
> +ct_label.ecmp_reply_port = ct_label[48..63]
>  ct_mark = NXM_NX_CT_MARK
>  ct_state = NXM_NX_CT_STATE
>  ]])
> @@ -16065,7 +16067,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # Since the sw0-vir is not claimed by any chassis, eth.dst should be set
> to
>  # zero if the ip4.dst is the virtual ip in the router pipeline.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>
>  ip_to_hex() {
> @@ -16116,7 +16118,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p1's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>
>  # Forcibly clear virtual_parent. ovn-controller should release the binding
> @@ -16157,7 +16159,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p2's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
>  ])
>
>  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> @@ -16180,7 +16182,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p3's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>
>  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> @@ -16201,7 +16203,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>
>  # Delete hv1-vif1 port. hv1 should release sw0-vir
> @@ -16219,7 +16221,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>
>  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> @@ -16240,7 +16242,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>
>  # Delete sw0-p2 logical port
> @@ -20274,22 +20276,22 @@ ovn-nbctl set logical_router_policy $pol5
> options:pkt_mark=5
>  ovn-nbctl --wait=hv sync
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x64->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x3->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x4->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x5->NXM_NX_PKT_MARK" -c)
>  ])
>
> @@ -20380,12 +20382,12 @@ send_ipv4_pkt hv1 hv1-vif1 505400000003
> 00000000ff01 \
>      $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x2->NXM_NX_PKT_MARK" -c)
>  ])
>
>  AT_CHECK([
> -    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x64->NXM_NX_PKT_MARK" -c)
>  ])
>
> @@ -20741,3 +20743,126 @@ AT_CHECK([test "$hv2_offlows" =
> "$hv2_offlows_mon"])
>
>  OVN_CLEANUP([hv1], [hv2])
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- Symmetric ECMP reply flows])
> +ovn_start
> +
> +net_add n1
> +sim_add hv1
> +as hv1
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.1
> +
> +sim_add hv2
> +as hv2
> +ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.2
> +
> +# Logical network
> +#
> +#   ls1 \
> +#        \
> +#         DR -- join -- GW -- ext
> +#        /
> +#   ls2 /
> +#
> +#  ls1 and ls2 are internal switches connected to distributed router
> +#  DR. DR is then connected via a join switch to gateway router GW.
> +#  GW is then connected to external switch ext. In real life, this
> +#  would likely have a localnet port, but for the purposes of this test
> +#  it is unnecessary.
> +
> +ovn-nbctl create Logical_Router name=DR
> +gw_uuid=$(ovn-nbctl create Logical_Router name=GW)
> +
> +ovn-nbctl ls-add ls1
> +ovn-nbctl ls-add ls2
> +ovn-nbctl ls-add join
> +ovn-nbctl ls-add ext
> +
> +# Connect ls1 to DR
> +ovn-nbctl lrp-add DR dr-ls1 00:00:01:01:02:03 10.0.0.1/24
> +ovn-nbctl lsp-add ls1 ls1-dr -- set Logical_Switch_Port ls1-dr \
> +    type=router options:router-port=dr-ls1 addresses='"00:00:01:01:02:03"'
> +
> +# Connect ls2 to DR
> +ovn-nbctl lrp-add DR dr-ls2 00:00:01:01:02:04 10.0.0.2/24
> +ovn-nbctl lsp-add ls2 ls2-dr -- set Logical_Switch_Port ls2-dr \
> +    type=router options:router-port=dr-ls2 addresses='"00:00:01:01:02:04"'
> +
> +# Connect join to DR
> +ovn-nbctl lrp-add DR dr-join 00:00:02:01:02:03 20.0.0.1/24
> +ovn-nbctl lsp-add join join-dr -- set Logical_Switch_Port join-dr \
> +    type=router options:router-port=dr-join
> addresses='"00:00:02:01:02:03"'
> +
> +# Connect join to GW
> +ovn-nbctl lrp-add GW gw-join 00:00:02:01:02:04 20.0.0.2/24
> +ovn-nbctl lsp-add join join-gw -- set Logical_Switch_Port join-gw \
> +    type=router options:router-port=gw-join
> addresses='"00:00:02:01:02:04"'
> +
> +# Connect ext to GW
> +ovn-nbctl lrp-add GW gw-ext 00:00:03:01:02:03 172.16.0.1/16
> +ovn-nbctl lsp-add ext ext-gw -- set Logical_Switch_Port ext-gw \
> +    type=router options:router-port=gw-ext addresses='"00:00:03:01:02:03"'
> +
> +ovn-nbctl lr-route-add GW 10.0.0.0/24 20.0.0.1
> +ovn-nbctl --policy="src-ip" lr-route-add DR 10.0.0.0/24 20.0.0.2
> +
> +# Now add some ECMP routes to the GW router.
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> 10.0.0.0/24 172.16.0.2
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> 10.0.0.0/24 172.16.0.3
> +
> +ovn-nbctl --wait=hv sync
> +
> +# Ensure ECMP symmetric reply flows are not present on any hypervisor.
> +AT_CHECK([
> +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> +    grep "priority=100" | \
> +    grep
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> -c)
> +])
> +AT_CHECK([
> +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> +    grep "priority=200" | \
> +    grep
> "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
> +])
> +
> +AT_CHECK([
> +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> +    grep "priority=100" | \
> +    grep
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> -c)
> +])
> +AT_CHECK([
> +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> +    grep "priority=200" | \
> +    grep
> "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
> +])
> +
> +# Now make GW a gateway router on hv1
> +ovn-nbctl set Logical_Router $gw_uuid options:chassis=hv1
> +ovn-nbctl --wait=hv sync
> +
> +# And ensure that ECMP symmetric reply flows are present only on hv1
> +AT_CHECK([
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> +    grep "priority=100" | \
> +    grep
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> -c)
> +])
> +AT_CHECK([
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> +    grep "priority=200" | \
> +    grep
> "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
> +])
> +
> +AT_CHECK([
> +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> +    grep "priority=100" | \
> +    grep
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> -c)
> +])
> +AT_CHECK([
> +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> +    grep "priority=200" | \
> +    grep
> "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
> +])
> +
> +OVN_CLEANUP([hv1], [hv2])
> +AT_CLEANUP
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index eddc530f9..e239b7394 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -4483,3 +4483,147 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port
> patch-.*/d
>  /connection dropped.*/d"])
>
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- ECMP symmetric reply])
> +AT_KEYWORDS([ecmp])
> +
> +CHECK_CONNTRACK()
> +ovn_start
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-int])
> +
> +# Set external-ids in br-int needed for ovn-controller
> +ovs-vsctl \
> +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> +        -- set Open_vSwitch .
> external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
> +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> +        -- set bridge br-int fail-mode=secure
> other-config:disable-in-band=true
> +
> +# Start ovn-controller
> +start_daemon ovn-controller
> +
> +# Logical network:
> +# Alice is connected to gateway router R1. R1 is connected to two
> "external"
> +# routers, R2 and R3 via an "ext" switch.
> +# Bob is connected to both R2 and R3. R1 contains two ECMP routes, one
> through R2
> +# and one through R3, to Bob.
> +#
> +#     alice -- R1 -- ext ---- R2
> +#                     |         \
> +#                     |           bob
> +#                     |         /
> +#                     + ----- R3
> +#
> +# For this test, Bob sends request traffic through R2 to Alice. We want
> to ensure that
> +# all response traffic from Alice is routed through R2 as well.
> +
> +ovn-nbctl create Logical_Router name=R1 options:chassis=hv1
> +ovn-nbctl create Logical_Router name=R2
> +ovn-nbctl create Logical_Router name=R3
> +
> +ovn-nbctl ls-add alice
> +ovn-nbctl ls-add bob
> +ovn-nbctl ls-add ext
> +
> +# connect alice to R1
> +ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
> +ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
> +    type=router options:router-port=alice addresses='"00:00:01:01:02:03"'
> +
> +# connect bob to R2
> +ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
> +ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
> +    type=router options:router-port=R2_bob addresses='"00:00:02:01:02:03"'
> +
> +# connect bob to R3
> +ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
> +ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
> +    type=router options:router-port=R3_bob addresses='"00:00:02:01:02:04"'
> +
> +# Connect R1 to ext
> +ovn-nbctl lrp-add R1 R1_ext 00:00:04:01:02:03 20.0.0.1/24
> +ovn-nbctl lsp-add ext r1-ext -- set Logical_Switch_Port r1-ext \
> +    type=router options:router-port=R1_ext addresses='"00:00:04:01:02:03"'
> +
> +# Connect R2 to ext
> +ovn-nbctl lrp-add R2 R2_ext 00:00:04:01:02:04 20.0.0.2/24
> +ovn-nbctl lsp-add ext r2-ext -- set Logical_Switch_Port r2-ext \
> +    type=router options:router-port=R2_ext addresses='"00:00:04:01:02:04"'
> +
> +# Connect R3 to ext
> +ovn-nbctl lrp-add R3 R3_ext 00:00:04:01:02:05 20.0.0.3/24
> +ovn-nbctl lsp-add ext r3-ext -- set Logical_Switch_Port r3-ext \
> +    type=router options:router-port=R3_ext addresses='"00:00:04:01:02:05"'
> +
> +# Install ECMP routes for alice.
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> 10.0.0.0/24 20.0.0.2
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> 10.0.0.0/24 20.0.0.3
> +
> +# Static Routes
> +ovn-nbctl lr-route-add R2 10.0.0.0/24 20.0.0.1
> +ovn-nbctl lr-route-add R3 10.0.0.0/24 20.0.0.1
> +
> +# Logical port 'alice1' in switch 'alice'.
> +ADD_NAMESPACES(alice1)
> +ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24", "f0:00:00:01:02:04", \
> +         "10.0.0.1")
> +ovn-nbctl lsp-add alice alice1 \
> +-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
> +
> +# Logical port 'bob1' in switch 'bob'.
> +ADD_NAMESPACES(bob1)
> +ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16", "f0:00:00:01:02:06", \
> +         "172.16.0.2")
> +ovn-nbctl lsp-add bob bob1 \
> +-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
> +
> +# Ensure ovn-controller is caught up
> +ovn-nbctl --wait=hv sync
> +
> +on_exit 'ovs-ofctl dump-flows br-int'
> +
> +# 'bob1' should be able to ping 'alice1' directly.
> +NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 |
> FORMAT_PING], \
> +[0], [dnl
> +20 packets transmitted, 20 received, 0% packet loss, time 0ms
> +])
> +
> +# Ensure conntrack entry is present. We should not try to predict
> +# the tunnel key for the output port, so we strip it from the labels
> +# and just ensure that the known ethernet address is present.
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) | \
> +sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
> +sed -e
> 's/labels=0x[[0-9a-f]]*00000401020400000000/labels=0x00000401020400000000/'],
> [0], [dnl
>
> +icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x00000401020400000000
> +])
> +
> +# Ensure datapaths show conntrack states as expected
> +# Like with conntrack entries, we shouldn't try to predict
> +# port binding tunnel keys. So omit them from expected labels.
> +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> 'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*00000401020400000000/0xffffffffffffffff00000000)'
> -c], [0], [dnl
> +1
> +])
> +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> 'ct_state(-new+est+rpl+trk).*ct_label(0x.*00000401020400000000/0xffffffffffffffff00000000)'
> -c], [0], [dnl
> +1
> +])
> +
> +ovs-ofctl dump-flows br-int
> +
> +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as
> +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
> +/connection dropped.*/d"])
> +
> +AT_CLEANUP
> diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
> index de86b70e6..18bf90e08 100644
> --- a/utilities/ovn-nbctl.8.xml
> +++ b/utilities/ovn-nbctl.8.xml
> @@ -658,7 +658,8 @@
>
>      <dl>
>        <dt>[<code>--may-exist</code>]
> [<code>--policy</code>=<var>POLICY</var>]
> -        [<code>--ecmp</code>] <code>lr-route-add</code> <var>router</var>
> +        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
> +        <code>lr-route-add</code> <var>router</var>
>          <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
>        <dd>
>          <p>
> @@ -680,15 +681,31 @@
>            specified, the default is "dst-ip".
>          </p>
>
> +        <p>
> +          The <code>--ecmp</code> option allows for multiple routes with
> the
> +          same <var>prefix</var> <var>POLICY</var> but different
> +          <var>nexthop</var> and <var>port</var> to be added.
> +        </p>
> +
> +        <p>
> +          The <code>--ecmp-symmetric-reply</code> option makes it so that
> +          traffic that arrives over an ECMP route will have its reply
> traffic
> +          sent out over that same route. Setting
> +          <code>--ecmp-symmetric-reply</code> implies <code>--ecmp</code>
> so
> +          it is not necessary to set both.
> +        </p>
> +
>          <p>
>            It is an error if a route with <var>prefix</var> and
> -          <var>POLICY</var> already exists, unless
> <code>--may-exist</code> or
> -          <code>--ecmp</code> is specified.  If <code>--may-exist</code>
> is
> -          specified but not <code>--ecmp</code>, the existed route will be
> -          updated with the new nexthop and port.  If <code>--ecmp</code>
> is
> +          <var>POLICY</var> already exists, unless
> <code>--may-exist</code>,
> +          <code>--ecmp</code>, or <code>--ecmp-symmetric-reply</code> is
> +          specified.  If <code>--may-exist</code> is specified but not
> +          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code>, the
> +          existed route will be updated with the new nexthop and port.  If
> +          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code> is
>            specified, a new route will be added, regardless of the existed
> -          route, which is useful when adding ECMP routes, i.e. routes
> with same
> -          <var>POLICY</var> and <var>prefix</var> but different
> +          route., which is useful when adding ECMP routes, i.e. routes
> with
> +          same <var>POLICY</var> and <var>prefix</var> but different
>            <var>nexthop</var> and <var>port</var>.
>          </p>
>        </dd>
> diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> index 0079ad5a6..e6d8dbe63 100644
> --- a/utilities/ovn-nbctl.c
> +++ b/utilities/ovn-nbctl.c
> @@ -687,7 +687,8 @@ Logical router port commands:\n\
>                              ('overlay' or 'bridged')\n\
>  \n\
>  Route commands:\n\
> -  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
> +  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add ROUTER
> \n\
> +                            PREFIX NEXTHOP [PORT]\n\
>                              add a route to ROUTER\n\
>    [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP [PORT]]]\n\
>                              remove routes from ROUTER\n\
> @@ -3855,7 +3856,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>      }
>
>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
> -    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
> +    bool ecmp_symmetric_reply = shash_find(&ctx->options,
> +                                           "--ecmp-symmetric-reply") !=
> NULL;
> +    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
> +                ecmp_symmetric_reply;
>      if (!ecmp) {
>          for (int i = 0; i < lr->n_static_routes; i++) {
>              const struct nbrec_logical_router_static_route *route
> @@ -3920,6 +3924,13 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>          nbrec_logical_router_static_route_set_policy(route, policy);
>      }
>
> +    if (ecmp_symmetric_reply) {
> +        const struct smap options = SMAP_CONST1(&options,
> +                                                "ecmp_symmetric_reply",
> +                                                "true");
> +        nbrec_logical_router_static_route_set_options(route, &options);
> +    }
> +
>      nbrec_logical_router_verify_static_routes(lr);
>      struct nbrec_logical_router_static_route **new_routes
>          = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
> @@ -6361,7 +6372,8 @@ static const struct ctl_command_syntax
> nbctl_commands[] = {
>
>      /* logical router route commands. */
>      { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
> -      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW },
> +      nbctl_lr_route_add, NULL,
> "--may-exist,--ecmp,--ecmp-symmetric-reply,"
> +      "--policy=", RW },
>      { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]", NULL,
>        nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
>      { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list, NULL,
> --
> 2.25.4
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
Mark Michelson July 28, 2020, 3:08 p.m. UTC | #3
On 7/28/20 9:23 AM, Numan Siddique wrote:
> 
> 
> On Tue, Jul 28, 2020 at 2:51 AM Mark Michelson <mmichels@redhat.com 
> <mailto:mmichels@redhat.com>> wrote:
> 
>     When traffic arrives over an ECMP route, there is no guarantee that the
>     reply traffic will egress over the same route. Sometimes, the nature of
>     the traffic (or the intervening equipment) means that it is important
>     for reply traffic to go out the same route it came in.
> 
>     This commit introduces optional ECMP symmetric reply behavior. If
>     configured, then traffic to or from the ECMP route will be sent to
>     conntrack. New incoming traffic over the route will have the source MAC
>     address and incoming port saved in the ct_label. Reply traffic then uses
>     this saved information to send the packet back out the same way it came
>     in.
> 
>     To facilitate this, a new table was added to the ingress logical router
>     pipeline. The ECMP_STATEFUL table is responsible for committing to
>     conntrack and setting the ct_label when it detects new incoming traffic
>     from the route.
> 
>     Since ingress pipeline logic on the logical router depends on ct state
>     of a particular hypervisor, this feature is only usable on gateway
>     routers.
> 
>     Signed-off-by: Mark Michelson <mmichels@redhat.com
>     <mailto:mmichels@redhat.com>>
>     Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683
> 
> 
> Hi Mark,
> 
> Thanks for the new version. The first 4 patches in the series LGTM.
> 
> I've few comments in this patch
> 
> 1. This patch series needs a rebase as it's not applying cleanly on top 
> of the master.

OK thanks, I'll get this fixed.

> 
> 2. I think we should not exclude this feature to logical routers with 
> distributed gateway ports.
>     For logical router with gw port I think you can add the same ecmp 
> symmetric flows but with
>     one extra match - "inport == cr-<gw-port> && ...."
>     We do the same in many parts of the code. Openstack may use this 
> feature and Openstack neutron
>     don't use gateway routers.

I think I need a bit of education here about how this can work. Let me 
explain how I'm viewing this and you can explain if my thinking is wrong.

Let's say that you have a router ro-1 with ECMP routes. You've set this 
up to be a distributed router with a gateway port, and the gateway port 
is bound to chassis-1 (either via ha_chassis_group, gateway_chassis, or 
options:redirect_chassis). You have switch ls-1 connected to ro-1. VMs 
connected to ls-1 are distributed across multiple chassis.

Traffic originates from outside of the logical network and comes in the 
gateway port of ro-1 on chassis-1. The ingress pipeline runs, and 
conntrack saves the source ethernet address and port so we can send 
return traffic out the same port. Next, the egress pipeline of ro-1 runs 
on chassis-1. Then the ingress pipeline of ls-1 runs on chassis-1. 
During this, it is determined that the destination output port is on 
chassis-2. So the packet is tunneled to chassis-2. There, the egress 
pipeline of ls-1 runs and the packet is output to the VM. All is fine at 
this point.

Now, the VM sends reply traffic. The ingress pipeline of ls-1 is run, 
and the output port is the port linking ls-1 to ro-1. Now here's where 
things get a bit hazy for me. Since ro-1 is a distributed router, the 
port binding type for ls-1's port to ro-1 is a "patch" port. So the 
egress pipeline of ls-1 is run on chassis-2. Then the ingress pipeline 
of ro-1 will also run on chassis-2. This is a problem, because the 
conntrack entries for symmetric ECMP reply are on chassis-1. It's not 
until the ingress pipeline of ro-1 is completed that the packet is 
tunneled to chassis-1. Then on chassis-1 the egress pipeline of ro-1 
will run.

When you use a gateway router, the port binding type for ls-1's port to 
ro-1 is "l3gateway". This means that the packet would get tunneled to 
chassis-1 before running the egress pipeline for ls-1. Then, the ingress 
pipeline of ro-1 runs on chassis-1 so everything works.

Have I misunderstood how this works?

Assuming I haven't...

If the logic for using ECMP symmetric reply on the return traffic could 
be moved to the egress router pipeline, then I understand how it would 
work with a distributed router with gateway port. But I don't see how 
you can do that since the ECMP symmetric reply needs to choose the 
output port. By definition that has to be done in the ingress pipeline.

I guess one option would be to limit the use of ECMP symmetric reply 
traffic to only the gateway port on a distributed router. In this case, 
there would be no need to save the input port in conntrack since there's 
only one possibility. Instead, we would only need to save the nexthop 
MAC address. This way, in the egress pipeline we could override the 
initial ECMP route selection by changing eth.dst.

> 
> 3. In my testing with the logical resources created from system-ovn.at 
> <http://system-ovn.at>, I noticed that
>       - The traffic initiated from bob1 to alice1 works as expected. The 
> newly added logical flows gets hit
>          and the ct_label is set as expected.
> 
>        - The problem is in the traffic initiated by alice1. For the 
> first packet from alice1, the select action is executed
>          to choose one ecmp route (which is expected) and this packet is 
> not committed to the conntrack.
>          For the reply traffic from bob1, the packet gets committed 
> because of this flow
>          table=7 (lr_in_ecmp_stateful), priority=100  , match=(inport == 
> "R1_ext" && ip4.dst == 10.0.0.0/24 <http://10.0.0.0/24> && (ct.new && 
> !ct.est)), action=(ct_commit { ct_label.ecmp_reply_eth = eth.src; 
> ct_label.ecmp_reply_port = 2;}; next;)
>    -   Basically the reverse traffic is treated as new traffic. And from 
> here on, the packet from alice1 is considered as reply traffic.
>        table=10(lr_in_ip_routing   ), priority=100  , match=(ct.rpl && 
> ct_label.ecmp_reply_port == 2 && ip4.src == 10.0.0.0/24 
> <http://10.0.0.0/24>), action=(ip.ttl--; flags.loopback = 1; eth.src = 
> 00:00:04:01:02:03; reg1 = 20.0.0.1; outport = "R1_ext"; next;)
>    - I'm not really sure if it's a problem or not. Maybe it's fine. But 
> is it as expected ? I personally don't see any harm with this.
> 
>     - But I would like to know your comments and maybe Han has some 
> comments.

Hm, this is a bit hard to fix.

If you don't turn on symmetric replies, then traffic that originates 
from Alice for a connection *should* choose the same outgoing route 
every time since the 4-tuple will be the same throughout the life of the 
connection.

If you turn on symmetric replies, then you still get the same behavior, 
but you're adding in extra conntrack use.

So how do you detect that the traffic coming from Bob to Alice is in 
reply to Alice's traffic and avoid sending it to conntrack? You have to 
use conntrack to detect the direction, right? So in order to avoid using 
conntrack, we have to use conntrack...

> 
>   4. The test case - "3: ovn -- conntrack fields" is failing with this 
> patch. It's a small error which you forgot to change I suppose.

I actually had fixed this locally but then I guess I accidentally 
overwrote the changes and pushed an unfixed version. Sorry about that.

> 
>   5. Since you are adding a new column in Logical_Router_Static_Route, I 
> think the schema version needs to be updated to - "5.25.0"

Will do.

> 
> Thanks
> Numan
> 
> 
>     ---
>       lib/logical-fields.c      |   4 +
>       northd/ovn-northd.8.xml   |  49 ++++++++++---
>       northd/ovn-northd.c       | 123 +++++++++++++++++++++++++++----
>       ovn-architecture.7.xml    |   7 +-
>       ovn-nb.ovsschema          |   5 +-
>       ovn-nb.xml                |  16 ++++
>       tests/ovn.at <http://ovn.at>              | 151
>     ++++++++++++++++++++++++++++++++++----
>       tests/system-ovn.at <http://system-ovn.at>       | 144
>     ++++++++++++++++++++++++++++++++++++
>       utilities/ovn-nbctl.8.xml |  31 ++++++--
>       utilities/ovn-nbctl.c     |  18 ++++-
>       10 files changed, 496 insertions(+), 52 deletions(-)
> 
>     diff --git a/lib/logical-fields.c b/lib/logical-fields.c
>     index fde53a47e..15342dded 100644
>     --- a/lib/logical-fields.c
>     +++ b/lib/logical-fields.c
>     @@ -130,6 +130,10 @@ ovn_init_symtab(struct shash *symtab)
>                                        WR_CT_COMMIT);
>           expr_symtab_add_subfield_scoped(symtab, "ct_label.blocked", NULL,
>                                           "ct_label[0]", WR_CT_COMMIT);
>     +    expr_symtab_add_subfield_scoped(symtab,
>     "ct_label.ecmp_reply_eth", NULL,
>     +                                    "ct_label[32..79]", WR_CT_COMMIT);
>     +    expr_symtab_add_subfield_scoped(symtab,
>     "ct_label.ecmp_reply_port", NULL,
>     +                                    "ct_label[80..95]", WR_CT_COMMIT);
> 
>           expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL,
>     false);
> 
>     diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
>     index eb2514f15..cf251e02a 100644
>     --- a/northd/ovn-northd.8.xml
>     +++ b/northd/ovn-northd.8.xml
>     @@ -2120,15 +2120,31 @@ icmp6 {
>           <p>
>             This is to send packets to connection tracker for tracking and
>             defragmentation.  It contains a priority-0 flow that simply
>     moves traffic
>     -      to the next table.  If load balancing rules with virtual IP
>     addresses
>     -      (and ports) are configured in <code>OVN_Northbound</code>
>     database for a
>     -      Gateway router, a priority-100 flow is added for each
>     configured virtual
>     -      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow
>     matches
>     -      <code>ip &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
>     -      <var>VIPs</var>, the flow matches <code>ip &amp;&amp; ip6.dst ==
>     -      <var>VIP</var></code>.  The flow uses the action
>     <code>ct_next;</code>
>     -      to send IP packets to the connection tracker for packet
>     de-fragmentation
>     -      and tracking before sending it to the next table.
>     +      to the next table.
>     +    </p>
>     +
>     +    <p>
>     +      If load balancing rules with virtual IP addresses (and ports) are
>     +      configured in <code>OVN_Northbound</code> database for a
>     Gateway router,
>     +      a priority-100 flow is added for each configured virtual IP
>     address
>     +      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
>     <code>ip
>     +      &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
>     <var>VIPs</var>,
>     +      the flow matches <code>ip &amp;&amp; ip6.dst ==
>     <var>VIP</var></code>.
>     +      The flow uses the action <code>ct_next;</code> to send IP
>     packets to the
>     +      connection tracker for packet de-fragmentation and tracking
>     before
>     +      sending it to the next table.
>     +    </p>
>     +
>     +    <p>
>     +      If ECMP routes with symmetric reply are configured in the
>     +      <code>OVN_Northbound</code> database for a gateway router, a
>     priority-100
>     +      flow is added for each router port on which symmetric replies are
>     +      configured. The matching logic for these ports essentially
>     reverses the
>     +      configured logic of the ECMP route. So for instance, a route
>     with a
>     +      destination routing policy will instead match if the source
>     IP address
>     +      matches the static route's prefix. The flow uses the action
>     +      <code>ct_next</code> to send IP packets to the connection
>     tracker for
>     +      packet de-fragmentation and tracking before sending it to the
>     next table.
>           </p>
> 
>           <h3>Ingress Table 5: UNSNAT</h3>
>     @@ -2489,7 +2505,15 @@ output;
>             table.  This table, instead, is responsible for determine
>     the ECMP
>             group id and select a member id within the group based on
>     5-tuple
>             hashing.  It stores group id in <code>reg8[0..15]</code> and
>     member id in
>     -      <code>reg8[16..31]</code>.
>     +      <code>reg8[16..31]</code>. This step is skipped if the
>     traffic going
>     +      out the ECMP route is reply traffic, and the ECMP route was
>     configured
>     +      to use symmetric replies. Instead, the stored
>     <code>ct_label</code> value
>     +      is used to choose the destination. The least significant 48
>     bits of the
>     +      <code>ct_label</code> tell the destination MAC address to
>     which the
>     +      packet should be sent. The next 16 bits tell the logical
>     router port on
>     +      which the packet should be sent. These values in the
>     +      <code>ct_label</code> are set when the initial ingress traffic is
>     +      received over the ECMP route.
>           </p>
> 
>           <p>
>     @@ -2639,6 +2663,11 @@ select(reg8[16..31], <var>MID1</var>,
>     <var>MID2</var>, ...);
>             address and <code>reg1</code> as the source protocol address).
>           </p>
> 
>     +    <p>
>     +      This processing is skipped for reply traffic being sent out
>     of an ECMP
>     +      route if the route was configured to use symmetric replies.
>     +    </p>
>     +
>           <p>
>             This table contains the following logical flows:
>           </p>
>     diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
>     index d10e5ee5d..85f04ccde 100644
>     --- a/northd/ovn-northd.c
>     +++ b/northd/ovn-northd.c
>     @@ -172,16 +172,17 @@ enum ovn_stage {
>           PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4,
>     "lr_in_defrag")       \
>           PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5,
>     "lr_in_unsnat")       \
>           PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat") 
>             \
>     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7,
>     "lr_in_nd_ra_options") \
>     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8,
>     "lr_in_nd_ra_response") \
>     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
>     "lr_in_ip_routing")   \
>     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
>     "lr_in_ip_routing_ecmp") \
>     -    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11,
>     "lr_in_policy")       \
>     -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
>     "lr_in_arp_resolve")  \
>     -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
>     "lr_in_chk_pkt_len")   \
>     -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,   
>       14,"lr_in_larger_pkts")   \
>     -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
>     "lr_in_gw_redirect")  \
>     -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
>     "lr_in_arp_request")  \
>     +    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7,
>     "lr_in_ecmp_stateful") \
>     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8,
>     "lr_in_nd_ra_options") \
>     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9,
>     "lr_in_nd_ra_response") \
>     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10,
>     "lr_in_ip_routing")   \
>     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11,
>     "lr_in_ip_routing_ecmp") \
>     +    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12,
>     "lr_in_policy")       \
>     +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
>     "lr_in_arp_resolve")  \
>     +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
>     "lr_in_chk_pkt_len")   \
>     +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,   
>       15,"lr_in_larger_pkts")   \
>     +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
>     "lr_in_gw_redirect")  \
>     +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
>     "lr_in_arp_request")  \
>                                                                         
>         \
>           /* Logical router egress stages. */                           
>         \
>           PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")   
>          \
>     @@ -7312,6 +7313,7 @@ struct parsed_route {
>           bool is_src_route;
>           uint32_t hash;
>           const struct nbrec_logical_router_static_route *route;
>     +    bool ecmp_symmetric_reply;
>       };
> 
>       static uint32_t
>     @@ -7373,6 +7375,8 @@ parsed_routes_add(struct ovs_list *routes,
>                                                        "src-ip"));
>           pr->hash = route_hash(pr);
>           pr->route = route;
>     +    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
>     +                                           
>       "ecmp_symmetric_reply", false);
>           ovs_list_insert(routes, &pr->list_node);
>           return pr;
>       }
>     @@ -7621,18 +7625,95 @@ find_static_route_outport(struct
>     ovn_datapath *od, struct hmap *ports,
>           return true;
>       }
> 
>     +static void
>     +add_ecmp_symmetric_reply_flows(struct hmap *lflows,
>     +                               struct ovn_datapath *od,
>     +                               const char *port_ip,
>     +                               struct ovn_port *out_port,
>     +                               const struct parsed_route *route,
>     +                               struct ds *route_match)
>     +{
>     +    const struct nbrec_logical_router_static_route *st_route =
>     route->route;
>     +    struct ds match = DS_EMPTY_INITIALIZER;
>     +    struct ds actions = DS_EMPTY_INITIALIZER;
>     +    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
>     +    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
>     +
>     +    /* If symmetric ECMP replies are enabled, then packets that
>     arrive over
>     +     * an ECMP route need to go through conntrack.
>     +     */
>     +    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
>     +                  out_port->json_key,
>     +                  route->prefix.family == AF_INET ? "4" : "6",
>     +                  route->is_src_route ? "dst" : "src",
>     +                  cidr);
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
>     +                            ds_cstr(&match), "ct_next;",
>     +                            &st_route->header_);
>     +
>     +    /* And packets that go out over an ECMP route need conntrack */
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
>     +                            ds_cstr(route_match), "ct_next;",
>     +                            &st_route->header_);
>     +
>     +    /* Save src eth and inport in ct_label for packets that arrive over
>     +     * an ECMP route.
>     +     *
>     +     * NOTE: we purposely are not clearing match before this
>     +     * ds_put_cstr() call. The previous contents are needed.
>     +     */
>     +    ds_put_cstr(&match, " && (ct.new && !ct.est)");
>     +
>     +    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth =
>     eth.src;"
>     +                  " ct_label.ecmp_reply_port = %" PRId64 ";}; next;",
>     +                  out_port->sb->tunnel_key);
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 100,
>     +                            ds_cstr(&match), ds_cstr(&actions),
>     +                            &st_route->header_);
>     +
>     +    /* Bypass ECMP selection if we already have ct_label information
>     +     * for where to route the packet.
>     +     */
>     +    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port
>     == %"
>     +                  PRId64, out_port->sb->tunnel_key);
>     +    ds_clear(&match);
>     +    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
>     +                  ds_cstr(route_match));
>     +    ds_clear(&actions);
>     +    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
>     +                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
>     +                  out_port->lrp_networks.ea_s,
>     +                  route->prefix.family == AF_INET ? "" : "xx",
>     +                  port_ip, out_port->json_key);
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
>     +                           ds_cstr(&match), ds_cstr(&actions),
>     +                           &st_route->header_);
>     +
>     +    /* Egress reply traffic for symmetric ECMP routes skips router
>     policies. */
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
>     +                            ds_cstr(&ecmp_reply), "next;",
>     +                            &st_route->header_);
>     +
>     +    ds_clear(&actions);
>     +    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth; next;");
>     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
>     +                            200, ds_cstr(&ecmp_reply),
>     +                            ds_cstr(&actions), &st_route->header_);
>     +}
>     +
>       static void
>       build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>                             struct hmap *ports, struct ecmp_groups_node *eg)
> 
>       {
>           bool is_ipv4 = (eg->prefix.family == AF_INET);
>     -    struct ds match = DS_EMPTY_INITIALIZER;
>           uint16_t priority;
>     +    struct ecmp_route_list_node *er;
>     +    struct ds route_match = DS_EMPTY_INITIALIZER;
> 
>           char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
>           build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route,
>     is_ipv4,
>     -                      &match, &priority);
>     +                      &route_match, &priority);
>           free(prefix_s);
> 
>           struct ds actions = DS_EMPTY_INITIALIZER;
>     @@ -7640,7 +7721,6 @@ build_ecmp_route_flow(struct hmap *lflows,
>     struct ovn_datapath *od,
>                         "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
>                         REG_ECMP_MEMBER_ID);
> 
>     -    struct ecmp_route_list_node *er;
>           bool is_first = true;
>           LIST_FOR_EACH (er, list_node, &eg->route_list) {
>               if (is_first) {
>     @@ -7654,11 +7734,12 @@ build_ecmp_route_flow(struct hmap *lflows,
>     struct ovn_datapath *od,
>           ds_put_cstr(&actions, ");");
> 
>           ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
>     -                  ds_cstr(&match), ds_cstr(&actions));
>     +                  ds_cstr(&route_match), ds_cstr(&actions));
> 
>           /* Add per member flow */
>     +    struct ds match = DS_EMPTY_INITIALIZER;
>     +    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
>           LIST_FOR_EACH (er, list_node, &eg->route_list) {
>     -
>               const struct parsed_route *route_ = er->route;
>               const struct nbrec_logical_router_static_route *route =
>     route_->route;
>               /* Find the outgoing port. */
>     @@ -7668,6 +7749,15 @@ build_ecmp_route_flow(struct hmap *lflows,
>     struct ovn_datapath *od,
>                                              &out_port)) {
>                   continue;
>               }
>     +        /* Symmetric ECMP reply is only usable on gateway routers.
>     +         * It is NOT usable on distributed routers with a gateway port.
>     +         */
>     +        if (smap_get(&od->nbr->options, "chassis") &&
>     +            route_->ecmp_symmetric_reply && sset_add(&visited_ports,
>     +                                                     out_port->key)) {
>     +            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s,
>     out_port,
>     +                                           route_, &route_match);
>     +        }
>               ds_clear(&match);
>               ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
>                             REG_ECMP_MEMBER_ID" == %"PRIu16,
>     @@ -7688,7 +7778,9 @@ build_ecmp_route_flow(struct hmap *lflows,
>     struct ovn_datapath *od,
>                                       ds_cstr(&match), ds_cstr(&actions),
>                                       &route->header_);
>           }
>     +    sset_destroy(&visited_ports);
>           ds_destroy(&match);
>     +    ds_destroy(&route_match);
>           ds_destroy(&actions);
>       }
> 
>     @@ -8972,6 +9064,7 @@ build_lrouter_flows(struct hmap *datapaths,
>     struct hmap *ports,
>               ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1", "next;");
>               ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1",
>     "next;");
>               ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1",
>     "next;");
>     +        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0,
>     "1", "next;");
> 
>               /* Send the IPv6 NS packets to next table. When ovn-controller
>                * generates IPv6 NS (for the action - nd_ns{}), the injected
>     diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
>     index 246cebc19..b1a462933 100644
>     --- a/ovn-architecture.7.xml
>     +++ b/ovn-architecture.7.xml
>     @@ -1210,11 +1210,12 @@
>           <dd>
>             Fields that denote the connection tracking zones for
>     routers.  These
>             values only have local significance and are not meaningful
>     between
>     -      chassis.  OVN stores the zone information for DNATting in
>     Open vSwitch
>     +      chassis.  OVN stores the zone information for north to south
>     traffic
>     +      (for DNATting or ECMP symmetric replies) in Open vSwitch
>               <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
>               MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
>     -      extension register number 11 and zone information for SNATing in
>     -      Open vSwitch extension register number 12.
>     +      extension register number 11 and zone information for south
>     to north
>     +      traffic (for SNATing) in Open vSwitch extension register
>     number 12.
>           </dd>
> 
>           <dt>logical flow flags</dt>
>     diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
>     index da9af7157..16f7794f2 100644
>     --- a/ovn-nb.ovsschema
>     +++ b/ovn-nb.ovsschema
>     @@ -1,7 +1,7 @@
>       {
>           "name": "OVN_Northbound",
>           "version": "5.24.0",
>     -    "cksum": "1092394564 25961",
>     +    "cksum": "679745602 26116",
>           "tables": {
>               "NB_Global": {
>                   "columns": {
>     @@ -365,6 +365,9 @@
>                                           "min": 0, "max": 1}},
>                       "nexthop": {"type": "string"},
>                       "output_port": {"type": {"key": "string", "min":
>     0, "max": 1}},
>     +                "options": {
>     +                    "type": {"key": "string", "value": "string",
>     +                             "min": 0, "max": "unlimited"}},
>                       "external_ids": {
>                           "type": {"key": "string", "value": "string",
>                                    "min": 0, "max": "unlimited"}}},
>     diff --git a/ovn-nb.xml b/ovn-nb.xml
>     index db5908cd5..5e434d257 100644
>     --- a/ovn-nb.xml
>     +++ b/ovn-nb.xml
>     @@ -2481,6 +2481,22 @@
>             </column>
>           </group>
> 
>     +    <group title="Common options">
>     +      <column name="options">
>     +        This column provides general key/value settings. The supported
>     +        options are described individually below.
>     +      </column>
>     +
>     +      <column name="options" key="ecmp_symmetric_reply">
>     +        It true, then new traffic that arrives over this route will
>     have
>     +        its reply traffic bypass ECMP route selection and will be
>     sent out
>     +        this route instead. Note that this option overrides any
>     rules set
>     +        in the <ref table="Logical_Router_policy" /> table. This option
>     +        only works on gateway routers (routers that have
>     +        <ref column="options" key="chassis" table="Logical_Router"
>     /> set).
>     +      </column>
>     +    </group>
>     +
>         </table>
> 
>         <table name="Logical_Router_Policy" title="Logical router policies">
>     diff --git a/tests/ovn.at <http://ovn.at> b/tests/ovn.at <http://ovn.at>
>     index f8dde14c2..c1ab6b85f 100644
>     --- a/tests/ovn.at <http://ovn.at>
>     +++ b/tests/ovn.at <http://ovn.at>
>     @@ -195,6 +195,8 @@ ct.snat = ct_state[6]
>       ct.trk = ct_state[5]
>       ct_label = NXM_NX_CT_LABEL
>       ct_label.blocked = ct_label[0]
>     +ct_label.ecmp_reply_eth = ct_label[0..47]
>     +ct_label.ecmp_reply_port = ct_label[48..63]
>       ct_mark = NXM_NX_CT_MARK
>       ct_state = NXM_NX_CT_STATE
>       ]])
>     @@ -16065,7 +16067,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       # Since the sw0-vir is not claimed by any chassis, eth.dst should
>     be set to
>       # zero if the ip4.dst is the virtual ip in the router pipeline.
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     00:00:00:00:00:00; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     00:00:00:00:00:00; next;)
>       ])
> 
>       ip_to_hex() {
>     @@ -16116,7 +16118,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       # There should be an arp resolve flow to resolve the virtual_ip
>     with the
>       # sw0-p1's MAC.
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:03; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:03; next;)
>       ])
> 
>       # Forcibly clear virtual_parent. ovn-controller should release the
>     binding
>     @@ -16157,7 +16159,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       # There should be an arp resolve flow to resolve the virtual_ip
>     with the
>       # sw0-p2's MAC.
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:05; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:05; next;)
>       ])
> 
>       # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
>     @@ -16180,7 +16182,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       # There should be an arp resolve flow to resolve the virtual_ip
>     with the
>       # sw0-p3's MAC.
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:04; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:04; next;)
>       ])
> 
>       # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
>     @@ -16201,7 +16203,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       > lflows.txt
> 
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:03; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:03; next;)
>       ])
> 
>       # Delete hv1-vif1 port. hv1 should release sw0-vir
>     @@ -16219,7 +16221,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       > lflows.txt
> 
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     00:00:00:00:00:00; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     00:00:00:00:00:00; next;)
>       ])
> 
>       # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
>     @@ -16240,7 +16242,7 @@ ovn-sbctl dump-flows lr0 | grep
>     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
>       > lflows.txt
> 
>       AT_CHECK([cat lflows.txt], [0], [dnl
>     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:04; next;)
>     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
>     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
>     50:54:00:00:00:04; next;)
>       ])
> 
>       # Delete sw0-p2 logical port
>     @@ -20274,22 +20276,22 @@ ovn-nbctl set logical_router_policy $pol5
>     options:pkt_mark=5
>       ovn-nbctl --wait=hv sync
> 
>       OVS_WAIT_UNTIL([
>     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x64->NXM_NX_PKT_MARK" -c)
>       ])
> 
>       OVS_WAIT_UNTIL([
>     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x3->NXM_NX_PKT_MARK" -c)
>       ])
> 
>       OVS_WAIT_UNTIL([
>     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x4->NXM_NX_PKT_MARK" -c)
>       ])
> 
>       OVS_WAIT_UNTIL([
>     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x5->NXM_NX_PKT_MARK" -c)
>       ])
> 
>     @@ -20380,12 +20382,12 @@ send_ipv4_pkt hv1 hv1-vif1 505400000003
>     00000000ff01 \
>           $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
> 
>       OVS_WAIT_UNTIL([
>     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x2->NXM_NX_PKT_MARK" -c)
>       ])
> 
>       AT_CHECK([
>     -    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
>     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>           grep "load:0x64->NXM_NX_PKT_MARK" -c)
>       ])
> 
>     @@ -20741,3 +20743,126 @@ AT_CHECK([test "$hv2_offlows" =
>     "$hv2_offlows_mon"])
> 
>       OVN_CLEANUP([hv1], [hv2])
>       AT_CLEANUP
>     +
>     +AT_SETUP([ovn -- Symmetric ECMP reply flows])
>     +ovn_start
>     +
>     +net_add n1
>     +sim_add hv1
>     +as hv1
>     +ovs-vsctl add-br br-phys
>     +ovn_attach n1 br-phys 192.168.0.1
>     +
>     +sim_add hv2
>     +as hv2
>     +ovs-vsctl add-br br-phys
>     +ovn_attach n1 br-phys 192.168.0.2
>     +
>     +# Logical network
>     +#
>     +#   ls1 \
>     +#        \
>     +#         DR -- join -- GW -- ext
>     +#        /
>     +#   ls2 /
>     +#
>     +#  ls1 and ls2 are internal switches connected to distributed router
>     +#  DR. DR is then connected via a join switch to gateway router GW.
>     +#  GW is then connected to external switch ext. In real life, this
>     +#  would likely have a localnet port, but for the purposes of this test
>     +#  it is unnecessary.
>     +
>     +ovn-nbctl create Logical_Router name=DR
>     +gw_uuid=$(ovn-nbctl create Logical_Router name=GW)
>     +
>     +ovn-nbctl ls-add ls1
>     +ovn-nbctl ls-add ls2
>     +ovn-nbctl ls-add join
>     +ovn-nbctl ls-add ext
>     +
>     +# Connect ls1 to DR
>     +ovn-nbctl lrp-add DR dr-ls1 00:00:01:01:02:03 10.0.0.1/24
>     <http://10.0.0.1/24>
>     +ovn-nbctl lsp-add ls1 ls1-dr -- set Logical_Switch_Port ls1-dr \
>     +    type=router options:router-port=dr-ls1
>     addresses='"00:00:01:01:02:03"'
>     +
>     +# Connect ls2 to DR
>     +ovn-nbctl lrp-add DR dr-ls2 00:00:01:01:02:04 10.0.0.2/24
>     <http://10.0.0.2/24>
>     +ovn-nbctl lsp-add ls2 ls2-dr -- set Logical_Switch_Port ls2-dr \
>     +    type=router options:router-port=dr-ls2
>     addresses='"00:00:01:01:02:04"'
>     +
>     +# Connect join to DR
>     +ovn-nbctl lrp-add DR dr-join 00:00:02:01:02:03 20.0.0.1/24
>     <http://20.0.0.1/24>
>     +ovn-nbctl lsp-add join join-dr -- set Logical_Switch_Port join-dr \
>     +    type=router options:router-port=dr-join
>     addresses='"00:00:02:01:02:03"'
>     +
>     +# Connect join to GW
>     +ovn-nbctl lrp-add GW gw-join 00:00:02:01:02:04 20.0.0.2/24
>     <http://20.0.0.2/24>
>     +ovn-nbctl lsp-add join join-gw -- set Logical_Switch_Port join-gw \
>     +    type=router options:router-port=gw-join
>     addresses='"00:00:02:01:02:04"'
>     +
>     +# Connect ext to GW
>     +ovn-nbctl lrp-add GW gw-ext 00:00:03:01:02:03 172.16.0.1/16
>     <http://172.16.0.1/16>
>     +ovn-nbctl lsp-add ext ext-gw -- set Logical_Switch_Port ext-gw \
>     +    type=router options:router-port=gw-ext
>     addresses='"00:00:03:01:02:03"'
>     +
>     +ovn-nbctl lr-route-add GW 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
>     +ovn-nbctl --policy="src-ip" lr-route-add DR 10.0.0.0/24
>     <http://10.0.0.0/24> 20.0.0.2
>     +
>     +# Now add some ECMP routes to the GW router.
>     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
>     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.2
>     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
>     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.3
>     +
>     +ovn-nbctl --wait=hv sync
>     +
>     +# Ensure ECMP symmetric reply flows are not present on any hypervisor.
>     +AT_CHECK([
>     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
>     +    grep "priority=100" | \
>     +    grep
>     "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
>     -c)
>     +])
>     +AT_CHECK([
>     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
>     +    grep "priority=200" | \
>     +    grep
>     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
>     +])
>     +
>     +AT_CHECK([
>     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
>     +    grep "priority=100" | \
>     +    grep
>     "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
>     -c)
>     +])
>     +AT_CHECK([
>     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
>     +    grep "priority=200" | \
>     +    grep
>     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
>     +])
>     +
>     +# Now make GW a gateway router on hv1
>     +ovn-nbctl set Logical_Router $gw_uuid options:chassis=hv1
>     +ovn-nbctl --wait=hv sync
>     +
>     +# And ensure that ECMP symmetric reply flows are present only on hv1
>     +AT_CHECK([
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
>     +    grep "priority=100" | \
>     +    grep
>     "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
>     -c)
>     +])
>     +AT_CHECK([
>     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
>     +    grep "priority=200" | \
>     +    grep
>     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
>     +])
>     +
>     +AT_CHECK([
>     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
>     +    grep "priority=100" | \
>     +    grep
>     "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
>     -c)
>     +])
>     +AT_CHECK([
>     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
>     +    grep "priority=200" | \
>     +    grep
>     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
>     +])
>     +
>     +OVN_CLEANUP([hv1], [hv2])
>     +AT_CLEANUP
>     diff --git a/tests/system-ovn.at <http://system-ovn.at>
>     b/tests/system-ovn.at <http://system-ovn.at>
>     index eddc530f9..e239b7394 100644
>     --- a/tests/system-ovn.at <http://system-ovn.at>
>     +++ b/tests/system-ovn.at <http://system-ovn.at>
>     @@ -4483,3 +4483,147 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query
>     port patch-.*/d
>       /connection dropped.*/d"])
> 
>       AT_CLEANUP
>     +
>     +AT_SETUP([ovn -- ECMP symmetric reply])
>     +AT_KEYWORDS([ecmp])
>     +
>     +CHECK_CONNTRACK()
>     +ovn_start
>     +
>     +OVS_TRAFFIC_VSWITCHD_START()
>     +ADD_BR([br-int])
>     +
>     +# Set external-ids in br-int needed for ovn-controller
>     +ovs-vsctl \
>     +        -- set Open_vSwitch . external-ids:system-id=hv1 \
>     +        -- set Open_vSwitch .
>     external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
>     +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
>     +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
>     +        -- set bridge br-int fail-mode=secure
>     other-config:disable-in-band=true
>     +
>     +# Start ovn-controller
>     +start_daemon ovn-controller
>     +
>     +# Logical network:
>     +# Alice is connected to gateway router R1. R1 is connected to two
>     "external"
>     +# routers, R2 and R3 via an "ext" switch.
>     +# Bob is connected to both R2 and R3. R1 contains two ECMP routes,
>     one through R2
>     +# and one through R3, to Bob.
>     +#
>     +#     alice -- R1 -- ext ---- R2
>     +#                     |         \
>     +#                     |           bob
>     +#                     |         /
>     +#                     + ----- R3
>     +#
>     +# For this test, Bob sends request traffic through R2 to Alice. We
>     want to ensure that
>     +# all response traffic from Alice is routed through R2 as well.
>     +
>     +ovn-nbctl create Logical_Router name=R1 options:chassis=hv1
>     +ovn-nbctl create Logical_Router name=R2
>     +ovn-nbctl create Logical_Router name=R3
>     +
>     +ovn-nbctl ls-add alice
>     +ovn-nbctl ls-add bob
>     +ovn-nbctl ls-add ext
>     +
>     +# connect alice to R1
>     +ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
>     <http://10.0.0.1/24>
>     +ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
>     +    type=router options:router-port=alice
>     addresses='"00:00:01:01:02:03"'
>     +
>     +# connect bob to R2
>     +ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
>     <http://172.16.0.2/16>
>     +ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
>     +    type=router options:router-port=R2_bob
>     addresses='"00:00:02:01:02:03"'
>     +
>     +# connect bob to R3
>     +ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
>     <http://172.16.0.3/16>
>     +ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
>     +    type=router options:router-port=R3_bob
>     addresses='"00:00:02:01:02:04"'
>     +
>     +# Connect R1 to ext
>     +ovn-nbctl lrp-add R1 R1_ext 00:00:04:01:02:03 20.0.0.1/24
>     <http://20.0.0.1/24>
>     +ovn-nbctl lsp-add ext r1-ext -- set Logical_Switch_Port r1-ext \
>     +    type=router options:router-port=R1_ext
>     addresses='"00:00:04:01:02:03"'
>     +
>     +# Connect R2 to ext
>     +ovn-nbctl lrp-add R2 R2_ext 00:00:04:01:02:04 20.0.0.2/24
>     <http://20.0.0.2/24>
>     +ovn-nbctl lsp-add ext r2-ext -- set Logical_Switch_Port r2-ext \
>     +    type=router options:router-port=R2_ext
>     addresses='"00:00:04:01:02:04"'
>     +
>     +# Connect R3 to ext
>     +ovn-nbctl lrp-add R3 R3_ext 00:00:04:01:02:05 20.0.0.3/24
>     <http://20.0.0.3/24>
>     +ovn-nbctl lsp-add ext r3-ext -- set Logical_Switch_Port r3-ext \
>     +    type=router options:router-port=R3_ext
>     addresses='"00:00:04:01:02:05"'
>     +
>     +# Install ECMP routes for alice.
>     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
>     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.2
>     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
>     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.3
>     +
>     +# Static Routes
>     +ovn-nbctl lr-route-add R2 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
>     +ovn-nbctl lr-route-add R3 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
>     +
>     +# Logical port 'alice1' in switch 'alice'.
>     +ADD_NAMESPACES(alice1)
>     +ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24
>     <http://10.0.0.2/24>", "f0:00:00:01:02:04", \
>     +         "10.0.0.1")
>     +ovn-nbctl lsp-add alice alice1 \
>     +-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
>     +
>     +# Logical port 'bob1' in switch 'bob'.
>     +ADD_NAMESPACES(bob1)
>     +ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16
>     <http://172.16.0.1/16>", "f0:00:00:01:02:06", \
>     +         "172.16.0.2")
>     +ovn-nbctl lsp-add bob bob1 \
>     +-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
>     +
>     +# Ensure ovn-controller is caught up
>     +ovn-nbctl --wait=hv sync
>     +
>     +on_exit 'ovs-ofctl dump-flows br-int'
>     +
>     +# 'bob1' should be able to ping 'alice1' directly.
>     +NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 |
>     FORMAT_PING], \
>     +[0], [dnl
>     +20 packets transmitted, 20 received, 0% packet loss, time 0ms
>     +])
>     +
>     +# Ensure conntrack entry is present. We should not try to predict
>     +# the tunnel key for the output port, so we strip it from the labels
>     +# and just ensure that the known ethernet address is present.
>     +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) | \
>     +sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
>     +sed -e
>     's/labels=0x[[0-9a-f]]*00000401020400000000/labels=0x00000401020400000000/'],
>     [0], [dnl
>     +icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x00000401020400000000
>     +])
>     +
>     +# Ensure datapaths show conntrack states as expected
>     +# Like with conntrack entries, we shouldn't try to predict
>     +# port binding tunnel keys. So omit them from expected labels.
>     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
>     'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*00000401020400000000/0xffffffffffffffff00000000)'
>     -c], [0], [dnl
>     +1
>     +])
>     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
>     'ct_state(-new+est+rpl+trk).*ct_label(0x.*00000401020400000000/0xffffffffffffffff00000000)'
>     -c], [0], [dnl
>     +1
>     +])
>     +
>     +ovs-ofctl dump-flows br-int
>     +
>     +OVS_APP_EXIT_AND_WAIT([ovn-controller])
>     +
>     +as ovn-sb
>     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>     +
>     +as ovn-nb
>     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
>     +
>     +as northd
>     +OVS_APP_EXIT_AND_WAIT([ovn-northd])
>     +
>     +as
>     +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
>     +/connection dropped.*/d"])
>     +
>     +AT_CLEANUP
>     diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
>     index de86b70e6..18bf90e08 100644
>     --- a/utilities/ovn-nbctl.8.xml
>     +++ b/utilities/ovn-nbctl.8.xml
>     @@ -658,7 +658,8 @@
> 
>           <dl>
>             <dt>[<code>--may-exist</code>]
>     [<code>--policy</code>=<var>POLICY</var>]
>     -        [<code>--ecmp</code>] <code>lr-route-add</code>
>     <var>router</var>
>     +        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
>     +        <code>lr-route-add</code> <var>router</var>
>               <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
>             <dd>
>               <p>
>     @@ -680,15 +681,31 @@
>                 specified, the default is "dst-ip".
>               </p>
> 
>     +        <p>
>     +          The <code>--ecmp</code> option allows for multiple routes
>     with the
>     +          same <var>prefix</var> <var>POLICY</var> but different
>     +          <var>nexthop</var> and <var>port</var> to be added.
>     +        </p>
>     +
>     +        <p>
>     +          The <code>--ecmp-symmetric-reply</code> option makes it
>     so that
>     +          traffic that arrives over an ECMP route will have its
>     reply traffic
>     +          sent out over that same route. Setting
>     +          <code>--ecmp-symmetric-reply</code> implies
>     <code>--ecmp</code> so
>     +          it is not necessary to set both.
>     +        </p>
>     +
>               <p>
>                 It is an error if a route with <var>prefix</var> and
>     -          <var>POLICY</var> already exists, unless
>     <code>--may-exist</code> or
>     -          <code>--ecmp</code> is specified.  If
>     <code>--may-exist</code> is
>     -          specified but not <code>--ecmp</code>, the existed route
>     will be
>     -          updated with the new nexthop and port.  If
>     <code>--ecmp</code> is
>     +          <var>POLICY</var> already exists, unless
>     <code>--may-exist</code>,
>     +          <code>--ecmp</code>, or
>     <code>--ecmp-symmetric-reply</code> is
>     +          specified.  If <code>--may-exist</code> is specified but not
>     +          <code>--ecmp</code> or
>     <code>--ecmp-symmetric-reply</code>, the
>     +          existed route will be updated with the new nexthop and
>     port.  If
>     +          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code> is
>                 specified, a new route will be added, regardless of the
>     existed
>     -          route, which is useful when adding ECMP routes, i.e.
>     routes with same
>     -          <var>POLICY</var> and <var>prefix</var> but different
>     +          route., which is useful when adding ECMP routes, i.e.
>     routes with
>     +          same <var>POLICY</var> and <var>prefix</var> but different
>                 <var>nexthop</var> and <var>port</var>.
>               </p>
>             </dd>
>     diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
>     index 0079ad5a6..e6d8dbe63 100644
>     --- a/utilities/ovn-nbctl.c
>     +++ b/utilities/ovn-nbctl.c
>     @@ -687,7 +687,8 @@ Logical router port commands:\n\
>                                   ('overlay' or 'bridged')\n\
>       \n\
>       Route commands:\n\
>     -  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP
>     [PORT]\n\
>     +  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add
>     ROUTER \n\
>     +                            PREFIX NEXTHOP [PORT]\n\
>                                   add a route to ROUTER\n\
>         [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP [PORT]]]\n\
>                                   remove routes from ROUTER\n\
>     @@ -3855,7 +3856,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>           }
> 
>           bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
>     -    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
>     +    bool ecmp_symmetric_reply = shash_find(&ctx->options,
>     +                                         
>       "--ecmp-symmetric-reply") != NULL;
>     +    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
>     +                ecmp_symmetric_reply;
>           if (!ecmp) {
>               for (int i = 0; i < lr->n_static_routes; i++) {
>                   const struct nbrec_logical_router_static_route *route
>     @@ -3920,6 +3924,13 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>               nbrec_logical_router_static_route_set_policy(route, policy);
>           }
> 
>     +    if (ecmp_symmetric_reply) {
>     +        const struct smap options = SMAP_CONST1(&options,
>     +                                                "ecmp_symmetric_reply",
>     +                                                "true");
>     +        nbrec_logical_router_static_route_set_options(route, &options);
>     +    }
>     +
>           nbrec_logical_router_verify_static_routes(lr);
>           struct nbrec_logical_router_static_route **new_routes
>               = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
>     @@ -6361,7 +6372,8 @@ static const struct ctl_command_syntax
>     nbctl_commands[] = {
> 
>           /* logical router route commands. */
>           { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
>     -      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW },
>     +      nbctl_lr_route_add, NULL,
>     "--may-exist,--ecmp,--ecmp-symmetric-reply,"
>     +      "--policy=", RW },
>           { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]", NULL,
>             nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
>           { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list,
>     NULL,
>     -- 
>     2.25.4
> 
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org <mailto:dev@openvswitch.org>
>     https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Numan Siddique July 28, 2020, 4:15 p.m. UTC | #4
On Tue, Jul 28, 2020 at 8:38 PM Mark Michelson <mmichels@redhat.com> wrote:

> On 7/28/20 9:23 AM, Numan Siddique wrote:
> >
> >
> > On Tue, Jul 28, 2020 at 2:51 AM Mark Michelson <mmichels@redhat.com
> > <mailto:mmichels@redhat.com>> wrote:
> >
> >     When traffic arrives over an ECMP route, there is no guarantee that
> the
> >     reply traffic will egress over the same route. Sometimes, the nature
> of
> >     the traffic (or the intervening equipment) means that it is important
> >     for reply traffic to go out the same route it came in.
> >
> >     This commit introduces optional ECMP symmetric reply behavior. If
> >     configured, then traffic to or from the ECMP route will be sent to
> >     conntrack. New incoming traffic over the route will have the source
> MAC
> >     address and incoming port saved in the ct_label. Reply traffic then
> uses
> >     this saved information to send the packet back out the same way it
> came
> >     in.
> >
> >     To facilitate this, a new table was added to the ingress logical
> router
> >     pipeline. The ECMP_STATEFUL table is responsible for committing to
> >     conntrack and setting the ct_label when it detects new incoming
> traffic
> >     from the route.
> >
> >     Since ingress pipeline logic on the logical router depends on ct
> state
> >     of a particular hypervisor, this feature is only usable on gateway
> >     routers.
> >
> >     Signed-off-by: Mark Michelson <mmichels@redhat.com
> >     <mailto:mmichels@redhat.com>>
> >     Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683
> >
> >
> > Hi Mark,
> >
> > Thanks for the new version. The first 4 patches in the series LGTM.
> >
> > I've few comments in this patch
> >
> > 1. This patch series needs a rebase as it's not applying cleanly on top
> > of the master.
>
> OK thanks, I'll get this fixed.
>
> >
> > 2. I think we should not exclude this feature to logical routers with
> > distributed gateway ports.
> >     For logical router with gw port I think you can add the same ecmp
> > symmetric flows but with
> >     one extra match - "inport == cr-<gw-port> && ...."
> >     We do the same in many parts of the code. Openstack may use this
> > feature and Openstack neutron
> >     don't use gateway routers.
>
> I think I need a bit of education here about how this can work. Let me
> explain how I'm viewing this and you can explain if my thinking is wrong.
>
> Let's say that you have a router ro-1 with ECMP routes. You've set this
> up to be a distributed router with a gateway port, and the gateway port
> is bound to chassis-1 (either via ha_chassis_group, gateway_chassis, or
> options:redirect_chassis). You have switch ls-1 connected to ro-1. VMs
> connected to ls-1 are distributed across multiple chassis.
>
> Traffic originates from outside of the logical network and comes in the
> gateway port of ro-1 on chassis-1. The ingress pipeline runs, and
> conntrack saves the source ethernet address and port so we can send
> return traffic out the same port. Next, the egress pipeline of ro-1 runs
> on chassis-1. Then the ingress pipeline of ls-1 runs on chassis-1.
> During this, it is determined that the destination output port is on
> chassis-2. So the packet is tunneled to chassis-2. There, the egress
> pipeline of ls-1 runs and the packet is output to the VM. All is fine at
> this point.
>
> Now, the VM sends reply traffic. The ingress pipeline of ls-1 is run,
> and the output port is the port linking ls-1 to ro-1. Now here's where
> things get a bit hazy for me. Since ro-1 is a distributed router, the
> port binding type for ls-1's port to ro-1 is a "patch" port. So the
> egress pipeline of ls-1 is run on chassis-2. Then the ingress pipeline
> of ro-1 will also run on chassis-2. This is a problem, because the
> conntrack entries for symmetric ECMP reply are on chassis-1. It's not
> until the ingress pipeline of ro-1 is completed that the packet is
> tunneled to chassis-1. Then on chassis-1 the egress pipeline of ro-1
> will run.
>
> When you use a gateway router, the port binding type for ls-1's port to
> ro-1 is "l3gateway". This means that the packet would get tunneled to
> chassis-1 before running the egress pipeline for ls-1. Then, the ingress
> pipeline of ro-1 runs on chassis-1 so everything works.
>
> Have I misunderstood how this works?
>


I made a mistake earlier -- I meant to say "is_chassis_resident("cr-...")

I think you're right. I need to take a closer look on how that can be
supported if at all it can be.

We can explore later if there is a requirement to support this feature on
gw router ports.

Thanks
Numan


>
> Assuming I haven't...
>
> If the logic for using ECMP symmetric reply on the return traffic could
> be moved to the egress router pipeline, then I understand how it would
> work with a distributed router with gateway port. But I don't see how
> you can do that since the ECMP symmetric reply needs to choose the
> output port. By definition that has to be done in the ingress pipeline.
>
> I guess one option would be to limit the use of ECMP symmetric reply
> traffic to only the gateway port on a distributed router. In this case,
> there would be no need to save the input port in conntrack since there's
> only one possibility. Instead, we would only need to save the nexthop
> MAC address. This way, in the egress pipeline we could override the
> initial ECMP route selection by changing eth.dst.
>
> >
> > 3. In my testing with the logical resources created from system-ovn.at
> > <http://system-ovn.at>, I noticed that
> >       - The traffic initiated from bob1 to alice1 works as expected. The
> > newly added logical flows gets hit
> >          and the ct_label is set as expected.
> >
> >        - The problem is in the traffic initiated by alice1. For the
> > first packet from alice1, the select action is executed
> >          to choose one ecmp route (which is expected) and this packet is
> > not committed to the conntrack.
> >          For the reply traffic from bob1, the packet gets committed
> > because of this flow
> >          table=7 (lr_in_ecmp_stateful), priority=100  , match=(inport ==
> > "R1_ext" && ip4.dst == 10.0.0.0/24 <http://10.0.0.0/24> && (ct.new &&
> > !ct.est)), action=(ct_commit { ct_label.ecmp_reply_eth = eth.src;
> > ct_label.ecmp_reply_port = 2;}; next;)
> >    -   Basically the reverse traffic is treated as new traffic. And from
> > here on, the packet from alice1 is considered as reply traffic.
> >        table=10(lr_in_ip_routing   ), priority=100  , match=(ct.rpl &&
> > ct_label.ecmp_reply_port == 2 && ip4.src == 10.0.0.0/24
> > <http://10.0.0.0/24>), action=(ip.ttl--; flags.loopback = 1; eth.src =
> > 00:00:04:01:02:03; reg1 = 20.0.0.1; outport = "R1_ext"; next;)
> >    - I'm not really sure if it's a problem or not. Maybe it's fine. But
> > is it as expected ? I personally don't see any harm with this.
> >
> >     - But I would like to know your comments and maybe Han has some
> > comments.
>
> Hm, this is a bit hard to fix.
>
> If you don't turn on symmetric replies, then traffic that originates
> from Alice for a connection *should* choose the same outgoing route
> every time since the 4-tuple will be the same throughout the life of the
> connection.
>
> If you turn on symmetric replies, then you still get the same behavior,
> but you're adding in extra conntrack use.
>
> So how do you detect that the traffic coming from Bob to Alice is in
> reply to Alice's traffic and avoid sending it to conntrack? You have to
> use conntrack to detect the direction, right? So in order to avoid using
> conntrack, we have to use conntrack...
>

Agree. I'm fine with the observed behaviour with your patches.



>
> >
> >   4. The test case - "3: ovn -- conntrack fields" is failing with this
> > patch. It's a small error which you forgot to change I suppose.
>
> I actually had fixed this locally but then I guess I accidentally
> overwrote the changes and pushed an unfixed version. Sorry about that.
>
> >
> >   5. Since you are adding a new column in Logical_Router_Static_Route, I
> > think the schema version needs to be updated to - "5.25.0"
>
> Will do.
>
>

Thanks
Numan


> >
> > Thanks
> > Numan
> >
> >
> >     ---
> >       lib/logical-fields.c      |   4 +
> >       northd/ovn-northd.8.xml   |  49 ++++++++++---
> >       northd/ovn-northd.c       | 123 +++++++++++++++++++++++++++----
> >       ovn-architecture.7.xml    |   7 +-
> >       ovn-nb.ovsschema          |   5 +-
> >       ovn-nb.xml                |  16 ++++
> >       tests/ovn.at <http://ovn.at>              | 151
> >     ++++++++++++++++++++++++++++++++++----
> >       tests/system-ovn.at <http://system-ovn.at>       | 144
> >     ++++++++++++++++++++++++++++++++++++
> >       utilities/ovn-nbctl.8.xml |  31 ++++++--
> >       utilities/ovn-nbctl.c     |  18 ++++-
> >       10 files changed, 496 insertions(+), 52 deletions(-)
> >
> >     diff --git a/lib/logical-fields.c b/lib/logical-fields.c
> >     index fde53a47e..15342dded 100644
> >     --- a/lib/logical-fields.c
> >     +++ b/lib/logical-fields.c
> >     @@ -130,6 +130,10 @@ ovn_init_symtab(struct shash *symtab)
> >                                        WR_CT_COMMIT);
> >           expr_symtab_add_subfield_scoped(symtab, "ct_label.blocked",
> NULL,
> >                                           "ct_label[0]", WR_CT_COMMIT);
> >     +    expr_symtab_add_subfield_scoped(symtab,
> >     "ct_label.ecmp_reply_eth", NULL,
> >     +                                    "ct_label[32..79]",
> WR_CT_COMMIT);
> >     +    expr_symtab_add_subfield_scoped(symtab,
> >     "ct_label.ecmp_reply_port", NULL,
> >     +                                    "ct_label[80..95]",
> WR_CT_COMMIT);
> >
> >           expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL,
> >     false);
> >
> >     diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> >     index eb2514f15..cf251e02a 100644
> >     --- a/northd/ovn-northd.8.xml
> >     +++ b/northd/ovn-northd.8.xml
> >     @@ -2120,15 +2120,31 @@ icmp6 {
> >           <p>
> >             This is to send packets to connection tracker for tracking
> and
> >             defragmentation.  It contains a priority-0 flow that simply
> >     moves traffic
> >     -      to the next table.  If load balancing rules with virtual IP
> >     addresses
> >     -      (and ports) are configured in <code>OVN_Northbound</code>
> >     database for a
> >     -      Gateway router, a priority-100 flow is added for each
> >     configured virtual
> >     -      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow
> >     matches
> >     -      <code>ip &amp;&amp; ip4.dst == <var>VIP</var></code>.  For
> IPv6
> >     -      <var>VIPs</var>, the flow matches <code>ip &amp;&amp; ip6.dst
> ==
> >     -      <var>VIP</var></code>.  The flow uses the action
> >     <code>ct_next;</code>
> >     -      to send IP packets to the connection tracker for packet
> >     de-fragmentation
> >     -      and tracking before sending it to the next table.
> >     +      to the next table.
> >     +    </p>
> >     +
> >     +    <p>
> >     +      If load balancing rules with virtual IP addresses (and ports)
> are
> >     +      configured in <code>OVN_Northbound</code> database for a
> >     Gateway router,
> >     +      a priority-100 flow is added for each configured virtual IP
> >     address
> >     +      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
> >     <code>ip
> >     +      &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
> >     <var>VIPs</var>,
> >     +      the flow matches <code>ip &amp;&amp; ip6.dst ==
> >     <var>VIP</var></code>.
> >     +      The flow uses the action <code>ct_next;</code> to send IP
> >     packets to the
> >     +      connection tracker for packet de-fragmentation and tracking
> >     before
> >     +      sending it to the next table.
> >     +    </p>
> >     +
> >     +    <p>
> >     +      If ECMP routes with symmetric reply are configured in the
> >     +      <code>OVN_Northbound</code> database for a gateway router, a
> >     priority-100
> >     +      flow is added for each router port on which symmetric replies
> are
> >     +      configured. The matching logic for these ports essentially
> >     reverses the
> >     +      configured logic of the ECMP route. So for instance, a route
> >     with a
> >     +      destination routing policy will instead match if the source
> >     IP address
> >     +      matches the static route's prefix. The flow uses the action
> >     +      <code>ct_next</code> to send IP packets to the connection
> >     tracker for
> >     +      packet de-fragmentation and tracking before sending it to the
> >     next table.
> >           </p>
> >
> >           <h3>Ingress Table 5: UNSNAT</h3>
> >     @@ -2489,7 +2505,15 @@ output;
> >             table.  This table, instead, is responsible for determine
> >     the ECMP
> >             group id and select a member id within the group based on
> >     5-tuple
> >             hashing.  It stores group id in <code>reg8[0..15]</code> and
> >     member id in
> >     -      <code>reg8[16..31]</code>.
> >     +      <code>reg8[16..31]</code>. This step is skipped if the
> >     traffic going
> >     +      out the ECMP route is reply traffic, and the ECMP route was
> >     configured
> >     +      to use symmetric replies. Instead, the stored
> >     <code>ct_label</code> value
> >     +      is used to choose the destination. The least significant 48
> >     bits of the
> >     +      <code>ct_label</code> tell the destination MAC address to
> >     which the
> >     +      packet should be sent. The next 16 bits tell the logical
> >     router port on
> >     +      which the packet should be sent. These values in the
> >     +      <code>ct_label</code> are set when the initial ingress
> traffic is
> >     +      received over the ECMP route.
> >           </p>
> >
> >           <p>
> >     @@ -2639,6 +2663,11 @@ select(reg8[16..31], <var>MID1</var>,
> >     <var>MID2</var>, ...);
> >             address and <code>reg1</code> as the source protocol
> address).
> >           </p>
> >
> >     +    <p>
> >     +      This processing is skipped for reply traffic being sent out
> >     of an ECMP
> >     +      route if the route was configured to use symmetric replies.
> >     +    </p>
> >     +
> >           <p>
> >             This table contains the following logical flows:
> >           </p>
> >     diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> >     index d10e5ee5d..85f04ccde 100644
> >     --- a/northd/ovn-northd.c
> >     +++ b/northd/ovn-northd.c
> >     @@ -172,16 +172,17 @@ enum ovn_stage {
> >           PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4,
> >     "lr_in_defrag")       \
> >           PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5,
> >     "lr_in_unsnat")       \
> >           PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat")
> >             \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7,
> >     "lr_in_nd_ra_options") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8,
> >     "lr_in_nd_ra_response") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
> >     "lr_in_ip_routing")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> >     "lr_in_ip_routing_ecmp") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11,
> >     "lr_in_policy")       \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> >     "lr_in_arp_resolve")  \
> >     -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> >     "lr_in_chk_pkt_len")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> >       14,"lr_in_larger_pkts")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> >     "lr_in_gw_redirect")  \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> >     "lr_in_arp_request")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7,
> >     "lr_in_ecmp_stateful") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8,
> >     "lr_in_nd_ra_options") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9,
> >     "lr_in_nd_ra_response") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10,
> >     "lr_in_ip_routing")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11,
> >     "lr_in_ip_routing_ecmp") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12,
> >     "lr_in_policy")       \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> >     "lr_in_arp_resolve")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> >     "lr_in_chk_pkt_len")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> >       15,"lr_in_larger_pkts")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> >     "lr_in_gw_redirect")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> >     "lr_in_arp_request")  \
> >
> >         \
> >           /* Logical router egress stages. */
> >         \
> >           PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")
> >          \
> >     @@ -7312,6 +7313,7 @@ struct parsed_route {
> >           bool is_src_route;
> >           uint32_t hash;
> >           const struct nbrec_logical_router_static_route *route;
> >     +    bool ecmp_symmetric_reply;
> >       };
> >
> >       static uint32_t
> >     @@ -7373,6 +7375,8 @@ parsed_routes_add(struct ovs_list *routes,
> >                                                        "src-ip"));
> >           pr->hash = route_hash(pr);
> >           pr->route = route;
> >     +    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
> >     +
> >       "ecmp_symmetric_reply", false);
> >           ovs_list_insert(routes, &pr->list_node);
> >           return pr;
> >       }
> >     @@ -7621,18 +7625,95 @@ find_static_route_outport(struct
> >     ovn_datapath *od, struct hmap *ports,
> >           return true;
> >       }
> >
> >     +static void
> >     +add_ecmp_symmetric_reply_flows(struct hmap *lflows,
> >     +                               struct ovn_datapath *od,
> >     +                               const char *port_ip,
> >     +                               struct ovn_port *out_port,
> >     +                               const struct parsed_route *route,
> >     +                               struct ds *route_match)
> >     +{
> >     +    const struct nbrec_logical_router_static_route *st_route =
> >     route->route;
> >     +    struct ds match = DS_EMPTY_INITIALIZER;
> >     +    struct ds actions = DS_EMPTY_INITIALIZER;
> >     +    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
> >     +    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
> >     +
> >     +    /* If symmetric ECMP replies are enabled, then packets that
> >     arrive over
> >     +     * an ECMP route need to go through conntrack.
> >     +     */
> >     +    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
> >     +                  out_port->json_key,
> >     +                  route->prefix.family == AF_INET ? "4" : "6",
> >     +                  route->is_src_route ? "dst" : "src",
> >     +                  cidr);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> >     +                            ds_cstr(&match), "ct_next;",
> >     +                            &st_route->header_);
> >     +
> >     +    /* And packets that go out over an ECMP route need conntrack */
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> >     +                            ds_cstr(route_match), "ct_next;",
> >     +                            &st_route->header_);
> >     +
> >     +    /* Save src eth and inport in ct_label for packets that arrive
> over
> >     +     * an ECMP route.
> >     +     *
> >     +     * NOTE: we purposely are not clearing match before this
> >     +     * ds_put_cstr() call. The previous contents are needed.
> >     +     */
> >     +    ds_put_cstr(&match, " && (ct.new && !ct.est)");
> >     +
> >     +    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth =
> >     eth.src;"
> >     +                  " ct_label.ecmp_reply_port = %" PRId64 ";};
> next;",
> >     +                  out_port->sb->tunnel_key);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL,
> 100,
> >     +                            ds_cstr(&match), ds_cstr(&actions),
> >     +                            &st_route->header_);
> >     +
> >     +    /* Bypass ECMP selection if we already have ct_label information
> >     +     * for where to route the packet.
> >     +     */
> >     +    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port
> >     == %"
> >     +                  PRId64, out_port->sb->tunnel_key);
> >     +    ds_clear(&match);
> >     +    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
> >     +                  ds_cstr(route_match));
> >     +    ds_clear(&actions);
> >     +    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
> >     +                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
> >     +                  out_port->lrp_networks.ea_s,
> >     +                  route->prefix.family == AF_INET ? "" : "xx",
> >     +                  port_ip, out_port->json_key);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
> >     +                           ds_cstr(&match), ds_cstr(&actions),
> >     +                           &st_route->header_);
> >     +
> >     +    /* Egress reply traffic for symmetric ECMP routes skips router
> >     policies. */
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
> >     +                            ds_cstr(&ecmp_reply), "next;",
> >     +                            &st_route->header_);
> >     +
> >     +    ds_clear(&actions);
> >     +    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth;
> next;");
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
> >     +                            200, ds_cstr(&ecmp_reply),
> >     +                            ds_cstr(&actions), &st_route->header_);
> >     +}
> >     +
> >       static void
> >       build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
> >                             struct hmap *ports, struct ecmp_groups_node
> *eg)
> >
> >       {
> >           bool is_ipv4 = (eg->prefix.family == AF_INET);
> >     -    struct ds match = DS_EMPTY_INITIALIZER;
> >           uint16_t priority;
> >     +    struct ecmp_route_list_node *er;
> >     +    struct ds route_match = DS_EMPTY_INITIALIZER;
> >
> >           char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
> >           build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route,
> >     is_ipv4,
> >     -                      &match, &priority);
> >     +                      &route_match, &priority);
> >           free(prefix_s);
> >
> >           struct ds actions = DS_EMPTY_INITIALIZER;
> >     @@ -7640,7 +7721,6 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                         "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
> >                         REG_ECMP_MEMBER_ID);
> >
> >     -    struct ecmp_route_list_node *er;
> >           bool is_first = true;
> >           LIST_FOR_EACH (er, list_node, &eg->route_list) {
> >               if (is_first) {
> >     @@ -7654,11 +7734,12 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >           ds_put_cstr(&actions, ");");
> >
> >           ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
> >     -                  ds_cstr(&match), ds_cstr(&actions));
> >     +                  ds_cstr(&route_match), ds_cstr(&actions));
> >
> >           /* Add per member flow */
> >     +    struct ds match = DS_EMPTY_INITIALIZER;
> >     +    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
> >           LIST_FOR_EACH (er, list_node, &eg->route_list) {
> >     -
> >               const struct parsed_route *route_ = er->route;
> >               const struct nbrec_logical_router_static_route *route =
> >     route_->route;
> >               /* Find the outgoing port. */
> >     @@ -7668,6 +7749,15 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                                              &out_port)) {
> >                   continue;
> >               }
> >     +        /* Symmetric ECMP reply is only usable on gateway routers.
> >     +         * It is NOT usable on distributed routers with a gateway
> port.
> >     +         */
> >     +        if (smap_get(&od->nbr->options, "chassis") &&
> >     +            route_->ecmp_symmetric_reply && sset_add(&visited_ports,
> >     +
>  out_port->key)) {
> >     +            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s,
> >     out_port,
> >     +                                           route_, &route_match);
> >     +        }
> >               ds_clear(&match);
> >               ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
> >                             REG_ECMP_MEMBER_ID" == %"PRIu16,
> >     @@ -7688,7 +7778,9 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                                       ds_cstr(&match), ds_cstr(&actions),
> >                                       &route->header_);
> >           }
> >     +    sset_destroy(&visited_ports);
> >           ds_destroy(&match);
> >     +    ds_destroy(&route_match);
> >           ds_destroy(&actions);
> >       }
> >
> >     @@ -8972,6 +9064,7 @@ build_lrouter_flows(struct hmap *datapaths,
> >     struct hmap *ports,
> >               ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1",
> "next;");
> >               ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1",
> >     "next;");
> >               ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1",
> >     "next;");
> >     +        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0,
> >     "1", "next;");
> >
> >               /* Send the IPv6 NS packets to next table. When
> ovn-controller
> >                * generates IPv6 NS (for the action - nd_ns{}), the
> injected
> >     diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> >     index 246cebc19..b1a462933 100644
> >     --- a/ovn-architecture.7.xml
> >     +++ b/ovn-architecture.7.xml
> >     @@ -1210,11 +1210,12 @@
> >           <dd>
> >             Fields that denote the connection tracking zones for
> >     routers.  These
> >             values only have local significance and are not meaningful
> >     between
> >     -      chassis.  OVN stores the zone information for DNATting in
> >     Open vSwitch
> >     +      chassis.  OVN stores the zone information for north to south
> >     traffic
> >     +      (for DNATting or ECMP symmetric replies) in Open vSwitch
> >               <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
> >               MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
> >     -      extension register number 11 and zone information for SNATing
> in
> >     -      Open vSwitch extension register number 12.
> >     +      extension register number 11 and zone information for south
> >     to north
> >     +      traffic (for SNATing) in Open vSwitch extension register
> >     number 12.
> >           </dd>
> >
> >           <dt>logical flow flags</dt>
> >     diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
> >     index da9af7157..16f7794f2 100644
> >     --- a/ovn-nb.ovsschema
> >     +++ b/ovn-nb.ovsschema
> >     @@ -1,7 +1,7 @@
> >       {
> >           "name": "OVN_Northbound",
> >           "version": "5.24.0",
> >     -    "cksum": "1092394564 25961",
> >     +    "cksum": "679745602 26116",
> >           "tables": {
> >               "NB_Global": {
> >                   "columns": {
> >     @@ -365,6 +365,9 @@
> >                                           "min": 0, "max": 1}},
> >                       "nexthop": {"type": "string"},
> >                       "output_port": {"type": {"key": "string", "min":
> >     0, "max": 1}},
> >     +                "options": {
> >     +                    "type": {"key": "string", "value": "string",
> >     +                             "min": 0, "max": "unlimited"}},
> >                       "external_ids": {
> >                           "type": {"key": "string", "value": "string",
> >                                    "min": 0, "max": "unlimited"}}},
> >     diff --git a/ovn-nb.xml b/ovn-nb.xml
> >     index db5908cd5..5e434d257 100644
> >     --- a/ovn-nb.xml
> >     +++ b/ovn-nb.xml
> >     @@ -2481,6 +2481,22 @@
> >             </column>
> >           </group>
> >
> >     +    <group title="Common options">
> >     +      <column name="options">
> >     +        This column provides general key/value settings. The
> supported
> >     +        options are described individually below.
> >     +      </column>
> >     +
> >     +      <column name="options" key="ecmp_symmetric_reply">
> >     +        It true, then new traffic that arrives over this route will
> >     have
> >     +        its reply traffic bypass ECMP route selection and will be
> >     sent out
> >     +        this route instead. Note that this option overrides any
> >     rules set
> >     +        in the <ref table="Logical_Router_policy" /> table. This
> option
> >     +        only works on gateway routers (routers that have
> >     +        <ref column="options" key="chassis" table="Logical_Router"
> >     /> set).
> >     +      </column>
> >     +    </group>
> >     +
> >         </table>
> >
> >         <table name="Logical_Router_Policy" title="Logical router
> policies">
> >     diff --git a/tests/ovn.at <http://ovn.at> b/tests/ovn.at <
> http://ovn.at>
> >     index f8dde14c2..c1ab6b85f 100644
> >     --- a/tests/ovn.at <http://ovn.at>
> >     +++ b/tests/ovn.at <http://ovn.at>
> >     @@ -195,6 +195,8 @@ ct.snat = ct_state[6]
> >       ct.trk = ct_state[5]
> >       ct_label = NXM_NX_CT_LABEL
> >       ct_label.blocked = ct_label[0]
> >     +ct_label.ecmp_reply_eth = ct_label[0..47]
> >     +ct_label.ecmp_reply_port = ct_label[48..63]
> >       ct_mark = NXM_NX_CT_MARK
> >       ct_state = NXM_NX_CT_STATE
> >       ]])
> >     @@ -16065,7 +16067,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # Since the sw0-vir is not claimed by any chassis, eth.dst should
> >     be set to
> >       # zero if the ip4.dst is the virtual ip in the router pipeline.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >       ])
> >
> >       ip_to_hex() {
> >     @@ -16116,7 +16118,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p1's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >       ])
> >
> >       # Forcibly clear virtual_parent. ovn-controller should release the
> >     binding
> >     @@ -16157,7 +16159,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p2's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:05; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:05; next;)
> >       ])
> >
> >       # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> >     @@ -16180,7 +16182,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p3's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >       ])
> >
> >       # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> >     @@ -16201,7 +16203,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >       ])
> >
> >       # Delete hv1-vif1 port. hv1 should release sw0-vir
> >     @@ -16219,7 +16221,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >       ])
> >
> >       # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> >     @@ -16240,7 +16242,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >       ])
> >
> >       # Delete sw0-p2 logical port
> >     @@ -20274,22 +20276,22 @@ ovn-nbctl set logical_router_policy $pol5
> >     options:pkt_mark=5
> >       ovn-nbctl --wait=hv sync
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x64->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x3->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x4->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x5->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >     @@ -20380,12 +20382,12 @@ send_ipv4_pkt hv1 hv1-vif1 505400000003
> >     00000000ff01 \
> >           $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x2->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       AT_CHECK([
> >     -    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x64->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >     @@ -20741,3 +20743,126 @@ AT_CHECK([test "$hv2_offlows" =
> >     "$hv2_offlows_mon"])
> >
> >       OVN_CLEANUP([hv1], [hv2])
> >       AT_CLEANUP
> >     +
> >     +AT_SETUP([ovn -- Symmetric ECMP reply flows])
> >     +ovn_start
> >     +
> >     +net_add n1
> >     +sim_add hv1
> >     +as hv1
> >     +ovs-vsctl add-br br-phys
> >     +ovn_attach n1 br-phys 192.168.0.1
> >     +
> >     +sim_add hv2
> >     +as hv2
> >     +ovs-vsctl add-br br-phys
> >     +ovn_attach n1 br-phys 192.168.0.2
> >     +
> >     +# Logical network
> >     +#
> >     +#   ls1 \
> >     +#        \
> >     +#         DR -- join -- GW -- ext
> >     +#        /
> >     +#   ls2 /
> >     +#
> >     +#  ls1 and ls2 are internal switches connected to distributed router
> >     +#  DR. DR is then connected via a join switch to gateway router GW.
> >     +#  GW is then connected to external switch ext. In real life, this
> >     +#  would likely have a localnet port, but for the purposes of this
> test
> >     +#  it is unnecessary.
> >     +
> >     +ovn-nbctl create Logical_Router name=DR
> >     +gw_uuid=$(ovn-nbctl create Logical_Router name=GW)
> >     +
> >     +ovn-nbctl ls-add ls1
> >     +ovn-nbctl ls-add ls2
> >     +ovn-nbctl ls-add join
> >     +ovn-nbctl ls-add ext
> >     +
> >     +# Connect ls1 to DR
> >     +ovn-nbctl lrp-add DR dr-ls1 00:00:01:01:02:03 10.0.0.1/24
> >     <http://10.0.0.1/24>
> >     +ovn-nbctl lsp-add ls1 ls1-dr -- set Logical_Switch_Port ls1-dr \
> >     +    type=router options:router-port=dr-ls1
> >     addresses='"00:00:01:01:02:03"'
> >     +
> >     +# Connect ls2 to DR
> >     +ovn-nbctl lrp-add DR dr-ls2 00:00:01:01:02:04 10.0.0.2/24
> >     <http://10.0.0.2/24>
> >     +ovn-nbctl lsp-add ls2 ls2-dr -- set Logical_Switch_Port ls2-dr \
> >     +    type=router options:router-port=dr-ls2
> >     addresses='"00:00:01:01:02:04"'
> >     +
> >     +# Connect join to DR
> >     +ovn-nbctl lrp-add DR dr-join 00:00:02:01:02:03 20.0.0.1/24
> >     <http://20.0.0.1/24>
> >     +ovn-nbctl lsp-add join join-dr -- set Logical_Switch_Port join-dr \
> >     +    type=router options:router-port=dr-join
> >     addresses='"00:00:02:01:02:03"'
> >     +
> >     +# Connect join to GW
> >     +ovn-nbctl lrp-add GW gw-join 00:00:02:01:02:04 20.0.0.2/24
> >     <http://20.0.0.2/24>
> >     +ovn-nbctl lsp-add join join-gw -- set Logical_Switch_Port join-gw \
> >     +    type=router options:router-port=gw-join
> >     addresses='"00:00:02:01:02:04"'
> >     +
> >     +# Connect ext to GW
> >     +ovn-nbctl lrp-add GW gw-ext 00:00:03:01:02:03 172.16.0.1/16
> >     <http://172.16.0.1/16>
> >     +ovn-nbctl lsp-add ext ext-gw -- set Logical_Switch_Port ext-gw \
> >     +    type=router options:router-port=gw-ext
> >     addresses='"00:00:03:01:02:03"'
> >     +
> >     +ovn-nbctl lr-route-add GW 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +ovn-nbctl --policy="src-ip" lr-route-add DR 10.0.0.0/24
> >     <http://10.0.0.0/24> 20.0.0.2
> >     +
> >     +# Now add some ECMP routes to the GW router.
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> >     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.2
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> >     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.3
> >     +
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +# Ensure ECMP symmetric reply flows are not present on any
> hypervisor.
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +# Now make GW a gateway router on hv1
> >     +ovn-nbctl set Logical_Router $gw_uuid options:chassis=hv1
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +# And ensure that ECMP symmetric reply flows are present only on hv1
> >     +AT_CHECK([
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +OVN_CLEANUP([hv1], [hv2])
> >     +AT_CLEANUP
> >     diff --git a/tests/system-ovn.at <http://system-ovn.at>
> >     b/tests/system-ovn.at <http://system-ovn.at>
> >     index eddc530f9..e239b7394 100644
> >     --- a/tests/system-ovn.at <http://system-ovn.at>
> >     +++ b/tests/system-ovn.at <http://system-ovn.at>
> >     @@ -4483,3 +4483,147 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query
> >     port patch-.*/d
> >       /connection dropped.*/d"])
> >
> >       AT_CLEANUP
> >     +
> >     +AT_SETUP([ovn -- ECMP symmetric reply])
> >     +AT_KEYWORDS([ecmp])
> >     +
> >     +CHECK_CONNTRACK()
> >     +ovn_start
> >     +
> >     +OVS_TRAFFIC_VSWITCHD_START()
> >     +ADD_BR([br-int])
> >     +
> >     +# Set external-ids in br-int needed for ovn-controller
> >     +ovs-vsctl \
> >     +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> >     +        -- set Open_vSwitch .
> >     external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
> >     +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> >     +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> >     +        -- set bridge br-int fail-mode=secure
> >     other-config:disable-in-band=true
> >     +
> >     +# Start ovn-controller
> >     +start_daemon ovn-controller
> >     +
> >     +# Logical network:
> >     +# Alice is connected to gateway router R1. R1 is connected to two
> >     "external"
> >     +# routers, R2 and R3 via an "ext" switch.
> >     +# Bob is connected to both R2 and R3. R1 contains two ECMP routes,
> >     one through R2
> >     +# and one through R3, to Bob.
> >     +#
> >     +#     alice -- R1 -- ext ---- R2
> >     +#                     |         \
> >     +#                     |           bob
> >     +#                     |         /
> >     +#                     + ----- R3
> >     +#
> >     +# For this test, Bob sends request traffic through R2 to Alice. We
> >     want to ensure that
> >     +# all response traffic from Alice is routed through R2 as well.
> >     +
> >     +ovn-nbctl create Logical_Router name=R1 options:chassis=hv1
> >     +ovn-nbctl create Logical_Router name=R2
> >     +ovn-nbctl create Logical_Router name=R3
> >     +
> >     +ovn-nbctl ls-add alice
> >     +ovn-nbctl ls-add bob
> >     +ovn-nbctl ls-add ext
> >     +
> >     +# connect alice to R1
> >     +ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
> >     <http://10.0.0.1/24>
> >     +ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port
> rp-alice \
> >     +    type=router options:router-port=alice
> >     addresses='"00:00:01:01:02:03"'
> >     +
> >     +# connect bob to R2
> >     +ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
> >     <http://172.16.0.2/16>
> >     +ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
> >     +    type=router options:router-port=R2_bob
> >     addresses='"00:00:02:01:02:03"'
> >     +
> >     +# connect bob to R3
> >     +ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
> >     <http://172.16.0.3/16>
> >     +ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
> >     +    type=router options:router-port=R3_bob
> >     addresses='"00:00:02:01:02:04"'
> >     +
> >     +# Connect R1 to ext
> >     +ovn-nbctl lrp-add R1 R1_ext 00:00:04:01:02:03 20.0.0.1/24
> >     <http://20.0.0.1/24>
> >     +ovn-nbctl lsp-add ext r1-ext -- set Logical_Switch_Port r1-ext \
> >     +    type=router options:router-port=R1_ext
> >     addresses='"00:00:04:01:02:03"'
> >     +
> >     +# Connect R2 to ext
> >     +ovn-nbctl lrp-add R2 R2_ext 00:00:04:01:02:04 20.0.0.2/24
> >     <http://20.0.0.2/24>
> >     +ovn-nbctl lsp-add ext r2-ext -- set Logical_Switch_Port r2-ext \
> >     +    type=router options:router-port=R2_ext
> >     addresses='"00:00:04:01:02:04"'
> >     +
> >     +# Connect R3 to ext
> >     +ovn-nbctl lrp-add R3 R3_ext 00:00:04:01:02:05 20.0.0.3/24
> >     <http://20.0.0.3/24>
> >     +ovn-nbctl lsp-add ext r3-ext -- set Logical_Switch_Port r3-ext \
> >     +    type=router options:router-port=R3_ext
> >     addresses='"00:00:04:01:02:05"'
> >     +
> >     +# Install ECMP routes for alice.
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> >     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.2
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> >     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.3
> >     +
> >     +# Static Routes
> >     +ovn-nbctl lr-route-add R2 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +ovn-nbctl lr-route-add R3 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +
> >     +# Logical port 'alice1' in switch 'alice'.
> >     +ADD_NAMESPACES(alice1)
> >     +ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24
> >     <http://10.0.0.2/24>", "f0:00:00:01:02:04", \
> >     +         "10.0.0.1")
> >     +ovn-nbctl lsp-add alice alice1 \
> >     +-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
> >     +
> >     +# Logical port 'bob1' in switch 'bob'.
> >     +ADD_NAMESPACES(bob1)
> >     +ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16
> >     <http://172.16.0.1/16>", "f0:00:00:01:02:06", \
> >     +         "172.16.0.2")
> >     +ovn-nbctl lsp-add bob bob1 \
> >     +-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
> >     +
> >     +# Ensure ovn-controller is caught up
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +on_exit 'ovs-ofctl dump-flows br-int'
> >     +
> >     +# 'bob1' should be able to ping 'alice1' directly.
> >     +NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 |
> >     FORMAT_PING], \
> >     +[0], [dnl
> >     +20 packets transmitted, 20 received, 0% packet loss, time 0ms
> >     +])
> >     +
> >     +# Ensure conntrack entry is present. We should not try to predict
> >     +# the tunnel key for the output port, so we strip it from the labels
> >     +# and just ensure that the known ethernet address is present.
> >     +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) |
> \
> >     +sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
> >     +sed -e
> >
>  's/labels=0x[[0-9a-f]]*00000401020400000000/labels=0x00000401020400000000/'],
> >     [0], [dnl
> >
>  +icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x00000401020400000000
> >     +])
> >     +
> >     +# Ensure datapaths show conntrack states as expected
> >     +# Like with conntrack entries, we shouldn't try to predict
> >     +# port binding tunnel keys. So omit them from expected labels.
> >     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> >
>  'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*00000401020400000000/0xffffffffffffffff00000000)'
> >     -c], [0], [dnl
> >     +1
> >     +])
> >     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> >
>  'ct_state(-new+est+rpl+trk).*ct_label(0x.*00000401020400000000/0xffffffffffffffff00000000)'
> >     -c], [0], [dnl
> >     +1
> >     +])
> >     +
> >     +ovs-ofctl dump-flows br-int
> >     +
> >     +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> >     +
> >     +as ovn-sb
> >     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >     +
> >     +as ovn-nb
> >     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >     +
> >     +as northd
> >     +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> >     +
> >     +as
> >     +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
> >     +/connection dropped.*/d"])
> >     +
> >     +AT_CLEANUP
> >     diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
> >     index de86b70e6..18bf90e08 100644
> >     --- a/utilities/ovn-nbctl.8.xml
> >     +++ b/utilities/ovn-nbctl.8.xml
> >     @@ -658,7 +658,8 @@
> >
> >           <dl>
> >             <dt>[<code>--may-exist</code>]
> >     [<code>--policy</code>=<var>POLICY</var>]
> >     -        [<code>--ecmp</code>] <code>lr-route-add</code>
> >     <var>router</var>
> >     +        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
> >     +        <code>lr-route-add</code> <var>router</var>
> >               <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
> >             <dd>
> >               <p>
> >     @@ -680,15 +681,31 @@
> >                 specified, the default is "dst-ip".
> >               </p>
> >
> >     +        <p>
> >     +          The <code>--ecmp</code> option allows for multiple routes
> >     with the
> >     +          same <var>prefix</var> <var>POLICY</var> but different
> >     +          <var>nexthop</var> and <var>port</var> to be added.
> >     +        </p>
> >     +
> >     +        <p>
> >     +          The <code>--ecmp-symmetric-reply</code> option makes it
> >     so that
> >     +          traffic that arrives over an ECMP route will have its
> >     reply traffic
> >     +          sent out over that same route. Setting
> >     +          <code>--ecmp-symmetric-reply</code> implies
> >     <code>--ecmp</code> so
> >     +          it is not necessary to set both.
> >     +        </p>
> >     +
> >               <p>
> >                 It is an error if a route with <var>prefix</var> and
> >     -          <var>POLICY</var> already exists, unless
> >     <code>--may-exist</code> or
> >     -          <code>--ecmp</code> is specified.  If
> >     <code>--may-exist</code> is
> >     -          specified but not <code>--ecmp</code>, the existed route
> >     will be
> >     -          updated with the new nexthop and port.  If
> >     <code>--ecmp</code> is
> >     +          <var>POLICY</var> already exists, unless
> >     <code>--may-exist</code>,
> >     +          <code>--ecmp</code>, or
> >     <code>--ecmp-symmetric-reply</code> is
> >     +          specified.  If <code>--may-exist</code> is specified but
> not
> >     +          <code>--ecmp</code> or
> >     <code>--ecmp-symmetric-reply</code>, the
> >     +          existed route will be updated with the new nexthop and
> >     port.  If
> >     +          <code>--ecmp</code> or
> <code>--ecmp-symmetric-reply</code> is
> >                 specified, a new route will be added, regardless of the
> >     existed
> >     -          route, which is useful when adding ECMP routes, i.e.
> >     routes with same
> >     -          <var>POLICY</var> and <var>prefix</var> but different
> >     +          route., which is useful when adding ECMP routes, i.e.
> >     routes with
> >     +          same <var>POLICY</var> and <var>prefix</var> but different
> >                 <var>nexthop</var> and <var>port</var>.
> >               </p>
> >             </dd>
> >     diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> >     index 0079ad5a6..e6d8dbe63 100644
> >     --- a/utilities/ovn-nbctl.c
> >     +++ b/utilities/ovn-nbctl.c
> >     @@ -687,7 +687,8 @@ Logical router port commands:\n\
> >                                   ('overlay' or 'bridged')\n\
> >       \n\
> >       Route commands:\n\
> >     -  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP
> >     [PORT]\n\
> >     +  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add
> >     ROUTER \n\
> >     +                            PREFIX NEXTHOP [PORT]\n\
> >                                   add a route to ROUTER\n\
> >         [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP
> [PORT]]]\n\
> >                                   remove routes from ROUTER\n\
> >     @@ -3855,7 +3856,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
> >           }
> >
> >           bool may_exist = shash_find(&ctx->options, "--may-exist") !=
> NULL;
> >     -    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
> >     +    bool ecmp_symmetric_reply = shash_find(&ctx->options,
> >     +
> >       "--ecmp-symmetric-reply") != NULL;
> >     +    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
> >     +                ecmp_symmetric_reply;
> >           if (!ecmp) {
> >               for (int i = 0; i < lr->n_static_routes; i++) {
> >                   const struct nbrec_logical_router_static_route *route
> >     @@ -3920,6 +3924,13 @@ nbctl_lr_route_add(struct ctl_context *ctx)
> >               nbrec_logical_router_static_route_set_policy(route,
> policy);
> >           }
> >
> >     +    if (ecmp_symmetric_reply) {
> >     +        const struct smap options = SMAP_CONST1(&options,
> >     +
> "ecmp_symmetric_reply",
> >     +                                                "true");
> >     +        nbrec_logical_router_static_route_set_options(route,
> &options);
> >     +    }
> >     +
> >           nbrec_logical_router_verify_static_routes(lr);
> >           struct nbrec_logical_router_static_route **new_routes
> >               = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
> >     @@ -6361,7 +6372,8 @@ static const struct ctl_command_syntax
> >     nbctl_commands[] = {
> >
> >           /* logical router route commands. */
> >           { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
> >     -      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW
> },
> >     +      nbctl_lr_route_add, NULL,
> >     "--may-exist,--ecmp,--ecmp-symmetric-reply,"
> >     +      "--policy=", RW },
> >           { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]",
> NULL,
> >             nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
> >           { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list,
> >     NULL,
> >     --
> >     2.25.4
> >
> >     _______________________________________________
> >     dev mailing list
> >     dev@openvswitch.org <mailto:dev@openvswitch.org>
> >     https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
>
diff mbox series

Patch

diff --git a/lib/logical-fields.c b/lib/logical-fields.c
index fde53a47e..15342dded 100644
--- a/lib/logical-fields.c
+++ b/lib/logical-fields.c
@@ -130,6 +130,10 @@  ovn_init_symtab(struct shash *symtab)
                                  WR_CT_COMMIT);
     expr_symtab_add_subfield_scoped(symtab, "ct_label.blocked", NULL,
                                     "ct_label[0]", WR_CT_COMMIT);
+    expr_symtab_add_subfield_scoped(symtab, "ct_label.ecmp_reply_eth", NULL,
+                                    "ct_label[32..79]", WR_CT_COMMIT);
+    expr_symtab_add_subfield_scoped(symtab, "ct_label.ecmp_reply_port", NULL,
+                                    "ct_label[80..95]", WR_CT_COMMIT);
 
     expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL, false);
 
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index eb2514f15..cf251e02a 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -2120,15 +2120,31 @@  icmp6 {
     <p>
       This is to send packets to connection tracker for tracking and
       defragmentation.  It contains a priority-0 flow that simply moves traffic
-      to the next table.  If load balancing rules with virtual IP addresses
-      (and ports) are configured in <code>OVN_Northbound</code> database for a
-      Gateway router, a priority-100 flow is added for each configured virtual
-      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
-      <code>ip &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
-      <var>VIPs</var>, the flow matches <code>ip &amp;&amp; ip6.dst ==
-      <var>VIP</var></code>.  The flow uses the action <code>ct_next;</code>
-      to send IP packets to the connection tracker for packet de-fragmentation
-      and tracking before sending it to the next table.
+      to the next table.
+    </p>
+
+    <p>
+      If load balancing rules with virtual IP addresses (and ports) are
+      configured in <code>OVN_Northbound</code> database for a Gateway router,
+      a priority-100 flow is added for each configured virtual IP address
+      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches <code>ip
+      &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6 <var>VIPs</var>,
+      the flow matches <code>ip &amp;&amp; ip6.dst == <var>VIP</var></code>.
+      The flow uses the action <code>ct_next;</code> to send IP packets to the
+      connection tracker for packet de-fragmentation and tracking before
+      sending it to the next table.
+    </p>
+
+    <p>
+      If ECMP routes with symmetric reply are configured in the
+      <code>OVN_Northbound</code> database for a gateway router, a priority-100
+      flow is added for each router port on which symmetric replies are
+      configured. The matching logic for these ports essentially reverses the
+      configured logic of the ECMP route. So for instance, a route with a
+      destination routing policy will instead match if the source IP address
+      matches the static route's prefix. The flow uses the action
+      <code>ct_next</code> to send IP packets to the connection tracker for
+      packet de-fragmentation and tracking before sending it to the next table.
     </p>
 
     <h3>Ingress Table 5: UNSNAT</h3>
@@ -2489,7 +2505,15 @@  output;
       table.  This table, instead, is responsible for determine the ECMP
       group id and select a member id within the group based on 5-tuple
       hashing.  It stores group id in <code>reg8[0..15]</code> and member id in
-      <code>reg8[16..31]</code>.
+      <code>reg8[16..31]</code>. This step is skipped if the traffic going
+      out the ECMP route is reply traffic, and the ECMP route was configured
+      to use symmetric replies. Instead, the stored <code>ct_label</code> value
+      is used to choose the destination. The least significant 48 bits of the
+      <code>ct_label</code> tell the destination MAC address to which the
+      packet should be sent. The next 16 bits tell the logical router port on
+      which the packet should be sent. These values in the
+      <code>ct_label</code> are set when the initial ingress traffic is
+      received over the ECMP route.
     </p>
 
     <p>
@@ -2639,6 +2663,11 @@  select(reg8[16..31], <var>MID1</var>, <var>MID2</var>, ...);
       address and <code>reg1</code> as the source protocol address).
     </p>
 
+    <p>
+      This processing is skipped for reply traffic being sent out of an ECMP
+      route if the route was configured to use symmetric replies.
+    </p>
+
     <p>
       This table contains the following logical flows:
     </p>
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index d10e5ee5d..85f04ccde 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -172,16 +172,17 @@  enum ovn_stage {
     PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4, "lr_in_defrag")       \
     PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5, "lr_in_unsnat")       \
     PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat")         \
-    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7, "lr_in_nd_ra_options") \
-    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8, "lr_in_nd_ra_response") \
-    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")   \
-    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10, "lr_in_ip_routing_ecmp") \
-    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")       \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12, "lr_in_arp_resolve")  \
-    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13, "lr_in_chk_pkt_len")   \
-    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")   \
-    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15, "lr_in_gw_redirect")  \
-    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16, "lr_in_arp_request")  \
+    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7, "lr_in_ecmp_stateful") \
+    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8, "lr_in_nd_ra_options") \
+    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9, "lr_in_nd_ra_response") \
+    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10, "lr_in_ip_routing")   \
+    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11, "lr_in_ip_routing_ecmp") \
+    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12, "lr_in_policy")       \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13, "lr_in_arp_resolve")  \
+    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14, "lr_in_chk_pkt_len")   \
+    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")   \
+    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16, "lr_in_gw_redirect")  \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17, "lr_in_arp_request")  \
                                                                       \
     /* Logical router egress stages. */                               \
     PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
@@ -7312,6 +7313,7 @@  struct parsed_route {
     bool is_src_route;
     uint32_t hash;
     const struct nbrec_logical_router_static_route *route;
+    bool ecmp_symmetric_reply;
 };
 
 static uint32_t
@@ -7373,6 +7375,8 @@  parsed_routes_add(struct ovs_list *routes,
                                                  "src-ip"));
     pr->hash = route_hash(pr);
     pr->route = route;
+    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
+                                             "ecmp_symmetric_reply", false);
     ovs_list_insert(routes, &pr->list_node);
     return pr;
 }
@@ -7621,18 +7625,95 @@  find_static_route_outport(struct ovn_datapath *od, struct hmap *ports,
     return true;
 }
 
+static void
+add_ecmp_symmetric_reply_flows(struct hmap *lflows,
+                               struct ovn_datapath *od,
+                               const char *port_ip,
+                               struct ovn_port *out_port,
+                               const struct parsed_route *route,
+                               struct ds *route_match)
+{
+    const struct nbrec_logical_router_static_route *st_route = route->route;
+    struct ds match = DS_EMPTY_INITIALIZER;
+    struct ds actions = DS_EMPTY_INITIALIZER;
+    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
+    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
+
+    /* If symmetric ECMP replies are enabled, then packets that arrive over
+     * an ECMP route need to go through conntrack.
+     */
+    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
+                  out_port->json_key,
+                  route->prefix.family == AF_INET ? "4" : "6",
+                  route->is_src_route ? "dst" : "src",
+                  cidr);
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
+                            ds_cstr(&match), "ct_next;",
+                            &st_route->header_);
+
+    /* And packets that go out over an ECMP route need conntrack */
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
+                            ds_cstr(route_match), "ct_next;",
+                            &st_route->header_);
+
+    /* Save src eth and inport in ct_label for packets that arrive over
+     * an ECMP route.
+     *
+     * NOTE: we purposely are not clearing match before this
+     * ds_put_cstr() call. The previous contents are needed.
+     */
+    ds_put_cstr(&match, " && (ct.new && !ct.est)");
+
+    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth = eth.src;"
+                  " ct_label.ecmp_reply_port = %" PRId64 ";}; next;",
+                  out_port->sb->tunnel_key);
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 100,
+                            ds_cstr(&match), ds_cstr(&actions),
+                            &st_route->header_);
+
+    /* Bypass ECMP selection if we already have ct_label information
+     * for where to route the packet.
+     */
+    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port == %"
+                  PRId64, out_port->sb->tunnel_key);
+    ds_clear(&match);
+    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
+                  ds_cstr(route_match));
+    ds_clear(&actions);
+    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
+                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
+                  out_port->lrp_networks.ea_s,
+                  route->prefix.family == AF_INET ? "" : "xx",
+                  port_ip, out_port->json_key);
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
+                           ds_cstr(&match), ds_cstr(&actions),
+                           &st_route->header_);
+
+    /* Egress reply traffic for symmetric ECMP routes skips router policies. */
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
+                            ds_cstr(&ecmp_reply), "next;",
+                            &st_route->header_);
+
+    ds_clear(&actions);
+    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth; next;");
+    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
+                            200, ds_cstr(&ecmp_reply),
+                            ds_cstr(&actions), &st_route->header_);
+}
+
 static void
 build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
                       struct hmap *ports, struct ecmp_groups_node *eg)
 
 {
     bool is_ipv4 = (eg->prefix.family == AF_INET);
-    struct ds match = DS_EMPTY_INITIALIZER;
     uint16_t priority;
+    struct ecmp_route_list_node *er;
+    struct ds route_match = DS_EMPTY_INITIALIZER;
 
     char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
     build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route, is_ipv4,
-                      &match, &priority);
+                      &route_match, &priority);
     free(prefix_s);
 
     struct ds actions = DS_EMPTY_INITIALIZER;
@@ -7640,7 +7721,6 @@  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
                   "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
                   REG_ECMP_MEMBER_ID);
 
-    struct ecmp_route_list_node *er;
     bool is_first = true;
     LIST_FOR_EACH (er, list_node, &eg->route_list) {
         if (is_first) {
@@ -7654,11 +7734,12 @@  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
     ds_put_cstr(&actions, ");");
 
     ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
-                  ds_cstr(&match), ds_cstr(&actions));
+                  ds_cstr(&route_match), ds_cstr(&actions));
 
     /* Add per member flow */
+    struct ds match = DS_EMPTY_INITIALIZER;
+    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
     LIST_FOR_EACH (er, list_node, &eg->route_list) {
-
         const struct parsed_route *route_ = er->route;
         const struct nbrec_logical_router_static_route *route = route_->route;
         /* Find the outgoing port. */
@@ -7668,6 +7749,15 @@  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
                                        &out_port)) {
             continue;
         }
+        /* Symmetric ECMP reply is only usable on gateway routers.
+         * It is NOT usable on distributed routers with a gateway port.
+         */
+        if (smap_get(&od->nbr->options, "chassis") &&
+            route_->ecmp_symmetric_reply && sset_add(&visited_ports,
+                                                     out_port->key)) {
+            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s, out_port,
+                                           route_, &route_match);
+        }
         ds_clear(&match);
         ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
                       REG_ECMP_MEMBER_ID" == %"PRIu16,
@@ -7688,7 +7778,9 @@  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
                                 ds_cstr(&match), ds_cstr(&actions),
                                 &route->header_);
     }
+    sset_destroy(&visited_ports);
     ds_destroy(&match);
+    ds_destroy(&route_match);
     ds_destroy(&actions);
 }
 
@@ -8972,6 +9064,7 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1", "next;");
         ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1", "next;");
         ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1", "next;");
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0, "1", "next;");
 
         /* Send the IPv6 NS packets to next table. When ovn-controller
          * generates IPv6 NS (for the action - nd_ns{}), the injected
diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
index 246cebc19..b1a462933 100644
--- a/ovn-architecture.7.xml
+++ b/ovn-architecture.7.xml
@@ -1210,11 +1210,12 @@ 
     <dd>
       Fields that denote the connection tracking zones for routers.  These
       values only have local significance and are not meaningful between
-      chassis.  OVN stores the zone information for DNATting in Open vSwitch
+      chassis.  OVN stores the zone information for north to south traffic
+      (for DNATting or ECMP symmetric replies) in Open vSwitch
         <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
         MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
-      extension register number 11 and zone information for SNATing in
-      Open vSwitch extension register number 12.
+      extension register number 11 and zone information for south to north
+      traffic (for SNATing) in Open vSwitch extension register number 12.
     </dd>
 
     <dt>logical flow flags</dt>
diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
index da9af7157..16f7794f2 100644
--- a/ovn-nb.ovsschema
+++ b/ovn-nb.ovsschema
@@ -1,7 +1,7 @@ 
 {
     "name": "OVN_Northbound",
     "version": "5.24.0",
-    "cksum": "1092394564 25961",
+    "cksum": "679745602 26116",
     "tables": {
         "NB_Global": {
             "columns": {
@@ -365,6 +365,9 @@ 
                                     "min": 0, "max": 1}},
                 "nexthop": {"type": "string"},
                 "output_port": {"type": {"key": "string", "min": 0, "max": 1}},
+                "options": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}},
                 "external_ids": {
                     "type": {"key": "string", "value": "string",
                              "min": 0, "max": "unlimited"}}},
diff --git a/ovn-nb.xml b/ovn-nb.xml
index db5908cd5..5e434d257 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -2481,6 +2481,22 @@ 
       </column>
     </group>
 
+    <group title="Common options">
+      <column name="options">
+        This column provides general key/value settings. The supported
+        options are described individually below.
+      </column>
+
+      <column name="options" key="ecmp_symmetric_reply">
+        It true, then new traffic that arrives over this route will have
+        its reply traffic bypass ECMP route selection and will be sent out
+        this route instead. Note that this option overrides any rules set
+        in the <ref table="Logical_Router_policy" /> table. This option
+        only works on gateway routers (routers that have
+        <ref column="options" key="chassis" table="Logical_Router" /> set).
+      </column>
+    </group>
+
   </table>
 
   <table name="Logical_Router_Policy" title="Logical router policies">
diff --git a/tests/ovn.at b/tests/ovn.at
index f8dde14c2..c1ab6b85f 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -195,6 +195,8 @@  ct.snat = ct_state[6]
 ct.trk = ct_state[5]
 ct_label = NXM_NX_CT_LABEL
 ct_label.blocked = ct_label[0]
+ct_label.ecmp_reply_eth = ct_label[0..47]
+ct_label.ecmp_reply_port = ct_label[48..63]
 ct_mark = NXM_NX_CT_MARK
 ct_state = NXM_NX_CT_STATE
 ]])
@@ -16065,7 +16067,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # Since the sw0-vir is not claimed by any chassis, eth.dst should be set to
 # zero if the ip4.dst is the virtual ip in the router pipeline.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
 ])
 
 ip_to_hex() {
@@ -16116,7 +16118,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p1's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
 ])
 
 # Forcibly clear virtual_parent. ovn-controller should release the binding
@@ -16157,7 +16159,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p2's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
 ])
 
 # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
@@ -16180,7 +16182,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 # There should be an arp resolve flow to resolve the virtual_ip with the
 # sw0-p3's MAC.
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
 ])
 
 # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
@@ -16201,7 +16203,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
 ])
 
 # Delete hv1-vif1 port. hv1 should release sw0-vir
@@ -16219,7 +16221,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
 ])
 
 # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
@@ -16240,7 +16242,7 @@  ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
 > lflows.txt
 
 AT_CHECK([cat lflows.txt], [0], [dnl
-  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
+  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport == "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
 ])
 
 # Delete sw0-p2 logical port
@@ -20274,22 +20276,22 @@  ovn-nbctl set logical_router_policy $pol5 options:pkt_mark=5
 ovn-nbctl --wait=hv sync
 
 OVS_WAIT_UNTIL([
-    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x64->NXM_NX_PKT_MARK" -c)
 ])
 
 OVS_WAIT_UNTIL([
-    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x3->NXM_NX_PKT_MARK" -c)
 ])
 
 OVS_WAIT_UNTIL([
-    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x4->NXM_NX_PKT_MARK" -c)
 ])
 
 OVS_WAIT_UNTIL([
-    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x5->NXM_NX_PKT_MARK" -c)
 ])
 
@@ -20380,12 +20382,12 @@  send_ipv4_pkt hv1 hv1-vif1 505400000003 00000000ff01 \
     $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
 
 OVS_WAIT_UNTIL([
-    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x2->NXM_NX_PKT_MARK" -c)
 ])
 
 AT_CHECK([
-    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
+    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
     grep "load:0x64->NXM_NX_PKT_MARK" -c)
 ])
 
@@ -20741,3 +20743,126 @@  AT_CHECK([test "$hv2_offlows" = "$hv2_offlows_mon"])
 
 OVN_CLEANUP([hv1], [hv2])
 AT_CLEANUP
+
+AT_SETUP([ovn -- Symmetric ECMP reply flows])
+ovn_start
+
+net_add n1
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+
+sim_add hv2
+as hv2
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.2
+
+# Logical network
+#
+#   ls1 \
+#        \
+#         DR -- join -- GW -- ext
+#        /
+#   ls2 /
+#
+#  ls1 and ls2 are internal switches connected to distributed router
+#  DR. DR is then connected via a join switch to gateway router GW.
+#  GW is then connected to external switch ext. In real life, this
+#  would likely have a localnet port, but for the purposes of this test
+#  it is unnecessary.
+
+ovn-nbctl create Logical_Router name=DR
+gw_uuid=$(ovn-nbctl create Logical_Router name=GW)
+
+ovn-nbctl ls-add ls1
+ovn-nbctl ls-add ls2
+ovn-nbctl ls-add join
+ovn-nbctl ls-add ext
+
+# Connect ls1 to DR
+ovn-nbctl lrp-add DR dr-ls1 00:00:01:01:02:03 10.0.0.1/24
+ovn-nbctl lsp-add ls1 ls1-dr -- set Logical_Switch_Port ls1-dr \
+    type=router options:router-port=dr-ls1 addresses='"00:00:01:01:02:03"'
+
+# Connect ls2 to DR
+ovn-nbctl lrp-add DR dr-ls2 00:00:01:01:02:04 10.0.0.2/24
+ovn-nbctl lsp-add ls2 ls2-dr -- set Logical_Switch_Port ls2-dr \
+    type=router options:router-port=dr-ls2 addresses='"00:00:01:01:02:04"'
+
+# Connect join to DR
+ovn-nbctl lrp-add DR dr-join 00:00:02:01:02:03 20.0.0.1/24
+ovn-nbctl lsp-add join join-dr -- set Logical_Switch_Port join-dr \
+    type=router options:router-port=dr-join addresses='"00:00:02:01:02:03"'
+
+# Connect join to GW
+ovn-nbctl lrp-add GW gw-join 00:00:02:01:02:04 20.0.0.2/24
+ovn-nbctl lsp-add join join-gw -- set Logical_Switch_Port join-gw \
+    type=router options:router-port=gw-join addresses='"00:00:02:01:02:04"'
+
+# Connect ext to GW
+ovn-nbctl lrp-add GW gw-ext 00:00:03:01:02:03 172.16.0.1/16
+ovn-nbctl lsp-add ext ext-gw -- set Logical_Switch_Port ext-gw \
+    type=router options:router-port=gw-ext addresses='"00:00:03:01:02:03"'
+
+ovn-nbctl lr-route-add GW 10.0.0.0/24 20.0.0.1
+ovn-nbctl --policy="src-ip" lr-route-add DR 10.0.0.0/24 20.0.0.2
+
+# Now add some ECMP routes to the GW router.
+ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW 10.0.0.0/24 172.16.0.2
+ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW 10.0.0.0/24 172.16.0.3
+
+ovn-nbctl --wait=hv sync
+
+# Ensure ECMP symmetric reply flows are not present on any hypervisor.
+AT_CHECK([
+    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
+    grep "priority=100" | \
+    grep "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))" -c)
+])
+AT_CHECK([
+    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
+    grep "priority=200" | \
+    grep "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
+])
+
+AT_CHECK([
+    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
+    grep "priority=100" | \
+    grep "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))" -c)
+])
+AT_CHECK([
+    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
+    grep "priority=200" | \
+    grep "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
+])
+
+# Now make GW a gateway router on hv1
+ovn-nbctl set Logical_Router $gw_uuid options:chassis=hv1
+ovn-nbctl --wait=hv sync
+
+# And ensure that ECMP symmetric reply flows are present only on hv1
+AT_CHECK([
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
+    grep "priority=100" | \
+    grep "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))" -c)
+])
+AT_CHECK([
+    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
+    grep "priority=200" | \
+    grep "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
+])
+
+AT_CHECK([
+    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
+    grep "priority=100" | \
+    grep "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))" -c)
+])
+AT_CHECK([
+    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
+    grep "priority=200" | \
+    grep "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]" -c)
+])
+
+OVN_CLEANUP([hv1], [hv2])
+AT_CLEANUP
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index eddc530f9..e239b7394 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -4483,3 +4483,147 @@  OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
 /connection dropped.*/d"])
 
 AT_CLEANUP
+
+AT_SETUP([ovn -- ECMP symmetric reply])
+AT_KEYWORDS([ecmp])
+
+CHECK_CONNTRACK()
+ovn_start
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-int])
+
+# Set external-ids in br-int needed for ovn-controller
+ovs-vsctl \
+        -- set Open_vSwitch . external-ids:system-id=hv1 \
+        -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
+        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
+        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
+        -- set bridge br-int fail-mode=secure other-config:disable-in-band=true
+
+# Start ovn-controller
+start_daemon ovn-controller
+
+# Logical network:
+# Alice is connected to gateway router R1. R1 is connected to two "external"
+# routers, R2 and R3 via an "ext" switch.
+# Bob is connected to both R2 and R3. R1 contains two ECMP routes, one through R2
+# and one through R3, to Bob.
+#
+#     alice -- R1 -- ext ---- R2
+#                     |         \
+#                     |           bob
+#                     |         /
+#                     + ----- R3
+#
+# For this test, Bob sends request traffic through R2 to Alice. We want to ensure that
+# all response traffic from Alice is routed through R2 as well.
+
+ovn-nbctl create Logical_Router name=R1 options:chassis=hv1
+ovn-nbctl create Logical_Router name=R2
+ovn-nbctl create Logical_Router name=R3
+
+ovn-nbctl ls-add alice
+ovn-nbctl ls-add bob
+ovn-nbctl ls-add ext
+
+# connect alice to R1
+ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
+ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
+    type=router options:router-port=alice addresses='"00:00:01:01:02:03"'
+
+# connect bob to R2
+ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
+ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
+    type=router options:router-port=R2_bob addresses='"00:00:02:01:02:03"'
+
+# connect bob to R3
+ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
+ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
+    type=router options:router-port=R3_bob addresses='"00:00:02:01:02:04"'
+
+# Connect R1 to ext
+ovn-nbctl lrp-add R1 R1_ext 00:00:04:01:02:03 20.0.0.1/24
+ovn-nbctl lsp-add ext r1-ext -- set Logical_Switch_Port r1-ext \
+    type=router options:router-port=R1_ext addresses='"00:00:04:01:02:03"'
+
+# Connect R2 to ext
+ovn-nbctl lrp-add R2 R2_ext 00:00:04:01:02:04 20.0.0.2/24
+ovn-nbctl lsp-add ext r2-ext -- set Logical_Switch_Port r2-ext \
+    type=router options:router-port=R2_ext addresses='"00:00:04:01:02:04"'
+
+# Connect R3 to ext
+ovn-nbctl lrp-add R3 R3_ext 00:00:04:01:02:05 20.0.0.3/24
+ovn-nbctl lsp-add ext r3-ext -- set Logical_Switch_Port r3-ext \
+    type=router options:router-port=R3_ext addresses='"00:00:04:01:02:05"'
+
+# Install ECMP routes for alice.
+ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1 10.0.0.0/24 20.0.0.2
+ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1 10.0.0.0/24 20.0.0.3
+
+# Static Routes
+ovn-nbctl lr-route-add R2 10.0.0.0/24 20.0.0.1
+ovn-nbctl lr-route-add R3 10.0.0.0/24 20.0.0.1
+
+# Logical port 'alice1' in switch 'alice'.
+ADD_NAMESPACES(alice1)
+ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24", "f0:00:00:01:02:04", \
+         "10.0.0.1")
+ovn-nbctl lsp-add alice alice1 \
+-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
+
+# Logical port 'bob1' in switch 'bob'.
+ADD_NAMESPACES(bob1)
+ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16", "f0:00:00:01:02:06", \
+         "172.16.0.2")
+ovn-nbctl lsp-add bob bob1 \
+-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
+
+# Ensure ovn-controller is caught up
+ovn-nbctl --wait=hv sync
+
+on_exit 'ovs-ofctl dump-flows br-int'
+
+# 'bob1' should be able to ping 'alice1' directly.
+NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 | FORMAT_PING], \
+[0], [dnl
+20 packets transmitted, 20 received, 0% packet loss, time 0ms
+])
+
+# Ensure conntrack entry is present. We should not try to predict
+# the tunnel key for the output port, so we strip it from the labels
+# and just ensure that the known ethernet address is present.
+AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) | \
+sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
+sed -e 's/labels=0x[[0-9a-f]]*00000401020400000000/labels=0x00000401020400000000/'], [0], [dnl
+icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x00000401020400000000
+])
+
+# Ensure datapaths show conntrack states as expected
+# Like with conntrack entries, we shouldn't try to predict
+# port binding tunnel keys. So omit them from expected labels.
+AT_CHECK([ovs-appctl dpctl/dump-flows | grep 'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*00000401020400000000/0xffffffffffffffff00000000)' -c], [0], [dnl
+1
+])
+AT_CHECK([ovs-appctl dpctl/dump-flows | grep 'ct_state(-new+est+rpl+trk).*ct_label(0x.*00000401020400000000/0xffffffffffffffff00000000)' -c], [0], [dnl
+1
+])
+
+ovs-ofctl dump-flows br-int
+
+OVS_APP_EXIT_AND_WAIT([ovn-controller])
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as
+OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
+/connection dropped.*/d"])
+
+AT_CLEANUP
diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
index de86b70e6..18bf90e08 100644
--- a/utilities/ovn-nbctl.8.xml
+++ b/utilities/ovn-nbctl.8.xml
@@ -658,7 +658,8 @@ 
 
     <dl>
       <dt>[<code>--may-exist</code>] [<code>--policy</code>=<var>POLICY</var>]
-        [<code>--ecmp</code>] <code>lr-route-add</code> <var>router</var>
+        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
+        <code>lr-route-add</code> <var>router</var>
         <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
       <dd>
         <p>
@@ -680,15 +681,31 @@ 
           specified, the default is "dst-ip".
         </p>
 
+        <p>
+          The <code>--ecmp</code> option allows for multiple routes with the
+          same <var>prefix</var> <var>POLICY</var> but different
+          <var>nexthop</var> and <var>port</var> to be added.
+        </p>
+
+        <p>
+          The <code>--ecmp-symmetric-reply</code> option makes it so that
+          traffic that arrives over an ECMP route will have its reply traffic
+          sent out over that same route. Setting
+          <code>--ecmp-symmetric-reply</code> implies <code>--ecmp</code> so
+          it is not necessary to set both.
+        </p>
+
         <p>
           It is an error if a route with <var>prefix</var> and
-          <var>POLICY</var> already exists, unless <code>--may-exist</code> or
-          <code>--ecmp</code> is specified.  If <code>--may-exist</code> is
-          specified but not <code>--ecmp</code>, the existed route will be
-          updated with the new nexthop and port.  If <code>--ecmp</code> is
+          <var>POLICY</var> already exists, unless <code>--may-exist</code>,
+          <code>--ecmp</code>, or <code>--ecmp-symmetric-reply</code> is
+          specified.  If <code>--may-exist</code> is specified but not
+          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code>, the
+          existed route will be updated with the new nexthop and port.  If
+          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code> is
           specified, a new route will be added, regardless of the existed
-          route, which is useful when adding ECMP routes, i.e. routes with same
-          <var>POLICY</var> and <var>prefix</var> but different
+          route., which is useful when adding ECMP routes, i.e. routes with
+          same <var>POLICY</var> and <var>prefix</var> but different
           <var>nexthop</var> and <var>port</var>.
         </p>
       </dd>
diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
index 0079ad5a6..e6d8dbe63 100644
--- a/utilities/ovn-nbctl.c
+++ b/utilities/ovn-nbctl.c
@@ -687,7 +687,8 @@  Logical router port commands:\n\
                             ('overlay' or 'bridged')\n\
 \n\
 Route commands:\n\
-  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
+  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add ROUTER \n\
+                            PREFIX NEXTHOP [PORT]\n\
                             add a route to ROUTER\n\
   [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP [PORT]]]\n\
                             remove routes from ROUTER\n\
@@ -3855,7 +3856,10 @@  nbctl_lr_route_add(struct ctl_context *ctx)
     }
 
     bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
-    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
+    bool ecmp_symmetric_reply = shash_find(&ctx->options,
+                                           "--ecmp-symmetric-reply") != NULL;
+    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
+                ecmp_symmetric_reply;
     if (!ecmp) {
         for (int i = 0; i < lr->n_static_routes; i++) {
             const struct nbrec_logical_router_static_route *route
@@ -3920,6 +3924,13 @@  nbctl_lr_route_add(struct ctl_context *ctx)
         nbrec_logical_router_static_route_set_policy(route, policy);
     }
 
+    if (ecmp_symmetric_reply) {
+        const struct smap options = SMAP_CONST1(&options,
+                                                "ecmp_symmetric_reply",
+                                                "true");
+        nbrec_logical_router_static_route_set_options(route, &options);
+    }
+
     nbrec_logical_router_verify_static_routes(lr);
     struct nbrec_logical_router_static_route **new_routes
         = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
@@ -6361,7 +6372,8 @@  static const struct ctl_command_syntax nbctl_commands[] = {
 
     /* logical router route commands. */
     { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
-      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW },
+      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--ecmp-symmetric-reply,"
+      "--policy=", RW },
     { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]", NULL,
       nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
     { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list, NULL,