diff mbox series

[ovs-dev,5/6] Implement MTU Path Discovery for multichassis ports

Message ID 20230503011239.2100488-6-ihrachys@redhat.com
State New, archived
Headers show
Series Implement MTU Path Discovery for multichassis ports | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_ovn-kubernetes success github build: passed
ovsrobot/github-robot-_Build_and_Test fail github build: failed

Commit Message

Ihar Hrachyshka May 3, 2023, 1:12 a.m. UTC
When a multichassis port belongs to a switch with a localnet port,
packets originating or directed to the multichassis port are NOT sent
thorugh the localnet port. Instead, tunneling is enforced in-cluster to
guarantee delivery of all packets to all chassis of the port.

This behavior has an unfortunate side effect, where - because of
additional tunnel header added to each packet - the effective MTU of the
path for multichassis ports changes from what's set as mtu_request. This
effectively makes OVN to black hole all packets for the port that use
full capacity of the interface MTU. This breaks usual TCP / UDP
services, among other things (SSH, iperf sessions etc.)

This patch adds flows so that
- (in table 38) detect too-big packets (table 38), and then
- (in table 39) icmp fragmentation needed / too big errors are sent
  back to offending port.

Once the error is received, the sender is expected to adjust the route
MTU accordingly, sending the next packets with the new path MTU. After a
multichassis port is re-assigned to a single chassis, the effective path
MTU is restored to "usual". Peers will eventually see their "learned"
path MTU cache expire, which will make them switch back to the "usual"
MTU.

Among other scenarios, this patch helps to maintain existing services
working during live migration of a VM, if multichassis ports are used.
(E.g. in OpenStack Nueutron.)

Fixes: 7084cf437421 ("Always funnel multichassis port traffic through tunnels")

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
---
 controller/ovn-controller.c |   3 +
 controller/physical.c       | 297 +++++++++++++++++++++++++++++++++++-
 controller/physical.h       |   1 +
 lib/ovn-util.h              |  11 ++
 tests/ovn.at                | 262 +++++++++++++++++++++++++++++++
 5 files changed, 567 insertions(+), 7 deletions(-)

Comments

Mark Michelson May 3, 2023, 8:54 p.m. UTC | #1
On 5/2/23 21:12, Ihar Hrachyshka wrote:
> When a multichassis port belongs to a switch with a localnet port,
> packets originating or directed to the multichassis port are NOT sent
> thorugh the localnet port. Instead, tunneling is enforced in-cluster to
> guarantee delivery of all packets to all chassis of the port.
> 
> This behavior has an unfortunate side effect, where - because of
> additional tunnel header added to each packet - the effective MTU of the
> path for multichassis ports changes from what's set as mtu_request. This
> effectively makes OVN to black hole all packets for the port that use
> full capacity of the interface MTU. This breaks usual TCP / UDP
> services, among other things (SSH, iperf sessions etc.)
> 
> This patch adds flows so that
> - (in table 38) detect too-big packets (table 38), and then
> - (in table 39) icmp fragmentation needed / too big errors are sent
>    back to offending port.
> 
> Once the error is received, the sender is expected to adjust the route
> MTU accordingly, sending the next packets with the new path MTU. After a
> multichassis port is re-assigned to a single chassis, the effective path
> MTU is restored to "usual". Peers will eventually see their "learned"
> path MTU cache expire, which will make them switch back to the "usual"
> MTU.
> 
> Among other scenarios, this patch helps to maintain existing services
> working during live migration of a VM, if multichassis ports are used.
> (E.g. in OpenStack Nueutron.)
> 
> Fixes: 7084cf437421 ("Always funnel multichassis port traffic through tunnels")
> 
> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
> ---
>   controller/ovn-controller.c |   3 +
>   controller/physical.c       | 297 +++++++++++++++++++++++++++++++++++-
>   controller/physical.h       |   1 +
>   lib/ovn-util.h              |  11 ++
>   tests/ovn.at                | 262 +++++++++++++++++++++++++++++++
>   5 files changed, 567 insertions(+), 7 deletions(-)
> 
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index c094cb74d..9359925fa 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -4083,6 +4083,9 @@ static void init_physical_ctx(struct engine_node *node,
>       p_ctx->patch_ofports = &non_vif_data->patch_ofports;
>       p_ctx->chassis_tunnels = &non_vif_data->chassis_tunnels;
>   
> +    struct controller_engine_ctx *ctrl_ctx = engine_get_context()->client_ctx;
> +    p_ctx->if_mgr = ctrl_ctx->if_mgr;
> +
>       pflow_output_get_debug(node, &p_ctx->debug);
>   }
>   
> diff --git a/controller/physical.c b/controller/physical.c
> index 1b0482e3b..1c1018616 100644
> --- a/controller/physical.c
> +++ b/controller/physical.c
> @@ -41,6 +41,7 @@
>   #include "lib/ovn-sb-idl.h"
>   #include "lib/ovn-util.h"
>   #include "ovn/actions.h"
> +#include "if-status.h"
>   #include "physical.h"
>   #include "pinctrl.h"
>   #include "openvswitch/shash.h"
> @@ -91,6 +92,7 @@ physical_register_ovs_idl(struct ovsdb_idl *ovs_idl)
>   
>       ovsdb_idl_add_table(ovs_idl, &ovsrec_table_interface);
>       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_name);
> +    ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_mtu);
>       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_ofport);
>       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_external_ids);
>   }
> @@ -1104,6 +1106,273 @@ setup_activation_strategy(const struct sbrec_port_binding *binding,
>       }
>   }
>   
> +static size_t
> +encode_start_controller_op(enum action_opcode opcode, bool pause,
> +                           uint32_t meter_id, struct ofpbuf *ofpacts)
> +{
> +    size_t ofs = ofpacts->size;
> +
> +    struct ofpact_controller *oc = ofpact_put_CONTROLLER(ofpacts);
> +    oc->max_len = UINT16_MAX;
> +    oc->reason = OFPR_ACTION;
> +    oc->pause = pause;
> +    if (!ovs_feature_is_supported(OVS_DP_METER_SUPPORT)) {
> +        meter_id = NX_CTLR_NO_METER;
> +    }
> +    oc->meter_id = meter_id;
> +
> +    struct action_header ah = { .opcode = htonl(opcode) };
> +    ofpbuf_put(ofpacts, &ah, sizeof ah);
> +
> +    return ofs;
> +}
> +
> +static void
> +encode_finish_controller_op(size_t ofs, struct ofpbuf *ofpacts)
> +{
> +    struct ofpact_controller *oc = ofpbuf_at_assert(ofpacts, ofs, sizeof *oc);
> +    ofpacts->header = oc;
> +    oc->userdata_len = ofpacts->size - (ofs + sizeof *oc);
> +    ofpact_finish_CONTROLLER(ofpacts, &oc);
> +}

ecnode_start_controller_op() and encode_finish_controller_op() appear to 
be copied from lib/actions.c . I think it's reasonable to expose these 
functions in lib/actions.h and call them instead of redefining them.

> +
> +/*
> + * Insert a flow to determine if an IP packet is too big for the corresponding
> + * egress interface.
> + */
> +static void
> +determine_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> +                         const struct sbrec_port_binding *binding,
> +                         const struct sbrec_port_binding *mcp,
> +                         uint16_t mtu, bool is_ipv6, int direction)
> +{
> +    struct ofpbuf ofpacts;
> +    ofpbuf_init(&ofpacts, 0);
> +
> +    /* Store packet too large flag in reg9[1]. */
> +    struct match match;
> +    match_init_catchall(&match);
> +    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
> +    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
> +    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
> +
> +    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
> +    struct ofpact_check_pkt_larger *pkt_larger =
> +        ofpact_put_CHECK_PKT_LARGER(&ofpacts);
> +    pkt_larger->pkt_len = mtu;
> +    pkt_larger->dst.field = mf_from_id(MFF_REG9);
> +    pkt_larger->dst.ofs = 1;
> +
> +    put_resubmit(OFTABLE_OUTPUT_LARGE_PKT_PROCESS, &ofpacts);
> +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 100,
> +                    binding->header_.uuid.parts[0], &match, &ofpacts,
> +                    &binding->header_.uuid);
> +    ofpbuf_uninit(&ofpacts);
> +}
> +
> +/*
> + * Insert a flow to reply with ICMP error for IP packets that are too big for
> + * the corresponding egress interface.
> + */
> +/*
> + * NOTE(ihrachys) This reimplements icmp_error as found in
> + * build_icmperr_pkt_big_flows. We may look into reusing the existing OVN
> + * action for this flow in the future.
> + */
> +static void
> +reply_imcp_error_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> +                                const struct sbrec_port_binding *binding,
> +                                const struct sbrec_port_binding *mcp,
> +                                uint16_t mtu, bool is_ipv6, int direction)
> +{
> +    struct match match;
> +    match_init_catchall(&match);
> +    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
> +    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
> +    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
> +    match_set_reg_masked(&match, MFF_REG9 - MFF_REG0, 1 << 1, 1 << 1);
> +
> +    /* Return ICMP error with a part of the original IP packet included. */
> +    struct ofpbuf ofpacts;
> +    ofpbuf_init(&ofpacts, 0);
> +    size_t oc_offset = encode_start_controller_op(
> +        ACTION_OPCODE_ICMP, true, NX_CTLR_NO_METER, &ofpacts);
> +
> +    struct ofpbuf inner_ofpacts;
> +    ofpbuf_init(&inner_ofpacts, 0);
> +
> +    /* The error packet is no longer too large, set REGBIT_PKT_LARGER = 0 */
> +    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
> +    ovs_be32 value = htonl(0);
> +    ovs_be32 mask = htonl(1 << 1);
> +    ofpact_put_set_field(
> +        &inner_ofpacts, mf_from_id(MFF_REG9), &value, &mask);
> +
> +    /* The new error packet is delivered locally */
> +    /* REGBIT_EGRESS_LOOPBACK = 1 */
> +    value = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
> +    mask = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
> +    ofpact_put_set_field(
> +        &inner_ofpacts, mf_from_id(MFF_LOG_FLAGS), &value, &mask);
> +
> +    /* eth.src <-> eth.dst */
> +    put_stack(MFF_ETH_DST, ofpact_put_STACK_PUSH(&inner_ofpacts));
> +    put_stack(MFF_ETH_SRC, ofpact_put_STACK_PUSH(&inner_ofpacts));
> +    put_stack(MFF_ETH_DST, ofpact_put_STACK_POP(&inner_ofpacts));
> +    put_stack(MFF_ETH_SRC, ofpact_put_STACK_POP(&inner_ofpacts));
> +
> +    /* ip.src <-> ip.dst */
> +    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
> +        ofpact_put_STACK_PUSH(&inner_ofpacts));
> +    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
> +        ofpact_put_STACK_PUSH(&inner_ofpacts));
> +    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
> +        ofpact_put_STACK_POP(&inner_ofpacts));
> +    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
> +        ofpact_put_STACK_POP(&inner_ofpacts));
> +
> +    /* ip.ttl = 255 */
> +    struct ofpact_ip_ttl *ip_ttl = ofpact_put_SET_IP_TTL(&inner_ofpacts);
> +    ip_ttl->ttl = 255;
> +
> +    uint16_t frag_mtu = mtu - ETHERNET_OVERHEAD;
> +    size_t frag_mtu_oc_offset;
> +    if (is_ipv6) {
> +        /* icmp6.type = 2 (Packet Too Big) */
> +        /* icmp6.code = 0 */
> +        uint8_t icmp_type = 2;
> +        uint8_t icmp_code = 0;
> +        ofpact_put_set_field(
> +            &inner_ofpacts, mf_from_id(MFF_ICMPV6_TYPE), &icmp_type, NULL);
> +        ofpact_put_set_field(
> +            &inner_ofpacts, mf_from_id(MFF_ICMPV6_CODE), &icmp_code, NULL);
> +
> +        /* icmp6.frag_mtu */
> +        frag_mtu_oc_offset = encode_start_controller_op(
> +            ACTION_OPCODE_PUT_ICMP6_FRAG_MTU, true, NX_CTLR_NO_METER,
> +            &inner_ofpacts);
> +        ovs_be32 frag_mtu_ovs = htonl(frag_mtu);
> +        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
> +    } else {
> +        /* icmp4.type = 3 (Destination Unreachable) */
> +        /* icmp4.code = 4 (Fragmentation Needed) */
> +        uint8_t icmp_type = 3;
> +        uint8_t icmp_code = 4;
> +        ofpact_put_set_field(
> +            &inner_ofpacts, mf_from_id(MFF_ICMPV4_TYPE), &icmp_type, NULL);
> +        ofpact_put_set_field(
> +            &inner_ofpacts, mf_from_id(MFF_ICMPV4_CODE), &icmp_code, NULL);
> +
> +        /* icmp4.frag_mtu = */
> +        frag_mtu_oc_offset = encode_start_controller_op(
> +            ACTION_OPCODE_PUT_ICMP4_FRAG_MTU, true, NX_CTLR_NO_METER,
> +            &inner_ofpacts);
> +        ovs_be16 frag_mtu_ovs = htons(frag_mtu);
> +        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
> +    }
> +    encode_finish_controller_op(frag_mtu_oc_offset, &inner_ofpacts);
> +
> +    /* Finally, submit the ICMP error back to the ingress pipeline */
> +    put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, &inner_ofpacts);
> +
> +    /* Attach nested actions to ICMP error controller handler */
> +    ofpacts_put_openflow_actions(inner_ofpacts.data, inner_ofpacts.size,
> +                                 &ofpacts, OFP15_VERSION);
> +
> +    /* Finalize the ICMP error controller handler */
> +    encode_finish_controller_op(oc_offset, &ofpacts);
> +
> +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 100,
> +                    binding->header_.uuid.parts[0], &match, &ofpacts,
> +                    &binding->header_.uuid);
> +
> +    ofpbuf_uninit(&inner_ofpacts);
> +    ofpbuf_uninit(&ofpacts);
> +}
> +
> +static uint16_t
> +get_tunnel_overhead(struct chassis_tunnel const *tun)
> +{
> +    uint16_t overhead = 0;
> +    enum chassis_tunnel_type type = tun->type;
> +    if (type == GENEVE) {
> +        overhead += GENEVE_TUNNEL_OVERHEAD;
> +    } else if (type == STT) {
> +        overhead += STT_TUNNEL_OVERHEAD;
> +    } else if (type == VXLAN) {
> +        overhead += VXLAN_TUNNEL_OVERHEAD;
> +    } else {
> +        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> +        VLOG_WARN_RL(&rl, "Unknown tunnel type %d, can't determine overhead "
> +                          "size for Path MTU Discovery", type);
> +        return 0;
> +    }
> +    overhead += tun->is_ipv6? IPV6_HEADER_LEN : IPV4_HEADER_LEN;
> +    return overhead;
> +}
> +
> +static uint16_t
> +get_effective_mtu(const struct sbrec_port_binding *mcp,
> +                  struct ovs_list *remote_tunnels,
> +                  const struct if_status_mgr *if_mgr)
> +{
> +    /* Use interface MTU as a base for calculation */
> +    uint16_t iface_mtu = if_status_mgr_iface_get_mtu(if_mgr,
> +                                                     mcp->logical_port);
> +    if (!iface_mtu) {
> +        return 0;
> +    }
> +
> +    /* Iterate over all peer tunnels and find the biggest tunnel overhead */
> +    uint16_t overhead = 0;
> +    struct tunnel *tun;
> +    LIST_FOR_EACH (tun, list_node, remote_tunnels) {
> +        uint16_t tunnel_overhead = get_tunnel_overhead(tun->tun);
> +        if (tunnel_overhead > overhead) {
> +            overhead = tunnel_overhead;
> +        }
> +    }
> +    if (!overhead) {
> +        return 0;
> +    }
> +
> +    return iface_mtu - overhead;
> +}
> +
> +static void
> +handle_pkt_too_big_for_ip_version(struct ovn_desired_flow_table *flow_table,
> +                                  const struct sbrec_port_binding *binding,
> +                                  const struct sbrec_port_binding *mcp,
> +                                  uint16_t mtu, bool is_ipv6)
> +{
> +    /* ingress */
> +    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> +                             MFF_LOG_INPORT);
> +    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> +                                    MFF_LOG_INPORT);
> +
> +    /* egress */
> +    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> +                             MFF_LOG_OUTPORT);
> +    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> +                                    MFF_LOG_OUTPORT);
> +}
> +
> +static void
> +handle_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> +                   struct ovs_list *remote_tunnels,
> +                   const struct sbrec_port_binding *binding,
> +                   const struct sbrec_port_binding *mcp,
> +                   const struct if_status_mgr *if_mgr)
> +{
> +    uint16_t mtu = get_effective_mtu(mcp, remote_tunnels, if_mgr);
> +    if (!mtu) {
> +        return;
> +    }
> +    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, false);
> +    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, true);
> +}
> +
>   static void
>   enforce_tunneling_for_multichassis_ports(
>       struct local_datapath *ld,
> @@ -1111,7 +1380,8 @@ enforce_tunneling_for_multichassis_ports(
>       const struct sbrec_chassis *chassis,
>       const struct hmap *chassis_tunnels,
>       enum mf_field_id mff_ovn_geneve,
> -    struct ovn_desired_flow_table *flow_table)
> +    struct ovn_desired_flow_table *flow_table,
> +    const struct if_status_mgr *if_mgr)
>   {
>       if (shash_is_empty(&ld->multichassis_ports)) {
>           return;
> @@ -1156,6 +1426,8 @@ enforce_tunneling_for_multichassis_ports(
>                           binding->header_.uuid.parts[0], &match, &ofpacts,
>                           &binding->header_.uuid);
>           ofpbuf_uninit(&ofpacts);
> +
> +        handle_pkt_too_big(flow_table, tuns, binding, mcp, if_mgr);
>       }
>   
>       struct tunnel *tun_elem;
> @@ -1177,6 +1449,7 @@ consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
>                         const struct sbrec_port_binding *binding,
>                         const struct sbrec_chassis *chassis,
>                         const struct physical_debug *debug,
> +                      const struct if_status_mgr *if_mgr,
>                         struct ovn_desired_flow_table *flow_table,
>                         struct ofpbuf *ofpacts_p)
>   {
> @@ -1602,8 +1875,10 @@ consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
>                           binding->header_.uuid.parts[0],
>                           &match, ofpacts_p, &binding->header_.uuid);
>   
> -        enforce_tunneling_for_multichassis_ports(
> -            ld, binding, chassis, chassis_tunnels, mff_ovn_geneve, flow_table);
> +        enforce_tunneling_for_multichassis_ports(ld, binding, chassis,
> +                                                 chassis_tunnels,
> +                                                 mff_ovn_geneve, flow_table,
> +                                                 if_mgr);
>   
>           /* No more tunneling to set up. */
>           goto out;
> @@ -1908,7 +2183,7 @@ physical_eval_port_binding(struct physical_ctx *p_ctx,
>                             p_ctx->patch_ofports,
>                             p_ctx->chassis_tunnels,
>                             pb, p_ctx->chassis, &p_ctx->debug,
> -                          flow_table, &ofpacts);
> +                          p_ctx->if_mgr, flow_table, &ofpacts);
>       ofpbuf_uninit(&ofpacts);
>   }
>   
> @@ -2032,7 +2307,7 @@ physical_run(struct physical_ctx *p_ctx,
>                                 p_ctx->patch_ofports,
>                                 p_ctx->chassis_tunnels, binding,
>                                 p_ctx->chassis, &p_ctx->debug,
> -                              flow_table, &ofpacts);
> +                              p_ctx->if_mgr, flow_table, &ofpacts);
>       }
>   
>       /* Handle output to multicast groups, in tables 40 and 41. */
> @@ -2176,11 +2451,19 @@ physical_run(struct physical_ctx *p_ctx,
>       ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 0, 0, &match,
>                       &ofpacts, hc_uuid);
>   
> +    match_init_catchall(&match);
> +    match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0,
> +                         MLF_ALLOW_LOOPBACK, MLF_ALLOW_LOOPBACK);
> +    ofpbuf_clear(&ofpacts);
> +    put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
> +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 10, 0,
> +                    &match, &ofpacts, hc_uuid);
> +
>       match_init_catchall(&match);
>       ofpbuf_clear(&ofpacts);
>       put_resubmit(OFTABLE_REMOTE_OUTPUT, &ofpacts);
> -    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0, &match,
> -                    &ofpacts, hc_uuid);
> +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0,
> +                    &match, &ofpacts, hc_uuid);
>   
>       /* Table 40, priority 150.
>        * =======================
> diff --git a/controller/physical.h b/controller/physical.h
> index f450dca94..396bcb138 100644
> --- a/controller/physical.h
> +++ b/controller/physical.h
> @@ -57,6 +57,7 @@ struct physical_ctx {
>       const struct sbrec_chassis_table *chassis_table;
>       const struct sbrec_chassis *chassis;
>       const struct sset *active_tunnels;
> +    const struct if_status_mgr *if_mgr;
>       struct hmap *local_datapaths;
>       struct sset *local_lports;
>       const struct simap *ct_zones;
> diff --git a/lib/ovn-util.h b/lib/ovn-util.h
> index 7cf861dbc..7ec2bca48 100644
> --- a/lib/ovn-util.h
> +++ b/lib/ovn-util.h
> @@ -29,6 +29,17 @@
>   #define ROUTE_ORIGIN_CONNECTED "connected"
>   #define ROUTE_ORIGIN_STATIC "static"
>   
> +#define ETH_HEADER_LENGTH 14
> +#define ETH_CRC_LENGTH 4
> +#define ETHERNET_OVERHEAD (ETH_HEADER_LENGTH + ETH_CRC_LENGTH)
> +
> +#define IPV4_HEADER_LEN 20
> +#define IPV6_HEADER_LEN 40

ETH_HEADER_LENGTH, IPV4_HEADER_LEN, and IPV6_HEADER_LEN are already 
defined in OVS in lib/packets.h. You can reuse those values instead of 
redefining them.

> +
> +#define GENEVE_TUNNEL_OVERHEAD 38
> +#define STT_TUNNEL_OVERHEAD 18
> +#define VXLAN_TUNNEL_OVERHEAD 30
> +
>   struct eth_addr;
>   struct nbrec_logical_router_port;
>   struct ovsrec_flow_sample_collector_set_table;
> diff --git a/tests/ovn.at b/tests/ovn.at
> index b0439d99e..99ce3dd90 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -15194,6 +15194,268 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
>   AT_CLEANUP
>   ])
>   
> +m4_define([MULTICHASSIS_PATH_MTU_DISCOVERY_TEST],
> +  [OVN_FOR_EACH_NORTHD([
> +   AT_SETUP([localnet connectivity with multiple requested-chassis, path mtu discovery (ip=$1, tunnel=$2, mtu=$3)])
> +   AT_KEYWORDS([multi-chassis])
> +
> +   ovn_start
> +
> +   net_add n1
> +   for i in 1 2; do
> +       sim_add hv$i
> +       as hv$i
> +       check ovs-vsctl add-br br-phys
> +       if test "x$1" = "xipv6"; then
> +           ovn_attach n1 br-phys fd00::$i 64 $2
> +       else
> +           ovn_attach n1 br-phys 192.168.0.$i 24 $2
> +       fi
> +       check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> +   done
> +
> +   first_mac=00:00:00:00:00:01
> +   second_mac=00:00:00:00:00:02
> +   multi1_mac=00:00:00:00:00:f0
> +   multi2_mac=00:00:00:00:00:f1
> +   first_ip=10.0.0.1
> +   second_ip=10.0.0.2
> +   multi1_ip=10.0.0.10
> +   multi2_ip=10.0.0.20
> +   first_ip6=abcd::1
> +   second_ip6=abcd::2
> +   multi1_ip6=abcd::f0
> +   multi2_ip6=abcd::f1
> +
> +   check ovn-nbctl ls-add ls0
> +   check ovn-nbctl lsp-add ls0 first
> +   check ovn-nbctl lsp-add ls0 second
> +   check ovn-nbctl lsp-add ls0 multi1
> +   check ovn-nbctl lsp-add ls0 multi2
> +   check ovn-nbctl lsp-set-addresses first "${first_mac} ${first_ip} ${first_ip6}"
> +   check ovn-nbctl lsp-set-addresses second "${second_mac} ${second_ip} ${second_ip6}"
> +   check ovn-nbctl lsp-set-addresses multi1 "${multi1_mac} ${multi1_ip} ${multi1_ip6}"
> +   check ovn-nbctl lsp-set-addresses multi2 "${multi2_mac} ${multi2_ip} ${multi2_ip6}"
> +
> +   check ovn-nbctl lsp-add ls0 public
> +   check ovn-nbctl lsp-set-type public localnet
> +   check ovn-nbctl lsp-set-addresses public unknown
> +   check ovn-nbctl lsp-set-options public network_name=phys
> +
> +   check ovn-nbctl lsp-set-options first requested-chassis=hv1
> +   check ovn-nbctl lsp-set-options second requested-chassis=hv2
> +   check ovn-nbctl lsp-set-options multi1 requested-chassis=hv1,hv2
> +   check ovn-nbctl lsp-set-options multi2 requested-chassis=hv1,hv2
> +
> +   as hv1 check ovs-vsctl -- add-port br-int first -- \
> +       set Interface first external-ids:iface-id=first \
> +       options:tx_pcap=hv1/first-tx.pcap \
> +       options:rxq_pcap=hv1/first-rx.pcap \
> +       ofport-request=1
> +   as hv2 check ovs-vsctl -- add-port br-int second -- \
> +       set Interface second external-ids:iface-id=second \
> +       options:tx_pcap=hv2/second-tx.pcap \
> +       options:rxq_pcap=hv2/second-rx.pcap \
> +       ofport-request=2
> +
> +   # Create Migrator interfaces on both hv1 and hv2
> +   for hv in hv1 hv2; do
> +       for i in 1 2; do
> +           as $hv check ovs-vsctl -- add-port br-int multi${i} -- \
> +               set Interface multi${i} external-ids:iface-id=multi${i} \
> +               options:tx_pcap=$hv/multi${i}-tx.pcap \
> +               options:rxq_pcap=$hv/multi${i}-rx.pcap \
> +               ofport-request=${i}00
> +       done
> +   done
> +
> +   send_ip_packet() {
> +       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv4_src=${5} ipv4_dst=${6} data=${7} fail=${8}
> +       packet=$(fmt_pkt "
> +           Ether(dst='${eth_dst}', src='${eth_src}') /
> +           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
> +           ICMP(type=8) / bytes.fromhex('${data}')
> +       ")
> +       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
> +       if [[ x"${fail}" != x0 ]]; then
> +         original_ip_frame=$(fmt_pkt "
> +           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
> +           ICMP(type=8) / bytes.fromhex('${data}')
> +         ")
> +         # IP(flags=2) means DF (Don't Fragment) = 1
> +         # ICMP(type=3, code=4) means Destination Unreachable, Fragmentation Needed
> +         packet=$(fmt_pkt "
> +             Ether(dst='${eth_src}', src='${eth_dst}') /
> +             IP(src='${ipv4_dst}', dst='${ipv4_src}', ttl=255, flags=2, id=0) /
> +             ICMP(type=3, code=4, nexthopmtu=$3) /
> +             bytes.fromhex('${original_ip_frame:0:$((534 * 2))}')
> +         ")
> +       fi
> +       echo ${packet}
> +   }
> +
> +   send_ip6_packet() {
> +       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv6_src=${5} ipv6_dst=${6} data=${7} fail=${8}
> +       packet=$(fmt_pkt "
> +           Ether(dst='${eth_dst}', src='${eth_src}') /
> +           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
> +           ICMPv6EchoRequest() / bytes.fromhex('${data}')
> +       ")
> +       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
> +       if [[ x"${fail}" != x0 ]]; then
> +         original_ip_frame=$(fmt_pkt "
> +           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
> +           ICMPv6EchoRequest() / bytes.fromhex('${data}')
> +         ")
> +         packet=$(fmt_pkt "
> +             Ether(dst='${eth_src}', src='${eth_dst}') /
> +             IPv6(src='${ipv6_dst}', dst='${ipv6_src}', hlim=255) /
> +             ICMPv6PacketTooBig(mtu=$3) /
> +             bytes.fromhex('${original_ip_frame:0:$((1218 * 2))}')
> +         ")
> +       fi
> +       echo ${packet}
> +   }
> +
> +   reset_env() {
> +       for port in first multi1 multi2; do
> +           as hv1 reset_pcap_file $port hv1/$port
> +       done
> +       for port in second multi1 multi2; do
> +           as hv2 reset_pcap_file $port hv2/$port
> +       done
> +       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
> +           : > $port.expected
> +       done
> +   }
> +
> +   check_pkts() {
> +       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
> +           OVN_CHECK_PACKETS_REMOVE_BROADCAST([${port}-tx.pcap], [${port}.expected])
> +       done
> +   }
> +
> +   payload() {
> +       echo $(xxd -l ${1} -c ${1} -p < /dev/urandom)
> +   }
> +
> +   wait_for_ports_up
> +   OVN_POPULATE_ARP
> +
> +   reset_env
> +
> +   AS_BOX([Packets of proper size are delivered from multichassis to regular ports])
> +
> +   len=1000
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 0)
> +   echo $packet >> hv1/first.expected
> +
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 0)
> +   echo $packet >> hv2/second.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 0)
> +   echo $packet >> hv1/first.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 0)
> +   echo $packet >> hv2/second.expected
> +
> +   check_pkts
> +   reset_env
> +
> +   AS_BOX([Oversized packets are not delivered from multichassis to regular ports])
> +
> +   len=3000
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   check_pkts
> +   reset_env
> +
> +   AS_BOX([Packets of proper size are delivered from regular to multichassis ports])
> +
> +   len=1000
> +   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 0)
> +   echo $packet >> hv1/multi1.expected
> +   echo $packet >> hv2/multi1.expected
> +
> +   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 0)
> +   echo $packet >> hv1/multi1.expected
> +   echo $packet >> hv2/multi1.expected
> +
> +   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 0)
> +   echo $packet >> hv1/multi1.expected
> +   echo $packet >> hv2/multi1.expected
> +
> +   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 0)
> +   echo $packet >> hv1/multi1.expected
> +   echo $packet >> hv2/multi1.expected
> +
> +   check_pkts
> +   reset_env
> +
> +   AS_BOX([Oversized packets are not delivered from regular to multichassis ports])
> +
> +   len=3000
> +   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 1)
> +   echo $packet >> hv1/first.expected
> +
> +   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 1)
> +   echo $packet >> hv2/second.expected
> +
> +   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 1)
> +   echo $packet >> hv1/first.expected
> +
> +   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 1)
> +   echo $packet >> hv2/second.expected
> +
> +   check_pkts
> +   reset_env
> +
> +   AS_BOX([Packets of proper size are delivered from multichassis to multichassis ports])
> +
> +   len=1000
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 0)
> +   echo $packet >> hv1/multi2.expected
> +   echo $packet >> hv2/multi2.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 0)
> +   echo $packet >> hv1/multi2.expected
> +   echo $packet >> hv2/multi2.expected
> +
> +   check_pkts
> +   reset_env
> +
> +   AS_BOX([Oversized packets are not delivered from multichassis to multichassis ports])
> +
> +   len=3000
> +   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 1)
> +   echo $packet >> hv1/multi1.expected
> +
> +   check_pkts
> +
> +   OVN_CLEANUP([hv1],[hv2])
> +
> +   AT_CLEANUP
> +   ])])
> +
> +# NOTE(ihar) no STT variants because it's not supported by upstream kernels
> +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [geneve], [1424])
> +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [geneve], [1404])
> +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [vxlan], [1432])
> +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [vxlan], [1412])
> +
>   OVN_FOR_EACH_NORTHD([
>   AT_SETUP([options:activation-strategy for logical port])
>   AT_KEYWORDS([multi-chassis])
Ihar Hrachyshka May 4, 2023, 2:46 p.m. UTC | #2
Thanks Mark! I will update the series according to your suggestions
after more reviews land to the series.

On Wed, May 3, 2023 at 4:55 PM Mark Michelson <mmichels@redhat.com> wrote:
>
> On 5/2/23 21:12, Ihar Hrachyshka wrote:
> > When a multichassis port belongs to a switch with a localnet port,
> > packets originating or directed to the multichassis port are NOT sent
> > thorugh the localnet port. Instead, tunneling is enforced in-cluster to
> > guarantee delivery of all packets to all chassis of the port.
> >
> > This behavior has an unfortunate side effect, where - because of
> > additional tunnel header added to each packet - the effective MTU of the
> > path for multichassis ports changes from what's set as mtu_request. This
> > effectively makes OVN to black hole all packets for the port that use
> > full capacity of the interface MTU. This breaks usual TCP / UDP
> > services, among other things (SSH, iperf sessions etc.)
> >
> > This patch adds flows so that
> > - (in table 38) detect too-big packets (table 38), and then
> > - (in table 39) icmp fragmentation needed / too big errors are sent
> >    back to offending port.
> >
> > Once the error is received, the sender is expected to adjust the route
> > MTU accordingly, sending the next packets with the new path MTU. After a
> > multichassis port is re-assigned to a single chassis, the effective path
> > MTU is restored to "usual". Peers will eventually see their "learned"
> > path MTU cache expire, which will make them switch back to the "usual"
> > MTU.
> >
> > Among other scenarios, this patch helps to maintain existing services
> > working during live migration of a VM, if multichassis ports are used.
> > (E.g. in OpenStack Nueutron.)
> >
> > Fixes: 7084cf437421 ("Always funnel multichassis port traffic through tunnels")
> >
> > Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
> > ---
> >   controller/ovn-controller.c |   3 +
> >   controller/physical.c       | 297 +++++++++++++++++++++++++++++++++++-
> >   controller/physical.h       |   1 +
> >   lib/ovn-util.h              |  11 ++
> >   tests/ovn.at                | 262 +++++++++++++++++++++++++++++++
> >   5 files changed, 567 insertions(+), 7 deletions(-)
> >
> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> > index c094cb74d..9359925fa 100644
> > --- a/controller/ovn-controller.c
> > +++ b/controller/ovn-controller.c
> > @@ -4083,6 +4083,9 @@ static void init_physical_ctx(struct engine_node *node,
> >       p_ctx->patch_ofports = &non_vif_data->patch_ofports;
> >       p_ctx->chassis_tunnels = &non_vif_data->chassis_tunnels;
> >
> > +    struct controller_engine_ctx *ctrl_ctx = engine_get_context()->client_ctx;
> > +    p_ctx->if_mgr = ctrl_ctx->if_mgr;
> > +
> >       pflow_output_get_debug(node, &p_ctx->debug);
> >   }
> >
> > diff --git a/controller/physical.c b/controller/physical.c
> > index 1b0482e3b..1c1018616 100644
> > --- a/controller/physical.c
> > +++ b/controller/physical.c
> > @@ -41,6 +41,7 @@
> >   #include "lib/ovn-sb-idl.h"
> >   #include "lib/ovn-util.h"
> >   #include "ovn/actions.h"
> > +#include "if-status.h"
> >   #include "physical.h"
> >   #include "pinctrl.h"
> >   #include "openvswitch/shash.h"
> > @@ -91,6 +92,7 @@ physical_register_ovs_idl(struct ovsdb_idl *ovs_idl)
> >
> >       ovsdb_idl_add_table(ovs_idl, &ovsrec_table_interface);
> >       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_name);
> > +    ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_mtu);
> >       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_ofport);
> >       ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_external_ids);
> >   }
> > @@ -1104,6 +1106,273 @@ setup_activation_strategy(const struct sbrec_port_binding *binding,
> >       }
> >   }
> >
> > +static size_t
> > +encode_start_controller_op(enum action_opcode opcode, bool pause,
> > +                           uint32_t meter_id, struct ofpbuf *ofpacts)
> > +{
> > +    size_t ofs = ofpacts->size;
> > +
> > +    struct ofpact_controller *oc = ofpact_put_CONTROLLER(ofpacts);
> > +    oc->max_len = UINT16_MAX;
> > +    oc->reason = OFPR_ACTION;
> > +    oc->pause = pause;
> > +    if (!ovs_feature_is_supported(OVS_DP_METER_SUPPORT)) {
> > +        meter_id = NX_CTLR_NO_METER;
> > +    }
> > +    oc->meter_id = meter_id;
> > +
> > +    struct action_header ah = { .opcode = htonl(opcode) };
> > +    ofpbuf_put(ofpacts, &ah, sizeof ah);
> > +
> > +    return ofs;
> > +}
> > +
> > +static void
> > +encode_finish_controller_op(size_t ofs, struct ofpbuf *ofpacts)
> > +{
> > +    struct ofpact_controller *oc = ofpbuf_at_assert(ofpacts, ofs, sizeof *oc);
> > +    ofpacts->header = oc;
> > +    oc->userdata_len = ofpacts->size - (ofs + sizeof *oc);
> > +    ofpact_finish_CONTROLLER(ofpacts, &oc);
> > +}
>
> ecnode_start_controller_op() and encode_finish_controller_op() appear to
> be copied from lib/actions.c . I think it's reasonable to expose these
> functions in lib/actions.h and call them instead of redefining them.
>

You are right, and I already forgot I did copy them in my initial PoC
(back in Oct 2022?) I will make them public and reuse, thanks for
spotting it.

> > +
> > +/*
> > + * Insert a flow to determine if an IP packet is too big for the corresponding
> > + * egress interface.
> > + */
> > +static void
> > +determine_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> > +                         const struct sbrec_port_binding *binding,
> > +                         const struct sbrec_port_binding *mcp,
> > +                         uint16_t mtu, bool is_ipv6, int direction)
> > +{
> > +    struct ofpbuf ofpacts;
> > +    ofpbuf_init(&ofpacts, 0);
> > +
> > +    /* Store packet too large flag in reg9[1]. */
> > +    struct match match;
> > +    match_init_catchall(&match);
> > +    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
> > +    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
> > +    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
> > +
> > +    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
> > +    struct ofpact_check_pkt_larger *pkt_larger =
> > +        ofpact_put_CHECK_PKT_LARGER(&ofpacts);
> > +    pkt_larger->pkt_len = mtu;
> > +    pkt_larger->dst.field = mf_from_id(MFF_REG9);
> > +    pkt_larger->dst.ofs = 1;
> > +
> > +    put_resubmit(OFTABLE_OUTPUT_LARGE_PKT_PROCESS, &ofpacts);
> > +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 100,
> > +                    binding->header_.uuid.parts[0], &match, &ofpacts,
> > +                    &binding->header_.uuid);
> > +    ofpbuf_uninit(&ofpacts);
> > +}
> > +
> > +/*
> > + * Insert a flow to reply with ICMP error for IP packets that are too big for
> > + * the corresponding egress interface.
> > + */
> > +/*
> > + * NOTE(ihrachys) This reimplements icmp_error as found in
> > + * build_icmperr_pkt_big_flows. We may look into reusing the existing OVN
> > + * action for this flow in the future.
> > + */
> > +static void
> > +reply_imcp_error_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> > +                                const struct sbrec_port_binding *binding,
> > +                                const struct sbrec_port_binding *mcp,
> > +                                uint16_t mtu, bool is_ipv6, int direction)
> > +{
> > +    struct match match;
> > +    match_init_catchall(&match);
> > +    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
> > +    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
> > +    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
> > +    match_set_reg_masked(&match, MFF_REG9 - MFF_REG0, 1 << 1, 1 << 1);
> > +
> > +    /* Return ICMP error with a part of the original IP packet included. */
> > +    struct ofpbuf ofpacts;
> > +    ofpbuf_init(&ofpacts, 0);
> > +    size_t oc_offset = encode_start_controller_op(
> > +        ACTION_OPCODE_ICMP, true, NX_CTLR_NO_METER, &ofpacts);
> > +
> > +    struct ofpbuf inner_ofpacts;
> > +    ofpbuf_init(&inner_ofpacts, 0);
> > +
> > +    /* The error packet is no longer too large, set REGBIT_PKT_LARGER = 0 */
> > +    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
> > +    ovs_be32 value = htonl(0);
> > +    ovs_be32 mask = htonl(1 << 1);
> > +    ofpact_put_set_field(
> > +        &inner_ofpacts, mf_from_id(MFF_REG9), &value, &mask);
> > +
> > +    /* The new error packet is delivered locally */
> > +    /* REGBIT_EGRESS_LOOPBACK = 1 */
> > +    value = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
> > +    mask = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
> > +    ofpact_put_set_field(
> > +        &inner_ofpacts, mf_from_id(MFF_LOG_FLAGS), &value, &mask);
> > +
> > +    /* eth.src <-> eth.dst */
> > +    put_stack(MFF_ETH_DST, ofpact_put_STACK_PUSH(&inner_ofpacts));
> > +    put_stack(MFF_ETH_SRC, ofpact_put_STACK_PUSH(&inner_ofpacts));
> > +    put_stack(MFF_ETH_DST, ofpact_put_STACK_POP(&inner_ofpacts));
> > +    put_stack(MFF_ETH_SRC, ofpact_put_STACK_POP(&inner_ofpacts));
> > +
> > +    /* ip.src <-> ip.dst */
> > +    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
> > +        ofpact_put_STACK_PUSH(&inner_ofpacts));
> > +    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
> > +        ofpact_put_STACK_PUSH(&inner_ofpacts));
> > +    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
> > +        ofpact_put_STACK_POP(&inner_ofpacts));
> > +    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
> > +        ofpact_put_STACK_POP(&inner_ofpacts));
> > +
> > +    /* ip.ttl = 255 */
> > +    struct ofpact_ip_ttl *ip_ttl = ofpact_put_SET_IP_TTL(&inner_ofpacts);
> > +    ip_ttl->ttl = 255;
> > +
> > +    uint16_t frag_mtu = mtu - ETHERNET_OVERHEAD;
> > +    size_t frag_mtu_oc_offset;
> > +    if (is_ipv6) {
> > +        /* icmp6.type = 2 (Packet Too Big) */
> > +        /* icmp6.code = 0 */
> > +        uint8_t icmp_type = 2;
> > +        uint8_t icmp_code = 0;
> > +        ofpact_put_set_field(
> > +            &inner_ofpacts, mf_from_id(MFF_ICMPV6_TYPE), &icmp_type, NULL);
> > +        ofpact_put_set_field(
> > +            &inner_ofpacts, mf_from_id(MFF_ICMPV6_CODE), &icmp_code, NULL);
> > +
> > +        /* icmp6.frag_mtu */
> > +        frag_mtu_oc_offset = encode_start_controller_op(
> > +            ACTION_OPCODE_PUT_ICMP6_FRAG_MTU, true, NX_CTLR_NO_METER,
> > +            &inner_ofpacts);
> > +        ovs_be32 frag_mtu_ovs = htonl(frag_mtu);
> > +        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
> > +    } else {
> > +        /* icmp4.type = 3 (Destination Unreachable) */
> > +        /* icmp4.code = 4 (Fragmentation Needed) */
> > +        uint8_t icmp_type = 3;
> > +        uint8_t icmp_code = 4;
> > +        ofpact_put_set_field(
> > +            &inner_ofpacts, mf_from_id(MFF_ICMPV4_TYPE), &icmp_type, NULL);
> > +        ofpact_put_set_field(
> > +            &inner_ofpacts, mf_from_id(MFF_ICMPV4_CODE), &icmp_code, NULL);
> > +
> > +        /* icmp4.frag_mtu = */
> > +        frag_mtu_oc_offset = encode_start_controller_op(
> > +            ACTION_OPCODE_PUT_ICMP4_FRAG_MTU, true, NX_CTLR_NO_METER,
> > +            &inner_ofpacts);
> > +        ovs_be16 frag_mtu_ovs = htons(frag_mtu);
> > +        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
> > +    }
> > +    encode_finish_controller_op(frag_mtu_oc_offset, &inner_ofpacts);
> > +
> > +    /* Finally, submit the ICMP error back to the ingress pipeline */
> > +    put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, &inner_ofpacts);
> > +
> > +    /* Attach nested actions to ICMP error controller handler */
> > +    ofpacts_put_openflow_actions(inner_ofpacts.data, inner_ofpacts.size,
> > +                                 &ofpacts, OFP15_VERSION);
> > +
> > +    /* Finalize the ICMP error controller handler */
> > +    encode_finish_controller_op(oc_offset, &ofpacts);
> > +
> > +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 100,
> > +                    binding->header_.uuid.parts[0], &match, &ofpacts,
> > +                    &binding->header_.uuid);
> > +
> > +    ofpbuf_uninit(&inner_ofpacts);
> > +    ofpbuf_uninit(&ofpacts);
> > +}
> > +
> > +static uint16_t
> > +get_tunnel_overhead(struct chassis_tunnel const *tun)
> > +{
> > +    uint16_t overhead = 0;
> > +    enum chassis_tunnel_type type = tun->type;
> > +    if (type == GENEVE) {
> > +        overhead += GENEVE_TUNNEL_OVERHEAD;
> > +    } else if (type == STT) {
> > +        overhead += STT_TUNNEL_OVERHEAD;
> > +    } else if (type == VXLAN) {
> > +        overhead += VXLAN_TUNNEL_OVERHEAD;
> > +    } else {
> > +        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> > +        VLOG_WARN_RL(&rl, "Unknown tunnel type %d, can't determine overhead "
> > +                          "size for Path MTU Discovery", type);
> > +        return 0;
> > +    }
> > +    overhead += tun->is_ipv6? IPV6_HEADER_LEN : IPV4_HEADER_LEN;
> > +    return overhead;
> > +}
> > +
> > +static uint16_t
> > +get_effective_mtu(const struct sbrec_port_binding *mcp,
> > +                  struct ovs_list *remote_tunnels,
> > +                  const struct if_status_mgr *if_mgr)
> > +{
> > +    /* Use interface MTU as a base for calculation */
> > +    uint16_t iface_mtu = if_status_mgr_iface_get_mtu(if_mgr,
> > +                                                     mcp->logical_port);
> > +    if (!iface_mtu) {
> > +        return 0;
> > +    }
> > +
> > +    /* Iterate over all peer tunnels and find the biggest tunnel overhead */
> > +    uint16_t overhead = 0;
> > +    struct tunnel *tun;
> > +    LIST_FOR_EACH (tun, list_node, remote_tunnels) {
> > +        uint16_t tunnel_overhead = get_tunnel_overhead(tun->tun);
> > +        if (tunnel_overhead > overhead) {
> > +            overhead = tunnel_overhead;
> > +        }
> > +    }
> > +    if (!overhead) {
> > +        return 0;
> > +    }
> > +
> > +    return iface_mtu - overhead;
> > +}
> > +
> > +static void
> > +handle_pkt_too_big_for_ip_version(struct ovn_desired_flow_table *flow_table,
> > +                                  const struct sbrec_port_binding *binding,
> > +                                  const struct sbrec_port_binding *mcp,
> > +                                  uint16_t mtu, bool is_ipv6)
> > +{
> > +    /* ingress */
> > +    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> > +                             MFF_LOG_INPORT);
> > +    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> > +                                    MFF_LOG_INPORT);
> > +
> > +    /* egress */
> > +    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> > +                             MFF_LOG_OUTPORT);
> > +    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
> > +                                    MFF_LOG_OUTPORT);
> > +}
> > +
> > +static void
> > +handle_pkt_too_big(struct ovn_desired_flow_table *flow_table,
> > +                   struct ovs_list *remote_tunnels,
> > +                   const struct sbrec_port_binding *binding,
> > +                   const struct sbrec_port_binding *mcp,
> > +                   const struct if_status_mgr *if_mgr)
> > +{
> > +    uint16_t mtu = get_effective_mtu(mcp, remote_tunnels, if_mgr);
> > +    if (!mtu) {
> > +        return;
> > +    }
> > +    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, false);
> > +    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, true);
> > +}
> > +
> >   static void
> >   enforce_tunneling_for_multichassis_ports(
> >       struct local_datapath *ld,
> > @@ -1111,7 +1380,8 @@ enforce_tunneling_for_multichassis_ports(
> >       const struct sbrec_chassis *chassis,
> >       const struct hmap *chassis_tunnels,
> >       enum mf_field_id mff_ovn_geneve,
> > -    struct ovn_desired_flow_table *flow_table)
> > +    struct ovn_desired_flow_table *flow_table,
> > +    const struct if_status_mgr *if_mgr)
> >   {
> >       if (shash_is_empty(&ld->multichassis_ports)) {
> >           return;
> > @@ -1156,6 +1426,8 @@ enforce_tunneling_for_multichassis_ports(
> >                           binding->header_.uuid.parts[0], &match, &ofpacts,
> >                           &binding->header_.uuid);
> >           ofpbuf_uninit(&ofpacts);
> > +
> > +        handle_pkt_too_big(flow_table, tuns, binding, mcp, if_mgr);
> >       }
> >
> >       struct tunnel *tun_elem;
> > @@ -1177,6 +1449,7 @@ consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
> >                         const struct sbrec_port_binding *binding,
> >                         const struct sbrec_chassis *chassis,
> >                         const struct physical_debug *debug,
> > +                      const struct if_status_mgr *if_mgr,
> >                         struct ovn_desired_flow_table *flow_table,
> >                         struct ofpbuf *ofpacts_p)
> >   {
> > @@ -1602,8 +1875,10 @@ consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
> >                           binding->header_.uuid.parts[0],
> >                           &match, ofpacts_p, &binding->header_.uuid);
> >
> > -        enforce_tunneling_for_multichassis_ports(
> > -            ld, binding, chassis, chassis_tunnels, mff_ovn_geneve, flow_table);
> > +        enforce_tunneling_for_multichassis_ports(ld, binding, chassis,
> > +                                                 chassis_tunnels,
> > +                                                 mff_ovn_geneve, flow_table,
> > +                                                 if_mgr);
> >
> >           /* No more tunneling to set up. */
> >           goto out;
> > @@ -1908,7 +2183,7 @@ physical_eval_port_binding(struct physical_ctx *p_ctx,
> >                             p_ctx->patch_ofports,
> >                             p_ctx->chassis_tunnels,
> >                             pb, p_ctx->chassis, &p_ctx->debug,
> > -                          flow_table, &ofpacts);
> > +                          p_ctx->if_mgr, flow_table, &ofpacts);
> >       ofpbuf_uninit(&ofpacts);
> >   }
> >
> > @@ -2032,7 +2307,7 @@ physical_run(struct physical_ctx *p_ctx,
> >                                 p_ctx->patch_ofports,
> >                                 p_ctx->chassis_tunnels, binding,
> >                                 p_ctx->chassis, &p_ctx->debug,
> > -                              flow_table, &ofpacts);
> > +                              p_ctx->if_mgr, flow_table, &ofpacts);
> >       }
> >
> >       /* Handle output to multicast groups, in tables 40 and 41. */
> > @@ -2176,11 +2451,19 @@ physical_run(struct physical_ctx *p_ctx,
> >       ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 0, 0, &match,
> >                       &ofpacts, hc_uuid);
> >
> > +    match_init_catchall(&match);
> > +    match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0,
> > +                         MLF_ALLOW_LOOPBACK, MLF_ALLOW_LOOPBACK);
> > +    ofpbuf_clear(&ofpacts);
> > +    put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
> > +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 10, 0,
> > +                    &match, &ofpacts, hc_uuid);
> > +
> >       match_init_catchall(&match);
> >       ofpbuf_clear(&ofpacts);
> >       put_resubmit(OFTABLE_REMOTE_OUTPUT, &ofpacts);
> > -    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0, &match,
> > -                    &ofpacts, hc_uuid);
> > +    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0,
> > +                    &match, &ofpacts, hc_uuid);
> >
> >       /* Table 40, priority 150.
> >        * =======================
> > diff --git a/controller/physical.h b/controller/physical.h
> > index f450dca94..396bcb138 100644
> > --- a/controller/physical.h
> > +++ b/controller/physical.h
> > @@ -57,6 +57,7 @@ struct physical_ctx {
> >       const struct sbrec_chassis_table *chassis_table;
> >       const struct sbrec_chassis *chassis;
> >       const struct sset *active_tunnels;
> > +    const struct if_status_mgr *if_mgr;
> >       struct hmap *local_datapaths;
> >       struct sset *local_lports;
> >       const struct simap *ct_zones;
> > diff --git a/lib/ovn-util.h b/lib/ovn-util.h
> > index 7cf861dbc..7ec2bca48 100644
> > --- a/lib/ovn-util.h
> > +++ b/lib/ovn-util.h
> > @@ -29,6 +29,17 @@
> >   #define ROUTE_ORIGIN_CONNECTED "connected"
> >   #define ROUTE_ORIGIN_STATIC "static"
> >
> > +#define ETH_HEADER_LENGTH 14
> > +#define ETH_CRC_LENGTH 4
> > +#define ETHERNET_OVERHEAD (ETH_HEADER_LENGTH + ETH_CRC_LENGTH)
> > +
> > +#define IPV4_HEADER_LEN 20
> > +#define IPV6_HEADER_LEN 40
>
> ETH_HEADER_LENGTH, IPV4_HEADER_LEN, and IPV6_HEADER_LEN are already
> defined in OVS in lib/packets.h. You can reuse those values instead of
> redefining them.
>

Thanks! I looked for them before adding here, but clearly my search
skills are not as great as yours. :)

> > +
> > +#define GENEVE_TUNNEL_OVERHEAD 38
> > +#define STT_TUNNEL_OVERHEAD 18
> > +#define VXLAN_TUNNEL_OVERHEAD 30
> > +
> >   struct eth_addr;
> >   struct nbrec_logical_router_port;
> >   struct ovsrec_flow_sample_collector_set_table;
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index b0439d99e..99ce3dd90 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -15194,6 +15194,268 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
> >   AT_CLEANUP
> >   ])
> >
> > +m4_define([MULTICHASSIS_PATH_MTU_DISCOVERY_TEST],
> > +  [OVN_FOR_EACH_NORTHD([
> > +   AT_SETUP([localnet connectivity with multiple requested-chassis, path mtu discovery (ip=$1, tunnel=$2, mtu=$3)])
> > +   AT_KEYWORDS([multi-chassis])
> > +
> > +   ovn_start
> > +
> > +   net_add n1
> > +   for i in 1 2; do
> > +       sim_add hv$i
> > +       as hv$i
> > +       check ovs-vsctl add-br br-phys
> > +       if test "x$1" = "xipv6"; then
> > +           ovn_attach n1 br-phys fd00::$i 64 $2
> > +       else
> > +           ovn_attach n1 br-phys 192.168.0.$i 24 $2
> > +       fi
> > +       check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> > +   done
> > +
> > +   first_mac=00:00:00:00:00:01
> > +   second_mac=00:00:00:00:00:02
> > +   multi1_mac=00:00:00:00:00:f0
> > +   multi2_mac=00:00:00:00:00:f1
> > +   first_ip=10.0.0.1
> > +   second_ip=10.0.0.2
> > +   multi1_ip=10.0.0.10
> > +   multi2_ip=10.0.0.20
> > +   first_ip6=abcd::1
> > +   second_ip6=abcd::2
> > +   multi1_ip6=abcd::f0
> > +   multi2_ip6=abcd::f1
> > +
> > +   check ovn-nbctl ls-add ls0
> > +   check ovn-nbctl lsp-add ls0 first
> > +   check ovn-nbctl lsp-add ls0 second
> > +   check ovn-nbctl lsp-add ls0 multi1
> > +   check ovn-nbctl lsp-add ls0 multi2
> > +   check ovn-nbctl lsp-set-addresses first "${first_mac} ${first_ip} ${first_ip6}"
> > +   check ovn-nbctl lsp-set-addresses second "${second_mac} ${second_ip} ${second_ip6}"
> > +   check ovn-nbctl lsp-set-addresses multi1 "${multi1_mac} ${multi1_ip} ${multi1_ip6}"
> > +   check ovn-nbctl lsp-set-addresses multi2 "${multi2_mac} ${multi2_ip} ${multi2_ip6}"
> > +
> > +   check ovn-nbctl lsp-add ls0 public
> > +   check ovn-nbctl lsp-set-type public localnet
> > +   check ovn-nbctl lsp-set-addresses public unknown
> > +   check ovn-nbctl lsp-set-options public network_name=phys
> > +
> > +   check ovn-nbctl lsp-set-options first requested-chassis=hv1
> > +   check ovn-nbctl lsp-set-options second requested-chassis=hv2
> > +   check ovn-nbctl lsp-set-options multi1 requested-chassis=hv1,hv2
> > +   check ovn-nbctl lsp-set-options multi2 requested-chassis=hv1,hv2
> > +
> > +   as hv1 check ovs-vsctl -- add-port br-int first -- \
> > +       set Interface first external-ids:iface-id=first \
> > +       options:tx_pcap=hv1/first-tx.pcap \
> > +       options:rxq_pcap=hv1/first-rx.pcap \
> > +       ofport-request=1
> > +   as hv2 check ovs-vsctl -- add-port br-int second -- \
> > +       set Interface second external-ids:iface-id=second \
> > +       options:tx_pcap=hv2/second-tx.pcap \
> > +       options:rxq_pcap=hv2/second-rx.pcap \
> > +       ofport-request=2
> > +
> > +   # Create Migrator interfaces on both hv1 and hv2
> > +   for hv in hv1 hv2; do
> > +       for i in 1 2; do
> > +           as $hv check ovs-vsctl -- add-port br-int multi${i} -- \
> > +               set Interface multi${i} external-ids:iface-id=multi${i} \
> > +               options:tx_pcap=$hv/multi${i}-tx.pcap \
> > +               options:rxq_pcap=$hv/multi${i}-rx.pcap \
> > +               ofport-request=${i}00
> > +       done
> > +   done
> > +
> > +   send_ip_packet() {
> > +       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv4_src=${5} ipv4_dst=${6} data=${7} fail=${8}
> > +       packet=$(fmt_pkt "
> > +           Ether(dst='${eth_dst}', src='${eth_src}') /
> > +           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
> > +           ICMP(type=8) / bytes.fromhex('${data}')
> > +       ")
> > +       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
> > +       if [[ x"${fail}" != x0 ]]; then
> > +         original_ip_frame=$(fmt_pkt "
> > +           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
> > +           ICMP(type=8) / bytes.fromhex('${data}')
> > +         ")
> > +         # IP(flags=2) means DF (Don't Fragment) = 1
> > +         # ICMP(type=3, code=4) means Destination Unreachable, Fragmentation Needed
> > +         packet=$(fmt_pkt "
> > +             Ether(dst='${eth_src}', src='${eth_dst}') /
> > +             IP(src='${ipv4_dst}', dst='${ipv4_src}', ttl=255, flags=2, id=0) /
> > +             ICMP(type=3, code=4, nexthopmtu=$3) /
> > +             bytes.fromhex('${original_ip_frame:0:$((534 * 2))}')
> > +         ")
> > +       fi
> > +       echo ${packet}
> > +   }
> > +
> > +   send_ip6_packet() {
> > +       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv6_src=${5} ipv6_dst=${6} data=${7} fail=${8}
> > +       packet=$(fmt_pkt "
> > +           Ether(dst='${eth_dst}', src='${eth_src}') /
> > +           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
> > +           ICMPv6EchoRequest() / bytes.fromhex('${data}')
> > +       ")
> > +       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
> > +       if [[ x"${fail}" != x0 ]]; then
> > +         original_ip_frame=$(fmt_pkt "
> > +           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
> > +           ICMPv6EchoRequest() / bytes.fromhex('${data}')
> > +         ")
> > +         packet=$(fmt_pkt "
> > +             Ether(dst='${eth_src}', src='${eth_dst}') /
> > +             IPv6(src='${ipv6_dst}', dst='${ipv6_src}', hlim=255) /
> > +             ICMPv6PacketTooBig(mtu=$3) /
> > +             bytes.fromhex('${original_ip_frame:0:$((1218 * 2))}')
> > +         ")
> > +       fi
> > +       echo ${packet}
> > +   }
> > +
> > +   reset_env() {
> > +       for port in first multi1 multi2; do
> > +           as hv1 reset_pcap_file $port hv1/$port
> > +       done
> > +       for port in second multi1 multi2; do
> > +           as hv2 reset_pcap_file $port hv2/$port
> > +       done
> > +       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
> > +           : > $port.expected
> > +       done
> > +   }
> > +
> > +   check_pkts() {
> > +       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
> > +           OVN_CHECK_PACKETS_REMOVE_BROADCAST([${port}-tx.pcap], [${port}.expected])
> > +       done
> > +   }
> > +
> > +   payload() {
> > +       echo $(xxd -l ${1} -c ${1} -p < /dev/urandom)
> > +   }
> > +
> > +   wait_for_ports_up
> > +   OVN_POPULATE_ARP
> > +
> > +   reset_env
> > +
> > +   AS_BOX([Packets of proper size are delivered from multichassis to regular ports])
> > +
> > +   len=1000
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 0)
> > +   echo $packet >> hv1/first.expected
> > +
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 0)
> > +   echo $packet >> hv2/second.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 0)
> > +   echo $packet >> hv1/first.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 0)
> > +   echo $packet >> hv2/second.expected
> > +
> > +   check_pkts
> > +   reset_env
> > +
> > +   AS_BOX([Oversized packets are not delivered from multichassis to regular ports])
> > +
> > +   len=3000
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   check_pkts
> > +   reset_env
> > +
> > +   AS_BOX([Packets of proper size are delivered from regular to multichassis ports])
> > +
> > +   len=1000
> > +   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 0)
> > +   echo $packet >> hv1/multi1.expected
> > +   echo $packet >> hv2/multi1.expected
> > +
> > +   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 0)
> > +   echo $packet >> hv1/multi1.expected
> > +   echo $packet >> hv2/multi1.expected
> > +
> > +   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 0)
> > +   echo $packet >> hv1/multi1.expected
> > +   echo $packet >> hv2/multi1.expected
> > +
> > +   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 0)
> > +   echo $packet >> hv1/multi1.expected
> > +   echo $packet >> hv2/multi1.expected
> > +
> > +   check_pkts
> > +   reset_env
> > +
> > +   AS_BOX([Oversized packets are not delivered from regular to multichassis ports])
> > +
> > +   len=3000
> > +   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 1)
> > +   echo $packet >> hv1/first.expected
> > +
> > +   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 1)
> > +   echo $packet >> hv2/second.expected
> > +
> > +   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 1)
> > +   echo $packet >> hv1/first.expected
> > +
> > +   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 1)
> > +   echo $packet >> hv2/second.expected
> > +
> > +   check_pkts
> > +   reset_env
> > +
> > +   AS_BOX([Packets of proper size are delivered from multichassis to multichassis ports])
> > +
> > +   len=1000
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 0)
> > +   echo $packet >> hv1/multi2.expected
> > +   echo $packet >> hv2/multi2.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 0)
> > +   echo $packet >> hv1/multi2.expected
> > +   echo $packet >> hv2/multi2.expected
> > +
> > +   check_pkts
> > +   reset_env
> > +
> > +   AS_BOX([Oversized packets are not delivered from multichassis to multichassis ports])
> > +
> > +   len=3000
> > +   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 1)
> > +   echo $packet >> hv1/multi1.expected
> > +
> > +   check_pkts
> > +
> > +   OVN_CLEANUP([hv1],[hv2])
> > +
> > +   AT_CLEANUP
> > +   ])])
> > +
> > +# NOTE(ihar) no STT variants because it's not supported by upstream kernels
> > +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [geneve], [1424])
> > +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [geneve], [1404])
> > +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [vxlan], [1432])
> > +MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [vxlan], [1412])
> > +
> >   OVN_FOR_EACH_NORTHD([
> >   AT_SETUP([options:activation-strategy for logical port])
> >   AT_KEYWORDS([multi-chassis])
>
diff mbox series

Patch

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index c094cb74d..9359925fa 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -4083,6 +4083,9 @@  static void init_physical_ctx(struct engine_node *node,
     p_ctx->patch_ofports = &non_vif_data->patch_ofports;
     p_ctx->chassis_tunnels = &non_vif_data->chassis_tunnels;
 
+    struct controller_engine_ctx *ctrl_ctx = engine_get_context()->client_ctx;
+    p_ctx->if_mgr = ctrl_ctx->if_mgr;
+
     pflow_output_get_debug(node, &p_ctx->debug);
 }
 
diff --git a/controller/physical.c b/controller/physical.c
index 1b0482e3b..1c1018616 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -41,6 +41,7 @@ 
 #include "lib/ovn-sb-idl.h"
 #include "lib/ovn-util.h"
 #include "ovn/actions.h"
+#include "if-status.h"
 #include "physical.h"
 #include "pinctrl.h"
 #include "openvswitch/shash.h"
@@ -91,6 +92,7 @@  physical_register_ovs_idl(struct ovsdb_idl *ovs_idl)
 
     ovsdb_idl_add_table(ovs_idl, &ovsrec_table_interface);
     ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_name);
+    ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_mtu);
     ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_ofport);
     ovsdb_idl_track_add_column(ovs_idl, &ovsrec_interface_col_external_ids);
 }
@@ -1104,6 +1106,273 @@  setup_activation_strategy(const struct sbrec_port_binding *binding,
     }
 }
 
+static size_t
+encode_start_controller_op(enum action_opcode opcode, bool pause,
+                           uint32_t meter_id, struct ofpbuf *ofpacts)
+{
+    size_t ofs = ofpacts->size;
+
+    struct ofpact_controller *oc = ofpact_put_CONTROLLER(ofpacts);
+    oc->max_len = UINT16_MAX;
+    oc->reason = OFPR_ACTION;
+    oc->pause = pause;
+    if (!ovs_feature_is_supported(OVS_DP_METER_SUPPORT)) {
+        meter_id = NX_CTLR_NO_METER;
+    }
+    oc->meter_id = meter_id;
+
+    struct action_header ah = { .opcode = htonl(opcode) };
+    ofpbuf_put(ofpacts, &ah, sizeof ah);
+
+    return ofs;
+}
+
+static void
+encode_finish_controller_op(size_t ofs, struct ofpbuf *ofpacts)
+{
+    struct ofpact_controller *oc = ofpbuf_at_assert(ofpacts, ofs, sizeof *oc);
+    ofpacts->header = oc;
+    oc->userdata_len = ofpacts->size - (ofs + sizeof *oc);
+    ofpact_finish_CONTROLLER(ofpacts, &oc);
+}
+
+/*
+ * Insert a flow to determine if an IP packet is too big for the corresponding
+ * egress interface.
+ */
+static void
+determine_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
+                         const struct sbrec_port_binding *binding,
+                         const struct sbrec_port_binding *mcp,
+                         uint16_t mtu, bool is_ipv6, int direction)
+{
+    struct ofpbuf ofpacts;
+    ofpbuf_init(&ofpacts, 0);
+
+    /* Store packet too large flag in reg9[1]. */
+    struct match match;
+    match_init_catchall(&match);
+    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
+    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
+    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
+
+    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
+    struct ofpact_check_pkt_larger *pkt_larger =
+        ofpact_put_CHECK_PKT_LARGER(&ofpacts);
+    pkt_larger->pkt_len = mtu;
+    pkt_larger->dst.field = mf_from_id(MFF_REG9);
+    pkt_larger->dst.ofs = 1;
+
+    put_resubmit(OFTABLE_OUTPUT_LARGE_PKT_PROCESS, &ofpacts);
+    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 100,
+                    binding->header_.uuid.parts[0], &match, &ofpacts,
+                    &binding->header_.uuid);
+    ofpbuf_uninit(&ofpacts);
+}
+
+/*
+ * Insert a flow to reply with ICMP error for IP packets that are too big for
+ * the corresponding egress interface.
+ */
+/*
+ * NOTE(ihrachys) This reimplements icmp_error as found in
+ * build_icmperr_pkt_big_flows. We may look into reusing the existing OVN
+ * action for this flow in the future.
+ */
+static void
+reply_imcp_error_if_pkt_too_big(struct ovn_desired_flow_table *flow_table,
+                                const struct sbrec_port_binding *binding,
+                                const struct sbrec_port_binding *mcp,
+                                uint16_t mtu, bool is_ipv6, int direction)
+{
+    struct match match;
+    match_init_catchall(&match);
+    match_set_dl_type(&match, htons(is_ipv6 ? ETH_TYPE_IPV6 : ETH_TYPE_IP));
+    match_set_metadata(&match, htonll(binding->datapath->tunnel_key));
+    match_set_reg(&match, direction - MFF_REG0, mcp->tunnel_key);
+    match_set_reg_masked(&match, MFF_REG9 - MFF_REG0, 1 << 1, 1 << 1);
+
+    /* Return ICMP error with a part of the original IP packet included. */
+    struct ofpbuf ofpacts;
+    ofpbuf_init(&ofpacts, 0);
+    size_t oc_offset = encode_start_controller_op(
+        ACTION_OPCODE_ICMP, true, NX_CTLR_NO_METER, &ofpacts);
+
+    struct ofpbuf inner_ofpacts;
+    ofpbuf_init(&inner_ofpacts, 0);
+
+    /* The error packet is no longer too large, set REGBIT_PKT_LARGER = 0 */
+    /* reg9[1] is REGBIT_PKT_LARGER as defined by northd */
+    ovs_be32 value = htonl(0);
+    ovs_be32 mask = htonl(1 << 1);
+    ofpact_put_set_field(
+        &inner_ofpacts, mf_from_id(MFF_REG9), &value, &mask);
+
+    /* The new error packet is delivered locally */
+    /* REGBIT_EGRESS_LOOPBACK = 1 */
+    value = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
+    mask = htonl(1 << MLF_ALLOW_LOOPBACK_BIT);
+    ofpact_put_set_field(
+        &inner_ofpacts, mf_from_id(MFF_LOG_FLAGS), &value, &mask);
+
+    /* eth.src <-> eth.dst */
+    put_stack(MFF_ETH_DST, ofpact_put_STACK_PUSH(&inner_ofpacts));
+    put_stack(MFF_ETH_SRC, ofpact_put_STACK_PUSH(&inner_ofpacts));
+    put_stack(MFF_ETH_DST, ofpact_put_STACK_POP(&inner_ofpacts));
+    put_stack(MFF_ETH_SRC, ofpact_put_STACK_POP(&inner_ofpacts));
+
+    /* ip.src <-> ip.dst */
+    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
+        ofpact_put_STACK_PUSH(&inner_ofpacts));
+    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
+        ofpact_put_STACK_PUSH(&inner_ofpacts));
+    put_stack(is_ipv6 ? MFF_IPV6_DST : MFF_IPV4_DST,
+        ofpact_put_STACK_POP(&inner_ofpacts));
+    put_stack(is_ipv6 ? MFF_IPV6_SRC : MFF_IPV4_SRC,
+        ofpact_put_STACK_POP(&inner_ofpacts));
+
+    /* ip.ttl = 255 */
+    struct ofpact_ip_ttl *ip_ttl = ofpact_put_SET_IP_TTL(&inner_ofpacts);
+    ip_ttl->ttl = 255;
+
+    uint16_t frag_mtu = mtu - ETHERNET_OVERHEAD;
+    size_t frag_mtu_oc_offset;
+    if (is_ipv6) {
+        /* icmp6.type = 2 (Packet Too Big) */
+        /* icmp6.code = 0 */
+        uint8_t icmp_type = 2;
+        uint8_t icmp_code = 0;
+        ofpact_put_set_field(
+            &inner_ofpacts, mf_from_id(MFF_ICMPV6_TYPE), &icmp_type, NULL);
+        ofpact_put_set_field(
+            &inner_ofpacts, mf_from_id(MFF_ICMPV6_CODE), &icmp_code, NULL);
+
+        /* icmp6.frag_mtu */
+        frag_mtu_oc_offset = encode_start_controller_op(
+            ACTION_OPCODE_PUT_ICMP6_FRAG_MTU, true, NX_CTLR_NO_METER,
+            &inner_ofpacts);
+        ovs_be32 frag_mtu_ovs = htonl(frag_mtu);
+        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
+    } else {
+        /* icmp4.type = 3 (Destination Unreachable) */
+        /* icmp4.code = 4 (Fragmentation Needed) */
+        uint8_t icmp_type = 3;
+        uint8_t icmp_code = 4;
+        ofpact_put_set_field(
+            &inner_ofpacts, mf_from_id(MFF_ICMPV4_TYPE), &icmp_type, NULL);
+        ofpact_put_set_field(
+            &inner_ofpacts, mf_from_id(MFF_ICMPV4_CODE), &icmp_code, NULL);
+
+        /* icmp4.frag_mtu = */
+        frag_mtu_oc_offset = encode_start_controller_op(
+            ACTION_OPCODE_PUT_ICMP4_FRAG_MTU, true, NX_CTLR_NO_METER,
+            &inner_ofpacts);
+        ovs_be16 frag_mtu_ovs = htons(frag_mtu);
+        ofpbuf_put(&inner_ofpacts, &frag_mtu_ovs, sizeof(frag_mtu_ovs));
+    }
+    encode_finish_controller_op(frag_mtu_oc_offset, &inner_ofpacts);
+
+    /* Finally, submit the ICMP error back to the ingress pipeline */
+    put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, &inner_ofpacts);
+
+    /* Attach nested actions to ICMP error controller handler */
+    ofpacts_put_openflow_actions(inner_ofpacts.data, inner_ofpacts.size,
+                                 &ofpacts, OFP15_VERSION);
+
+    /* Finalize the ICMP error controller handler */
+    encode_finish_controller_op(oc_offset, &ofpacts);
+
+    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 100,
+                    binding->header_.uuid.parts[0], &match, &ofpacts,
+                    &binding->header_.uuid);
+
+    ofpbuf_uninit(&inner_ofpacts);
+    ofpbuf_uninit(&ofpacts);
+}
+
+static uint16_t
+get_tunnel_overhead(struct chassis_tunnel const *tun)
+{
+    uint16_t overhead = 0;
+    enum chassis_tunnel_type type = tun->type;
+    if (type == GENEVE) {
+        overhead += GENEVE_TUNNEL_OVERHEAD;
+    } else if (type == STT) {
+        overhead += STT_TUNNEL_OVERHEAD;
+    } else if (type == VXLAN) {
+        overhead += VXLAN_TUNNEL_OVERHEAD;
+    } else {
+        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
+        VLOG_WARN_RL(&rl, "Unknown tunnel type %d, can't determine overhead "
+                          "size for Path MTU Discovery", type);
+        return 0;
+    }
+    overhead += tun->is_ipv6? IPV6_HEADER_LEN : IPV4_HEADER_LEN;
+    return overhead;
+}
+
+static uint16_t
+get_effective_mtu(const struct sbrec_port_binding *mcp,
+                  struct ovs_list *remote_tunnels,
+                  const struct if_status_mgr *if_mgr)
+{
+    /* Use interface MTU as a base for calculation */
+    uint16_t iface_mtu = if_status_mgr_iface_get_mtu(if_mgr,
+                                                     mcp->logical_port);
+    if (!iface_mtu) {
+        return 0;
+    }
+
+    /* Iterate over all peer tunnels and find the biggest tunnel overhead */
+    uint16_t overhead = 0;
+    struct tunnel *tun;
+    LIST_FOR_EACH (tun, list_node, remote_tunnels) {
+        uint16_t tunnel_overhead = get_tunnel_overhead(tun->tun);
+        if (tunnel_overhead > overhead) {
+            overhead = tunnel_overhead;
+        }
+    }
+    if (!overhead) {
+        return 0;
+    }
+
+    return iface_mtu - overhead;
+}
+
+static void
+handle_pkt_too_big_for_ip_version(struct ovn_desired_flow_table *flow_table,
+                                  const struct sbrec_port_binding *binding,
+                                  const struct sbrec_port_binding *mcp,
+                                  uint16_t mtu, bool is_ipv6)
+{
+    /* ingress */
+    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
+                             MFF_LOG_INPORT);
+    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
+                                    MFF_LOG_INPORT);
+
+    /* egress */
+    determine_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
+                             MFF_LOG_OUTPORT);
+    reply_imcp_error_if_pkt_too_big(flow_table, binding, mcp, mtu, is_ipv6,
+                                    MFF_LOG_OUTPORT);
+}
+
+static void
+handle_pkt_too_big(struct ovn_desired_flow_table *flow_table,
+                   struct ovs_list *remote_tunnels,
+                   const struct sbrec_port_binding *binding,
+                   const struct sbrec_port_binding *mcp,
+                   const struct if_status_mgr *if_mgr)
+{
+    uint16_t mtu = get_effective_mtu(mcp, remote_tunnels, if_mgr);
+    if (!mtu) {
+        return;
+    }
+    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, false);
+    handle_pkt_too_big_for_ip_version(flow_table, binding, mcp, mtu, true);
+}
+
 static void
 enforce_tunneling_for_multichassis_ports(
     struct local_datapath *ld,
@@ -1111,7 +1380,8 @@  enforce_tunneling_for_multichassis_ports(
     const struct sbrec_chassis *chassis,
     const struct hmap *chassis_tunnels,
     enum mf_field_id mff_ovn_geneve,
-    struct ovn_desired_flow_table *flow_table)
+    struct ovn_desired_flow_table *flow_table,
+    const struct if_status_mgr *if_mgr)
 {
     if (shash_is_empty(&ld->multichassis_ports)) {
         return;
@@ -1156,6 +1426,8 @@  enforce_tunneling_for_multichassis_ports(
                         binding->header_.uuid.parts[0], &match, &ofpacts,
                         &binding->header_.uuid);
         ofpbuf_uninit(&ofpacts);
+
+        handle_pkt_too_big(flow_table, tuns, binding, mcp, if_mgr);
     }
 
     struct tunnel *tun_elem;
@@ -1177,6 +1449,7 @@  consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
                       const struct sbrec_port_binding *binding,
                       const struct sbrec_chassis *chassis,
                       const struct physical_debug *debug,
+                      const struct if_status_mgr *if_mgr,
                       struct ovn_desired_flow_table *flow_table,
                       struct ofpbuf *ofpacts_p)
 {
@@ -1602,8 +1875,10 @@  consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
                         binding->header_.uuid.parts[0],
                         &match, ofpacts_p, &binding->header_.uuid);
 
-        enforce_tunneling_for_multichassis_ports(
-            ld, binding, chassis, chassis_tunnels, mff_ovn_geneve, flow_table);
+        enforce_tunneling_for_multichassis_ports(ld, binding, chassis,
+                                                 chassis_tunnels,
+                                                 mff_ovn_geneve, flow_table,
+                                                 if_mgr);
 
         /* No more tunneling to set up. */
         goto out;
@@ -1908,7 +2183,7 @@  physical_eval_port_binding(struct physical_ctx *p_ctx,
                           p_ctx->patch_ofports,
                           p_ctx->chassis_tunnels,
                           pb, p_ctx->chassis, &p_ctx->debug,
-                          flow_table, &ofpacts);
+                          p_ctx->if_mgr, flow_table, &ofpacts);
     ofpbuf_uninit(&ofpacts);
 }
 
@@ -2032,7 +2307,7 @@  physical_run(struct physical_ctx *p_ctx,
                               p_ctx->patch_ofports,
                               p_ctx->chassis_tunnels, binding,
                               p_ctx->chassis, &p_ctx->debug,
-                              flow_table, &ofpacts);
+                              p_ctx->if_mgr, flow_table, &ofpacts);
     }
 
     /* Handle output to multicast groups, in tables 40 and 41. */
@@ -2176,11 +2451,19 @@  physical_run(struct physical_ctx *p_ctx,
     ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_DETECT, 0, 0, &match,
                     &ofpacts, hc_uuid);
 
+    match_init_catchall(&match);
+    match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0,
+                         MLF_ALLOW_LOOPBACK, MLF_ALLOW_LOOPBACK);
+    ofpbuf_clear(&ofpacts);
+    put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
+    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 10, 0,
+                    &match, &ofpacts, hc_uuid);
+
     match_init_catchall(&match);
     ofpbuf_clear(&ofpacts);
     put_resubmit(OFTABLE_REMOTE_OUTPUT, &ofpacts);
-    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0, &match,
-                    &ofpacts, hc_uuid);
+    ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0,
+                    &match, &ofpacts, hc_uuid);
 
     /* Table 40, priority 150.
      * =======================
diff --git a/controller/physical.h b/controller/physical.h
index f450dca94..396bcb138 100644
--- a/controller/physical.h
+++ b/controller/physical.h
@@ -57,6 +57,7 @@  struct physical_ctx {
     const struct sbrec_chassis_table *chassis_table;
     const struct sbrec_chassis *chassis;
     const struct sset *active_tunnels;
+    const struct if_status_mgr *if_mgr;
     struct hmap *local_datapaths;
     struct sset *local_lports;
     const struct simap *ct_zones;
diff --git a/lib/ovn-util.h b/lib/ovn-util.h
index 7cf861dbc..7ec2bca48 100644
--- a/lib/ovn-util.h
+++ b/lib/ovn-util.h
@@ -29,6 +29,17 @@ 
 #define ROUTE_ORIGIN_CONNECTED "connected"
 #define ROUTE_ORIGIN_STATIC "static"
 
+#define ETH_HEADER_LENGTH 14
+#define ETH_CRC_LENGTH 4
+#define ETHERNET_OVERHEAD (ETH_HEADER_LENGTH + ETH_CRC_LENGTH)
+
+#define IPV4_HEADER_LEN 20
+#define IPV6_HEADER_LEN 40
+
+#define GENEVE_TUNNEL_OVERHEAD 38
+#define STT_TUNNEL_OVERHEAD 18
+#define VXLAN_TUNNEL_OVERHEAD 30
+
 struct eth_addr;
 struct nbrec_logical_router_port;
 struct ovsrec_flow_sample_collector_set_table;
diff --git a/tests/ovn.at b/tests/ovn.at
index b0439d99e..99ce3dd90 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -15194,6 +15194,268 @@  OVN_CLEANUP([hv1],[hv2],[hv3])
 AT_CLEANUP
 ])
 
+m4_define([MULTICHASSIS_PATH_MTU_DISCOVERY_TEST],
+  [OVN_FOR_EACH_NORTHD([
+   AT_SETUP([localnet connectivity with multiple requested-chassis, path mtu discovery (ip=$1, tunnel=$2, mtu=$3)])
+   AT_KEYWORDS([multi-chassis])
+
+   ovn_start
+
+   net_add n1
+   for i in 1 2; do
+       sim_add hv$i
+       as hv$i
+       check ovs-vsctl add-br br-phys
+       if test "x$1" = "xipv6"; then
+           ovn_attach n1 br-phys fd00::$i 64 $2
+       else
+           ovn_attach n1 br-phys 192.168.0.$i 24 $2
+       fi
+       check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
+   done
+
+   first_mac=00:00:00:00:00:01
+   second_mac=00:00:00:00:00:02
+   multi1_mac=00:00:00:00:00:f0
+   multi2_mac=00:00:00:00:00:f1
+   first_ip=10.0.0.1
+   second_ip=10.0.0.2
+   multi1_ip=10.0.0.10
+   multi2_ip=10.0.0.20
+   first_ip6=abcd::1
+   second_ip6=abcd::2
+   multi1_ip6=abcd::f0
+   multi2_ip6=abcd::f1
+
+   check ovn-nbctl ls-add ls0
+   check ovn-nbctl lsp-add ls0 first
+   check ovn-nbctl lsp-add ls0 second
+   check ovn-nbctl lsp-add ls0 multi1
+   check ovn-nbctl lsp-add ls0 multi2
+   check ovn-nbctl lsp-set-addresses first "${first_mac} ${first_ip} ${first_ip6}"
+   check ovn-nbctl lsp-set-addresses second "${second_mac} ${second_ip} ${second_ip6}"
+   check ovn-nbctl lsp-set-addresses multi1 "${multi1_mac} ${multi1_ip} ${multi1_ip6}"
+   check ovn-nbctl lsp-set-addresses multi2 "${multi2_mac} ${multi2_ip} ${multi2_ip6}"
+
+   check ovn-nbctl lsp-add ls0 public
+   check ovn-nbctl lsp-set-type public localnet
+   check ovn-nbctl lsp-set-addresses public unknown
+   check ovn-nbctl lsp-set-options public network_name=phys
+
+   check ovn-nbctl lsp-set-options first requested-chassis=hv1
+   check ovn-nbctl lsp-set-options second requested-chassis=hv2
+   check ovn-nbctl lsp-set-options multi1 requested-chassis=hv1,hv2
+   check ovn-nbctl lsp-set-options multi2 requested-chassis=hv1,hv2
+
+   as hv1 check ovs-vsctl -- add-port br-int first -- \
+       set Interface first external-ids:iface-id=first \
+       options:tx_pcap=hv1/first-tx.pcap \
+       options:rxq_pcap=hv1/first-rx.pcap \
+       ofport-request=1
+   as hv2 check ovs-vsctl -- add-port br-int second -- \
+       set Interface second external-ids:iface-id=second \
+       options:tx_pcap=hv2/second-tx.pcap \
+       options:rxq_pcap=hv2/second-rx.pcap \
+       ofport-request=2
+
+   # Create Migrator interfaces on both hv1 and hv2
+   for hv in hv1 hv2; do
+       for i in 1 2; do
+           as $hv check ovs-vsctl -- add-port br-int multi${i} -- \
+               set Interface multi${i} external-ids:iface-id=multi${i} \
+               options:tx_pcap=$hv/multi${i}-tx.pcap \
+               options:rxq_pcap=$hv/multi${i}-rx.pcap \
+               ofport-request=${i}00
+       done
+   done
+
+   send_ip_packet() {
+       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv4_src=${5} ipv4_dst=${6} data=${7} fail=${8}
+       packet=$(fmt_pkt "
+           Ether(dst='${eth_dst}', src='${eth_src}') /
+           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
+           ICMP(type=8) / bytes.fromhex('${data}')
+       ")
+       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
+       if [[ x"${fail}" != x0 ]]; then
+         original_ip_frame=$(fmt_pkt "
+           IP(src='${ipv4_src}', dst='${ipv4_dst}') /
+           ICMP(type=8) / bytes.fromhex('${data}')
+         ")
+         # IP(flags=2) means DF (Don't Fragment) = 1
+         # ICMP(type=3, code=4) means Destination Unreachable, Fragmentation Needed
+         packet=$(fmt_pkt "
+             Ether(dst='${eth_src}', src='${eth_dst}') /
+             IP(src='${ipv4_dst}', dst='${ipv4_src}', ttl=255, flags=2, id=0) /
+             ICMP(type=3, code=4, nexthopmtu=$3) /
+             bytes.fromhex('${original_ip_frame:0:$((534 * 2))}')
+         ")
+       fi
+       echo ${packet}
+   }
+
+   send_ip6_packet() {
+       local inport=${1} hv=${2} eth_src=${3} eth_dst=${4} ipv6_src=${5} ipv6_dst=${6} data=${7} fail=${8}
+       packet=$(fmt_pkt "
+           Ether(dst='${eth_dst}', src='${eth_src}') /
+           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
+           ICMPv6EchoRequest() / bytes.fromhex('${data}')
+       ")
+       as hv${hv} ovs-appctl netdev-dummy/receive ${inport} ${packet}
+       if [[ x"${fail}" != x0 ]]; then
+         original_ip_frame=$(fmt_pkt "
+           IPv6(src='${ipv6_src}', dst='${ipv6_dst}') /
+           ICMPv6EchoRequest() / bytes.fromhex('${data}')
+         ")
+         packet=$(fmt_pkt "
+             Ether(dst='${eth_src}', src='${eth_dst}') /
+             IPv6(src='${ipv6_dst}', dst='${ipv6_src}', hlim=255) /
+             ICMPv6PacketTooBig(mtu=$3) /
+             bytes.fromhex('${original_ip_frame:0:$((1218 * 2))}')
+         ")
+       fi
+       echo ${packet}
+   }
+
+   reset_env() {
+       for port in first multi1 multi2; do
+           as hv1 reset_pcap_file $port hv1/$port
+       done
+       for port in second multi1 multi2; do
+           as hv2 reset_pcap_file $port hv2/$port
+       done
+       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
+           : > $port.expected
+       done
+   }
+
+   check_pkts() {
+       for port in hv1/multi1 hv2/multi1 hv1/multi2 hv2/multi2 hv1/first hv2/second; do
+           OVN_CHECK_PACKETS_REMOVE_BROADCAST([${port}-tx.pcap], [${port}.expected])
+       done
+   }
+
+   payload() {
+       echo $(xxd -l ${1} -c ${1} -p < /dev/urandom)
+   }
+
+   wait_for_ports_up
+   OVN_POPULATE_ARP
+
+   reset_env
+
+   AS_BOX([Packets of proper size are delivered from multichassis to regular ports])
+
+   len=1000
+   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 0)
+   echo $packet >> hv1/first.expected
+
+   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 0)
+   echo $packet >> hv2/second.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 0)
+   echo $packet >> hv1/first.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 0)
+   echo $packet >> hv2/second.expected
+
+   check_pkts
+   reset_env
+
+   AS_BOX([Oversized packets are not delivered from multichassis to regular ports])
+
+   len=3000
+   packet=$(send_ip_packet multi1 1 $multi1_mac $first_mac $multi1_ip $first_ip $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   packet=$(send_ip_packet multi1 1 $multi1_mac $second_mac $multi1_ip $second_ip $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $first_mac $multi1_ip6 $first_ip6 $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $second_mac $multi1_ip6 $second_ip6 $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   check_pkts
+   reset_env
+
+   AS_BOX([Packets of proper size are delivered from regular to multichassis ports])
+
+   len=1000
+   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 0)
+   echo $packet >> hv1/multi1.expected
+   echo $packet >> hv2/multi1.expected
+
+   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 0)
+   echo $packet >> hv1/multi1.expected
+   echo $packet >> hv2/multi1.expected
+
+   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 0)
+   echo $packet >> hv1/multi1.expected
+   echo $packet >> hv2/multi1.expected
+
+   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 0)
+   echo $packet >> hv1/multi1.expected
+   echo $packet >> hv2/multi1.expected
+
+   check_pkts
+   reset_env
+
+   AS_BOX([Oversized packets are not delivered from regular to multichassis ports])
+
+   len=3000
+   packet=$(send_ip_packet first 1 $first_mac $multi1_mac $first_ip $multi1_ip $(payload $len) 1)
+   echo $packet >> hv1/first.expected
+
+   packet=$(send_ip_packet second 2 $second_mac $multi1_mac $second_ip $multi1_ip $(payload $len) 1)
+   echo $packet >> hv2/second.expected
+
+   packet=$(send_ip6_packet first 1 $first_mac $multi1_mac $first_ip6 $multi1_ip6 $(payload $len) 1)
+   echo $packet >> hv1/first.expected
+
+   packet=$(send_ip6_packet second 2 $second_mac $multi1_mac $second_ip6 $multi1_ip6 $(payload $len) 1)
+   echo $packet >> hv2/second.expected
+
+   check_pkts
+   reset_env
+
+   AS_BOX([Packets of proper size are delivered from multichassis to multichassis ports])
+
+   len=1000
+   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 0)
+   echo $packet >> hv1/multi2.expected
+   echo $packet >> hv2/multi2.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 0)
+   echo $packet >> hv1/multi2.expected
+   echo $packet >> hv2/multi2.expected
+
+   check_pkts
+   reset_env
+
+   AS_BOX([Oversized packets are not delivered from multichassis to multichassis ports])
+
+   len=3000
+   packet=$(send_ip_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip $multi2_ip $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   packet=$(send_ip6_packet multi1 1 $multi1_mac $multi2_mac $multi1_ip6 $multi2_ip6 $(payload $len) 1)
+   echo $packet >> hv1/multi1.expected
+
+   check_pkts
+
+   OVN_CLEANUP([hv1],[hv2])
+
+   AT_CLEANUP
+   ])])
+
+# NOTE(ihar) no STT variants because it's not supported by upstream kernels
+MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [geneve], [1424])
+MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [geneve], [1404])
+MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv4], [vxlan], [1432])
+MULTICHASSIS_PATH_MTU_DISCOVERY_TEST([ipv6], [vxlan], [1412])
+
 OVN_FOR_EACH_NORTHD([
 AT_SETUP([options:activation-strategy for logical port])
 AT_KEYWORDS([multi-chassis])