diff mbox series

[ovs-dev,v9,06/11] Userspace datapath: Add fragmentation handling.

Message ID 1542654570-63181-7-git-send-email-dlu998@gmail.com
State Changes Requested
Headers show
Series Userspace datapath: Add fragmentation support. | expand

Commit Message

Darrell Ball Nov. 19, 2018, 7:09 p.m. UTC
Fragmentation handling is added for supporting conntrack.
Both v4 and v6 are supported.

After discussion with several people, I decided to not store
configuration state in the database to be more consistent with
the kernel in future, similarity with other conntrack configuration
which will not be in the database as well and overall simplicity.
Accordingly, fragmentation handling is enabled by default.

This patch enables fragmentation tests for the userspace datapath.

Signed-off-by: Darrell Ball <dlu998@gmail.com>
---
 Documentation/faq/releases.rst   |   49 +-
 NEWS                             |    2 +
 include/sparse/netinet/ip6.h     |    1 +
 lib/automake.mk                  |    4 +-
 lib/conntrack.c                  |   13 +-
 lib/ipf.c                        | 1363 ++++++++++++++++++++++++++++++++++++++
 lib/ipf.h                        |   33 +
 tests/system-kmod-macros.at      |   10 +-
 tests/system-traffic.at          |   30 +-
 tests/system-userspace-macros.at |   23 +-
 10 files changed, 1460 insertions(+), 68 deletions(-)
 create mode 100644 lib/ipf.c
 create mode 100644 lib/ipf.h

Comments

Darrell Ball Nov. 19, 2018, 7:51 p.m. UTC | #1
On Mon, Nov 19, 2018 at 11:11 AM Darrell Ball <dlu998@gmail.com> wrote:

> Fragmentation handling is added for supporting conntrack.
> Both v4 and v6 are supported.
>
> After discussion with several people, I decided to not store
> configuration state in the database to be more consistent with
> the kernel in future, similarity with other conntrack configuration
> which will not be in the database as well and overall simplicity.
> Accordingly, fragmentation handling is enabled by default.
>
> This patch enables fragmentation tests for the userspace datapath.
>
> Signed-off-by: Darrell Ball <dlu998@gmail.com>
> ---
>  Documentation/faq/releases.rst   |   49 +-
>  NEWS                             |    2 +
>  include/sparse/netinet/ip6.h     |    1 +
>  lib/automake.mk                  |    4 +-
>  lib/conntrack.c                  |   13 +-
>  lib/ipf.c                        | 1363
> ++++++++++++++++++++++++++++++++++++++
>  lib/ipf.h                        |   33 +
>  tests/system-kmod-macros.at      |   10 +-
>  tests/system-traffic.at          |   30 +-
>  tests/system-userspace-macros.at |   23 +-
>  10 files changed, 1460 insertions(+), 68 deletions(-)
>  create mode 100644 lib/ipf.c
>  create mode 100644 lib/ipf.h
>
> diff --git a/Documentation/faq/releases.rst
> b/Documentation/faq/releases.rst
> index 96da23c..d281c97 100644
> --- a/Documentation/faq/releases.rst
> +++ b/Documentation/faq/releases.rst
> @@ -104,31 +104,30 @@ Q: Are all features available with all datapaths?
>      The following table lists the datapath supported features from an Open
>      vSwitch user's perspective.
>
> -    ===================== ============== ============== ========= =======
> -    Feature               Linux upstream Linux OVS tree Userspace Hyper-V
> -    ===================== ============== ============== ========= =======
> -    NAT                   4.6            YES            Yes       NO
> -    Connection tracking   4.3            YES            PARTIAL   PARTIAL
> -    Tunnel - LISP         NO             YES            NO        NO
> -    Tunnel - STT          NO             YES            NO        YES
> -    Tunnel - GRE          3.11           YES            YES       YES
> -    Tunnel - VXLAN        3.12           YES            YES       YES
> -    Tunnel - Geneve       3.18           YES            YES       YES
> -    Tunnel - GRE-IPv6     4.18           YES            YES       NO
> -    Tunnel - VXLAN-IPv6   4.3            YES            YES       NO
> -    Tunnel - Geneve-IPv6  4.4            YES            YES       NO
> -    Tunnel - ERSPAN       4.18           YES            YES       NO
> -    Tunnel - ERSPAN-IPv6  4.18           YES            YES       NO
> -    QoS - Policing        YES            YES            YES       NO
> -    QoS - Shaping         YES            YES            NO        NO
> -    sFlow                 YES            YES            YES       NO
> -    IPFIX                 3.10           YES            YES       NO
> -    Set action            YES            YES            YES       PARTIAL
> -    NIC Bonding           YES            YES            YES       YES
> -    Multiple VTEPs        YES            YES            YES       YES
> -    Meters                4.15           YES            YES       NO
> -    Conntrack zone limit  4.18           YES            NO        NO
> -    ===================== ============== ============== ========= =======
> +    ========================== ============== ============== =========
> =======
> +    Feature                    Linux upstream Linux OVS tree Userspace
> Hyper-V
> +    ========================== ============== ============== =========
> =======
> +    Connection tracking             4.3            YES          Yes
> YES
> +    Conntrack Fragment Reass.       4.3            Yes          Yes
> YES
> +    NAT                             4.6            YES          Yes
> NO
>


There is a capitalization inconsistency in the existing 'NAT' entry that
got copied to a few new entries.



> +    Conntrack zone limit            4.18           YES          NO
>  NO
> +    Tunnel - LISP                   NO             YES          NO
>  NO
> +    Tunnel - STT                    NO             YES          NO
>  YES
> +    Tunnel - GRE                    3.11           YES          YES
> YES
> +    Tunnel - VXLAN                  3.12           YES          YES
> YES
> +    Tunnel - Geneve                 3.18           YES          YES
> YES
> +    Tunnel - GRE-IPv6               NO             NO           YES
> NO
> +    Tunnel - VXLAN-IPv6             4.3            YES          YES
> NO
> +    Tunnel - Geneve-IPv6            4.4            YES          YES
> NO
> +    QoS - Policing                  YES            YES          YES
> NO
> +    QoS - Shaping                   YES            YES          NO
>  NO
> +    sFlow                           YES            YES          YES
> NO
> +    IPFIX                           3.10           YES          YES
> NO
> +    Set action                      YES            YES          YES
> PARTIAL
> +    NIC Bonding                     YES            YES          YES
> YES
> +    Multiple VTEPs                  YES            YES          YES
> YES
> +    Meters                          4.15           YES          YES
> NO
> +    ========================== ============== ============== =========
> =======
>
>      Do note, however:
>
> diff --git a/NEWS b/NEWS
> index 02402d1..d2e8724 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -9,6 +9,8 @@ Post-v2.10.0
>     - ovn:
>       * New support for IPSEC encrypted tunnels between hypervisors.
>       * ovn-ctl: allow passing user:group ids to the OVN daemons.
> +   - Userspace datapath:
> +     * Add v4/v6 fragmentation support for conntrack.
>     - DPDK:
>       * Add option for simple round-robin based Rxq to PMD assignment.
>         It can be set with pmd-rxq-assign.
> diff --git a/include/sparse/netinet/ip6.h b/include/sparse/netinet/ip6.h
> index d2a54de..bfa637a 100644
> --- a/include/sparse/netinet/ip6.h
> +++ b/include/sparse/netinet/ip6.h
> @@ -64,5 +64,6 @@ struct ip6_frag {
>  };
>
>  #define IP6F_OFF_MASK ((OVS_FORCE ovs_be16) 0xfff8)
> +#define IP6F_MORE_FRAG ((OVS_FORCE ovs_be16) 0x0001)
>
>  #endif /* netinet/ip6.h sparse */
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 63e9d72..b24b028 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -1,4 +1,4 @@
> -# Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
> Nicira, Inc.
> +# Copyright (C) 2009-2018 Nicira, Inc.
>  #
>  # Copying and distribution of this file, with or without modification,
>  # are permitted in any medium without royalty provided the copyright
> @@ -107,6 +107,8 @@ lib_libopenvswitch_la_SOURCES = \
>         lib/hmapx.h \
>         lib/id-pool.c \
>         lib/id-pool.h \
> +       lib/ipf.c \
> +       lib/ipf.h \
>         lib/jhash.c \
>         lib/jhash.h \
>         lib/json.c \
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 3f50fc8..be8debb 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -30,6 +30,7 @@
>  #include "ct-dpif.h"
>  #include "dp-packet.h"
>  #include "flow.h"
> +#include "ipf.h"
>  #include "netdev.h"
>  #include "odp-netlink.h"
>  #include "openvswitch/hmap.h"
> @@ -339,6 +340,7 @@ conntrack_init(struct conntrack *ct)
>      atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
>      latch_init(&ct->clean_thread_exit);
>      ct->clean_thread = ovs_thread_create("ct_clean", clean_thread_main,
> ct);
> +    ipf_init();
>  }
>
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory. */
> @@ -381,6 +383,7 @@ conntrack_destroy(struct conntrack *ct)
>      hindex_destroy(&ct->alg_expectation_refs);
>      ct_rwlock_unlock(&ct->resources_lock);
>      ct_rwlock_destroy(&ct->resources_lock);
> +    ipf_destroy();
>  }
>
>  static unsigned hash_to_bucket(uint32_t hash)
> @@ -1295,7 +1298,8 @@ process_one(struct conntrack *ct, struct dp_packet
> *pkt,
>
>  /* Sends the packets in '*pkt_batch' through the connection tracker
> 'ct'.  All
>   * the packets must have the same 'dl_type' (IPv4 or IPv6) and should have
> - * the l3 and and l4 offset properly set.
> + * the l3 and and l4 offset properly set.  Performs fragment reassembly
> with
> + * the help of ipf_preprocess_conntrack().
>   *
>   * If 'commit' is true, the packets are allowed to create new entries in
> the
>   * connection tables.  'setmark', if not NULL, should point to a two
> @@ -1310,11 +1314,14 @@ conntrack_execute(struct conntrack *ct, struct
> dp_packet_batch *pkt_batch,
>                    const struct nat_action_info_t *nat_action_info,
>                    long long now)
>  {
> +    ipf_preprocess_conntrack(pkt_batch, now, dl_type, zone,
> ct->hash_basis);
> +
>      struct dp_packet *packet;
>      struct conn_lookup_ctx ctx;
>
>      DP_PACKET_BATCH_FOR_EACH (i, packet, pkt_batch) {
> -        if (!conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
> +        if (packet->md.ct_state == CS_INVALID
> +            || !conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
>              packet->md.ct_state = CS_INVALID;
>              write_ct_md(packet, zone, NULL, NULL, NULL);
>              continue;
> @@ -1323,6 +1330,8 @@ conntrack_execute(struct conntrack *ct, struct
> dp_packet_batch *pkt_batch,
>                      setlabel, nat_action_info, tp_src, tp_dst, helper);
>      }
>
> +    ipf_postprocess_conntrack(pkt_batch, now, dl_type);
> +
>      return 0;
>  }
>
> diff --git a/lib/ipf.c b/lib/ipf.c
> new file mode 100644
> index 0000000..cc8427e
> --- /dev/null
> +++ b/lib/ipf.c
> @@ -0,0 +1,1363 @@
> +/*
> + * Copyright (c) 2018 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <sys/types.h>
> +#include <netinet/in.h>
> +#include <netinet/ip6.h>
> +#include <netinet/icmp6.h>
> +#include <string.h>
> +
> +#include "coverage.h"
> +#include "csum.h"
> +#include "ipf.h"
> +#include "latch.h"
> +#include "openvswitch/hmap.h"
> +#include "openvswitch/poll-loop.h"
> +#include "openvswitch/vlog.h"
> +#include "ovs-atomic.h"
> +#include "packets.h"
> +#include "util.h"
> +
> +VLOG_DEFINE_THIS_MODULE(ipf);
> +COVERAGE_DEFINE(ipf_stuck_frag_list_purged);
> +
> +enum {
> +    IPV4_PACKET_MAX_HDR_SIZE = 60,
> +    IPV4_PACKET_MAX_SIZE = 65535,
> +    IPV6_PACKET_MAX_DATA = 65535,
> +};
> +
> +enum ipf_list_state {
> +    IPF_LIST_STATE_UNUSED,
> +    IPF_LIST_STATE_REASS_FAIL,
> +    IPF_LIST_STATE_OTHER_SEEN,
> +    IPF_LIST_STATE_FIRST_SEEN,
> +    IPF_LIST_STATE_LAST_SEEN,
> +    IPF_LIST_STATE_FIRST_LAST_SEEN,
> +    IPF_LIST_STATE_COMPLETED,
> +    IPF_LIST_STATE_NUM,
> +};
> +
> +enum ipf_list_type {
> +    IPF_FRAG_COMPLETED_LIST,
> +    IPF_FRAG_EXPIRY_LIST,
> +};
> +
> +enum {
> +    IPF_INVALID_IDX = -1,
> +    IPF_V4_FRAG_SIZE_LBOUND = 400,
> +    IPF_V4_FRAG_SIZE_MIN_DEF = 1200,
> +    IPF_V6_FRAG_SIZE_LBOUND = 400, /* Useful for testing. */
> +    IPF_V6_FRAG_SIZE_MIN_DEF = 1280,
> +    IPF_MAX_FRAGS_DEFAULT = 1000,
> +    IPF_NFRAG_UBOUND = 5000,
> +};
> +
> +enum ipf_counter_type {
> +    IPF_COUNTER_NFRAGS,
> +    IPF_COUNTER_NFRAGS_ACCEPTED,
> +    IPF_COUNTER_NFRAGS_COMPL_SENT,
> +    IPF_COUNTER_NFRAGS_EXPD_SENT,
> +    IPF_COUNTER_NFRAGS_TOO_SMALL,
> +    IPF_COUNTER_NFRAGS_OVERLAP,
> +    IPF_COUNTER_NFRAGS_PURGED,
> +};
> +
> +struct ipf_addr {
> +    union {
> +        ovs_16aligned_be32 ipv4;
> +        union ovs_16aligned_in6_addr ipv6;
> +        ovs_be32 ipv4_aligned;
> +        struct in6_addr ipv6_aligned;
> +    };
> +};
> +
> +struct ipf_frag {
> +    struct dp_packet *pkt;
> +    uint16_t start_data_byte;
> +    uint16_t end_data_byte;
> +};
> +
> +struct ipf_list_key {
> +    struct ipf_addr src_addr;
> +    struct ipf_addr dst_addr;
> +    uint32_t recirc_id;
> +    ovs_be32 ip_id;   /* V6 is 32 bits. */
> +    ovs_be16 dl_type;
> +    uint16_t zone;
> +    uint8_t nw_proto;
> +};
> +
> +struct ipf_list {
> +    struct hmap_node node;
> +    struct ovs_list list_node;
> +    struct ipf_frag *frag_list;
> +    struct ipf_list_key key;
> +    struct dp_packet *reass_execute_ctx; /* Reassembled packet. */
> +    long long expiration;          /* In milliseconds. */
> +    int last_sent_idx;             /* Last sent fragment idx. */
> +    int last_inuse_idx;            /* Last inuse fragment idx. */
> +    int size;                      /* Fragment list size. */
> +    uint8_t state;                 /* Frag list state; see
> ipf_list_state. */
> +};
> +
> +struct reassembled_pkt {
> +    struct ovs_list rp_list_node;
> +    struct dp_packet *pkt;
> +    struct ipf_list *list;
> +};
> +
> +struct OVS_LOCKABLE ipf_lock {
> +    struct ovs_mutex lock;
> +};
> +
> +static struct ipf_lock ipf_lock;
> +
> +static int max_v4_frag_list_size;
> +
> +static pthread_t ipf_clean_thread;
> +static struct latch ipf_clean_thread_exit;
> +
> +static struct hmap frag_lists OVS_GUARDED_BY(ipf_lock);
> +static struct ovs_list frag_exp_list OVS_GUARDED_BY(ipf_lock);
> +static struct ovs_list frag_complete_list OVS_GUARDED_BY(ipf_lock);
> +static struct ovs_list reassembled_pkt_list OVS_GUARDED_BY(ipf_lock);
> +
> +static atomic_bool ifp_v4_enabled;
> +static atomic_bool ifp_v6_enabled;
> +static atomic_uint nfrag_max;
> +/* Will be clamped above 400 bytes; the value chosen should handle
> + * alg control packets of interest that use string encoding of mutable
> + * IP fields; meaning, the control packets should not be fragmented. */
> +static atomic_uint min_v4_frag_size;
> +static atomic_uint min_v6_frag_size;
> +
> +static atomic_count nfrag;
> +
> +static atomic_uint64_t n4frag_accepted;
> +static atomic_uint64_t n4frag_completed_sent;
> +static atomic_uint64_t n4frag_expired_sent;
> +static atomic_uint64_t n4frag_too_small;
> +static atomic_uint64_t n4frag_overlap;
> +static atomic_uint64_t n4frag_purged;
> +static atomic_uint64_t n6frag_accepted;
> +static atomic_uint64_t n6frag_completed_sent;
> +static atomic_uint64_t n6frag_expired_sent;
> +static atomic_uint64_t n6frag_too_small;
> +static atomic_uint64_t n6frag_overlap;
> +static atomic_uint64_t n6frag_purged;
> +
> +static void
> +ipf_print_reass_packet(char *es, void *pkt)
> +{
> +    const unsigned char *b = (const unsigned char *) pkt;
> +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(10, 10);
> +    VLOG_WARN_RL(&rl, "%s 91 bytes from specified part of packet "
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X",
> +                 es, b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7], b[8],
> +                 b[9], b[10], b[11], b[12], b[13], b[14], b[15], b[16],
> +                 b[17], b[18], b[19], b[20], b[21], b[22], b[23], b[24],
> +                 b[25], b[26], b[27], b[28], b[29], b[30], b[31], b[32],
> +                 b[33], b[34], b[35], b[36], b[37], b[38], b[39], b[40],
> +                 b[41], b[42], b[43], b[44], b[45], b[46], b[47], b[48],
> +                 b[49], b[50], b[51], b[52], b[53], b[54], b[55], b[56],
> +                 b[57], b[58], b[59], b[60], b[61], b[62], b[63], b[64],
> +                 b[65], b[66], b[67], b[68], b[69], b[70], b[71], b[72],
> +                 b[73], b[74], b[75], b[76], b[77], b[77], b[79], b[80],
> +                 b[81], b[82], b[83], b[84], b[85], b[86], b[87], b[88],
> +                 b[89], b[90]);
> +}
> +
> +static void ipf_lock_init(struct ipf_lock *lock)
> +{
> +    ovs_mutex_init_adaptive(&lock->lock);
> +}
> +
> +static void ipf_lock_lock(struct ipf_lock *lock)
> +    OVS_ACQUIRES(lock)
> +    OVS_NO_THREAD_SAFETY_ANALYSIS
> +{
> +    ovs_mutex_lock(&lock->lock);
> +}
> +
> +static void ipf_lock_unlock(struct ipf_lock *lock)
> +    OVS_RELEASES(lock)
> +    OVS_NO_THREAD_SAFETY_ANALYSIS
> +{
> +    ovs_mutex_unlock(&lock->lock);
> +}
> +
> +static void ipf_lock_destroy(struct ipf_lock *lock)
> +{
> +    ovs_mutex_destroy(&lock->lock);
> +}
> +
> +static void
> +ipf_count(bool v6, enum ipf_counter_type cntr)
> +{
> +    switch (cntr) {
> +    case IPF_COUNTER_NFRAGS_ACCEPTED:
> +        atomic_count_inc64(v6 ? &n6frag_accepted : &n4frag_accepted);
> +        break;
> +    case IPF_COUNTER_NFRAGS_COMPL_SENT:
> +        atomic_count_inc64(v6 ? &n6frag_completed_sent
> +                         : &n4frag_completed_sent);
> +        break;
> +    case IPF_COUNTER_NFRAGS_EXPD_SENT:
> +        atomic_count_inc64(v6 ? &n6frag_expired_sent
> +                         : &n4frag_expired_sent);
> +        break;
> +    case IPF_COUNTER_NFRAGS_TOO_SMALL:
> +        atomic_count_inc64(v6 ? &n6frag_too_small : &n4frag_too_small);
> +        break;
> +    case IPF_COUNTER_NFRAGS_OVERLAP:
> +        atomic_count_inc64(v6 ? &n6frag_overlap : &n4frag_overlap);
> +        break;
> +    case IPF_COUNTER_NFRAGS_PURGED:
> +        atomic_count_inc64(v6 ? &n6frag_purged : &n4frag_purged);
> +        break;
> +    case IPF_COUNTER_NFRAGS:
> +    default:
> +        OVS_NOT_REACHED();
> +    }
> +}
> +
> +static bool
> +ipf_get_v4_enabled(void)
> +{
> +    bool ifp_v4_enabled_;
> +    atomic_read_relaxed(&ifp_v4_enabled, &ifp_v4_enabled_);
> +    return ifp_v4_enabled_;
> +}
> +
> +static bool
> +ipf_get_v6_enabled(void)
> +{
> +    bool ifp_v6_enabled_;
> +    atomic_read_relaxed(&ifp_v6_enabled, &ifp_v6_enabled_);
> +    return ifp_v6_enabled_;
> +}
> +
> +static bool
> +ipf_get_enabled(void)
> +{
> +    return ipf_get_v4_enabled() || ipf_get_v6_enabled();
> +}
> +
> +static uint32_t
> +ipf_addr_hash_add(uint32_t hash, const struct ipf_addr *addr)
> +{
> +    BUILD_ASSERT_DECL(sizeof *addr % 4 == 0);
> +    return hash_add_bytes32(hash, (const uint32_t *) addr, sizeof *addr);
> +}
> +
> +static void
> +ipf_expiry_list_add(struct ipf_list *ipf_list, long long now)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    enum {
> +        IPF_FRAG_LIST_TIMEOUT = 15000,
> +    };
> +
> +    ipf_list->expiration = now + IPF_FRAG_LIST_TIMEOUT;
> +    ovs_list_push_back(&frag_exp_list, &ipf_list->list_node);
> +}
> +
> +static void
> +ipf_completed_list_add(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ovs_list_push_back(&frag_complete_list, &ipf_list->list_node);
> +}
> +
> +static void
> +ipf_reassembled_list_add(struct reassembled_pkt *rp)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ovs_list_push_back(&reassembled_pkt_list, &rp->rp_list_node);
> +}
> +
> +static void
> +ipf_list_clean(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ovs_list_remove(&ipf_list->list_node);
> +    hmap_remove(&frag_lists, &ipf_list->node);
> +    free(ipf_list->frag_list);
> +    free(ipf_list);
> +}
> +
> +static void
> +ipf_expiry_list_clean(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ipf_list_clean(ipf_list);
> +}
> +
> +static void
> +ipf_completed_list_clean(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ipf_list_clean(ipf_list);
> +}
> +
> +static void
> +ipf_expiry_list_remove(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ovs_list_remove(&ipf_list->list_node);
> +}
> +
> +static void
> +ipf_reassembled_list_remove(struct reassembled_pkt *rp)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    ovs_list_remove(&rp->rp_list_node);
> +}
> +
> +/* Symmetric */
> +static uint32_t
> +ipf_list_key_hash(const struct ipf_list_key *key, uint32_t basis)
> +{
> +    uint32_t hsrc, hdst, hash;
> +    hsrc = hdst = basis;
> +    hsrc = ipf_addr_hash_add(hsrc, &key->src_addr);
> +    hdst = ipf_addr_hash_add(hdst, &key->dst_addr);
> +    hash = hsrc ^ hdst;
> +
> +    /* Hash the rest of the key. */
> +    hash = hash_words((uint32_t *) (&key->dst_addr + 1),
> +                      (uint32_t *) (key + 1) -
> +                          (uint32_t *) (&key->dst_addr + 1),
> +                      hash);
> +
> +    return hash_finish(hash, 0);
> +}
> +
> +static bool
> +ipf_is_first_v4_frag(const struct dp_packet *pkt)
> +{
> +    const struct ip_header *l3 = dp_packet_l3(pkt);
> +    if (!(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK)) &&
> +        l3->ip_frag_off & htons(IP_MORE_FRAGMENTS)) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_is_last_v4_frag(const struct dp_packet *pkt)
> +{
> +    const struct ip_header *l3 = dp_packet_l3(pkt);
> +    if (l3->ip_frag_off & htons(IP_FRAG_OFF_MASK) &&
> +        !(l3->ip_frag_off & htons(IP_MORE_FRAGMENTS))) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_is_v6_frag(ovs_be16 ip6f_offlg)
> +{
> +    if (ip6f_offlg & (IP6F_OFF_MASK | IP6F_MORE_FRAG)) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_is_first_v6_frag(ovs_be16 ip6f_offlg)
> +{
> +    if (!(ip6f_offlg & IP6F_OFF_MASK) &&
> +        ip6f_offlg & IP6F_MORE_FRAG) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_is_last_v6_frag(ovs_be16 ip6f_offlg)
> +{
> +    if ((ip6f_offlg & IP6F_OFF_MASK) &&
> +        !(ip6f_offlg & IP6F_MORE_FRAG)) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_list_complete(const struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    for (int i = 0; i < ipf_list->last_inuse_idx; i++) {
> +        if (ipf_list->frag_list[i].end_data_byte + 1
> +            != ipf_list->frag_list[i + 1].start_data_byte) {
> +            return false;
> +        }
> +    }
> +    return true;
> +}
> +
> +/* Runs O(n) for a sorted or almost sorted list. */
> +static void
> +ipf_sort(struct ipf_frag *frag_list, size_t last_idx)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    int running_last_idx = 1;
> +    struct ipf_frag ipf_frag;
> +    while (running_last_idx <= last_idx) {
> +        ipf_frag = frag_list[running_last_idx];
> +        int frag_list_idx = running_last_idx - 1;
> +        while (frag_list_idx >= 0 &&
> +               frag_list[frag_list_idx].start_data_byte >
> +                   ipf_frag.start_data_byte) {
> +            frag_list[frag_list_idx + 1] = frag_list[frag_list_idx];
> +            frag_list_idx -= 1;
> +        }
> +        frag_list[frag_list_idx + 1] = ipf_frag;
> +        running_last_idx++;
> +    }
> +}
> +
> +/* Called on a sorted complete list of fragments. */
> +static struct dp_packet *
> +ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    struct ipf_frag *frag_list = ipf_list->frag_list;
> +    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
> +    struct ip_header *l3 = dp_packet_l3(pkt);
> +    int len = ntohs(l3->ip_tot_len);
> +    size_t add_len;
> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
> +
> +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
> +        add_len = frag_list[i].end_data_byte -
> +                         frag_list[i].start_data_byte + 1;
> +        len += add_len;
> +        if (len > IPV4_PACKET_MAX_SIZE) {
> +            ipf_print_reass_packet(
> +                "Unsupported big reassembled v4 packet; v4 hdr:", l3);
> +            dp_packet_delete(pkt);
> +            return NULL;
> +        }
> +        l3 = dp_packet_l3(frag_list[i].pkt);
> +        dp_packet_put(pkt, (char *)l3 + ip_hdr_len, add_len);
> +    }
> +    l3 = dp_packet_l3(pkt);
> +    ovs_be16 new_ip_frag_off = l3->ip_frag_off &
> ~htons(IP_MORE_FRAGMENTS);
> +    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> +                                new_ip_frag_off);
> +    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> +    l3->ip_tot_len = htons(len);
> +    l3->ip_frag_off = new_ip_frag_off;
> +
> +    return pkt;
> +}
> +
> +/* Called on a sorted complete list of fragments. */
> +static struct dp_packet *
> +ipf_reassemble_v6_frags(struct ipf_list *ipf_list)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    struct ipf_frag *frag_list = ipf_list->frag_list;
> +    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
> +    struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
> +    int pl = ntohs(l3->ip6_plen) - sizeof(struct ovs_16aligned_ip6_frag);
> +    const char *tail = dp_packet_tail(pkt);
> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
> +    const char *l4 = dp_packet_l4(pkt);
> +    size_t l3_size = tail - (char *)l3 - pad;
> +    size_t l4_size = tail - (char *)l4 - pad;
> +    size_t l3_hlen = l3_size - l4_size;
> +    size_t add_len;
> +
> +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
> +        add_len = frag_list[i].end_data_byte -
> +                          frag_list[i].start_data_byte + 1;
> +        pl += add_len;
> +        if (pl > IPV6_PACKET_MAX_DATA) {
> +            ipf_print_reass_packet(
> +                "Unsupported big reassembled v6 packet; v6 hdr:", l3);
> +            dp_packet_delete(pkt);
> +            return NULL;
> +        }
> +        l3 = dp_packet_l3(frag_list[i].pkt);
> +        dp_packet_put(pkt, (char *)l3 + l3_hlen, add_len);
> +    }
> +    l3 = dp_packet_l3(pkt);
> +    l4 = dp_packet_l4(pkt);
> +    tail = dp_packet_tail(pkt);
> +    pad = dp_packet_l2_pad_size(pkt);
> +    l3_size = tail - (char *)l3 - pad;
> +
> +    uint8_t nw_proto = l3->ip6_nxt;
> +    uint8_t nw_frag = 0;
> +    const void *data = l3 + 1;
> +    size_t datasize = l3_size - sizeof *l3;
> +
> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
> +    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
> &frag_hdr)
> +        || !nw_frag || !frag_hdr) {
> +
> +        ipf_print_reass_packet("Unparsed reassembled v6 packet; v6 hdr:",
> l3);
> +        dp_packet_delete(pkt);
> +        return NULL;
> +    }
> +
> +    struct ovs_16aligned_ip6_frag *fh =
> +        CONST_CAST(struct ovs_16aligned_ip6_frag *, frag_hdr);
> +    fh->ip6f_offlg = 0;
> +    l3->ip6_plen = htons(pl);
> +    l3->ip6_ctlun.ip6_un1.ip6_un1_nxt = nw_proto;
> +    return pkt;
> +}
> +
> +/* Called when a valid fragment is added. */
> +static void
> +ipf_list_state_transition(struct ipf_list *ipf_list, bool ff, bool lf,
> +                          bool v6)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    enum ipf_list_state curr_state = ipf_list->state;
> +    enum ipf_list_state next_state;
> +    switch (curr_state) {
> +    case IPF_LIST_STATE_UNUSED:
> +    case IPF_LIST_STATE_OTHER_SEEN:
> +        if (ff) {
> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
> +        } else if (lf) {
> +            next_state = IPF_LIST_STATE_LAST_SEEN;
> +        } else {
> +            next_state = IPF_LIST_STATE_OTHER_SEEN;
> +        }
> +        break;
> +    case IPF_LIST_STATE_FIRST_SEEN:
> +        if (ff) {
> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
> +        } else if (lf) {
> +            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
> +        } else {
> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
> +        }
> +        break;
> +    case IPF_LIST_STATE_LAST_SEEN:
> +        if (ff) {
> +            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
> +        } else if (lf) {
> +            next_state = IPF_LIST_STATE_LAST_SEEN;
> +        } else {
> +            next_state = IPF_LIST_STATE_LAST_SEEN;
> +        }
> +        break;
> +    case IPF_LIST_STATE_FIRST_LAST_SEEN:
> +        next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
> +        break;
> +    case IPF_LIST_STATE_COMPLETED:
> +    case IPF_LIST_STATE_REASS_FAIL:
> +    case IPF_LIST_STATE_NUM:
> +    default:
> +        OVS_NOT_REACHED();
> +    }
> +
> +    if (next_state == IPF_LIST_STATE_FIRST_LAST_SEEN) {
> +        ipf_sort(ipf_list->frag_list, ipf_list->last_inuse_idx);
> +        if (ipf_list_complete(ipf_list)) {
> +            struct dp_packet *reass_pkt = v6
> +                ? ipf_reassemble_v6_frags(ipf_list)
> +                : ipf_reassemble_v4_frags(ipf_list);
> +            if (reass_pkt) {
> +                struct reassembled_pkt *rp = xzalloc(sizeof *rp);
> +                rp->pkt = reass_pkt;
> +                rp->list = ipf_list;
> +                ipf_reassembled_list_add(rp);
> +                ipf_expiry_list_remove(ipf_list);
> +                next_state = IPF_LIST_STATE_COMPLETED;
> +            } else {
> +                next_state = IPF_LIST_STATE_REASS_FAIL;
> +            }
> +        }
> +    }
> +    ipf_list->state = next_state;
> +}
> +
> +static bool
> +ipf_is_valid_v4_frag(struct dp_packet *pkt)
> +{
> +    if (OVS_UNLIKELY(dp_packet_ip_checksum_bad(pkt))) {
> +        goto invalid_pkt;
> +    }
> +
> +    const struct eth_header *l2 = dp_packet_eth(pkt);
> +    const struct ip_header *l3 = dp_packet_l3(pkt);
> +
> +    if (OVS_UNLIKELY(!l2 || !l3)) {
> +        goto invalid_pkt;
> +    }
> +
> +    const char *tail = dp_packet_tail(pkt);
> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
> +    size_t size = tail - (char *)l3 - pad;
> +    if (OVS_UNLIKELY(size < IP_HEADER_LEN)) {
> +        goto invalid_pkt;
> +    }
> +
> +    if (!(IP_IS_FRAGMENT(l3->ip_frag_off))) {
> +        return false;
> +    }
> +
> +    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
> +    if (OVS_UNLIKELY(ip_tot_len != size)) {
> +        goto invalid_pkt;
> +    }
> +
> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
> +    if (OVS_UNLIKELY(ip_hdr_len < IP_HEADER_LEN)) {
> +        goto invalid_pkt;
> +    }
> +    if (OVS_UNLIKELY(size < ip_hdr_len)) {
> +        goto invalid_pkt;
> +    }
> +
> +    if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> +                     && csum(l3, ip_hdr_len) != 0)) {
> +        goto invalid_pkt;
> +    }
> +
> +    uint32_t min_v4_frag_size_;
> +    atomic_read_relaxed(&min_v4_frag_size, &min_v4_frag_size_);
> +    bool lf = ipf_is_last_v4_frag(pkt);
> +    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v4_frag_size_)) {
> +        ipf_count(false, IPF_COUNTER_NFRAGS_TOO_SMALL);
> +        goto invalid_pkt;
> +    }
> +    return true;
> +
> +invalid_pkt:
> +    pkt->md.ct_state = CS_INVALID;
> +    return false;
> +
> +}
> +
> +static bool
> +ipf_v4_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
> +                   struct ipf_list_key *key, uint16_t *start_data_byte,
> +                   uint16_t *end_data_byte, bool *ff, bool *lf)
> +{
> +    const struct ip_header *l3 = dp_packet_l3(pkt);
> +    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
> +
> +    *start_data_byte = ntohs(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK)) *
> 8;
> +    *end_data_byte = *start_data_byte + ip_tot_len - ip_hdr_len - 1;
> +    *ff = ipf_is_first_v4_frag(pkt);
> +    *lf = ipf_is_last_v4_frag(pkt);
> +    memset(key, 0, sizeof *key);
> +    key->ip_id = be16_to_be32(l3->ip_id);
> +    key->dl_type = dl_type;
> +    key->src_addr.ipv4 = l3->ip_src;
> +    key->dst_addr.ipv4 = l3->ip_dst;
> +    key->nw_proto = l3->ip_proto;
> +    key->zone = zone;
> +    key->recirc_id = pkt->md.recirc_id;
> +    return true;
> +}
> +
> +static bool
> +ipf_is_valid_v6_frag(struct dp_packet *pkt OVS_UNUSED)
> +{
> +    const struct eth_header *l2 = dp_packet_eth(pkt);
> +    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
> +    const char *l4 = dp_packet_l4(pkt);
> +
> +    if (OVS_UNLIKELY(!l2 || !l3 || !l4)) {
> +        goto invalid_pkt;
> +    }
> +
> +    const char *tail = dp_packet_tail(pkt);
> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
> +    size_t l3_size = tail - (char *)l3 - pad;
> +    size_t l3_hdr_size = sizeof *l3;
> +
> +    if (OVS_UNLIKELY(l3_size < l3_hdr_size)) {
> +        goto invalid_pkt;
> +    }
> +
> +    uint8_t nw_frag = 0;
> +    uint8_t nw_proto = l3->ip6_nxt;
> +    const void *data = l3 + 1;
> +    size_t datasize = l3_size - l3_hdr_size;
> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
> +    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
> +                             &frag_hdr) || !nw_frag || !frag_hdr) {
> +        return false;
> +    }
> +
> +    int pl = ntohs(l3->ip6_plen);
> +    if (OVS_UNLIKELY(pl + l3_hdr_size != l3_size)) {
> +        goto invalid_pkt;
> +    }
> +
> +    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
> +    if (OVS_UNLIKELY(!ipf_is_v6_frag(ip6f_offlg))) {
> +        return false;
> +    }
> +
> +    uint32_t min_v6_frag_size_;
> +    atomic_read_relaxed(&min_v6_frag_size, &min_v6_frag_size_);
> +    bool lf = ipf_is_last_v6_frag(ip6f_offlg);
> +
> +    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v6_frag_size_)) {
> +        ipf_count(true, IPF_COUNTER_NFRAGS_TOO_SMALL);
> +        goto invalid_pkt;
> +    }
> +
> +    return true;
> +
> +invalid_pkt:
> +    pkt->md.ct_state = CS_INVALID;
> +    return false;
> +
> +}
> +
> +static void
> +ipf_v6_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
> +                   struct ipf_list_key *key, uint16_t *start_data_byte,
> +                   uint16_t *end_data_byte, bool *ff, bool *lf)
> +{
> +    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
> +    const char *l4 = dp_packet_l4(pkt);
> +    const char *tail = dp_packet_tail(pkt);
> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
> +    size_t l3_size = tail - (char *)l3 - pad;
> +    size_t l4_size = tail - (char *)l4 - pad;
> +    size_t l3_hdr_size = sizeof *l3;
> +    uint8_t nw_frag = 0;
> +    uint8_t nw_proto = l3->ip6_nxt;
> +    const void *data = l3 + 1;
> +    size_t datasize = l3_size - l3_hdr_size;
> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
> +
> +    parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag, &frag_hdr);
> +    ovs_assert(nw_frag && frag_hdr);
> +    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
> +    *start_data_byte = ntohs(ip6f_offlg & IP6F_OFF_MASK) +
> +        sizeof (struct ovs_16aligned_ip6_frag);
> +    *end_data_byte = *start_data_byte + l4_size - 1;
> +    *ff = ipf_is_first_v6_frag(ip6f_offlg);
> +    *lf = ipf_is_last_v6_frag(ip6f_offlg);
> +    memset(key, 0, sizeof *key);
> +    key->ip_id = get_16aligned_be32(&frag_hdr->ip6f_ident);
> +    key->dl_type = dl_type;
> +    key->src_addr.ipv6 = l3->ip6_src;
> +    /* We are not supporting parsing of the routing header to use as the
> +     * dst address part of the key. */
> +    key->dst_addr.ipv6 = l3->ip6_dst;
> +    key->nw_proto = 0;   /* Not used for key for V6. */
> +    key->zone = zone;
> +    key->recirc_id = pkt->md.recirc_id;
> +}
> +
> +static int
> +ipf_list_key_cmp(const struct ipf_list_key *key1,
> +                 const struct ipf_list_key *key2)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    if (!memcmp(&key1->src_addr, &key2->src_addr, sizeof key1->src_addr)
> &&
> +        !memcmp(&key1->dst_addr, &key2->dst_addr, sizeof key1->dst_addr)
> &&
> +        (key1->dl_type == key2->dl_type) &&
> +        (key1->ip_id == key2->ip_id) &&
> +        (key1->zone == key2->zone) &&
> +        (key1->nw_proto == key2->nw_proto) &&
> +        (key1->recirc_id == key2->recirc_id)) {
> +        return 0;
> +    }
> +    return 1;
> +}
> +
> +static struct ipf_list *
> +ipf_list_key_lookup(const struct ipf_list_key *key,
> +                    uint32_t hash)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    struct ipf_list *ipf_list;
> +    HMAP_FOR_EACH_WITH_HASH (ipf_list, node, hash, &frag_lists) {
> +        if (!ipf_list_key_cmp(&ipf_list->key, key)) {
> +            return ipf_list;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static bool
> +ipf_is_frag_duped(const struct ipf_frag *frag_list, int last_inuse_idx,
> +                  size_t start_data_byte, size_t end_data_byte)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    for (int i = 0; i <= last_inuse_idx; i++) {
> +        if (((start_data_byte >= frag_list[i].start_data_byte) &&
> +            (start_data_byte <= frag_list[i].end_data_byte)) ||
> +            ((end_data_byte >= frag_list[i].start_data_byte) &&
> +             (end_data_byte <= frag_list[i].end_data_byte))) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static bool
> +ipf_process_frag(struct ipf_list *ipf_list, struct dp_packet *pkt,
> +                 uint16_t start_data_byte, uint16_t end_data_byte,
> +                 bool ff, bool lf, bool v6)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    bool duped_frag = ipf_is_frag_duped(ipf_list->frag_list,
> +        ipf_list->last_inuse_idx, start_data_byte, end_data_byte);
> +    int last_inuse_idx = ipf_list->last_inuse_idx;
> +
> +    if (!duped_frag) {
> +        if (last_inuse_idx < ipf_list->size - 1) {
> +            /* In the case of dpdk, it would be unfortunate if we had
> +             * to create a clone fragment outside the dpdk mp due to the
> +             * mempool size being too limited. We will otherwise need to
> +             * recommend not setting the mempool number of buffers too low
> +             * and also clamp the number of fragments. */
> +            ipf_list->frag_list[last_inuse_idx + 1].pkt = pkt;
> +            ipf_list->frag_list[last_inuse_idx + 1].start_data_byte =
> +                start_data_byte;
> +            ipf_list->frag_list[last_inuse_idx + 1].end_data_byte =
> +                end_data_byte;
> +            ipf_list->last_inuse_idx++;
> +            atomic_count_inc(&nfrag);
> +            ipf_count(v6, IPF_COUNTER_NFRAGS_ACCEPTED);
> +            ipf_list_state_transition(ipf_list, ff, lf, v6);
> +        } else {
> +            OVS_NOT_REACHED();
> +        }
> +    } else {
> +        ipf_count(v6, IPF_COUNTER_NFRAGS_OVERLAP);
> +        pkt->md.ct_state = CS_INVALID;
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static bool
> +ipf_handle_frag(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
> +                long long now, uint32_t hash_basis)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    struct ipf_list_key key;
> +    /* Initialize 4 variables for some versions of GCC. */
> +    uint16_t start_data_byte = 0;
> +    uint16_t end_data_byte = 0;
> +    bool ff = false;
> +    bool lf = false;
> +    bool v6 = dl_type == htons(ETH_TYPE_IPV6);
> +
> +    if (v6 && ipf_get_v6_enabled()) {
> +        ipf_v6_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
> +                           &end_data_byte, &ff, &lf);
> +    } else if (!v6 && ipf_get_v4_enabled()) {
> +        ipf_v4_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
> +                           &end_data_byte, &ff, &lf);
> +    } else {
> +        OVS_NOT_REACHED();
> +    }
> +
> +    unsigned int nfrag_max_;
> +    atomic_read_relaxed(&nfrag_max, &nfrag_max_);
> +    if (atomic_count_get(&nfrag) >= nfrag_max_) {
> +        return false;
> +    }
> +
> +    uint32_t hash = ipf_list_key_hash(&key, hash_basis);
> +    struct ipf_list *ipf_list = ipf_list_key_lookup(&key, hash);
> +    enum {
> +        IPF_FRAG_LIST_MIN_INCREMENT = 4,
> +        IPF_IPV6_MAX_FRAG_LIST_SIZE = 65535,
> +    };
> +
> +    int max_frag_list_size;
> +    if (v6) {
> +        /* Because the calculation with extension headers is variable,
> +         * we don't calculate a hard maximum fragment list size upfront.
> The
> +         * fragment list size is practically limited by the code,
> however. */
> +        max_frag_list_size = IPF_IPV6_MAX_FRAG_LIST_SIZE;
> +    } else {
> +        max_frag_list_size = max_v4_frag_list_size;
> +    }
> +
> +    if (!ipf_list) {
> +        ipf_list = xzalloc(sizeof *ipf_list);
> +        ipf_list->key = key;
> +        ipf_list->last_inuse_idx = IPF_INVALID_IDX;
> +        ipf_list->last_sent_idx = IPF_INVALID_IDX;
> +        ipf_list->size =
> +            MIN(max_frag_list_size, IPF_FRAG_LIST_MIN_INCREMENT);
> +        ipf_list->frag_list =
> +            xzalloc(ipf_list->size * sizeof *ipf_list->frag_list);
> +        hmap_insert(&frag_lists, &ipf_list->node, hash);
> +        ipf_expiry_list_add(ipf_list, now);
> +    } else if (ipf_list->state == IPF_LIST_STATE_REASS_FAIL) {
> +        /* Bail out as early as possible. */
> +        return false;
> +    } else if (ipf_list->last_inuse_idx + 1 >= ipf_list->size) {
> +        int increment = MIN(IPF_FRAG_LIST_MIN_INCREMENT,
> +                            max_frag_list_size - ipf_list->size);
> +        /* Enforce limit. */
> +        if (increment > 0) {
> +            ipf_list->frag_list =
> +                xrealloc(ipf_list->frag_list, (ipf_list->size +
> increment) *
> +                  sizeof *ipf_list->frag_list);
> +            ipf_list->size += increment;
> +        } else {
> +            return false;
> +        }
> +    }
> +
> +    return ipf_process_frag(ipf_list, pkt, start_data_byte,
> end_data_byte, ff,
> +                            lf, v6);
> +}
> +
> +static void
> +ipf_extract_frags_from_batch(struct dp_packet_batch *pb, ovs_be16 dl_type,
> +                             uint16_t zone, long long now, uint32_t
> hash_basis)
> +{
> +    const size_t pb_cnt = dp_packet_batch_size(pb);
> +    int pb_idx; /* Index in a packet batch. */
> +    struct dp_packet *pkt;
> +
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
> +
> +        if (OVS_UNLIKELY((dl_type == htons(ETH_TYPE_IP) &&
> +                          ipf_is_valid_v4_frag(pkt)) ||
> +                         (dl_type == htons(ETH_TYPE_IPV6) &&
> +                          ipf_is_valid_v6_frag(pkt)))) {
> +
> +            ipf_lock_lock(&ipf_lock);
> +            if (!ipf_handle_frag(pkt, dl_type, zone, now, hash_basis)) {
> +                dp_packet_batch_refill(pb, pkt, pb_idx);
> +            }
> +            ipf_lock_unlock(&ipf_lock);
> +        } else {
> +            dp_packet_batch_refill(pb, pkt, pb_idx);
> +        }
> +
> +    }
> +}
> +
> +/* In case of DPDK, a memory source check is done, as DPDK memory pool
> + * management has trouble dealing with multiple source types.  The
> + * check_source paramater is used to indicate when this check is needed.
> */
> +static bool
> +ipf_dp_packet_batch_add(struct dp_packet_batch *pb , struct dp_packet
> *pkt,
> +                        bool check_source OVS_UNUSED)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +#ifdef DPDK_NETDEV
> +    if ((pb->count >= NETDEV_MAX_BURST) ||
> +        /* DPDK cannot handle multiple sources in a batch. */
> +        (check_source && pb->count && pb->packets[0]->source !=
> pkt->source)) {
> +#else
> +    if (pb->count >= NETDEV_MAX_BURST) {
> +#endif
> +        return false;
> +    }
> +
> +    dp_packet_batch_add(pb, pkt);
> +    return true;
> +}
> +
> +/* This would be used in rare cases where a list cannot be sent. One rare
> + * reason known right now is a mempool source check, which exists due to
> DPDK
> + * support, where packets are no longer being received on any port with a
> + * source matching the fragment.  Another reason is a race where all
> + * conntrack rules are unconfigured when some fragments are yet to be
> + * flushed.
> + *
> + * Returns true if the list was purged. */
> +static bool
> +ipf_purge_list_check(struct ipf_list *ipf_list, long long now)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    enum {
> +        IPF_FRAG_LIST_PURGE_TIME_ADJ = 10000
> +    };
> +
> +    if (now < ipf_list->expiration + IPF_FRAG_LIST_PURGE_TIME_ADJ) {
> +        return false;
> +    }
> +
> +    struct dp_packet *pkt;
> +    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
> +        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
> +        dp_packet_delete(pkt);
> +        atomic_count_dec(&nfrag);
> +        ipf_list->last_sent_idx++;
> +    }
> +
> +    COVERAGE_INC(ipf_stuck_frag_list_purged);
> +    ipf_count(ipf_list->key.dl_type == htons(ETH_TYPE_IPV6),
> +              IPF_COUNTER_NFRAGS_PURGED);
> +    return true;
> +}
> +
> +static bool
> +ipf_send_frags_in_list(struct ipf_list *ipf_list, struct dp_packet_batch
> *pb,
> +                       enum ipf_list_type list_type, bool v6, long long
> now)
> +    OVS_REQUIRES(ipf_lock)
> +{
> +    if (ipf_purge_list_check(ipf_list, now)) {
> +        return true;
> +    }
> +
> +    struct dp_packet *pkt;
> +    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
> +        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
> +        if (ipf_dp_packet_batch_add(pb, pkt, true)) {
> +
> +            ipf_list->last_sent_idx++;
> +            atomic_count_dec(&nfrag);
> +
> +            if (list_type == IPF_FRAG_COMPLETED_LIST) {
> +                ipf_count(v6, IPF_COUNTER_NFRAGS_COMPL_SENT);
> +            } else {
> +                ipf_count(v6, IPF_COUNTER_NFRAGS_EXPD_SENT);
> +                pkt->md.ct_state = CS_INVALID;
> +            }
> +
> +            if (ipf_list->last_sent_idx == ipf_list->last_inuse_idx) {
> +                return true;
> +            }
> +        } else {
> +            return false;
> +        }
> +    }
> +    OVS_NOT_REACHED();
> +}
> +
> +static void
> +ipf_send_completed_frags(struct dp_packet_batch *pb, long long now, bool
> v6)
> +{
> +    if (ovs_list_is_empty(&frag_complete_list)) {
> +        return;
> +    }
> +
> +    ipf_lock_lock(&ipf_lock);
> +    struct ipf_list *ipf_list, *next;
> +
> +    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_complete_list) {
> +        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_COMPLETED_LIST,
> +                                   v6, now)) {
> +            ipf_completed_list_clean(ipf_list);
> +        } else {
> +            break;
> +        }
> +    }
> +    ipf_lock_unlock(&ipf_lock);
> +}
> +
> +static void
> +ipf_send_expired_frags(struct dp_packet_batch *pb, long long now, bool v6)
> +{
> +    enum {
> +        /* Very conservative, due to DOS probability. */
> +        IPF_FRAG_LIST_MAX_EXPIRED = 1,
> +    };
> +
> +
> +    if (ovs_list_is_empty(&frag_exp_list)) {
> +        return;
> +    }
> +
> +    ipf_lock_lock(&ipf_lock);
> +    struct ipf_list *ipf_list, *next;
> +    size_t lists_removed = 0;
> +
> +    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_exp_list) {
> +        if (!(now > ipf_list->expiration) ||
> +            lists_removed >= IPF_FRAG_LIST_MAX_EXPIRED) {
> +            break;
> +        }
> +
> +        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_EXPIRY_LIST, v6,
> +                                   now)) {
> +            ipf_expiry_list_clean(ipf_list);
> +            lists_removed++;
> +        } else {
> +            break;
> +        }
> +    }
> +    ipf_lock_unlock(&ipf_lock);
> +}
> +
> +static void
> +ipf_execute_reass_pkts(struct dp_packet_batch *pb)
> +{
> +    if (ovs_list_is_empty(&reassembled_pkt_list)) {
> +        return;
> +    }
> +
> +    ipf_lock_lock(&ipf_lock);
> +    struct reassembled_pkt *rp, *next;
> +
> +    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
> +        if (!rp->list->reass_execute_ctx &&
> +            ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
> +            rp->list->reass_execute_ctx = rp->pkt;
> +        }
> +    }
> +    ipf_lock_unlock(&ipf_lock);
> +}
> +
> +static void
> +ipf_post_execute_reass_pkts(struct dp_packet_batch *pb, bool v6)
> +{
> +    if (ovs_list_is_empty(&reassembled_pkt_list)) {
> +        return;
> +    }
> +
> +    ipf_lock_lock(&ipf_lock);
> +    struct reassembled_pkt *rp, *next;
> +
> +    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
> +        const size_t pb_cnt = dp_packet_batch_size(pb);
> +        int pb_idx;
> +        struct dp_packet *pkt;
> +        /* Inner batch loop is constant time since batch size is <=
> +         * NETDEV_MAX_BURST. */
> +        DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
> +            if (pkt == rp->list->reass_execute_ctx) {
> +                for (int i = 0; i <= rp->list->last_inuse_idx; i++) {
> +                    rp->list->frag_list[i].pkt->md.ct_label =
> pkt->md.ct_label;
> +                    rp->list->frag_list[i].pkt->md.ct_mark =
> pkt->md.ct_mark;
> +                    rp->list->frag_list[i].pkt->md.ct_state =
> pkt->md.ct_state;
> +                    rp->list->frag_list[i].pkt->md.ct_zone =
> pkt->md.ct_zone;
> +                    rp->list->frag_list[i].pkt->md.ct_orig_tuple_ipv6 =
> +                        pkt->md.ct_orig_tuple_ipv6;
> +                    if (pkt->md.ct_orig_tuple_ipv6) {
> +                        rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv6
> =
> +                            pkt->md.ct_orig_tuple.ipv6;
> +                    } else {
> +
> rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv4  =
> +                            pkt->md.ct_orig_tuple.ipv4;
> +                    }
> +                }
> +
> +                const char *tail_frag =
> +                    dp_packet_tail(rp->list->frag_list[0].pkt);
> +                uint8_t pad_frag =
> +                    dp_packet_l2_pad_size(rp->list->frag_list[0].pkt);
> +
> +                void *l4_frag = dp_packet_l4(rp->list->frag_list[0].pkt);
> +                void *l4_reass = dp_packet_l4(pkt);
> +                memcpy(l4_frag, l4_reass,
> +                       tail_frag - (char *) l4_frag - pad_frag);
> +
> +                if (v6) {
> +                    struct  ovs_16aligned_ip6_hdr *l3_frag =
> +                        dp_packet_l3(rp->list->frag_list[0].pkt);
> +                    struct  ovs_16aligned_ip6_hdr *l3_reass =
> +                        dp_packet_l3(pkt);
> +                    l3_frag->ip6_src = l3_reass->ip6_src;
> +                    l3_frag->ip6_dst = l3_reass->ip6_dst;
> +                } else {
> +                    struct ip_header *l3_frag =
> +                        dp_packet_l3(rp->list->frag_list[0].pkt);
> +                    struct ip_header *l3_reass = dp_packet_l3(pkt);
> +                    ovs_be32 reass_ip =
> get_16aligned_be32(&l3_reass->ip_src);
> +                    ovs_be32 frag_ip =
> get_16aligned_be32(&l3_frag->ip_src);
> +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                     frag_ip, reass_ip);
> +                    l3_frag->ip_src = l3_reass->ip_src;
> +
> +                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> +                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> +                                                     frag_ip, reass_ip);
> +                    l3_frag->ip_dst = l3_reass->ip_dst;
> +                }
> +
> +                ipf_completed_list_add(rp->list);
> +                ipf_reassembled_list_remove(rp);
> +                dp_packet_delete(rp->pkt);
> +                free(rp);
> +            } else {
> +                dp_packet_batch_refill(pb, pkt, pb_idx);
> +            }
> +        }
> +    }
> +    ipf_lock_unlock(&ipf_lock);
> +}
> +
> +/* Extracts any fragments from the batch and reassembles them when a
> + * complete packet is received.  Completed packets are attempted to
> + * be added to the batch to be sent through conntrack. */
> +void
> +ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
> +                         ovs_be16 dl_type, uint16_t zone, uint32_t
> hash_basis)
> +{
> +    if (ipf_get_enabled()) {
> +        ipf_extract_frags_from_batch(pb, dl_type, zone, now, hash_basis);
> +    }
> +
> +    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
> +        ipf_execute_reass_pkts(pb);
> +    }
> +}
> +
> +/* Updates fragments based on the processing of the reassembled packet
> sent
> + * through conntrack and adds these fragments to any batches seen.
> Expired
> + * fragments are marked as invalid and also added to the batches seen
> + * with low priority.  Reassembled packets are freed. */
> +void
> +ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
> +                          ovs_be16 dl_type)
> +{
> +    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
> +        bool v6 = dl_type == htons(ETH_TYPE_IPV6);
> +        ipf_post_execute_reass_pkts(pb, v6);
> +        ipf_send_completed_frags(pb, now, v6);
> +        ipf_send_expired_frags(pb, now, v6);
> +    }
> +}
> +
> +static void *
> +ipf_clean_thread_main(void *f OVS_UNUSED)
> +{
> +    enum {
> +        IPF_FRAG_LIST_CLEAN_TIMEOUT = 60000,
> +    };
> +
> +    while (!latch_is_set(&ipf_clean_thread_exit)) {
> +
> +        long long now = time_msec();
> +
> +        if (!ovs_list_is_empty(&frag_exp_list) ||
> +            !ovs_list_is_empty(&frag_complete_list)) {
> +
> +            ipf_lock_lock(&ipf_lock);
> +
> +            struct ipf_list *ipf_list, *next;
> +            LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
> &frag_exp_list) {
> +                if (ipf_purge_list_check(ipf_list, now)) {
> +                    ipf_expiry_list_clean(ipf_list);
> +                }
> +            }
> +
> +            LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
> +                                &frag_complete_list) {
> +                if (ipf_purge_list_check(ipf_list, now)) {
> +                    ipf_completed_list_clean(ipf_list);
> +                }
> +            }
> +
> +            ipf_lock_unlock(&ipf_lock);
> +        }
> +
> +        poll_timer_wait_until(now + IPF_FRAG_LIST_CLEAN_TIMEOUT);
> +        latch_wait(&ipf_clean_thread_exit);
> +        poll_block();
> +    }
> +
> +    return NULL;
> +}
> +
> +void
> +ipf_init(void)
> +{
> +    ipf_lock_init(&ipf_lock);
> +    ipf_lock_lock(&ipf_lock);
> +    hmap_init(&frag_lists);
> +    ovs_list_init(&frag_exp_list);
> +    ovs_list_init(&frag_complete_list);
> +    ovs_list_init(&reassembled_pkt_list);
> +    atomic_init(&min_v4_frag_size, IPF_V4_FRAG_SIZE_MIN_DEF);
> +    atomic_init(&min_v6_frag_size, IPF_V6_FRAG_SIZE_MIN_DEF);
> +    max_v4_frag_list_size = DIV_ROUND_UP(
> +        IPV4_PACKET_MAX_SIZE - IPV4_PACKET_MAX_HDR_SIZE,
> +        min_v4_frag_size - IPV4_PACKET_MAX_HDR_SIZE);
> +    ipf_lock_unlock(&ipf_lock);
> +    atomic_count_init(&nfrag, 0);
> +    atomic_init(&n4frag_accepted, 0);
> +    atomic_init(&n4frag_completed_sent, 0);
> +    atomic_init(&n4frag_expired_sent, 0);
> +    atomic_init(&n4frag_too_small, 0);
> +    atomic_init(&n4frag_overlap, 0);
> +    atomic_init(&n4frag_purged, 0);
> +    atomic_init(&n6frag_accepted, 0);
> +    atomic_init(&n6frag_completed_sent, 0);
> +    atomic_init(&n6frag_expired_sent, 0);
> +    atomic_init(&n6frag_too_small, 0);
> +    atomic_init(&n6frag_overlap, 0);
> +    atomic_init(&n6frag_purged, 0);
> +    atomic_init(&nfrag_max, IPF_MAX_FRAGS_DEFAULT);
> +    atomic_init(&ifp_v4_enabled, true);
> +    atomic_init(&ifp_v6_enabled, true);
> +    latch_init(&ipf_clean_thread_exit);
> +    ipf_clean_thread = ovs_thread_create("ipf_clean",
> +                                         ipf_clean_thread_main, NULL);
> +}
> +
> +void
> +ipf_destroy(void)
> +{
> +    ipf_lock_lock(&ipf_lock);
> +
> +    latch_set(&ipf_clean_thread_exit);
> +    pthread_join(ipf_clean_thread, NULL);
> +    latch_destroy(&ipf_clean_thread_exit);
> +
> +    struct ipf_list *ipf_list;
> +    HMAP_FOR_EACH_POP (ipf_list, node, &frag_lists) {
> +        struct dp_packet *pkt;
> +        while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
> +            pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
> +            dp_packet_delete(pkt);
> +            atomic_count_dec(&nfrag);
> +            ipf_list->last_sent_idx++;
> +        }
> +        free(ipf_list->frag_list);
> +        free(ipf_list);
> +    }
> +
> +    if (atomic_count_get(&nfrag)) {
> +        VLOG_WARN("ipf destroy with non-zero fragment count. ");
> +    }
> +
> +    struct reassembled_pkt * rp;
> +    LIST_FOR_EACH_POP (rp, rp_list_node, &reassembled_pkt_list) {
> +        dp_packet_delete(rp->pkt);
> +        free(rp);
> +    }
> +
> +    hmap_destroy(&frag_lists);
> +    ovs_list_poison(&frag_exp_list);
> +    ovs_list_poison(&frag_complete_list);
> +    ovs_list_poison(&reassembled_pkt_list);
> +    ipf_lock_unlock(&ipf_lock);
> +    ipf_lock_destroy(&ipf_lock);
> +}
> diff --git a/lib/ipf.h b/lib/ipf.h
> new file mode 100644
> index 0000000..040031f
> --- /dev/null
> +++ b/lib/ipf.h
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright (c) 2018 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef IPF_H
> +#define IPF_H 1
> +
> +#include "dp-packet.h"
> +#include "openvswitch/types.h"
> +
> +void ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
> +                              ovs_be16 dl_type, uint16_t zone,
> +                              uint32_t hash_basis);
> +
> +void ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
> +                               ovs_be16 dl_type);
> +
> +void ipf_init(void);
> +void ipf_destroy(void);
> +
> +#endif /* ipf.h */
> diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
> index 3296d64..3fbead8 100644
> --- a/tests/system-kmod-macros.at
> +++ b/tests/system-kmod-macros.at
> @@ -77,12 +77,6 @@ m4_define([CHECK_CONNTRACK],
>  #
>  m4_define([CHECK_CONNTRACK_ALG])
>
> -# CHECK_CONNTRACK_FRAG()
> -#
> -# Perform requirements checks for running conntrack fragmentations tests.
> -# The kernel always supports fragmentation, so no check is needed.
> -m4_define([CHECK_CONNTRACK_FRAG])
> -
>  # CHECK_CONNTRACK_LOCAL_STACK()
>  #
>  # Perform requirements checks for running conntrack tests with local
> stack.
> @@ -91,6 +85,10 @@ m4_define([CHECK_CONNTRACK_FRAG])
>  # needed.
>  m4_define([CHECK_CONNTRACK_LOCAL_STACK])
>
> +# CHECK_CONNTRACK_SMALL_FRAG()
> +#
> +m4_define([CHECK_CONNTRACK_SMALL_FRAG])
> +
>  # CHECK_CONNTRACK_FRAG_OVERLAP()
>  #
>  # The kernel does not support overlapping fragments checking.
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 840fea9..c4f6e47 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -2347,7 +2347,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2381,7 +2380,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation expiry])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2412,7 +2410,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation + vlan])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2448,7 +2445,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation + cvlan])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch . other_config:vlan-limit=0])
>  OVS_CHECK_8021AD()
>
> @@ -2523,7 +2519,7 @@ AT_CLEANUP
>  dnl Uses same first fragment as above 'incomplete reassembled packet'
> test.
>  AT_SETUP([conntrack - IPv4 fragmentation with fragments specified])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2547,7 +2543,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation out of order])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2571,7 +2567,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1
> octet])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_OVERLAP()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2595,7 +2591,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1 octet
> out of order])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_OVERLAP()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2619,7 +2615,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2659,7 +2654,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation expiry])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2700,7 +2694,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation + vlan])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2743,7 +2736,6 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation + cvlan])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch . other_config:vlan-limit=0])
>  OVS_CHECK_8021AD()
>
> @@ -2818,7 +2810,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation with fragments specified])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2842,7 +2834,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation out of order])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  OVS_TRAFFIC_VSWITCHD_START()
>
>  ADD_NAMESPACES(at_ns0, at_ns1)
> @@ -2866,7 +2858,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2892,7 +2884,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers +
> out of order])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2918,7 +2910,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2944,7 +2936,7 @@ AT_CLEANUP
>
>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2 +
> out of order])
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
> +CHECK_CONNTRACK_SMALL_FRAG()
>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>  OVS_TRAFFIC_VSWITCHD_START()
>
> @@ -2971,7 +2963,6 @@ AT_CLEANUP
>  AT_SETUP([conntrack - Fragmentation over vxlan])
>  OVS_CHECK_VXLAN()
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  CHECK_CONNTRACK_LOCAL_STACK()
>
>  OVS_TRAFFIC_VSWITCHD_START()
> @@ -3024,7 +3015,6 @@ AT_CLEANUP
>  AT_SETUP([conntrack - IPv6 Fragmentation over vxlan])
>  OVS_CHECK_VXLAN()
>  CHECK_CONNTRACK()
> -CHECK_CONNTRACK_FRAG()
>  CHECK_CONNTRACK_LOCAL_STACK()
>
>  OVS_TRAFFIC_VSWITCHD_START()
> diff --git a/tests/system-userspace-macros.at b/tests/
> system-userspace-macros.at
> index 27bde8b..219c046 100644
> --- a/tests/system-userspace-macros.at
> +++ b/tests/system-userspace-macros.at
> @@ -73,15 +73,6 @@ m4_define([CHECK_CONNTRACK],
>  #
>  m4_define([CHECK_CONNTRACK_ALG])
>
> -# CHECK_CONNTRACK_FRAG()
> -#
> -# Perform requirements checks for running conntrack fragmentations tests.
> -# The userspace doesn't support fragmentation yet, so skip the tests.
> -m4_define([CHECK_CONNTRACK_FRAG],
> -[
> -    AT_SKIP_IF([:])
> -])
> -
>  # CHECK_CONNTRACK_LOCAL_STACK()
>  #
>  # Perform requirements checks for running conntrack tests with local
> stack.
> @@ -93,22 +84,26 @@ m4_define([CHECK_CONNTRACK_LOCAL_STACK],
>      AT_SKIP_IF([:])
>  ])
>
> -# CHECK_CONNTRACK_FRAG_OVERLAP()
> +# CHECK_CONNTRACK_SMALL_FRAG()
>  #
> -# The userspace datapath does not support fragments yet.
> -m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
> +m4_define([CHECK_CONNTRACK_SMALL_FRAG],
>  [
>      AT_SKIP_IF([:])
>  ])
>
> -# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
> +# CHECK_CONNTRACK_FRAG_OVERLAP()
>  #
>  # The userspace datapath does not support fragments yet.
> -m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN],
> +m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
>  [
>      AT_SKIP_IF([:])
>  ])
>
> +# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN
> +#
> +# The userspace datapath supports fragments with multiple extension
> headers.
> +m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN])
> +
>  # CHECK_CONNTRACK_NAT()
>  #
>  # Perform requirements checks for running conntrack NAT tests. The
> userspace
> --
> 1.9.1
>
>
Darrell Ball Nov. 19, 2018, 9:40 p.m. UTC | #2
I folded in the following incremental locally to address:

1/ Below mentioned capitalization inconsistency in 'releases.rst'.
2/ Misplaced counter increment.

Darrell

diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index d281c97..1fb3b1c 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -107,9 +107,9 @@ Q: Are all features available with all datapaths?
     ========================== ============== ============== =========
=======
     Feature                    Linux upstream Linux OVS tree Userspace
Hyper-V
     ========================== ============== ============== =========
=======
-    Connection tracking             4.3            YES          Yes
YES
-    Conntrack Fragment Reass.       4.3            Yes          Yes
YES
-    NAT                             4.6            YES          Yes      NO
+    Connection tracking             4.3            YES          YES
YES
+    Conntrack Fragment Reass.       4.3            YES          YES
YES
+    NAT                             4.6            YES          YES      NO
     Conntrack zone limit            4.18           YES          NO       NO
     Tunnel - LISP                   NO             YES          NO       NO
     Tunnel - STT                    NO             YES          NO
 YES
diff --git a/lib/ipf.c b/lib/ipf.c
index cc8427e..c1a234f 100644
--- a/lib/ipf.c
+++ b/lib/ipf.c
@@ -1016,12 +1016,12 @@ ipf_purge_list_check(struct ipf_list *ipf_list,
long long now)
         pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
         dp_packet_delete(pkt);
         atomic_count_dec(&nfrag);
+        COVERAGE_INC(ipf_stuck_frag_list_purged);
+        ipf_count(ipf_list->key.dl_type == htons(ETH_TYPE_IPV6),
+                  IPF_COUNTER_NFRAGS_PURGED);
         ipf_list->last_sent_idx++;
     }

-    COVERAGE_INC(ipf_stuck_frag_list_purged);
-    ipf_count(ipf_list->key.dl_type == htons(ETH_TYPE_IPV6),
-              IPF_COUNTER_NFRAGS_PURGED);
     return true;
 }


On Mon, Nov 19, 2018 at 11:51 AM Darrell Ball <dlu998@gmail.com> wrote:

>
>
> On Mon, Nov 19, 2018 at 11:11 AM Darrell Ball <dlu998@gmail.com> wrote:
>
>> Fragmentation handling is added for supporting conntrack.
>> Both v4 and v6 are supported.
>>
>> After discussion with several people, I decided to not store
>> configuration state in the database to be more consistent with
>> the kernel in future, similarity with other conntrack configuration
>> which will not be in the database as well and overall simplicity.
>> Accordingly, fragmentation handling is enabled by default.
>>
>> This patch enables fragmentation tests for the userspace datapath.
>>
>> Signed-off-by: Darrell Ball <dlu998@gmail.com>
>> ---
>>  Documentation/faq/releases.rst   |   49 +-
>>  NEWS                             |    2 +
>>  include/sparse/netinet/ip6.h     |    1 +
>>  lib/automake.mk                  |    4 +-
>>  lib/conntrack.c                  |   13 +-
>>  lib/ipf.c                        | 1363
>> ++++++++++++++++++++++++++++++++++++++
>>  lib/ipf.h                        |   33 +
>>  tests/system-kmod-macros.at      |   10 +-
>>  tests/system-traffic.at          |   30 +-
>>  tests/system-userspace-macros.at |   23 +-
>>  10 files changed, 1460 insertions(+), 68 deletions(-)
>>  create mode 100644 lib/ipf.c
>>  create mode 100644 lib/ipf.h
>>
>> diff --git a/Documentation/faq/releases.rst
>> b/Documentation/faq/releases.rst
>> index 96da23c..d281c97 100644
>> --- a/Documentation/faq/releases.rst
>> +++ b/Documentation/faq/releases.rst
>> @@ -104,31 +104,30 @@ Q: Are all features available with all datapaths?
>>      The following table lists the datapath supported features from an
>> Open
>>      vSwitch user's perspective.
>>
>> -    ===================== ============== ============== ========= =======
>> -    Feature               Linux upstream Linux OVS tree Userspace Hyper-V
>> -    ===================== ============== ============== ========= =======
>> -    NAT                   4.6            YES            Yes       NO
>> -    Connection tracking   4.3            YES            PARTIAL   PARTIAL
>> -    Tunnel - LISP         NO             YES            NO        NO
>> -    Tunnel - STT          NO             YES            NO        YES
>> -    Tunnel - GRE          3.11           YES            YES       YES
>> -    Tunnel - VXLAN        3.12           YES            YES       YES
>> -    Tunnel - Geneve       3.18           YES            YES       YES
>> -    Tunnel - GRE-IPv6     4.18           YES            YES       NO
>> -    Tunnel - VXLAN-IPv6   4.3            YES            YES       NO
>> -    Tunnel - Geneve-IPv6  4.4            YES            YES       NO
>> -    Tunnel - ERSPAN       4.18           YES            YES       NO
>> -    Tunnel - ERSPAN-IPv6  4.18           YES            YES       NO
>> -    QoS - Policing        YES            YES            YES       NO
>> -    QoS - Shaping         YES            YES            NO        NO
>> -    sFlow                 YES            YES            YES       NO
>> -    IPFIX                 3.10           YES            YES       NO
>> -    Set action            YES            YES            YES       PARTIAL
>> -    NIC Bonding           YES            YES            YES       YES
>> -    Multiple VTEPs        YES            YES            YES       YES
>> -    Meters                4.15           YES            YES       NO
>> -    Conntrack zone limit  4.18           YES            NO        NO
>> -    ===================== ============== ============== ========= =======
>> +    ========================== ============== ============== =========
>> =======
>> +    Feature                    Linux upstream Linux OVS tree Userspace
>> Hyper-V
>> +    ========================== ============== ============== =========
>> =======
>> +    Connection tracking             4.3            YES          Yes
>> YES
>> +    Conntrack Fragment Reass.       4.3            Yes          Yes
>> YES
>> +    NAT                             4.6            YES          Yes
>> NO
>>
>
>
> There is a capitalization inconsistency in the existing 'NAT' entry that
> got copied to a few new entries.
>
>
>
>> +    Conntrack zone limit            4.18           YES          NO
>>  NO
>> +    Tunnel - LISP                   NO             YES          NO
>>  NO
>> +    Tunnel - STT                    NO             YES          NO
>>  YES
>> +    Tunnel - GRE                    3.11           YES          YES
>> YES
>> +    Tunnel - VXLAN                  3.12           YES          YES
>> YES
>> +    Tunnel - Geneve                 3.18           YES          YES
>> YES
>> +    Tunnel - GRE-IPv6               NO             NO           YES
>> NO
>> +    Tunnel - VXLAN-IPv6             4.3            YES          YES
>> NO
>> +    Tunnel - Geneve-IPv6            4.4            YES          YES
>> NO
>> +    QoS - Policing                  YES            YES          YES
>> NO
>> +    QoS - Shaping                   YES            YES          NO
>>  NO
>> +    sFlow                           YES            YES          YES
>> NO
>> +    IPFIX                           3.10           YES          YES
>> NO
>> +    Set action                      YES            YES          YES
>> PARTIAL
>> +    NIC Bonding                     YES            YES          YES
>> YES
>> +    Multiple VTEPs                  YES            YES          YES
>> YES
>> +    Meters                          4.15           YES          YES
>> NO
>> +    ========================== ============== ============== =========
>> =======
>>
>>      Do note, however:
>>
>> diff --git a/NEWS b/NEWS
>> index 02402d1..d2e8724 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -9,6 +9,8 @@ Post-v2.10.0
>>     - ovn:
>>       * New support for IPSEC encrypted tunnels between hypervisors.
>>       * ovn-ctl: allow passing user:group ids to the OVN daemons.
>> +   - Userspace datapath:
>> +     * Add v4/v6 fragmentation support for conntrack.
>>     - DPDK:
>>       * Add option for simple round-robin based Rxq to PMD assignment.
>>         It can be set with pmd-rxq-assign.
>> diff --git a/include/sparse/netinet/ip6.h b/include/sparse/netinet/ip6.h
>> index d2a54de..bfa637a 100644
>> --- a/include/sparse/netinet/ip6.h
>> +++ b/include/sparse/netinet/ip6.h
>> @@ -64,5 +64,6 @@ struct ip6_frag {
>>  };
>>
>>  #define IP6F_OFF_MASK ((OVS_FORCE ovs_be16) 0xfff8)
>> +#define IP6F_MORE_FRAG ((OVS_FORCE ovs_be16) 0x0001)
>>
>>  #endif /* netinet/ip6.h sparse */
>> diff --git a/lib/automake.mk b/lib/automake.mk
>> index 63e9d72..b24b028 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -1,4 +1,4 @@
>> -# Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
>> Nicira, Inc.
>> +# Copyright (C) 2009-2018 Nicira, Inc.
>>  #
>>  # Copying and distribution of this file, with or without modification,
>>  # are permitted in any medium without royalty provided the copyright
>> @@ -107,6 +107,8 @@ lib_libopenvswitch_la_SOURCES = \
>>         lib/hmapx.h \
>>         lib/id-pool.c \
>>         lib/id-pool.h \
>> +       lib/ipf.c \
>> +       lib/ipf.h \
>>         lib/jhash.c \
>>         lib/jhash.h \
>>         lib/json.c \
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> index 3f50fc8..be8debb 100644
>> --- a/lib/conntrack.c
>> +++ b/lib/conntrack.c
>> @@ -30,6 +30,7 @@
>>  #include "ct-dpif.h"
>>  #include "dp-packet.h"
>>  #include "flow.h"
>> +#include "ipf.h"
>>  #include "netdev.h"
>>  #include "odp-netlink.h"
>>  #include "openvswitch/hmap.h"
>> @@ -339,6 +340,7 @@ conntrack_init(struct conntrack *ct)
>>      atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
>>      latch_init(&ct->clean_thread_exit);
>>      ct->clean_thread = ovs_thread_create("ct_clean", clean_thread_main,
>> ct);
>> +    ipf_init();
>>  }
>>
>>  /* Destroys the connection tracker 'ct' and frees all the allocated
>> memory. */
>> @@ -381,6 +383,7 @@ conntrack_destroy(struct conntrack *ct)
>>      hindex_destroy(&ct->alg_expectation_refs);
>>      ct_rwlock_unlock(&ct->resources_lock);
>>      ct_rwlock_destroy(&ct->resources_lock);
>> +    ipf_destroy();
>>  }
>>
>>  static unsigned hash_to_bucket(uint32_t hash)
>> @@ -1295,7 +1298,8 @@ process_one(struct conntrack *ct, struct dp_packet
>> *pkt,
>>
>>  /* Sends the packets in '*pkt_batch' through the connection tracker
>> 'ct'.  All
>>   * the packets must have the same 'dl_type' (IPv4 or IPv6) and should
>> have
>> - * the l3 and and l4 offset properly set.
>> + * the l3 and and l4 offset properly set.  Performs fragment reassembly
>> with
>> + * the help of ipf_preprocess_conntrack().
>>   *
>>   * If 'commit' is true, the packets are allowed to create new entries in
>> the
>>   * connection tables.  'setmark', if not NULL, should point to a two
>> @@ -1310,11 +1314,14 @@ conntrack_execute(struct conntrack *ct, struct
>> dp_packet_batch *pkt_batch,
>>                    const struct nat_action_info_t *nat_action_info,
>>                    long long now)
>>  {
>> +    ipf_preprocess_conntrack(pkt_batch, now, dl_type, zone,
>> ct->hash_basis);
>> +
>>      struct dp_packet *packet;
>>      struct conn_lookup_ctx ctx;
>>
>>      DP_PACKET_BATCH_FOR_EACH (i, packet, pkt_batch) {
>> -        if (!conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
>> +        if (packet->md.ct_state == CS_INVALID
>> +            || !conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
>>              packet->md.ct_state = CS_INVALID;
>>              write_ct_md(packet, zone, NULL, NULL, NULL);
>>              continue;
>> @@ -1323,6 +1330,8 @@ conntrack_execute(struct conntrack *ct, struct
>> dp_packet_batch *pkt_batch,
>>                      setlabel, nat_action_info, tp_src, tp_dst, helper);
>>      }
>>
>> +    ipf_postprocess_conntrack(pkt_batch, now, dl_type);
>> +
>>      return 0;
>>  }
>>
>> diff --git a/lib/ipf.c b/lib/ipf.c
>> new file mode 100644
>> index 0000000..cc8427e
>> --- /dev/null
>> +++ b/lib/ipf.c
>> @@ -0,0 +1,1363 @@
>> +/*
>> + * Copyright (c) 2018 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +#include <ctype.h>
>> +#include <errno.h>
>> +#include <sys/types.h>
>> +#include <netinet/in.h>
>> +#include <netinet/ip6.h>
>> +#include <netinet/icmp6.h>
>> +#include <string.h>
>> +
>> +#include "coverage.h"
>> +#include "csum.h"
>> +#include "ipf.h"
>> +#include "latch.h"
>> +#include "openvswitch/hmap.h"
>> +#include "openvswitch/poll-loop.h"
>> +#include "openvswitch/vlog.h"
>> +#include "ovs-atomic.h"
>> +#include "packets.h"
>> +#include "util.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(ipf);
>> +COVERAGE_DEFINE(ipf_stuck_frag_list_purged);
>> +
>> +enum {
>> +    IPV4_PACKET_MAX_HDR_SIZE = 60,
>> +    IPV4_PACKET_MAX_SIZE = 65535,
>> +    IPV6_PACKET_MAX_DATA = 65535,
>> +};
>> +
>> +enum ipf_list_state {
>> +    IPF_LIST_STATE_UNUSED,
>> +    IPF_LIST_STATE_REASS_FAIL,
>> +    IPF_LIST_STATE_OTHER_SEEN,
>> +    IPF_LIST_STATE_FIRST_SEEN,
>> +    IPF_LIST_STATE_LAST_SEEN,
>> +    IPF_LIST_STATE_FIRST_LAST_SEEN,
>> +    IPF_LIST_STATE_COMPLETED,
>> +    IPF_LIST_STATE_NUM,
>> +};
>> +
>> +enum ipf_list_type {
>> +    IPF_FRAG_COMPLETED_LIST,
>> +    IPF_FRAG_EXPIRY_LIST,
>> +};
>> +
>> +enum {
>> +    IPF_INVALID_IDX = -1,
>> +    IPF_V4_FRAG_SIZE_LBOUND = 400,
>> +    IPF_V4_FRAG_SIZE_MIN_DEF = 1200,
>> +    IPF_V6_FRAG_SIZE_LBOUND = 400, /* Useful for testing. */
>> +    IPF_V6_FRAG_SIZE_MIN_DEF = 1280,
>> +    IPF_MAX_FRAGS_DEFAULT = 1000,
>> +    IPF_NFRAG_UBOUND = 5000,
>> +};
>> +
>> +enum ipf_counter_type {
>> +    IPF_COUNTER_NFRAGS,
>> +    IPF_COUNTER_NFRAGS_ACCEPTED,
>> +    IPF_COUNTER_NFRAGS_COMPL_SENT,
>> +    IPF_COUNTER_NFRAGS_EXPD_SENT,
>> +    IPF_COUNTER_NFRAGS_TOO_SMALL,
>> +    IPF_COUNTER_NFRAGS_OVERLAP,
>> +    IPF_COUNTER_NFRAGS_PURGED,
>> +};
>> +
>> +struct ipf_addr {
>> +    union {
>> +        ovs_16aligned_be32 ipv4;
>> +        union ovs_16aligned_in6_addr ipv6;
>> +        ovs_be32 ipv4_aligned;
>> +        struct in6_addr ipv6_aligned;
>> +    };
>> +};
>> +
>> +struct ipf_frag {
>> +    struct dp_packet *pkt;
>> +    uint16_t start_data_byte;
>> +    uint16_t end_data_byte;
>> +};
>> +
>> +struct ipf_list_key {
>> +    struct ipf_addr src_addr;
>> +    struct ipf_addr dst_addr;
>> +    uint32_t recirc_id;
>> +    ovs_be32 ip_id;   /* V6 is 32 bits. */
>> +    ovs_be16 dl_type;
>> +    uint16_t zone;
>> +    uint8_t nw_proto;
>> +};
>> +
>> +struct ipf_list {
>> +    struct hmap_node node;
>> +    struct ovs_list list_node;
>> +    struct ipf_frag *frag_list;
>> +    struct ipf_list_key key;
>> +    struct dp_packet *reass_execute_ctx; /* Reassembled packet. */
>> +    long long expiration;          /* In milliseconds. */
>> +    int last_sent_idx;             /* Last sent fragment idx. */
>> +    int last_inuse_idx;            /* Last inuse fragment idx. */
>> +    int size;                      /* Fragment list size. */
>> +    uint8_t state;                 /* Frag list state; see
>> ipf_list_state. */
>> +};
>> +
>> +struct reassembled_pkt {
>> +    struct ovs_list rp_list_node;
>> +    struct dp_packet *pkt;
>> +    struct ipf_list *list;
>> +};
>> +
>> +struct OVS_LOCKABLE ipf_lock {
>> +    struct ovs_mutex lock;
>> +};
>> +
>> +static struct ipf_lock ipf_lock;
>> +
>> +static int max_v4_frag_list_size;
>> +
>> +static pthread_t ipf_clean_thread;
>> +static struct latch ipf_clean_thread_exit;
>> +
>> +static struct hmap frag_lists OVS_GUARDED_BY(ipf_lock);
>> +static struct ovs_list frag_exp_list OVS_GUARDED_BY(ipf_lock);
>> +static struct ovs_list frag_complete_list OVS_GUARDED_BY(ipf_lock);
>> +static struct ovs_list reassembled_pkt_list OVS_GUARDED_BY(ipf_lock);
>> +
>> +static atomic_bool ifp_v4_enabled;
>> +static atomic_bool ifp_v6_enabled;
>> +static atomic_uint nfrag_max;
>> +/* Will be clamped above 400 bytes; the value chosen should handle
>> + * alg control packets of interest that use string encoding of mutable
>> + * IP fields; meaning, the control packets should not be fragmented. */
>> +static atomic_uint min_v4_frag_size;
>> +static atomic_uint min_v6_frag_size;
>> +
>> +static atomic_count nfrag;
>> +
>> +static atomic_uint64_t n4frag_accepted;
>> +static atomic_uint64_t n4frag_completed_sent;
>> +static atomic_uint64_t n4frag_expired_sent;
>> +static atomic_uint64_t n4frag_too_small;
>> +static atomic_uint64_t n4frag_overlap;
>> +static atomic_uint64_t n4frag_purged;
>> +static atomic_uint64_t n6frag_accepted;
>> +static atomic_uint64_t n6frag_completed_sent;
>> +static atomic_uint64_t n6frag_expired_sent;
>> +static atomic_uint64_t n6frag_too_small;
>> +static atomic_uint64_t n6frag_overlap;
>> +static atomic_uint64_t n6frag_purged;
>> +
>> +static void
>> +ipf_print_reass_packet(char *es, void *pkt)
>> +{
>> +    const unsigned char *b = (const unsigned char *) pkt;
>> +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(10, 10);
>> +    VLOG_WARN_RL(&rl, "%s 91 bytes from specified part of packet "
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> +                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X",
>> +                 es, b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7],
>> b[8],
>> +                 b[9], b[10], b[11], b[12], b[13], b[14], b[15], b[16],
>> +                 b[17], b[18], b[19], b[20], b[21], b[22], b[23], b[24],
>> +                 b[25], b[26], b[27], b[28], b[29], b[30], b[31], b[32],
>> +                 b[33], b[34], b[35], b[36], b[37], b[38], b[39], b[40],
>> +                 b[41], b[42], b[43], b[44], b[45], b[46], b[47], b[48],
>> +                 b[49], b[50], b[51], b[52], b[53], b[54], b[55], b[56],
>> +                 b[57], b[58], b[59], b[60], b[61], b[62], b[63], b[64],
>> +                 b[65], b[66], b[67], b[68], b[69], b[70], b[71], b[72],
>> +                 b[73], b[74], b[75], b[76], b[77], b[77], b[79], b[80],
>> +                 b[81], b[82], b[83], b[84], b[85], b[86], b[87], b[88],
>> +                 b[89], b[90]);
>> +}
>> +
>> +static void ipf_lock_init(struct ipf_lock *lock)
>> +{
>> +    ovs_mutex_init_adaptive(&lock->lock);
>> +}
>> +
>> +static void ipf_lock_lock(struct ipf_lock *lock)
>> +    OVS_ACQUIRES(lock)
>> +    OVS_NO_THREAD_SAFETY_ANALYSIS
>> +{
>> +    ovs_mutex_lock(&lock->lock);
>> +}
>> +
>> +static void ipf_lock_unlock(struct ipf_lock *lock)
>> +    OVS_RELEASES(lock)
>> +    OVS_NO_THREAD_SAFETY_ANALYSIS
>> +{
>> +    ovs_mutex_unlock(&lock->lock);
>> +}
>> +
>> +static void ipf_lock_destroy(struct ipf_lock *lock)
>> +{
>> +    ovs_mutex_destroy(&lock->lock);
>> +}
>> +
>> +static void
>> +ipf_count(bool v6, enum ipf_counter_type cntr)
>> +{
>> +    switch (cntr) {
>> +    case IPF_COUNTER_NFRAGS_ACCEPTED:
>> +        atomic_count_inc64(v6 ? &n6frag_accepted : &n4frag_accepted);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS_COMPL_SENT:
>> +        atomic_count_inc64(v6 ? &n6frag_completed_sent
>> +                         : &n4frag_completed_sent);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS_EXPD_SENT:
>> +        atomic_count_inc64(v6 ? &n6frag_expired_sent
>> +                         : &n4frag_expired_sent);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS_TOO_SMALL:
>> +        atomic_count_inc64(v6 ? &n6frag_too_small : &n4frag_too_small);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS_OVERLAP:
>> +        atomic_count_inc64(v6 ? &n6frag_overlap : &n4frag_overlap);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS_PURGED:
>> +        atomic_count_inc64(v6 ? &n6frag_purged : &n4frag_purged);
>> +        break;
>> +    case IPF_COUNTER_NFRAGS:
>> +    default:
>> +        OVS_NOT_REACHED();
>> +    }
>> +}
>> +
>> +static bool
>> +ipf_get_v4_enabled(void)
>> +{
>> +    bool ifp_v4_enabled_;
>> +    atomic_read_relaxed(&ifp_v4_enabled, &ifp_v4_enabled_);
>> +    return ifp_v4_enabled_;
>> +}
>> +
>> +static bool
>> +ipf_get_v6_enabled(void)
>> +{
>> +    bool ifp_v6_enabled_;
>> +    atomic_read_relaxed(&ifp_v6_enabled, &ifp_v6_enabled_);
>> +    return ifp_v6_enabled_;
>> +}
>> +
>> +static bool
>> +ipf_get_enabled(void)
>> +{
>> +    return ipf_get_v4_enabled() || ipf_get_v6_enabled();
>> +}
>> +
>> +static uint32_t
>> +ipf_addr_hash_add(uint32_t hash, const struct ipf_addr *addr)
>> +{
>> +    BUILD_ASSERT_DECL(sizeof *addr % 4 == 0);
>> +    return hash_add_bytes32(hash, (const uint32_t *) addr, sizeof *addr);
>> +}
>> +
>> +static void
>> +ipf_expiry_list_add(struct ipf_list *ipf_list, long long now)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    enum {
>> +        IPF_FRAG_LIST_TIMEOUT = 15000,
>> +    };
>> +
>> +    ipf_list->expiration = now + IPF_FRAG_LIST_TIMEOUT;
>> +    ovs_list_push_back(&frag_exp_list, &ipf_list->list_node);
>> +}
>> +
>> +static void
>> +ipf_completed_list_add(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ovs_list_push_back(&frag_complete_list, &ipf_list->list_node);
>> +}
>> +
>> +static void
>> +ipf_reassembled_list_add(struct reassembled_pkt *rp)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ovs_list_push_back(&reassembled_pkt_list, &rp->rp_list_node);
>> +}
>> +
>> +static void
>> +ipf_list_clean(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ovs_list_remove(&ipf_list->list_node);
>> +    hmap_remove(&frag_lists, &ipf_list->node);
>> +    free(ipf_list->frag_list);
>> +    free(ipf_list);
>> +}
>> +
>> +static void
>> +ipf_expiry_list_clean(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ipf_list_clean(ipf_list);
>> +}
>> +
>> +static void
>> +ipf_completed_list_clean(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ipf_list_clean(ipf_list);
>> +}
>> +
>> +static void
>> +ipf_expiry_list_remove(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ovs_list_remove(&ipf_list->list_node);
>> +}
>> +
>> +static void
>> +ipf_reassembled_list_remove(struct reassembled_pkt *rp)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    ovs_list_remove(&rp->rp_list_node);
>> +}
>> +
>> +/* Symmetric */
>> +static uint32_t
>> +ipf_list_key_hash(const struct ipf_list_key *key, uint32_t basis)
>> +{
>> +    uint32_t hsrc, hdst, hash;
>> +    hsrc = hdst = basis;
>> +    hsrc = ipf_addr_hash_add(hsrc, &key->src_addr);
>> +    hdst = ipf_addr_hash_add(hdst, &key->dst_addr);
>> +    hash = hsrc ^ hdst;
>> +
>> +    /* Hash the rest of the key. */
>> +    hash = hash_words((uint32_t *) (&key->dst_addr + 1),
>> +                      (uint32_t *) (key + 1) -
>> +                          (uint32_t *) (&key->dst_addr + 1),
>> +                      hash);
>> +
>> +    return hash_finish(hash, 0);
>> +}
>> +
>> +static bool
>> +ipf_is_first_v4_frag(const struct dp_packet *pkt)
>> +{
>> +    const struct ip_header *l3 = dp_packet_l3(pkt);
>> +    if (!(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK)) &&
>> +        l3->ip_frag_off & htons(IP_MORE_FRAGMENTS)) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_is_last_v4_frag(const struct dp_packet *pkt)
>> +{
>> +    const struct ip_header *l3 = dp_packet_l3(pkt);
>> +    if (l3->ip_frag_off & htons(IP_FRAG_OFF_MASK) &&
>> +        !(l3->ip_frag_off & htons(IP_MORE_FRAGMENTS))) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_is_v6_frag(ovs_be16 ip6f_offlg)
>> +{
>> +    if (ip6f_offlg & (IP6F_OFF_MASK | IP6F_MORE_FRAG)) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_is_first_v6_frag(ovs_be16 ip6f_offlg)
>> +{
>> +    if (!(ip6f_offlg & IP6F_OFF_MASK) &&
>> +        ip6f_offlg & IP6F_MORE_FRAG) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_is_last_v6_frag(ovs_be16 ip6f_offlg)
>> +{
>> +    if ((ip6f_offlg & IP6F_OFF_MASK) &&
>> +        !(ip6f_offlg & IP6F_MORE_FRAG)) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_list_complete(const struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    for (int i = 0; i < ipf_list->last_inuse_idx; i++) {
>> +        if (ipf_list->frag_list[i].end_data_byte + 1
>> +            != ipf_list->frag_list[i + 1].start_data_byte) {
>> +            return false;
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +/* Runs O(n) for a sorted or almost sorted list. */
>> +static void
>> +ipf_sort(struct ipf_frag *frag_list, size_t last_idx)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    int running_last_idx = 1;
>> +    struct ipf_frag ipf_frag;
>> +    while (running_last_idx <= last_idx) {
>> +        ipf_frag = frag_list[running_last_idx];
>> +        int frag_list_idx = running_last_idx - 1;
>> +        while (frag_list_idx >= 0 &&
>> +               frag_list[frag_list_idx].start_data_byte >
>> +                   ipf_frag.start_data_byte) {
>> +            frag_list[frag_list_idx + 1] = frag_list[frag_list_idx];
>> +            frag_list_idx -= 1;
>> +        }
>> +        frag_list[frag_list_idx + 1] = ipf_frag;
>> +        running_last_idx++;
>> +    }
>> +}
>> +
>> +/* Called on a sorted complete list of fragments. */
>> +static struct dp_packet *
>> +ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    struct ipf_frag *frag_list = ipf_list->frag_list;
>> +    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
>> +    struct ip_header *l3 = dp_packet_l3(pkt);
>> +    int len = ntohs(l3->ip_tot_len);
>> +    size_t add_len;
>> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
>> +
>> +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
>> +        add_len = frag_list[i].end_data_byte -
>> +                         frag_list[i].start_data_byte + 1;
>> +        len += add_len;
>> +        if (len > IPV4_PACKET_MAX_SIZE) {
>> +            ipf_print_reass_packet(
>> +                "Unsupported big reassembled v4 packet; v4 hdr:", l3);
>> +            dp_packet_delete(pkt);
>> +            return NULL;
>> +        }
>> +        l3 = dp_packet_l3(frag_list[i].pkt);
>> +        dp_packet_put(pkt, (char *)l3 + ip_hdr_len, add_len);
>> +    }
>> +    l3 = dp_packet_l3(pkt);
>> +    ovs_be16 new_ip_frag_off = l3->ip_frag_off &
>> ~htons(IP_MORE_FRAGMENTS);
>> +    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
>> +                                new_ip_frag_off);
>> +    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
>> +    l3->ip_tot_len = htons(len);
>> +    l3->ip_frag_off = new_ip_frag_off;
>> +
>> +    return pkt;
>> +}
>> +
>> +/* Called on a sorted complete list of fragments. */
>> +static struct dp_packet *
>> +ipf_reassemble_v6_frags(struct ipf_list *ipf_list)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    struct ipf_frag *frag_list = ipf_list->frag_list;
>> +    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
>> +    struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
>> +    int pl = ntohs(l3->ip6_plen) - sizeof(struct ovs_16aligned_ip6_frag);
>> +    const char *tail = dp_packet_tail(pkt);
>> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
>> +    const char *l4 = dp_packet_l4(pkt);
>> +    size_t l3_size = tail - (char *)l3 - pad;
>> +    size_t l4_size = tail - (char *)l4 - pad;
>> +    size_t l3_hlen = l3_size - l4_size;
>> +    size_t add_len;
>> +
>> +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
>> +        add_len = frag_list[i].end_data_byte -
>> +                          frag_list[i].start_data_byte + 1;
>> +        pl += add_len;
>> +        if (pl > IPV6_PACKET_MAX_DATA) {
>> +            ipf_print_reass_packet(
>> +                "Unsupported big reassembled v6 packet; v6 hdr:", l3);
>> +            dp_packet_delete(pkt);
>> +            return NULL;
>> +        }
>> +        l3 = dp_packet_l3(frag_list[i].pkt);
>> +        dp_packet_put(pkt, (char *)l3 + l3_hlen, add_len);
>> +    }
>> +    l3 = dp_packet_l3(pkt);
>> +    l4 = dp_packet_l4(pkt);
>> +    tail = dp_packet_tail(pkt);
>> +    pad = dp_packet_l2_pad_size(pkt);
>> +    l3_size = tail - (char *)l3 - pad;
>> +
>> +    uint8_t nw_proto = l3->ip6_nxt;
>> +    uint8_t nw_frag = 0;
>> +    const void *data = l3 + 1;
>> +    size_t datasize = l3_size - sizeof *l3;
>> +
>> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
>> +    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
>> &frag_hdr)
>> +        || !nw_frag || !frag_hdr) {
>> +
>> +        ipf_print_reass_packet("Unparsed reassembled v6 packet; v6
>> hdr:", l3);
>> +        dp_packet_delete(pkt);
>> +        return NULL;
>> +    }
>> +
>> +    struct ovs_16aligned_ip6_frag *fh =
>> +        CONST_CAST(struct ovs_16aligned_ip6_frag *, frag_hdr);
>> +    fh->ip6f_offlg = 0;
>> +    l3->ip6_plen = htons(pl);
>> +    l3->ip6_ctlun.ip6_un1.ip6_un1_nxt = nw_proto;
>> +    return pkt;
>> +}
>> +
>> +/* Called when a valid fragment is added. */
>> +static void
>> +ipf_list_state_transition(struct ipf_list *ipf_list, bool ff, bool lf,
>> +                          bool v6)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    enum ipf_list_state curr_state = ipf_list->state;
>> +    enum ipf_list_state next_state;
>> +    switch (curr_state) {
>> +    case IPF_LIST_STATE_UNUSED:
>> +    case IPF_LIST_STATE_OTHER_SEEN:
>> +        if (ff) {
>> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
>> +        } else if (lf) {
>> +            next_state = IPF_LIST_STATE_LAST_SEEN;
>> +        } else {
>> +            next_state = IPF_LIST_STATE_OTHER_SEEN;
>> +        }
>> +        break;
>> +    case IPF_LIST_STATE_FIRST_SEEN:
>> +        if (ff) {
>> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
>> +        } else if (lf) {
>> +            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
>> +        } else {
>> +            next_state = IPF_LIST_STATE_FIRST_SEEN;
>> +        }
>> +        break;
>> +    case IPF_LIST_STATE_LAST_SEEN:
>> +        if (ff) {
>> +            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
>> +        } else if (lf) {
>> +            next_state = IPF_LIST_STATE_LAST_SEEN;
>> +        } else {
>> +            next_state = IPF_LIST_STATE_LAST_SEEN;
>> +        }
>> +        break;
>> +    case IPF_LIST_STATE_FIRST_LAST_SEEN:
>> +        next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
>> +        break;
>> +    case IPF_LIST_STATE_COMPLETED:
>> +    case IPF_LIST_STATE_REASS_FAIL:
>> +    case IPF_LIST_STATE_NUM:
>> +    default:
>> +        OVS_NOT_REACHED();
>> +    }
>> +
>> +    if (next_state == IPF_LIST_STATE_FIRST_LAST_SEEN) {
>> +        ipf_sort(ipf_list->frag_list, ipf_list->last_inuse_idx);
>> +        if (ipf_list_complete(ipf_list)) {
>> +            struct dp_packet *reass_pkt = v6
>> +                ? ipf_reassemble_v6_frags(ipf_list)
>> +                : ipf_reassemble_v4_frags(ipf_list);
>> +            if (reass_pkt) {
>> +                struct reassembled_pkt *rp = xzalloc(sizeof *rp);
>> +                rp->pkt = reass_pkt;
>> +                rp->list = ipf_list;
>> +                ipf_reassembled_list_add(rp);
>> +                ipf_expiry_list_remove(ipf_list);
>> +                next_state = IPF_LIST_STATE_COMPLETED;
>> +            } else {
>> +                next_state = IPF_LIST_STATE_REASS_FAIL;
>> +            }
>> +        }
>> +    }
>> +    ipf_list->state = next_state;
>> +}
>> +
>> +static bool
>> +ipf_is_valid_v4_frag(struct dp_packet *pkt)
>> +{
>> +    if (OVS_UNLIKELY(dp_packet_ip_checksum_bad(pkt))) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    const struct eth_header *l2 = dp_packet_eth(pkt);
>> +    const struct ip_header *l3 = dp_packet_l3(pkt);
>> +
>> +    if (OVS_UNLIKELY(!l2 || !l3)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    const char *tail = dp_packet_tail(pkt);
>> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
>> +    size_t size = tail - (char *)l3 - pad;
>> +    if (OVS_UNLIKELY(size < IP_HEADER_LEN)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    if (!(IP_IS_FRAGMENT(l3->ip_frag_off))) {
>> +        return false;
>> +    }
>> +
>> +    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
>> +    if (OVS_UNLIKELY(ip_tot_len != size)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
>> +    if (OVS_UNLIKELY(ip_hdr_len < IP_HEADER_LEN)) {
>> +        goto invalid_pkt;
>> +    }
>> +    if (OVS_UNLIKELY(size < ip_hdr_len)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
>> +                     && csum(l3, ip_hdr_len) != 0)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    uint32_t min_v4_frag_size_;
>> +    atomic_read_relaxed(&min_v4_frag_size, &min_v4_frag_size_);
>> +    bool lf = ipf_is_last_v4_frag(pkt);
>> +    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v4_frag_size_)) {
>> +        ipf_count(false, IPF_COUNTER_NFRAGS_TOO_SMALL);
>> +        goto invalid_pkt;
>> +    }
>> +    return true;
>> +
>> +invalid_pkt:
>> +    pkt->md.ct_state = CS_INVALID;
>> +    return false;
>> +
>> +}
>> +
>> +static bool
>> +ipf_v4_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t
>> zone,
>> +                   struct ipf_list_key *key, uint16_t *start_data_byte,
>> +                   uint16_t *end_data_byte, bool *ff, bool *lf)
>> +{
>> +    const struct ip_header *l3 = dp_packet_l3(pkt);
>> +    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
>> +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
>> +
>> +    *start_data_byte = ntohs(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK))
>> * 8;
>> +    *end_data_byte = *start_data_byte + ip_tot_len - ip_hdr_len - 1;
>> +    *ff = ipf_is_first_v4_frag(pkt);
>> +    *lf = ipf_is_last_v4_frag(pkt);
>> +    memset(key, 0, sizeof *key);
>> +    key->ip_id = be16_to_be32(l3->ip_id);
>> +    key->dl_type = dl_type;
>> +    key->src_addr.ipv4 = l3->ip_src;
>> +    key->dst_addr.ipv4 = l3->ip_dst;
>> +    key->nw_proto = l3->ip_proto;
>> +    key->zone = zone;
>> +    key->recirc_id = pkt->md.recirc_id;
>> +    return true;
>> +}
>> +
>> +static bool
>> +ipf_is_valid_v6_frag(struct dp_packet *pkt OVS_UNUSED)
>> +{
>> +    const struct eth_header *l2 = dp_packet_eth(pkt);
>> +    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
>> +    const char *l4 = dp_packet_l4(pkt);
>> +
>> +    if (OVS_UNLIKELY(!l2 || !l3 || !l4)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    const char *tail = dp_packet_tail(pkt);
>> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
>> +    size_t l3_size = tail - (char *)l3 - pad;
>> +    size_t l3_hdr_size = sizeof *l3;
>> +
>> +    if (OVS_UNLIKELY(l3_size < l3_hdr_size)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    uint8_t nw_frag = 0;
>> +    uint8_t nw_proto = l3->ip6_nxt;
>> +    const void *data = l3 + 1;
>> +    size_t datasize = l3_size - l3_hdr_size;
>> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
>> +    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
>> +                             &frag_hdr) || !nw_frag || !frag_hdr) {
>> +        return false;
>> +    }
>> +
>> +    int pl = ntohs(l3->ip6_plen);
>> +    if (OVS_UNLIKELY(pl + l3_hdr_size != l3_size)) {
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
>> +    if (OVS_UNLIKELY(!ipf_is_v6_frag(ip6f_offlg))) {
>> +        return false;
>> +    }
>> +
>> +    uint32_t min_v6_frag_size_;
>> +    atomic_read_relaxed(&min_v6_frag_size, &min_v6_frag_size_);
>> +    bool lf = ipf_is_last_v6_frag(ip6f_offlg);
>> +
>> +    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v6_frag_size_)) {
>> +        ipf_count(true, IPF_COUNTER_NFRAGS_TOO_SMALL);
>> +        goto invalid_pkt;
>> +    }
>> +
>> +    return true;
>> +
>> +invalid_pkt:
>> +    pkt->md.ct_state = CS_INVALID;
>> +    return false;
>> +
>> +}
>> +
>> +static void
>> +ipf_v6_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t
>> zone,
>> +                   struct ipf_list_key *key, uint16_t *start_data_byte,
>> +                   uint16_t *end_data_byte, bool *ff, bool *lf)
>> +{
>> +    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
>> +    const char *l4 = dp_packet_l4(pkt);
>> +    const char *tail = dp_packet_tail(pkt);
>> +    uint8_t pad = dp_packet_l2_pad_size(pkt);
>> +    size_t l3_size = tail - (char *)l3 - pad;
>> +    size_t l4_size = tail - (char *)l4 - pad;
>> +    size_t l3_hdr_size = sizeof *l3;
>> +    uint8_t nw_frag = 0;
>> +    uint8_t nw_proto = l3->ip6_nxt;
>> +    const void *data = l3 + 1;
>> +    size_t datasize = l3_size - l3_hdr_size;
>> +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
>> +
>> +    parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
>> &frag_hdr);
>> +    ovs_assert(nw_frag && frag_hdr);
>> +    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
>> +    *start_data_byte = ntohs(ip6f_offlg & IP6F_OFF_MASK) +
>> +        sizeof (struct ovs_16aligned_ip6_frag);
>> +    *end_data_byte = *start_data_byte + l4_size - 1;
>> +    *ff = ipf_is_first_v6_frag(ip6f_offlg);
>> +    *lf = ipf_is_last_v6_frag(ip6f_offlg);
>> +    memset(key, 0, sizeof *key);
>> +    key->ip_id = get_16aligned_be32(&frag_hdr->ip6f_ident);
>> +    key->dl_type = dl_type;
>> +    key->src_addr.ipv6 = l3->ip6_src;
>> +    /* We are not supporting parsing of the routing header to use as the
>> +     * dst address part of the key. */
>> +    key->dst_addr.ipv6 = l3->ip6_dst;
>> +    key->nw_proto = 0;   /* Not used for key for V6. */
>> +    key->zone = zone;
>> +    key->recirc_id = pkt->md.recirc_id;
>> +}
>> +
>> +static int
>> +ipf_list_key_cmp(const struct ipf_list_key *key1,
>> +                 const struct ipf_list_key *key2)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    if (!memcmp(&key1->src_addr, &key2->src_addr, sizeof key1->src_addr)
>> &&
>> +        !memcmp(&key1->dst_addr, &key2->dst_addr, sizeof key1->dst_addr)
>> &&
>> +        (key1->dl_type == key2->dl_type) &&
>> +        (key1->ip_id == key2->ip_id) &&
>> +        (key1->zone == key2->zone) &&
>> +        (key1->nw_proto == key2->nw_proto) &&
>> +        (key1->recirc_id == key2->recirc_id)) {
>> +        return 0;
>> +    }
>> +    return 1;
>> +}
>> +
>> +static struct ipf_list *
>> +ipf_list_key_lookup(const struct ipf_list_key *key,
>> +                    uint32_t hash)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    struct ipf_list *ipf_list;
>> +    HMAP_FOR_EACH_WITH_HASH (ipf_list, node, hash, &frag_lists) {
>> +        if (!ipf_list_key_cmp(&ipf_list->key, key)) {
>> +            return ipf_list;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +static bool
>> +ipf_is_frag_duped(const struct ipf_frag *frag_list, int last_inuse_idx,
>> +                  size_t start_data_byte, size_t end_data_byte)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    for (int i = 0; i <= last_inuse_idx; i++) {
>> +        if (((start_data_byte >= frag_list[i].start_data_byte) &&
>> +            (start_data_byte <= frag_list[i].end_data_byte)) ||
>> +            ((end_data_byte >= frag_list[i].start_data_byte) &&
>> +             (end_data_byte <= frag_list[i].end_data_byte))) {
>> +            return true;
>> +        }
>> +    }
>> +    return false;
>> +}
>> +
>> +static bool
>> +ipf_process_frag(struct ipf_list *ipf_list, struct dp_packet *pkt,
>> +                 uint16_t start_data_byte, uint16_t end_data_byte,
>> +                 bool ff, bool lf, bool v6)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    bool duped_frag = ipf_is_frag_duped(ipf_list->frag_list,
>> +        ipf_list->last_inuse_idx, start_data_byte, end_data_byte);
>> +    int last_inuse_idx = ipf_list->last_inuse_idx;
>> +
>> +    if (!duped_frag) {
>> +        if (last_inuse_idx < ipf_list->size - 1) {
>> +            /* In the case of dpdk, it would be unfortunate if we had
>> +             * to create a clone fragment outside the dpdk mp due to the
>> +             * mempool size being too limited. We will otherwise need to
>> +             * recommend not setting the mempool number of buffers too
>> low
>> +             * and also clamp the number of fragments. */
>> +            ipf_list->frag_list[last_inuse_idx + 1].pkt = pkt;
>> +            ipf_list->frag_list[last_inuse_idx + 1].start_data_byte =
>> +                start_data_byte;
>> +            ipf_list->frag_list[last_inuse_idx + 1].end_data_byte =
>> +                end_data_byte;
>> +            ipf_list->last_inuse_idx++;
>> +            atomic_count_inc(&nfrag);
>> +            ipf_count(v6, IPF_COUNTER_NFRAGS_ACCEPTED);
>> +            ipf_list_state_transition(ipf_list, ff, lf, v6);
>> +        } else {
>> +            OVS_NOT_REACHED();
>> +        }
>> +    } else {
>> +        ipf_count(v6, IPF_COUNTER_NFRAGS_OVERLAP);
>> +        pkt->md.ct_state = CS_INVALID;
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static bool
>> +ipf_handle_frag(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
>> +                long long now, uint32_t hash_basis)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    struct ipf_list_key key;
>> +    /* Initialize 4 variables for some versions of GCC. */
>> +    uint16_t start_data_byte = 0;
>> +    uint16_t end_data_byte = 0;
>> +    bool ff = false;
>> +    bool lf = false;
>> +    bool v6 = dl_type == htons(ETH_TYPE_IPV6);
>> +
>> +    if (v6 && ipf_get_v6_enabled()) {
>> +        ipf_v6_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
>> +                           &end_data_byte, &ff, &lf);
>> +    } else if (!v6 && ipf_get_v4_enabled()) {
>> +        ipf_v4_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
>> +                           &end_data_byte, &ff, &lf);
>> +    } else {
>> +        OVS_NOT_REACHED();
>> +    }
>> +
>> +    unsigned int nfrag_max_;
>> +    atomic_read_relaxed(&nfrag_max, &nfrag_max_);
>> +    if (atomic_count_get(&nfrag) >= nfrag_max_) {
>> +        return false;
>> +    }
>> +
>> +    uint32_t hash = ipf_list_key_hash(&key, hash_basis);
>> +    struct ipf_list *ipf_list = ipf_list_key_lookup(&key, hash);
>> +    enum {
>> +        IPF_FRAG_LIST_MIN_INCREMENT = 4,
>> +        IPF_IPV6_MAX_FRAG_LIST_SIZE = 65535,
>> +    };
>> +
>> +    int max_frag_list_size;
>> +    if (v6) {
>> +        /* Because the calculation with extension headers is variable,
>> +         * we don't calculate a hard maximum fragment list size
>> upfront.  The
>> +         * fragment list size is practically limited by the code,
>> however. */
>> +        max_frag_list_size = IPF_IPV6_MAX_FRAG_LIST_SIZE;
>> +    } else {
>> +        max_frag_list_size = max_v4_frag_list_size;
>> +    }
>> +
>> +    if (!ipf_list) {
>> +        ipf_list = xzalloc(sizeof *ipf_list);
>> +        ipf_list->key = key;
>> +        ipf_list->last_inuse_idx = IPF_INVALID_IDX;
>> +        ipf_list->last_sent_idx = IPF_INVALID_IDX;
>> +        ipf_list->size =
>> +            MIN(max_frag_list_size, IPF_FRAG_LIST_MIN_INCREMENT);
>> +        ipf_list->frag_list =
>> +            xzalloc(ipf_list->size * sizeof *ipf_list->frag_list);
>> +        hmap_insert(&frag_lists, &ipf_list->node, hash);
>> +        ipf_expiry_list_add(ipf_list, now);
>> +    } else if (ipf_list->state == IPF_LIST_STATE_REASS_FAIL) {
>> +        /* Bail out as early as possible. */
>> +        return false;
>> +    } else if (ipf_list->last_inuse_idx + 1 >= ipf_list->size) {
>> +        int increment = MIN(IPF_FRAG_LIST_MIN_INCREMENT,
>> +                            max_frag_list_size - ipf_list->size);
>> +        /* Enforce limit. */
>> +        if (increment > 0) {
>> +            ipf_list->frag_list =
>> +                xrealloc(ipf_list->frag_list, (ipf_list->size +
>> increment) *
>> +                  sizeof *ipf_list->frag_list);
>> +            ipf_list->size += increment;
>> +        } else {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    return ipf_process_frag(ipf_list, pkt, start_data_byte,
>> end_data_byte, ff,
>> +                            lf, v6);
>> +}
>> +
>> +static void
>> +ipf_extract_frags_from_batch(struct dp_packet_batch *pb, ovs_be16
>> dl_type,
>> +                             uint16_t zone, long long now, uint32_t
>> hash_basis)
>> +{
>> +    const size_t pb_cnt = dp_packet_batch_size(pb);
>> +    int pb_idx; /* Index in a packet batch. */
>> +    struct dp_packet *pkt;
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
>> +
>> +        if (OVS_UNLIKELY((dl_type == htons(ETH_TYPE_IP) &&
>> +                          ipf_is_valid_v4_frag(pkt)) ||
>> +                         (dl_type == htons(ETH_TYPE_IPV6) &&
>> +                          ipf_is_valid_v6_frag(pkt)))) {
>> +
>> +            ipf_lock_lock(&ipf_lock);
>> +            if (!ipf_handle_frag(pkt, dl_type, zone, now, hash_basis)) {
>> +                dp_packet_batch_refill(pb, pkt, pb_idx);
>> +            }
>> +            ipf_lock_unlock(&ipf_lock);
>> +        } else {
>> +            dp_packet_batch_refill(pb, pkt, pb_idx);
>> +        }
>> +
>> +    }
>> +}
>> +
>> +/* In case of DPDK, a memory source check is done, as DPDK memory pool
>> + * management has trouble dealing with multiple source types.  The
>> + * check_source paramater is used to indicate when this check is needed.
>> */
>> +static bool
>> +ipf_dp_packet_batch_add(struct dp_packet_batch *pb , struct dp_packet
>> *pkt,
>> +                        bool check_source OVS_UNUSED)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +#ifdef DPDK_NETDEV
>> +    if ((pb->count >= NETDEV_MAX_BURST) ||
>> +        /* DPDK cannot handle multiple sources in a batch. */
>> +        (check_source && pb->count && pb->packets[0]->source !=
>> pkt->source)) {
>> +#else
>> +    if (pb->count >= NETDEV_MAX_BURST) {
>> +#endif
>> +        return false;
>> +    }
>> +
>> +    dp_packet_batch_add(pb, pkt);
>> +    return true;
>> +}
>> +
>> +/* This would be used in rare cases where a list cannot be sent. One rare
>> + * reason known right now is a mempool source check, which exists due to
>> DPDK
>> + * support, where packets are no longer being received on any port with a
>> + * source matching the fragment.  Another reason is a race where all
>> + * conntrack rules are unconfigured when some fragments are yet to be
>> + * flushed.
>> + *
>> + * Returns true if the list was purged. */
>> +static bool
>> +ipf_purge_list_check(struct ipf_list *ipf_list, long long now)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    enum {
>> +        IPF_FRAG_LIST_PURGE_TIME_ADJ = 10000
>> +    };
>> +
>> +    if (now < ipf_list->expiration + IPF_FRAG_LIST_PURGE_TIME_ADJ) {
>> +        return false;
>> +    }
>> +
>> +    struct dp_packet *pkt;
>> +    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
>> +        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
>> +        dp_packet_delete(pkt);
>> +        atomic_count_dec(&nfrag);
>> +        ipf_list->last_sent_idx++;
>> +    }
>> +
>> +    COVERAGE_INC(ipf_stuck_frag_list_purged);
>> +    ipf_count(ipf_list->key.dl_type == htons(ETH_TYPE_IPV6),
>> +              IPF_COUNTER_NFRAGS_PURGED);
>> +    return true;
>> +}
>> +
>> +static bool
>> +ipf_send_frags_in_list(struct ipf_list *ipf_list, struct dp_packet_batch
>> *pb,
>> +                       enum ipf_list_type list_type, bool v6, long long
>> now)
>> +    OVS_REQUIRES(ipf_lock)
>> +{
>> +    if (ipf_purge_list_check(ipf_list, now)) {
>> +        return true;
>> +    }
>> +
>> +    struct dp_packet *pkt;
>> +    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
>> +        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
>> +        if (ipf_dp_packet_batch_add(pb, pkt, true)) {
>> +
>> +            ipf_list->last_sent_idx++;
>> +            atomic_count_dec(&nfrag);
>> +
>> +            if (list_type == IPF_FRAG_COMPLETED_LIST) {
>> +                ipf_count(v6, IPF_COUNTER_NFRAGS_COMPL_SENT);
>> +            } else {
>> +                ipf_count(v6, IPF_COUNTER_NFRAGS_EXPD_SENT);
>> +                pkt->md.ct_state = CS_INVALID;
>> +            }
>> +
>> +            if (ipf_list->last_sent_idx == ipf_list->last_inuse_idx) {
>> +                return true;
>> +            }
>> +        } else {
>> +            return false;
>> +        }
>> +    }
>> +    OVS_NOT_REACHED();
>> +}
>> +
>> +static void
>> +ipf_send_completed_frags(struct dp_packet_batch *pb, long long now, bool
>> v6)
>> +{
>> +    if (ovs_list_is_empty(&frag_complete_list)) {
>> +        return;
>> +    }
>> +
>> +    ipf_lock_lock(&ipf_lock);
>> +    struct ipf_list *ipf_list, *next;
>> +
>> +    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_complete_list) {
>> +        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_COMPLETED_LIST,
>> +                                   v6, now)) {
>> +            ipf_completed_list_clean(ipf_list);
>> +        } else {
>> +            break;
>> +        }
>> +    }
>> +    ipf_lock_unlock(&ipf_lock);
>> +}
>> +
>> +static void
>> +ipf_send_expired_frags(struct dp_packet_batch *pb, long long now, bool
>> v6)
>> +{
>> +    enum {
>> +        /* Very conservative, due to DOS probability. */
>> +        IPF_FRAG_LIST_MAX_EXPIRED = 1,
>> +    };
>> +
>> +
>> +    if (ovs_list_is_empty(&frag_exp_list)) {
>> +        return;
>> +    }
>> +
>> +    ipf_lock_lock(&ipf_lock);
>> +    struct ipf_list *ipf_list, *next;
>> +    size_t lists_removed = 0;
>> +
>> +    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_exp_list) {
>> +        if (!(now > ipf_list->expiration) ||
>> +            lists_removed >= IPF_FRAG_LIST_MAX_EXPIRED) {
>> +            break;
>> +        }
>> +
>> +        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_EXPIRY_LIST,
>> v6,
>> +                                   now)) {
>> +            ipf_expiry_list_clean(ipf_list);
>> +            lists_removed++;
>> +        } else {
>> +            break;
>> +        }
>> +    }
>> +    ipf_lock_unlock(&ipf_lock);
>> +}
>> +
>> +static void
>> +ipf_execute_reass_pkts(struct dp_packet_batch *pb)
>> +{
>> +    if (ovs_list_is_empty(&reassembled_pkt_list)) {
>> +        return;
>> +    }
>> +
>> +    ipf_lock_lock(&ipf_lock);
>> +    struct reassembled_pkt *rp, *next;
>> +
>> +    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
>> +        if (!rp->list->reass_execute_ctx &&
>> +            ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
>> +            rp->list->reass_execute_ctx = rp->pkt;
>> +        }
>> +    }
>> +    ipf_lock_unlock(&ipf_lock);
>> +}
>> +
>> +static void
>> +ipf_post_execute_reass_pkts(struct dp_packet_batch *pb, bool v6)
>> +{
>> +    if (ovs_list_is_empty(&reassembled_pkt_list)) {
>> +        return;
>> +    }
>> +
>> +    ipf_lock_lock(&ipf_lock);
>> +    struct reassembled_pkt *rp, *next;
>> +
>> +    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
>> +        const size_t pb_cnt = dp_packet_batch_size(pb);
>> +        int pb_idx;
>> +        struct dp_packet *pkt;
>> +        /* Inner batch loop is constant time since batch size is <=
>> +         * NETDEV_MAX_BURST. */
>> +        DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
>> +            if (pkt == rp->list->reass_execute_ctx) {
>> +                for (int i = 0; i <= rp->list->last_inuse_idx; i++) {
>> +                    rp->list->frag_list[i].pkt->md.ct_label =
>> pkt->md.ct_label;
>> +                    rp->list->frag_list[i].pkt->md.ct_mark =
>> pkt->md.ct_mark;
>> +                    rp->list->frag_list[i].pkt->md.ct_state =
>> pkt->md.ct_state;
>> +                    rp->list->frag_list[i].pkt->md.ct_zone =
>> pkt->md.ct_zone;
>> +                    rp->list->frag_list[i].pkt->md.ct_orig_tuple_ipv6 =
>> +                        pkt->md.ct_orig_tuple_ipv6;
>> +                    if (pkt->md.ct_orig_tuple_ipv6) {
>> +
>> rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv6 =
>> +                            pkt->md.ct_orig_tuple.ipv6;
>> +                    } else {
>> +
>> rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv4  =
>> +                            pkt->md.ct_orig_tuple.ipv4;
>> +                    }
>> +                }
>> +
>> +                const char *tail_frag =
>> +                    dp_packet_tail(rp->list->frag_list[0].pkt);
>> +                uint8_t pad_frag =
>> +                    dp_packet_l2_pad_size(rp->list->frag_list[0].pkt);
>> +
>> +                void *l4_frag = dp_packet_l4(rp->list->frag_list[0].pkt);
>> +                void *l4_reass = dp_packet_l4(pkt);
>> +                memcpy(l4_frag, l4_reass,
>> +                       tail_frag - (char *) l4_frag - pad_frag);
>> +
>> +                if (v6) {
>> +                    struct  ovs_16aligned_ip6_hdr *l3_frag =
>> +                        dp_packet_l3(rp->list->frag_list[0].pkt);
>> +                    struct  ovs_16aligned_ip6_hdr *l3_reass =
>> +                        dp_packet_l3(pkt);
>> +                    l3_frag->ip6_src = l3_reass->ip6_src;
>> +                    l3_frag->ip6_dst = l3_reass->ip6_dst;
>> +                } else {
>> +                    struct ip_header *l3_frag =
>> +                        dp_packet_l3(rp->list->frag_list[0].pkt);
>> +                    struct ip_header *l3_reass = dp_packet_l3(pkt);
>> +                    ovs_be32 reass_ip =
>> get_16aligned_be32(&l3_reass->ip_src);
>> +                    ovs_be32 frag_ip =
>> get_16aligned_be32(&l3_frag->ip_src);
>> +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> +                                                     frag_ip, reass_ip);
>> +                    l3_frag->ip_src = l3_reass->ip_src;
>> +
>> +                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
>> +                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
>> +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
>> +                                                     frag_ip, reass_ip);
>> +                    l3_frag->ip_dst = l3_reass->ip_dst;
>> +                }
>> +
>> +                ipf_completed_list_add(rp->list);
>> +                ipf_reassembled_list_remove(rp);
>> +                dp_packet_delete(rp->pkt);
>> +                free(rp);
>> +            } else {
>> +                dp_packet_batch_refill(pb, pkt, pb_idx);
>> +            }
>> +        }
>> +    }
>> +    ipf_lock_unlock(&ipf_lock);
>> +}
>> +
>> +/* Extracts any fragments from the batch and reassembles them when a
>> + * complete packet is received.  Completed packets are attempted to
>> + * be added to the batch to be sent through conntrack. */
>> +void
>> +ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
>> +                         ovs_be16 dl_type, uint16_t zone, uint32_t
>> hash_basis)
>> +{
>> +    if (ipf_get_enabled()) {
>> +        ipf_extract_frags_from_batch(pb, dl_type, zone, now, hash_basis);
>> +    }
>> +
>> +    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
>> +        ipf_execute_reass_pkts(pb);
>> +    }
>> +}
>> +
>> +/* Updates fragments based on the processing of the reassembled packet
>> sent
>> + * through conntrack and adds these fragments to any batches seen.
>> Expired
>> + * fragments are marked as invalid and also added to the batches seen
>> + * with low priority.  Reassembled packets are freed. */
>> +void
>> +ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
>> +                          ovs_be16 dl_type)
>> +{
>> +    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
>> +        bool v6 = dl_type == htons(ETH_TYPE_IPV6);
>> +        ipf_post_execute_reass_pkts(pb, v6);
>> +        ipf_send_completed_frags(pb, now, v6);
>> +        ipf_send_expired_frags(pb, now, v6);
>> +    }
>> +}
>> +
>> +static void *
>> +ipf_clean_thread_main(void *f OVS_UNUSED)
>> +{
>> +    enum {
>> +        IPF_FRAG_LIST_CLEAN_TIMEOUT = 60000,
>> +    };
>> +
>> +    while (!latch_is_set(&ipf_clean_thread_exit)) {
>> +
>> +        long long now = time_msec();
>> +
>> +        if (!ovs_list_is_empty(&frag_exp_list) ||
>> +            !ovs_list_is_empty(&frag_complete_list)) {
>> +
>> +            ipf_lock_lock(&ipf_lock);
>> +
>> +            struct ipf_list *ipf_list, *next;
>> +            LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
>> &frag_exp_list) {
>> +                if (ipf_purge_list_check(ipf_list, now)) {
>> +                    ipf_expiry_list_clean(ipf_list);
>> +                }
>> +            }
>> +
>> +            LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
>> +                                &frag_complete_list) {
>> +                if (ipf_purge_list_check(ipf_list, now)) {
>> +                    ipf_completed_list_clean(ipf_list);
>> +                }
>> +            }
>> +
>> +            ipf_lock_unlock(&ipf_lock);
>> +        }
>> +
>> +        poll_timer_wait_until(now + IPF_FRAG_LIST_CLEAN_TIMEOUT);
>> +        latch_wait(&ipf_clean_thread_exit);
>> +        poll_block();
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +void
>> +ipf_init(void)
>> +{
>> +    ipf_lock_init(&ipf_lock);
>> +    ipf_lock_lock(&ipf_lock);
>> +    hmap_init(&frag_lists);
>> +    ovs_list_init(&frag_exp_list);
>> +    ovs_list_init(&frag_complete_list);
>> +    ovs_list_init(&reassembled_pkt_list);
>> +    atomic_init(&min_v4_frag_size, IPF_V4_FRAG_SIZE_MIN_DEF);
>> +    atomic_init(&min_v6_frag_size, IPF_V6_FRAG_SIZE_MIN_DEF);
>> +    max_v4_frag_list_size = DIV_ROUND_UP(
>> +        IPV4_PACKET_MAX_SIZE - IPV4_PACKET_MAX_HDR_SIZE,
>> +        min_v4_frag_size - IPV4_PACKET_MAX_HDR_SIZE);
>> +    ipf_lock_unlock(&ipf_lock);
>> +    atomic_count_init(&nfrag, 0);
>> +    atomic_init(&n4frag_accepted, 0);
>> +    atomic_init(&n4frag_completed_sent, 0);
>> +    atomic_init(&n4frag_expired_sent, 0);
>> +    atomic_init(&n4frag_too_small, 0);
>> +    atomic_init(&n4frag_overlap, 0);
>> +    atomic_init(&n4frag_purged, 0);
>> +    atomic_init(&n6frag_accepted, 0);
>> +    atomic_init(&n6frag_completed_sent, 0);
>> +    atomic_init(&n6frag_expired_sent, 0);
>> +    atomic_init(&n6frag_too_small, 0);
>> +    atomic_init(&n6frag_overlap, 0);
>> +    atomic_init(&n6frag_purged, 0);
>> +    atomic_init(&nfrag_max, IPF_MAX_FRAGS_DEFAULT);
>> +    atomic_init(&ifp_v4_enabled, true);
>> +    atomic_init(&ifp_v6_enabled, true);
>> +    latch_init(&ipf_clean_thread_exit);
>> +    ipf_clean_thread = ovs_thread_create("ipf_clean",
>> +                                         ipf_clean_thread_main, NULL);
>> +}
>> +
>> +void
>> +ipf_destroy(void)
>> +{
>> +    ipf_lock_lock(&ipf_lock);
>> +
>> +    latch_set(&ipf_clean_thread_exit);
>> +    pthread_join(ipf_clean_thread, NULL);
>> +    latch_destroy(&ipf_clean_thread_exit);
>> +
>> +    struct ipf_list *ipf_list;
>> +    HMAP_FOR_EACH_POP (ipf_list, node, &frag_lists) {
>> +        struct dp_packet *pkt;
>> +        while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
>> +            pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
>> +            dp_packet_delete(pkt);
>> +            atomic_count_dec(&nfrag);
>> +            ipf_list->last_sent_idx++;
>> +        }
>> +        free(ipf_list->frag_list);
>> +        free(ipf_list);
>> +    }
>> +
>> +    if (atomic_count_get(&nfrag)) {
>> +        VLOG_WARN("ipf destroy with non-zero fragment count. ");
>> +    }
>> +
>> +    struct reassembled_pkt * rp;
>> +    LIST_FOR_EACH_POP (rp, rp_list_node, &reassembled_pkt_list) {
>> +        dp_packet_delete(rp->pkt);
>> +        free(rp);
>> +    }
>> +
>> +    hmap_destroy(&frag_lists);
>> +    ovs_list_poison(&frag_exp_list);
>> +    ovs_list_poison(&frag_complete_list);
>> +    ovs_list_poison(&reassembled_pkt_list);
>> +    ipf_lock_unlock(&ipf_lock);
>> +    ipf_lock_destroy(&ipf_lock);
>> +}
>> diff --git a/lib/ipf.h b/lib/ipf.h
>> new file mode 100644
>> index 0000000..040031f
>> --- /dev/null
>> +++ b/lib/ipf.h
>> @@ -0,0 +1,33 @@
>> +/*
>> + * Copyright (c) 2018 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>> implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef IPF_H
>> +#define IPF_H 1
>> +
>> +#include "dp-packet.h"
>> +#include "openvswitch/types.h"
>> +
>> +void ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
>> +                              ovs_be16 dl_type, uint16_t zone,
>> +                              uint32_t hash_basis);
>> +
>> +void ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
>> +                               ovs_be16 dl_type);
>> +
>> +void ipf_init(void);
>> +void ipf_destroy(void);
>> +
>> +#endif /* ipf.h */
>> diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
>> index 3296d64..3fbead8 100644
>> --- a/tests/system-kmod-macros.at
>> +++ b/tests/system-kmod-macros.at
>> @@ -77,12 +77,6 @@ m4_define([CHECK_CONNTRACK],
>>  #
>>  m4_define([CHECK_CONNTRACK_ALG])
>>
>> -# CHECK_CONNTRACK_FRAG()
>> -#
>> -# Perform requirements checks for running conntrack fragmentations tests.
>> -# The kernel always supports fragmentation, so no check is needed.
>> -m4_define([CHECK_CONNTRACK_FRAG])
>> -
>>  # CHECK_CONNTRACK_LOCAL_STACK()
>>  #
>>  # Perform requirements checks for running conntrack tests with local
>> stack.
>> @@ -91,6 +85,10 @@ m4_define([CHECK_CONNTRACK_FRAG])
>>  # needed.
>>  m4_define([CHECK_CONNTRACK_LOCAL_STACK])
>>
>> +# CHECK_CONNTRACK_SMALL_FRAG()
>> +#
>> +m4_define([CHECK_CONNTRACK_SMALL_FRAG])
>> +
>>  # CHECK_CONNTRACK_FRAG_OVERLAP()
>>  #
>>  # The kernel does not support overlapping fragments checking.
>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>> index 840fea9..c4f6e47 100644
>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -2347,7 +2347,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2381,7 +2380,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation expiry])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2412,7 +2410,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation + vlan])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2448,7 +2445,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation + cvlan])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch .
>> other_config:vlan-limit=0])
>>  OVS_CHECK_8021AD()
>>
>> @@ -2523,7 +2519,7 @@ AT_CLEANUP
>>  dnl Uses same first fragment as above 'incomplete reassembled packet'
>> test.
>>  AT_SETUP([conntrack - IPv4 fragmentation with fragments specified])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2547,7 +2543,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation out of order])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2571,7 +2567,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1
>> octet])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_OVERLAP()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2595,7 +2591,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1
>> octet out of order])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_OVERLAP()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2619,7 +2615,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2659,7 +2654,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation expiry])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2700,7 +2694,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation + vlan])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2743,7 +2736,6 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation + cvlan])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch .
>> other_config:vlan-limit=0])
>>  OVS_CHECK_8021AD()
>>
>> @@ -2818,7 +2810,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation with fragments specified])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2842,7 +2834,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation out of order])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>>  ADD_NAMESPACES(at_ns0, at_ns1)
>> @@ -2866,7 +2858,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2892,7 +2884,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers +
>> out of order])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2918,7 +2910,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2944,7 +2936,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2 +
>> out of order])
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>> +CHECK_CONNTRACK_SMALL_FRAG()
>>  CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>>  OVS_TRAFFIC_VSWITCHD_START()
>>
>> @@ -2971,7 +2963,6 @@ AT_CLEANUP
>>  AT_SETUP([conntrack - Fragmentation over vxlan])
>>  OVS_CHECK_VXLAN()
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  CHECK_CONNTRACK_LOCAL_STACK()
>>
>>  OVS_TRAFFIC_VSWITCHD_START()
>> @@ -3024,7 +3015,6 @@ AT_CLEANUP
>>  AT_SETUP([conntrack - IPv6 Fragmentation over vxlan])
>>  OVS_CHECK_VXLAN()
>>  CHECK_CONNTRACK()
>> -CHECK_CONNTRACK_FRAG()
>>  CHECK_CONNTRACK_LOCAL_STACK()
>>
>>  OVS_TRAFFIC_VSWITCHD_START()
>> diff --git a/tests/system-userspace-macros.at b/tests/
>> system-userspace-macros.at
>> index 27bde8b..219c046 100644
>> --- a/tests/system-userspace-macros.at
>> +++ b/tests/system-userspace-macros.at
>> @@ -73,15 +73,6 @@ m4_define([CHECK_CONNTRACK],
>>  #
>>  m4_define([CHECK_CONNTRACK_ALG])
>>
>> -# CHECK_CONNTRACK_FRAG()
>> -#
>> -# Perform requirements checks for running conntrack fragmentations tests.
>> -# The userspace doesn't support fragmentation yet, so skip the tests.
>> -m4_define([CHECK_CONNTRACK_FRAG],
>> -[
>> -    AT_SKIP_IF([:])
>> -])
>> -
>>  # CHECK_CONNTRACK_LOCAL_STACK()
>>  #
>>  # Perform requirements checks for running conntrack tests with local
>> stack.
>> @@ -93,22 +84,26 @@ m4_define([CHECK_CONNTRACK_LOCAL_STACK],
>>      AT_SKIP_IF([:])
>>  ])
>>
>> -# CHECK_CONNTRACK_FRAG_OVERLAP()
>> +# CHECK_CONNTRACK_SMALL_FRAG()
>>  #
>> -# The userspace datapath does not support fragments yet.
>> -m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
>> +m4_define([CHECK_CONNTRACK_SMALL_FRAG],
>>  [
>>      AT_SKIP_IF([:])
>>  ])
>>
>> -# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
>> +# CHECK_CONNTRACK_FRAG_OVERLAP()
>>  #
>>  # The userspace datapath does not support fragments yet.
>> -m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN],
>> +m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
>>  [
>>      AT_SKIP_IF([:])
>>  ])
>>
>> +# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN
>> +#
>> +# The userspace datapath supports fragments with multiple extension
>> headers.
>> +m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN])
>> +
>>  # CHECK_CONNTRACK_NAT()
>>  #
>>  # Perform requirements checks for running conntrack NAT tests. The
>> userspace
>> --
>> 1.9.1
>>
>>
Ben Pfaff Dec. 11, 2018, 4:14 p.m. UTC | #3
On Mon, Nov 19, 2018 at 11:09:25AM -0800, Darrell Ball wrote:
> Fragmentation handling is added for supporting conntrack.
> Both v4 and v6 are supported.
> 
> After discussion with several people, I decided to not store
> configuration state in the database to be more consistent with
> the kernel in future, similarity with other conntrack configuration
> which will not be in the database as well and overall simplicity.
> Accordingly, fragmentation handling is enabled by default.
> 
> This patch enables fragmentation tests for the userspace datapath.
> 
> Signed-off-by: Darrell Ball <dlu998@gmail.com>

Thanks for implementing this.

This could use more comments, especially on the data structures and
file-level variables and at the level of an individual function.

Please don't invent yet another mutex.

ipf_print_reass_packet() seems like it could use ds_put_hex_dump() and
thereby be more readable (I don't really want to read 182 hex digits in
a row without any white space).

Why 91 bytes in ipf_print_reass_packet()?

All the counters seem to be write-only.

struct ipf_addr seems odd to me.  It has both aligned and unaligned
versions of addresses, which means that the overall struct needs to be
aligned, and it's a union nested in a struct instead of just a union.
ipf_addr_hash_add() implies that all the bytes in the struct need to be
initialized even if only some of them are used.

ipf_list_key_hash() seems to be at risk of hashing trailing padding at
the end of struct ipf_list_key.

Does ipf_list_complete() run one past the end of the array?  Naively, it
seems like it might.

I found ipf_sort() a little hard to read mostly due to the long variable
names.  Here's an alternate form that you can accept or reject as you
like (I have not tested it):

static void
ipf_sort(struct ipf_frag *frags, size_t last_idx)
    OVS_REQUIRES(ipf_lock)
{
    for (int i = 1; i <= last_idx; i++) {
        struct ipf_frag ipf_frag = frags[i];
        int j = i - 1;
        while (j >= 0 && frags[j].start_data_byte > ipf_frag.start_data_byte) {
            frags[j + 1] = frags[j];
            j--;
        }
        frags[j + 1] = ipf_frag;
    }
}

ipf_reassemble_v4_frags() and ipf_reassemble_v6_frags() know the final
length of the packet they're assembling, but it doesn't look to me like
they preallocate the space for it.

ipf_reassemble_v6_frags() and ipf_reassemble_v6_frags() calculate the
length of the L3 header manually and skip past it.  Couldn't they just
use the L4 header pointer?

The ipf_list_key_cmp() interface is a little confusing because the
return value convention and the name is a little like a strcmp()ish
function, but it doesn't use a -1/+1 return value.  I'd rename it to
ipf_list_key_equal()s, change the return type to "bool", and make true
mean equal, false mean unequal.

Processing a fragment is O(n) in the number of existing fragments due to
the dup check.  I don't know whether this is important.

It looks like packet duplication can cause a fraglist to be marked
invalid.  I wonder whether this is good behavior.

It seems like we could estimate the number of fragments needed by
dividing the total size by the size of the first fragment received.

Do we need xzalloc() for frag_list?  It seems like we're going to
initialize each element as needed anyway.

I think there's a race if either v4 or v6 is ever disabled, because the
code checks whether fragment reassembly is enabled twice, once at a high
level and once again in ipf_handle_frag().  The latter function
assert-fails if reassembly is disabled, which seems like it could be a
problem.

At the same time, I don't see any way to ever disable fragment
reassembly.  The code appears to enable it by default and not provide
any way to disable it.

Is it common practice to enforce a minimum size for fragments?

Maybe it would be a little easier to make the counters a two-dimensional
array: atomic_uint64_t ipf_counters[2][N_COUNTERS], where one dimension
is IPv4/IPv6 and the other is the particular counter.  Then ipf_count()
would become trivial.  Just a thought though.

Thanks, and I'll look forward to the next version.
Darrell Ball Dec. 12, 2018, 8:45 a.m. UTC | #4
I'll respond to the individual comments soon, although some comments are
answered by
subsequent patches.
I did notice a problem that I will fix and make adjustments for, in that
the ipf context should
have been per datapath.

Thanks Darrell

On Tue, Dec 11, 2018 at 8:15 AM Ben Pfaff <blp@ovn.org> wrote:

> On Mon, Nov 19, 2018 at 11:09:25AM -0800, Darrell Ball wrote:
> > Fragmentation handling is added for supporting conntrack.
> > Both v4 and v6 are supported.
> >
> > After discussion with several people, I decided to not store
> > configuration state in the database to be more consistent with
> > the kernel in future, similarity with other conntrack configuration
> > which will not be in the database as well and overall simplicity.
> > Accordingly, fragmentation handling is enabled by default.
> >
> > This patch enables fragmentation tests for the userspace datapath.
> >
> > Signed-off-by: Darrell Ball <dlu998@gmail.com>
>
> Thanks for implementing this.
>
> This could use more comments, especially on the data structures and
> file-level variables and at the level of an individual function.
>
> Please don't invent yet another mutex.
>
> ipf_print_reass_packet() seems like it could use ds_put_hex_dump() and
> thereby be more readable (I don't really want to read 182 hex digits in
> a row without any white space).
>
> Why 91 bytes in ipf_print_reass_packet()?
>
> All the counters seem to be write-only.
>
> struct ipf_addr seems odd to me.  It has both aligned and unaligned
> versions of addresses, which means that the overall struct needs to be
> aligned, and it's a union nested in a struct instead of just a union.
> ipf_addr_hash_add() implies that all the bytes in the struct need to be
> initialized even if only some of them are used.
>
> ipf_list_key_hash() seems to be at risk of hashing trailing padding at
> the end of struct ipf_list_key.
>
> Does ipf_list_complete() run one past the end of the array?  Naively, it
> seems like it might.
>
> I found ipf_sort() a little hard to read mostly due to the long variable
> names.  Here's an alternate form that you can accept or reject as you
> like (I have not tested it):
>
> static void
> ipf_sort(struct ipf_frag *frags, size_t last_idx)
>     OVS_REQUIRES(ipf_lock)
> {
>     for (int i = 1; i <= last_idx; i++) {
>         struct ipf_frag ipf_frag = frags[i];
>         int j = i - 1;
>         while (j >= 0 && frags[j].start_data_byte >
> ipf_frag.start_data_byte) {
>             frags[j + 1] = frags[j];
>             j--;
>         }
>         frags[j + 1] = ipf_frag;
>     }
> }
>
> ipf_reassemble_v4_frags() and ipf_reassemble_v6_frags() know the final
> length of the packet they're assembling, but it doesn't look to me like
> they preallocate the space for it.
>
> ipf_reassemble_v6_frags() and ipf_reassemble_v6_frags() calculate the
> length of the L3 header manually and skip past it.  Couldn't they just
> use the L4 header pointer?
>
> The ipf_list_key_cmp() interface is a little confusing because the
> return value convention and the name is a little like a strcmp()ish
> function, but it doesn't use a -1/+1 return value.  I'd rename it to
> ipf_list_key_equal()s, change the return type to "bool", and make true
> mean equal, false mean unequal.
>
> Processing a fragment is O(n) in the number of existing fragments due to
> the dup check.  I don't know whether this is important.
>
> It looks like packet duplication can cause a fraglist to be marked
> invalid.  I wonder whether this is good behavior.
>
> It seems like we could estimate the number of fragments needed by
> dividing the total size by the size of the first fragment received.
>
> Do we need xzalloc() for frag_list?  It seems like we're going to
> initialize each element as needed anyway.
>
> I think there's a race if either v4 or v6 is ever disabled, because the
> code checks whether fragment reassembly is enabled twice, once at a high
> level and once again in ipf_handle_frag().  The latter function
> assert-fails if reassembly is disabled, which seems like it could be a
> problem.
>
> At the same time, I don't see any way to ever disable fragment
> reassembly.  The code appears to enable it by default and not provide
> any way to disable it.
>
> Is it common practice to enforce a minimum size for fragments?
>
> Maybe it would be a little easier to make the counters a two-dimensional
> array: atomic_uint64_t ipf_counters[2][N_COUNTERS], where one dimension
> is IPv4/IPv6 and the other is the particular counter.  Then ipf_count()
> would become trivial.  Just a thought though.
>
> Thanks, and I'll look forward to the next version.
>
Darrell Ball Feb. 7, 2019, 3:31 a.m. UTC | #5
Thanks very much for the thorough review
sorry; I made the changes last year and then ran into vacation and internal
priorities.

On Tue, Dec 11, 2018 at 8:15 AM Ben Pfaff <blp@ovn.org> wrote:

> On Mon, Nov 19, 2018 at 11:09:25AM -0800, Darrell Ball wrote:
> > Fragmentation handling is added for supporting conntrack.
> > Both v4 and v6 are supported.
> >
> > After discussion with several people, I decided to not store
> > configuration state in the database to be more consistent with
> > the kernel in future, similarity with other conntrack configuration
> > which will not be in the database as well and overall simplicity.
> > Accordingly, fragmentation handling is enabled by default.
> >
> > This patch enables fragmentation tests for the userspace datapath.
> >
> > Signed-off-by: Darrell Ball <dlu998@gmail.com>
>
> Thanks for implementing this.
>
> This could use more comments, especially on the data structures and
> file-level variables and at the level of an individual function.
>

Added more comments.


>
> Please don't invent yet another mutex.
>

yep


>
> ipf_print_reass_packet() seems like it could use ds_put_hex_dump() and
> thereby be more readable (I don't really want to read 182 hex digits in
> a row without any white space).
>

ds_put_hex_dump() is better and I will use it – thanks.


>
> Why 91 bytes in ipf_print_reass_packet()?
>

Just a number with more than enough context


>
> All the counters seem to be write-only.
>

Subsequent patches provide that support, which as folded in now.


>
> struct ipf_addr seems odd to me.  It has both aligned and unaligned
> versions of addresses, which means that the overall struct needs to be
> aligned, and it's a union nested in a struct instead of just a union.
>

The struct was never extended; it is trivially converted to a plain union.


> ipf_addr_hash_add() implies that all the bytes in the struct need to be
> initialized even if only some of them are used.
>

The whole key that gets hashed is initialized with a memset.

The performance aspect is a rounding error.


>
> ipf_list_key_hash() seems to be at risk of hashing trailing padding at
> the end of struct ipf_list_key.
>

ipf_list_key_hash() always operates on a memset zeroed key with fields
later set.

The performance aspect is a rounding error.


>
> Does ipf_list_complete() run one past the end of the array?  Naively, it
> seems like it might.
>

No, it uses array indices with adds 1 to the penultimate one for last iter.

I made it more obvious though.


>
> I found ipf_sort() a little hard to read mostly due to the long variable
> names.  Here's an alternate form that you can accept or reject as you
> like (I have not tested it):
>
> static void
> ipf_sort(struct ipf_frag *frags, size_t last_idx)
>     OVS_REQUIRES(ipf_lock)
> {
>     for (int i = 1; i <= last_idx; i++) {
>         struct ipf_frag ipf_frag = frags[i];
>         int j = i - 1;
>         while (j >= 0 && frags[j].start_data_byte >
> ipf_frag.start_data_byte) {
>             frags[j + 1] = frags[j];
>             j--;
>         }
>         frags[j + 1] = ipf_frag;
>     }
> }
>

One day I decided to write only while loops to see if you would notice and
you did.
It is trivially equivalent and a for loop where applicable is always more
compact and easier to understand. Also, the longer names were not for my
benefit, so I shortened them.


>
> ipf_reassemble_v4_frags() and ipf_reassemble_v6_frags() know the final
> length of the packet they're assembling, but it doesn't look to me like
> they preallocate the space for it.
>

Sure, even though the performance aspect is a rounding error and it is just

 a couple extra lines of code and looks nicer.


>
> ipf_reassemble_v6_frags() and ipf_reassemble_v6_frags() calculate the
> length of the L3 header manually and skip past it.  Couldn't they just
> use the L4 header pointer?
>

I am using the L4 pointer elsewhere; it is weird that I didn’t just use it
here as well.

Converted now.


>
> The ipf_list_key_cmp() interface is a little confusing because the
> return value convention and the name is a little like a strcmp()ish
> function, but it doesn't use a -1/+1 return value.  I'd rename it to
> ipf_list_key_equal()s, change the return type to "bool", and make true
> mean equal, false mean unequal.
>

yep; ‘eq’ is simpler semantics for the API, so I used it.


>
> Processing a fragment is O(n) in the number of existing fragments due to
> the dup check.  I don't know whether this is important.
>

This processing is fast and a fraction of the total per fragment.

It is also practically a constant because of bounding.


>
> It looks like packet duplication can cause a fraglist to be marked
> invalid.  I wonder whether this is good behavior.
>

Only the packet is marked invalid and it is just an optimization because
conntrack

will do the same otherwise with unnecessary work.


>
> It seems like we could estimate the number of fragments needed by
> dividing the total size by the size of the first fragment received.
>

We don’t know the total size; furthermore, DOS is more likely than real
usage for fragmentation.

I could estimate but there is not much advantage and more complexity.


>
> Do we need xzalloc() for frag_list?  It seems like we're going to
> initialize each element as needed anyway.
>

Maintainability was more important than the insignificant performance gain
here. However,

I spliced out a separate ‘init’ function to handle the initializations and
used xmalloc instead.



>
> I think there's a race if either v4 or v6 is ever disabled, because the
> code checks whether fragment reassembly is enabled twice, once at a high
> level and once again in ipf_handle_frag().  The latter function
> assert-fails if reassembly is disabled, which seems like it could be a
> problem.
>

Yep, it is a bug – thanks

When I converted to NOT_REACHED, I tried to remember the reason I did not

do it originally and now I remember J


>
> At the same time, I don't see any way to ever disable fragment
> reassembly.  The code appears to enable it by default and not provide
> any way to disable it.
>

A subsequent patch in the series allows disabling; I folded that patch into
this one now.


>
> Is it common practice to enforce a minimum size for fragments?
>

A subsequent patch provided comments about this.

Folded that patch into this one.


>
> Maybe it would be a little easier to make the counters a two-dimensional
> array: atomic_uint64_t ipf_counters[2][N_COUNTERS], where one dimension
> is IPv4/IPv6 and the other is the particular counter.  Then ipf_count()
> would become trivial.  Just a thought though.
>

These should have been converted to arrays before.

The intention was to use two arrays since V6 counters can/will diverge from
V4,

which I did now; ipf_count() still becomes a 1 line function.



>
> Thanks, and I'll look forward to the next version.
>
diff mbox series

Patch

diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index 96da23c..d281c97 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -104,31 +104,30 @@  Q: Are all features available with all datapaths?
     The following table lists the datapath supported features from an Open
     vSwitch user's perspective.
 
-    ===================== ============== ============== ========= =======
-    Feature               Linux upstream Linux OVS tree Userspace Hyper-V
-    ===================== ============== ============== ========= =======
-    NAT                   4.6            YES            Yes       NO
-    Connection tracking   4.3            YES            PARTIAL   PARTIAL
-    Tunnel - LISP         NO             YES            NO        NO
-    Tunnel - STT          NO             YES            NO        YES
-    Tunnel - GRE          3.11           YES            YES       YES
-    Tunnel - VXLAN        3.12           YES            YES       YES
-    Tunnel - Geneve       3.18           YES            YES       YES
-    Tunnel - GRE-IPv6     4.18           YES            YES       NO
-    Tunnel - VXLAN-IPv6   4.3            YES            YES       NO
-    Tunnel - Geneve-IPv6  4.4            YES            YES       NO
-    Tunnel - ERSPAN       4.18           YES            YES       NO
-    Tunnel - ERSPAN-IPv6  4.18           YES            YES       NO
-    QoS - Policing        YES            YES            YES       NO
-    QoS - Shaping         YES            YES            NO        NO
-    sFlow                 YES            YES            YES       NO
-    IPFIX                 3.10           YES            YES       NO
-    Set action            YES            YES            YES       PARTIAL
-    NIC Bonding           YES            YES            YES       YES
-    Multiple VTEPs        YES            YES            YES       YES
-    Meters                4.15           YES            YES       NO
-    Conntrack zone limit  4.18           YES            NO        NO
-    ===================== ============== ============== ========= =======
+    ========================== ============== ============== ========= =======
+    Feature                    Linux upstream Linux OVS tree Userspace Hyper-V
+    ========================== ============== ============== ========= =======
+    Connection tracking             4.3            YES          Yes      YES
+    Conntrack Fragment Reass.       4.3            Yes          Yes      YES
+    NAT                             4.6            YES          Yes      NO
+    Conntrack zone limit            4.18           YES          NO       NO
+    Tunnel - LISP                   NO             YES          NO       NO
+    Tunnel - STT                    NO             YES          NO       YES
+    Tunnel - GRE                    3.11           YES          YES      YES
+    Tunnel - VXLAN                  3.12           YES          YES      YES
+    Tunnel - Geneve                 3.18           YES          YES      YES
+    Tunnel - GRE-IPv6               NO             NO           YES      NO
+    Tunnel - VXLAN-IPv6             4.3            YES          YES      NO
+    Tunnel - Geneve-IPv6            4.4            YES          YES      NO
+    QoS - Policing                  YES            YES          YES      NO
+    QoS - Shaping                   YES            YES          NO       NO
+    sFlow                           YES            YES          YES      NO
+    IPFIX                           3.10           YES          YES      NO
+    Set action                      YES            YES          YES    PARTIAL
+    NIC Bonding                     YES            YES          YES      YES
+    Multiple VTEPs                  YES            YES          YES      YES
+    Meters                          4.15           YES          YES      NO
+    ========================== ============== ============== ========= =======
 
     Do note, however:
 
diff --git a/NEWS b/NEWS
index 02402d1..d2e8724 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,8 @@  Post-v2.10.0
    - ovn:
      * New support for IPSEC encrypted tunnels between hypervisors.
      * ovn-ctl: allow passing user:group ids to the OVN daemons.
+   - Userspace datapath:
+     * Add v4/v6 fragmentation support for conntrack.
    - DPDK:
      * Add option for simple round-robin based Rxq to PMD assignment.
        It can be set with pmd-rxq-assign.
diff --git a/include/sparse/netinet/ip6.h b/include/sparse/netinet/ip6.h
index d2a54de..bfa637a 100644
--- a/include/sparse/netinet/ip6.h
+++ b/include/sparse/netinet/ip6.h
@@ -64,5 +64,6 @@  struct ip6_frag {
 };
 
 #define IP6F_OFF_MASK ((OVS_FORCE ovs_be16) 0xfff8)
+#define IP6F_MORE_FRAG ((OVS_FORCE ovs_be16) 0x0001)
 
 #endif /* netinet/ip6.h sparse */
diff --git a/lib/automake.mk b/lib/automake.mk
index 63e9d72..b24b028 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -1,4 +1,4 @@ 
-# Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 Nicira, Inc.
+# Copyright (C) 2009-2018 Nicira, Inc.
 #
 # Copying and distribution of this file, with or without modification,
 # are permitted in any medium without royalty provided the copyright
@@ -107,6 +107,8 @@  lib_libopenvswitch_la_SOURCES = \
 	lib/hmapx.h \
 	lib/id-pool.c \
 	lib/id-pool.h \
+	lib/ipf.c \
+	lib/ipf.h \
 	lib/jhash.c \
 	lib/jhash.h \
 	lib/json.c \
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 3f50fc8..be8debb 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -30,6 +30,7 @@ 
 #include "ct-dpif.h"
 #include "dp-packet.h"
 #include "flow.h"
+#include "ipf.h"
 #include "netdev.h"
 #include "odp-netlink.h"
 #include "openvswitch/hmap.h"
@@ -339,6 +340,7 @@  conntrack_init(struct conntrack *ct)
     atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
     latch_init(&ct->clean_thread_exit);
     ct->clean_thread = ovs_thread_create("ct_clean", clean_thread_main, ct);
+    ipf_init();
 }
 
 /* Destroys the connection tracker 'ct' and frees all the allocated memory. */
@@ -381,6 +383,7 @@  conntrack_destroy(struct conntrack *ct)
     hindex_destroy(&ct->alg_expectation_refs);
     ct_rwlock_unlock(&ct->resources_lock);
     ct_rwlock_destroy(&ct->resources_lock);
+    ipf_destroy();
 }
 
 static unsigned hash_to_bucket(uint32_t hash)
@@ -1295,7 +1298,8 @@  process_one(struct conntrack *ct, struct dp_packet *pkt,
 
 /* Sends the packets in '*pkt_batch' through the connection tracker 'ct'.  All
  * the packets must have the same 'dl_type' (IPv4 or IPv6) and should have
- * the l3 and and l4 offset properly set.
+ * the l3 and and l4 offset properly set.  Performs fragment reassembly with
+ * the help of ipf_preprocess_conntrack().
  *
  * If 'commit' is true, the packets are allowed to create new entries in the
  * connection tables.  'setmark', if not NULL, should point to a two
@@ -1310,11 +1314,14 @@  conntrack_execute(struct conntrack *ct, struct dp_packet_batch *pkt_batch,
                   const struct nat_action_info_t *nat_action_info,
                   long long now)
 {
+    ipf_preprocess_conntrack(pkt_batch, now, dl_type, zone, ct->hash_basis);
+
     struct dp_packet *packet;
     struct conn_lookup_ctx ctx;
 
     DP_PACKET_BATCH_FOR_EACH (i, packet, pkt_batch) {
-        if (!conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
+        if (packet->md.ct_state == CS_INVALID
+            || !conn_key_extract(ct, packet, dl_type, &ctx, zone)) {
             packet->md.ct_state = CS_INVALID;
             write_ct_md(packet, zone, NULL, NULL, NULL);
             continue;
@@ -1323,6 +1330,8 @@  conntrack_execute(struct conntrack *ct, struct dp_packet_batch *pkt_batch,
                     setlabel, nat_action_info, tp_src, tp_dst, helper);
     }
 
+    ipf_postprocess_conntrack(pkt_batch, now, dl_type);
+
     return 0;
 }
 
diff --git a/lib/ipf.c b/lib/ipf.c
new file mode 100644
index 0000000..cc8427e
--- /dev/null
+++ b/lib/ipf.c
@@ -0,0 +1,1363 @@ 
+/*
+ * Copyright (c) 2018 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+#include <ctype.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+#include <netinet/ip6.h>
+#include <netinet/icmp6.h>
+#include <string.h>
+
+#include "coverage.h"
+#include "csum.h"
+#include "ipf.h"
+#include "latch.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/poll-loop.h"
+#include "openvswitch/vlog.h"
+#include "ovs-atomic.h"
+#include "packets.h"
+#include "util.h"
+
+VLOG_DEFINE_THIS_MODULE(ipf);
+COVERAGE_DEFINE(ipf_stuck_frag_list_purged);
+
+enum {
+    IPV4_PACKET_MAX_HDR_SIZE = 60,
+    IPV4_PACKET_MAX_SIZE = 65535,
+    IPV6_PACKET_MAX_DATA = 65535,
+};
+
+enum ipf_list_state {
+    IPF_LIST_STATE_UNUSED,
+    IPF_LIST_STATE_REASS_FAIL,
+    IPF_LIST_STATE_OTHER_SEEN,
+    IPF_LIST_STATE_FIRST_SEEN,
+    IPF_LIST_STATE_LAST_SEEN,
+    IPF_LIST_STATE_FIRST_LAST_SEEN,
+    IPF_LIST_STATE_COMPLETED,
+    IPF_LIST_STATE_NUM,
+};
+
+enum ipf_list_type {
+    IPF_FRAG_COMPLETED_LIST,
+    IPF_FRAG_EXPIRY_LIST,
+};
+
+enum {
+    IPF_INVALID_IDX = -1,
+    IPF_V4_FRAG_SIZE_LBOUND = 400,
+    IPF_V4_FRAG_SIZE_MIN_DEF = 1200,
+    IPF_V6_FRAG_SIZE_LBOUND = 400, /* Useful for testing. */
+    IPF_V6_FRAG_SIZE_MIN_DEF = 1280,
+    IPF_MAX_FRAGS_DEFAULT = 1000,
+    IPF_NFRAG_UBOUND = 5000,
+};
+
+enum ipf_counter_type {
+    IPF_COUNTER_NFRAGS,
+    IPF_COUNTER_NFRAGS_ACCEPTED,
+    IPF_COUNTER_NFRAGS_COMPL_SENT,
+    IPF_COUNTER_NFRAGS_EXPD_SENT,
+    IPF_COUNTER_NFRAGS_TOO_SMALL,
+    IPF_COUNTER_NFRAGS_OVERLAP,
+    IPF_COUNTER_NFRAGS_PURGED,
+};
+
+struct ipf_addr {
+    union {
+        ovs_16aligned_be32 ipv4;
+        union ovs_16aligned_in6_addr ipv6;
+        ovs_be32 ipv4_aligned;
+        struct in6_addr ipv6_aligned;
+    };
+};
+
+struct ipf_frag {
+    struct dp_packet *pkt;
+    uint16_t start_data_byte;
+    uint16_t end_data_byte;
+};
+
+struct ipf_list_key {
+    struct ipf_addr src_addr;
+    struct ipf_addr dst_addr;
+    uint32_t recirc_id;
+    ovs_be32 ip_id;   /* V6 is 32 bits. */
+    ovs_be16 dl_type;
+    uint16_t zone;
+    uint8_t nw_proto;
+};
+
+struct ipf_list {
+    struct hmap_node node;
+    struct ovs_list list_node;
+    struct ipf_frag *frag_list;
+    struct ipf_list_key key;
+    struct dp_packet *reass_execute_ctx; /* Reassembled packet. */
+    long long expiration;          /* In milliseconds. */
+    int last_sent_idx;             /* Last sent fragment idx. */
+    int last_inuse_idx;            /* Last inuse fragment idx. */
+    int size;                      /* Fragment list size. */
+    uint8_t state;                 /* Frag list state; see ipf_list_state. */
+};
+
+struct reassembled_pkt {
+    struct ovs_list rp_list_node;
+    struct dp_packet *pkt;
+    struct ipf_list *list;
+};
+
+struct OVS_LOCKABLE ipf_lock {
+    struct ovs_mutex lock;
+};
+
+static struct ipf_lock ipf_lock;
+
+static int max_v4_frag_list_size;
+
+static pthread_t ipf_clean_thread;
+static struct latch ipf_clean_thread_exit;
+
+static struct hmap frag_lists OVS_GUARDED_BY(ipf_lock);
+static struct ovs_list frag_exp_list OVS_GUARDED_BY(ipf_lock);
+static struct ovs_list frag_complete_list OVS_GUARDED_BY(ipf_lock);
+static struct ovs_list reassembled_pkt_list OVS_GUARDED_BY(ipf_lock);
+
+static atomic_bool ifp_v4_enabled;
+static atomic_bool ifp_v6_enabled;
+static atomic_uint nfrag_max;
+/* Will be clamped above 400 bytes; the value chosen should handle
+ * alg control packets of interest that use string encoding of mutable
+ * IP fields; meaning, the control packets should not be fragmented. */
+static atomic_uint min_v4_frag_size;
+static atomic_uint min_v6_frag_size;
+
+static atomic_count nfrag;
+
+static atomic_uint64_t n4frag_accepted;
+static atomic_uint64_t n4frag_completed_sent;
+static atomic_uint64_t n4frag_expired_sent;
+static atomic_uint64_t n4frag_too_small;
+static atomic_uint64_t n4frag_overlap;
+static atomic_uint64_t n4frag_purged;
+static atomic_uint64_t n6frag_accepted;
+static atomic_uint64_t n6frag_completed_sent;
+static atomic_uint64_t n6frag_expired_sent;
+static atomic_uint64_t n6frag_too_small;
+static atomic_uint64_t n6frag_overlap;
+static atomic_uint64_t n6frag_purged;
+
+static void
+ipf_print_reass_packet(char *es, void *pkt)
+{
+    const unsigned char *b = (const unsigned char *) pkt;
+    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(10, 10);
+    VLOG_WARN_RL(&rl, "%s 91 bytes from specified part of packet "
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
+                 "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X",
+                 es, b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7], b[8],
+                 b[9], b[10], b[11], b[12], b[13], b[14], b[15], b[16],
+                 b[17], b[18], b[19], b[20], b[21], b[22], b[23], b[24],
+                 b[25], b[26], b[27], b[28], b[29], b[30], b[31], b[32],
+                 b[33], b[34], b[35], b[36], b[37], b[38], b[39], b[40],
+                 b[41], b[42], b[43], b[44], b[45], b[46], b[47], b[48],
+                 b[49], b[50], b[51], b[52], b[53], b[54], b[55], b[56],
+                 b[57], b[58], b[59], b[60], b[61], b[62], b[63], b[64],
+                 b[65], b[66], b[67], b[68], b[69], b[70], b[71], b[72],
+                 b[73], b[74], b[75], b[76], b[77], b[77], b[79], b[80],
+                 b[81], b[82], b[83], b[84], b[85], b[86], b[87], b[88],
+                 b[89], b[90]);
+}
+
+static void ipf_lock_init(struct ipf_lock *lock)
+{
+    ovs_mutex_init_adaptive(&lock->lock);
+}
+
+static void ipf_lock_lock(struct ipf_lock *lock)
+    OVS_ACQUIRES(lock)
+    OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+    ovs_mutex_lock(&lock->lock);
+}
+
+static void ipf_lock_unlock(struct ipf_lock *lock)
+    OVS_RELEASES(lock)
+    OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+    ovs_mutex_unlock(&lock->lock);
+}
+
+static void ipf_lock_destroy(struct ipf_lock *lock)
+{
+    ovs_mutex_destroy(&lock->lock);
+}
+
+static void
+ipf_count(bool v6, enum ipf_counter_type cntr)
+{
+    switch (cntr) {
+    case IPF_COUNTER_NFRAGS_ACCEPTED:
+        atomic_count_inc64(v6 ? &n6frag_accepted : &n4frag_accepted);
+        break;
+    case IPF_COUNTER_NFRAGS_COMPL_SENT:
+        atomic_count_inc64(v6 ? &n6frag_completed_sent
+                         : &n4frag_completed_sent);
+        break;
+    case IPF_COUNTER_NFRAGS_EXPD_SENT:
+        atomic_count_inc64(v6 ? &n6frag_expired_sent
+                         : &n4frag_expired_sent);
+        break;
+    case IPF_COUNTER_NFRAGS_TOO_SMALL:
+        atomic_count_inc64(v6 ? &n6frag_too_small : &n4frag_too_small);
+        break;
+    case IPF_COUNTER_NFRAGS_OVERLAP:
+        atomic_count_inc64(v6 ? &n6frag_overlap : &n4frag_overlap);
+        break;
+    case IPF_COUNTER_NFRAGS_PURGED:
+        atomic_count_inc64(v6 ? &n6frag_purged : &n4frag_purged);
+        break;
+    case IPF_COUNTER_NFRAGS:
+    default:
+        OVS_NOT_REACHED();
+    }
+}
+
+static bool
+ipf_get_v4_enabled(void)
+{
+    bool ifp_v4_enabled_;
+    atomic_read_relaxed(&ifp_v4_enabled, &ifp_v4_enabled_);
+    return ifp_v4_enabled_;
+}
+
+static bool
+ipf_get_v6_enabled(void)
+{
+    bool ifp_v6_enabled_;
+    atomic_read_relaxed(&ifp_v6_enabled, &ifp_v6_enabled_);
+    return ifp_v6_enabled_;
+}
+
+static bool
+ipf_get_enabled(void)
+{
+    return ipf_get_v4_enabled() || ipf_get_v6_enabled();
+}
+
+static uint32_t
+ipf_addr_hash_add(uint32_t hash, const struct ipf_addr *addr)
+{
+    BUILD_ASSERT_DECL(sizeof *addr % 4 == 0);
+    return hash_add_bytes32(hash, (const uint32_t *) addr, sizeof *addr);
+}
+
+static void
+ipf_expiry_list_add(struct ipf_list *ipf_list, long long now)
+    OVS_REQUIRES(ipf_lock)
+{
+    enum {
+        IPF_FRAG_LIST_TIMEOUT = 15000,
+    };
+
+    ipf_list->expiration = now + IPF_FRAG_LIST_TIMEOUT;
+    ovs_list_push_back(&frag_exp_list, &ipf_list->list_node);
+}
+
+static void
+ipf_completed_list_add(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    ovs_list_push_back(&frag_complete_list, &ipf_list->list_node);
+}
+
+static void
+ipf_reassembled_list_add(struct reassembled_pkt *rp)
+    OVS_REQUIRES(ipf_lock)
+{
+    ovs_list_push_back(&reassembled_pkt_list, &rp->rp_list_node);
+}
+
+static void
+ipf_list_clean(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    ovs_list_remove(&ipf_list->list_node);
+    hmap_remove(&frag_lists, &ipf_list->node);
+    free(ipf_list->frag_list);
+    free(ipf_list);
+}
+
+static void
+ipf_expiry_list_clean(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    ipf_list_clean(ipf_list);
+}
+
+static void
+ipf_completed_list_clean(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    ipf_list_clean(ipf_list);
+}
+
+static void
+ipf_expiry_list_remove(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    ovs_list_remove(&ipf_list->list_node);
+}
+
+static void
+ipf_reassembled_list_remove(struct reassembled_pkt *rp)
+    OVS_REQUIRES(ipf_lock)
+{
+    ovs_list_remove(&rp->rp_list_node);
+}
+
+/* Symmetric */
+static uint32_t
+ipf_list_key_hash(const struct ipf_list_key *key, uint32_t basis)
+{
+    uint32_t hsrc, hdst, hash;
+    hsrc = hdst = basis;
+    hsrc = ipf_addr_hash_add(hsrc, &key->src_addr);
+    hdst = ipf_addr_hash_add(hdst, &key->dst_addr);
+    hash = hsrc ^ hdst;
+
+    /* Hash the rest of the key. */
+    hash = hash_words((uint32_t *) (&key->dst_addr + 1),
+                      (uint32_t *) (key + 1) -
+                          (uint32_t *) (&key->dst_addr + 1),
+                      hash);
+
+    return hash_finish(hash, 0);
+}
+
+static bool
+ipf_is_first_v4_frag(const struct dp_packet *pkt)
+{
+    const struct ip_header *l3 = dp_packet_l3(pkt);
+    if (!(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK)) &&
+        l3->ip_frag_off & htons(IP_MORE_FRAGMENTS)) {
+        return true;
+    }
+    return false;
+}
+
+static bool
+ipf_is_last_v4_frag(const struct dp_packet *pkt)
+{
+    const struct ip_header *l3 = dp_packet_l3(pkt);
+    if (l3->ip_frag_off & htons(IP_FRAG_OFF_MASK) &&
+        !(l3->ip_frag_off & htons(IP_MORE_FRAGMENTS))) {
+        return true;
+    }
+    return false;
+}
+
+static bool
+ipf_is_v6_frag(ovs_be16 ip6f_offlg)
+{
+    if (ip6f_offlg & (IP6F_OFF_MASK | IP6F_MORE_FRAG)) {
+        return true;
+    }
+    return false;
+}
+
+static bool
+ipf_is_first_v6_frag(ovs_be16 ip6f_offlg)
+{
+    if (!(ip6f_offlg & IP6F_OFF_MASK) &&
+        ip6f_offlg & IP6F_MORE_FRAG) {
+        return true;
+    }
+    return false;
+}
+
+static bool
+ipf_is_last_v6_frag(ovs_be16 ip6f_offlg)
+{
+    if ((ip6f_offlg & IP6F_OFF_MASK) &&
+        !(ip6f_offlg & IP6F_MORE_FRAG)) {
+        return true;
+    }
+    return false;
+}
+
+static bool
+ipf_list_complete(const struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    for (int i = 0; i < ipf_list->last_inuse_idx; i++) {
+        if (ipf_list->frag_list[i].end_data_byte + 1
+            != ipf_list->frag_list[i + 1].start_data_byte) {
+            return false;
+        }
+    }
+    return true;
+}
+
+/* Runs O(n) for a sorted or almost sorted list. */
+static void
+ipf_sort(struct ipf_frag *frag_list, size_t last_idx)
+    OVS_REQUIRES(ipf_lock)
+{
+    int running_last_idx = 1;
+    struct ipf_frag ipf_frag;
+    while (running_last_idx <= last_idx) {
+        ipf_frag = frag_list[running_last_idx];
+        int frag_list_idx = running_last_idx - 1;
+        while (frag_list_idx >= 0 &&
+               frag_list[frag_list_idx].start_data_byte >
+                   ipf_frag.start_data_byte) {
+            frag_list[frag_list_idx + 1] = frag_list[frag_list_idx];
+            frag_list_idx -= 1;
+        }
+        frag_list[frag_list_idx + 1] = ipf_frag;
+        running_last_idx++;
+    }
+}
+
+/* Called on a sorted complete list of fragments. */
+static struct dp_packet *
+ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    struct ipf_frag *frag_list = ipf_list->frag_list;
+    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
+    struct ip_header *l3 = dp_packet_l3(pkt);
+    int len = ntohs(l3->ip_tot_len);
+    size_t add_len;
+    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
+
+    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
+        add_len = frag_list[i].end_data_byte -
+                         frag_list[i].start_data_byte + 1;
+        len += add_len;
+        if (len > IPV4_PACKET_MAX_SIZE) {
+            ipf_print_reass_packet(
+                "Unsupported big reassembled v4 packet; v4 hdr:", l3);
+            dp_packet_delete(pkt);
+            return NULL;
+        }
+        l3 = dp_packet_l3(frag_list[i].pkt);
+        dp_packet_put(pkt, (char *)l3 + ip_hdr_len, add_len);
+    }
+    l3 = dp_packet_l3(pkt);
+    ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
+    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
+                                new_ip_frag_off);
+    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
+    l3->ip_tot_len = htons(len);
+    l3->ip_frag_off = new_ip_frag_off;
+
+    return pkt;
+}
+
+/* Called on a sorted complete list of fragments. */
+static struct dp_packet *
+ipf_reassemble_v6_frags(struct ipf_list *ipf_list)
+    OVS_REQUIRES(ipf_lock)
+{
+    struct ipf_frag *frag_list = ipf_list->frag_list;
+    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
+    struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
+    int pl = ntohs(l3->ip6_plen) - sizeof(struct ovs_16aligned_ip6_frag);
+    const char *tail = dp_packet_tail(pkt);
+    uint8_t pad = dp_packet_l2_pad_size(pkt);
+    const char *l4 = dp_packet_l4(pkt);
+    size_t l3_size = tail - (char *)l3 - pad;
+    size_t l4_size = tail - (char *)l4 - pad;
+    size_t l3_hlen = l3_size - l4_size;
+    size_t add_len;
+
+    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
+        add_len = frag_list[i].end_data_byte -
+                          frag_list[i].start_data_byte + 1;
+        pl += add_len;
+        if (pl > IPV6_PACKET_MAX_DATA) {
+            ipf_print_reass_packet(
+                "Unsupported big reassembled v6 packet; v6 hdr:", l3);
+            dp_packet_delete(pkt);
+            return NULL;
+        }
+        l3 = dp_packet_l3(frag_list[i].pkt);
+        dp_packet_put(pkt, (char *)l3 + l3_hlen, add_len);
+    }
+    l3 = dp_packet_l3(pkt);
+    l4 = dp_packet_l4(pkt);
+    tail = dp_packet_tail(pkt);
+    pad = dp_packet_l2_pad_size(pkt);
+    l3_size = tail - (char *)l3 - pad;
+
+    uint8_t nw_proto = l3->ip6_nxt;
+    uint8_t nw_frag = 0;
+    const void *data = l3 + 1;
+    size_t datasize = l3_size - sizeof *l3;
+
+    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
+    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag, &frag_hdr)
+        || !nw_frag || !frag_hdr) {
+
+        ipf_print_reass_packet("Unparsed reassembled v6 packet; v6 hdr:", l3);
+        dp_packet_delete(pkt);
+        return NULL;
+    }
+
+    struct ovs_16aligned_ip6_frag *fh =
+        CONST_CAST(struct ovs_16aligned_ip6_frag *, frag_hdr);
+    fh->ip6f_offlg = 0;
+    l3->ip6_plen = htons(pl);
+    l3->ip6_ctlun.ip6_un1.ip6_un1_nxt = nw_proto;
+    return pkt;
+}
+
+/* Called when a valid fragment is added. */
+static void
+ipf_list_state_transition(struct ipf_list *ipf_list, bool ff, bool lf,
+                          bool v6)
+    OVS_REQUIRES(ipf_lock)
+{
+    enum ipf_list_state curr_state = ipf_list->state;
+    enum ipf_list_state next_state;
+    switch (curr_state) {
+    case IPF_LIST_STATE_UNUSED:
+    case IPF_LIST_STATE_OTHER_SEEN:
+        if (ff) {
+            next_state = IPF_LIST_STATE_FIRST_SEEN;
+        } else if (lf) {
+            next_state = IPF_LIST_STATE_LAST_SEEN;
+        } else {
+            next_state = IPF_LIST_STATE_OTHER_SEEN;
+        }
+        break;
+    case IPF_LIST_STATE_FIRST_SEEN:
+        if (ff) {
+            next_state = IPF_LIST_STATE_FIRST_SEEN;
+        } else if (lf) {
+            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
+        } else {
+            next_state = IPF_LIST_STATE_FIRST_SEEN;
+        }
+        break;
+    case IPF_LIST_STATE_LAST_SEEN:
+        if (ff) {
+            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
+        } else if (lf) {
+            next_state = IPF_LIST_STATE_LAST_SEEN;
+        } else {
+            next_state = IPF_LIST_STATE_LAST_SEEN;
+        }
+        break;
+    case IPF_LIST_STATE_FIRST_LAST_SEEN:
+        next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
+        break;
+    case IPF_LIST_STATE_COMPLETED:
+    case IPF_LIST_STATE_REASS_FAIL:
+    case IPF_LIST_STATE_NUM:
+    default:
+        OVS_NOT_REACHED();
+    }
+
+    if (next_state == IPF_LIST_STATE_FIRST_LAST_SEEN) {
+        ipf_sort(ipf_list->frag_list, ipf_list->last_inuse_idx);
+        if (ipf_list_complete(ipf_list)) {
+            struct dp_packet *reass_pkt = v6
+                ? ipf_reassemble_v6_frags(ipf_list)
+                : ipf_reassemble_v4_frags(ipf_list);
+            if (reass_pkt) {
+                struct reassembled_pkt *rp = xzalloc(sizeof *rp);
+                rp->pkt = reass_pkt;
+                rp->list = ipf_list;
+                ipf_reassembled_list_add(rp);
+                ipf_expiry_list_remove(ipf_list);
+                next_state = IPF_LIST_STATE_COMPLETED;
+            } else {
+                next_state = IPF_LIST_STATE_REASS_FAIL;
+            }
+        }
+    }
+    ipf_list->state = next_state;
+}
+
+static bool
+ipf_is_valid_v4_frag(struct dp_packet *pkt)
+{
+    if (OVS_UNLIKELY(dp_packet_ip_checksum_bad(pkt))) {
+        goto invalid_pkt;
+    }
+
+    const struct eth_header *l2 = dp_packet_eth(pkt);
+    const struct ip_header *l3 = dp_packet_l3(pkt);
+
+    if (OVS_UNLIKELY(!l2 || !l3)) {
+        goto invalid_pkt;
+    }
+
+    const char *tail = dp_packet_tail(pkt);
+    uint8_t pad = dp_packet_l2_pad_size(pkt);
+    size_t size = tail - (char *)l3 - pad;
+    if (OVS_UNLIKELY(size < IP_HEADER_LEN)) {
+        goto invalid_pkt;
+    }
+
+    if (!(IP_IS_FRAGMENT(l3->ip_frag_off))) {
+        return false;
+    }
+
+    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
+    if (OVS_UNLIKELY(ip_tot_len != size)) {
+        goto invalid_pkt;
+    }
+
+    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
+    if (OVS_UNLIKELY(ip_hdr_len < IP_HEADER_LEN)) {
+        goto invalid_pkt;
+    }
+    if (OVS_UNLIKELY(size < ip_hdr_len)) {
+        goto invalid_pkt;
+    }
+
+    if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
+                     && csum(l3, ip_hdr_len) != 0)) {
+        goto invalid_pkt;
+    }
+
+    uint32_t min_v4_frag_size_;
+    atomic_read_relaxed(&min_v4_frag_size, &min_v4_frag_size_);
+    bool lf = ipf_is_last_v4_frag(pkt);
+    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v4_frag_size_)) {
+        ipf_count(false, IPF_COUNTER_NFRAGS_TOO_SMALL);
+        goto invalid_pkt;
+    }
+    return true;
+
+invalid_pkt:
+    pkt->md.ct_state = CS_INVALID;
+    return false;
+
+}
+
+static bool
+ipf_v4_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
+                   struct ipf_list_key *key, uint16_t *start_data_byte,
+                   uint16_t *end_data_byte, bool *ff, bool *lf)
+{
+    const struct ip_header *l3 = dp_packet_l3(pkt);
+    uint16_t ip_tot_len = ntohs(l3->ip_tot_len);
+    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
+
+    *start_data_byte = ntohs(l3->ip_frag_off & htons(IP_FRAG_OFF_MASK)) * 8;
+    *end_data_byte = *start_data_byte + ip_tot_len - ip_hdr_len - 1;
+    *ff = ipf_is_first_v4_frag(pkt);
+    *lf = ipf_is_last_v4_frag(pkt);
+    memset(key, 0, sizeof *key);
+    key->ip_id = be16_to_be32(l3->ip_id);
+    key->dl_type = dl_type;
+    key->src_addr.ipv4 = l3->ip_src;
+    key->dst_addr.ipv4 = l3->ip_dst;
+    key->nw_proto = l3->ip_proto;
+    key->zone = zone;
+    key->recirc_id = pkt->md.recirc_id;
+    return true;
+}
+
+static bool
+ipf_is_valid_v6_frag(struct dp_packet *pkt OVS_UNUSED)
+{
+    const struct eth_header *l2 = dp_packet_eth(pkt);
+    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
+    const char *l4 = dp_packet_l4(pkt);
+
+    if (OVS_UNLIKELY(!l2 || !l3 || !l4)) {
+        goto invalid_pkt;
+    }
+
+    const char *tail = dp_packet_tail(pkt);
+    uint8_t pad = dp_packet_l2_pad_size(pkt);
+    size_t l3_size = tail - (char *)l3 - pad;
+    size_t l3_hdr_size = sizeof *l3;
+
+    if (OVS_UNLIKELY(l3_size < l3_hdr_size)) {
+        goto invalid_pkt;
+    }
+
+    uint8_t nw_frag = 0;
+    uint8_t nw_proto = l3->ip6_nxt;
+    const void *data = l3 + 1;
+    size_t datasize = l3_size - l3_hdr_size;
+    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
+    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag,
+                             &frag_hdr) || !nw_frag || !frag_hdr) {
+        return false;
+    }
+
+    int pl = ntohs(l3->ip6_plen);
+    if (OVS_UNLIKELY(pl + l3_hdr_size != l3_size)) {
+        goto invalid_pkt;
+    }
+
+    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
+    if (OVS_UNLIKELY(!ipf_is_v6_frag(ip6f_offlg))) {
+        return false;
+    }
+
+    uint32_t min_v6_frag_size_;
+    atomic_read_relaxed(&min_v6_frag_size, &min_v6_frag_size_);
+    bool lf = ipf_is_last_v6_frag(ip6f_offlg);
+
+    if (OVS_UNLIKELY(!lf && dp_packet_size(pkt) < min_v6_frag_size_)) {
+        ipf_count(true, IPF_COUNTER_NFRAGS_TOO_SMALL);
+        goto invalid_pkt;
+    }
+
+    return true;
+
+invalid_pkt:
+    pkt->md.ct_state = CS_INVALID;
+    return false;
+
+}
+
+static void
+ipf_v6_key_extract(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
+                   struct ipf_list_key *key, uint16_t *start_data_byte,
+                   uint16_t *end_data_byte, bool *ff, bool *lf)
+{
+    const struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
+    const char *l4 = dp_packet_l4(pkt);
+    const char *tail = dp_packet_tail(pkt);
+    uint8_t pad = dp_packet_l2_pad_size(pkt);
+    size_t l3_size = tail - (char *)l3 - pad;
+    size_t l4_size = tail - (char *)l4 - pad;
+    size_t l3_hdr_size = sizeof *l3;
+    uint8_t nw_frag = 0;
+    uint8_t nw_proto = l3->ip6_nxt;
+    const void *data = l3 + 1;
+    size_t datasize = l3_size - l3_hdr_size;
+    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
+
+    parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag, &frag_hdr);
+    ovs_assert(nw_frag && frag_hdr);
+    ovs_be16 ip6f_offlg = frag_hdr->ip6f_offlg;
+    *start_data_byte = ntohs(ip6f_offlg & IP6F_OFF_MASK) +
+        sizeof (struct ovs_16aligned_ip6_frag);
+    *end_data_byte = *start_data_byte + l4_size - 1;
+    *ff = ipf_is_first_v6_frag(ip6f_offlg);
+    *lf = ipf_is_last_v6_frag(ip6f_offlg);
+    memset(key, 0, sizeof *key);
+    key->ip_id = get_16aligned_be32(&frag_hdr->ip6f_ident);
+    key->dl_type = dl_type;
+    key->src_addr.ipv6 = l3->ip6_src;
+    /* We are not supporting parsing of the routing header to use as the
+     * dst address part of the key. */
+    key->dst_addr.ipv6 = l3->ip6_dst;
+    key->nw_proto = 0;   /* Not used for key for V6. */
+    key->zone = zone;
+    key->recirc_id = pkt->md.recirc_id;
+}
+
+static int
+ipf_list_key_cmp(const struct ipf_list_key *key1,
+                 const struct ipf_list_key *key2)
+    OVS_REQUIRES(ipf_lock)
+{
+    if (!memcmp(&key1->src_addr, &key2->src_addr, sizeof key1->src_addr) &&
+        !memcmp(&key1->dst_addr, &key2->dst_addr, sizeof key1->dst_addr) &&
+        (key1->dl_type == key2->dl_type) &&
+        (key1->ip_id == key2->ip_id) &&
+        (key1->zone == key2->zone) &&
+        (key1->nw_proto == key2->nw_proto) &&
+        (key1->recirc_id == key2->recirc_id)) {
+        return 0;
+    }
+    return 1;
+}
+
+static struct ipf_list *
+ipf_list_key_lookup(const struct ipf_list_key *key,
+                    uint32_t hash)
+    OVS_REQUIRES(ipf_lock)
+{
+    struct ipf_list *ipf_list;
+    HMAP_FOR_EACH_WITH_HASH (ipf_list, node, hash, &frag_lists) {
+        if (!ipf_list_key_cmp(&ipf_list->key, key)) {
+            return ipf_list;
+        }
+    }
+    return NULL;
+}
+
+static bool
+ipf_is_frag_duped(const struct ipf_frag *frag_list, int last_inuse_idx,
+                  size_t start_data_byte, size_t end_data_byte)
+    OVS_REQUIRES(ipf_lock)
+{
+    for (int i = 0; i <= last_inuse_idx; i++) {
+        if (((start_data_byte >= frag_list[i].start_data_byte) &&
+            (start_data_byte <= frag_list[i].end_data_byte)) ||
+            ((end_data_byte >= frag_list[i].start_data_byte) &&
+             (end_data_byte <= frag_list[i].end_data_byte))) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static bool
+ipf_process_frag(struct ipf_list *ipf_list, struct dp_packet *pkt,
+                 uint16_t start_data_byte, uint16_t end_data_byte,
+                 bool ff, bool lf, bool v6)
+    OVS_REQUIRES(ipf_lock)
+{
+    bool duped_frag = ipf_is_frag_duped(ipf_list->frag_list,
+        ipf_list->last_inuse_idx, start_data_byte, end_data_byte);
+    int last_inuse_idx = ipf_list->last_inuse_idx;
+
+    if (!duped_frag) {
+        if (last_inuse_idx < ipf_list->size - 1) {
+            /* In the case of dpdk, it would be unfortunate if we had
+             * to create a clone fragment outside the dpdk mp due to the
+             * mempool size being too limited. We will otherwise need to
+             * recommend not setting the mempool number of buffers too low
+             * and also clamp the number of fragments. */
+            ipf_list->frag_list[last_inuse_idx + 1].pkt = pkt;
+            ipf_list->frag_list[last_inuse_idx + 1].start_data_byte =
+                start_data_byte;
+            ipf_list->frag_list[last_inuse_idx + 1].end_data_byte =
+                end_data_byte;
+            ipf_list->last_inuse_idx++;
+            atomic_count_inc(&nfrag);
+            ipf_count(v6, IPF_COUNTER_NFRAGS_ACCEPTED);
+            ipf_list_state_transition(ipf_list, ff, lf, v6);
+        } else {
+            OVS_NOT_REACHED();
+        }
+    } else {
+        ipf_count(v6, IPF_COUNTER_NFRAGS_OVERLAP);
+        pkt->md.ct_state = CS_INVALID;
+        return false;
+    }
+    return true;
+}
+
+static bool
+ipf_handle_frag(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
+                long long now, uint32_t hash_basis)
+    OVS_REQUIRES(ipf_lock)
+{
+    struct ipf_list_key key;
+    /* Initialize 4 variables for some versions of GCC. */
+    uint16_t start_data_byte = 0;
+    uint16_t end_data_byte = 0;
+    bool ff = false;
+    bool lf = false;
+    bool v6 = dl_type == htons(ETH_TYPE_IPV6);
+
+    if (v6 && ipf_get_v6_enabled()) {
+        ipf_v6_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
+                           &end_data_byte, &ff, &lf);
+    } else if (!v6 && ipf_get_v4_enabled()) {
+        ipf_v4_key_extract(pkt, dl_type, zone, &key, &start_data_byte,
+                           &end_data_byte, &ff, &lf);
+    } else {
+        OVS_NOT_REACHED();
+    }
+
+    unsigned int nfrag_max_;
+    atomic_read_relaxed(&nfrag_max, &nfrag_max_);
+    if (atomic_count_get(&nfrag) >= nfrag_max_) {
+        return false;
+    }
+
+    uint32_t hash = ipf_list_key_hash(&key, hash_basis);
+    struct ipf_list *ipf_list = ipf_list_key_lookup(&key, hash);
+    enum {
+        IPF_FRAG_LIST_MIN_INCREMENT = 4,
+        IPF_IPV6_MAX_FRAG_LIST_SIZE = 65535,
+    };
+
+    int max_frag_list_size;
+    if (v6) {
+        /* Because the calculation with extension headers is variable,
+         * we don't calculate a hard maximum fragment list size upfront.  The
+         * fragment list size is practically limited by the code, however. */
+        max_frag_list_size = IPF_IPV6_MAX_FRAG_LIST_SIZE;
+    } else {
+        max_frag_list_size = max_v4_frag_list_size;
+    }
+
+    if (!ipf_list) {
+        ipf_list = xzalloc(sizeof *ipf_list);
+        ipf_list->key = key;
+        ipf_list->last_inuse_idx = IPF_INVALID_IDX;
+        ipf_list->last_sent_idx = IPF_INVALID_IDX;
+        ipf_list->size =
+            MIN(max_frag_list_size, IPF_FRAG_LIST_MIN_INCREMENT);
+        ipf_list->frag_list =
+            xzalloc(ipf_list->size * sizeof *ipf_list->frag_list);
+        hmap_insert(&frag_lists, &ipf_list->node, hash);
+        ipf_expiry_list_add(ipf_list, now);
+    } else if (ipf_list->state == IPF_LIST_STATE_REASS_FAIL) {
+        /* Bail out as early as possible. */
+        return false;
+    } else if (ipf_list->last_inuse_idx + 1 >= ipf_list->size) {
+        int increment = MIN(IPF_FRAG_LIST_MIN_INCREMENT,
+                            max_frag_list_size - ipf_list->size);
+        /* Enforce limit. */
+        if (increment > 0) {
+            ipf_list->frag_list =
+                xrealloc(ipf_list->frag_list, (ipf_list->size + increment) *
+                  sizeof *ipf_list->frag_list);
+            ipf_list->size += increment;
+        } else {
+            return false;
+        }
+    }
+
+    return ipf_process_frag(ipf_list, pkt, start_data_byte, end_data_byte, ff,
+                            lf, v6);
+}
+
+static void
+ipf_extract_frags_from_batch(struct dp_packet_batch *pb, ovs_be16 dl_type,
+                             uint16_t zone, long long now, uint32_t hash_basis)
+{
+    const size_t pb_cnt = dp_packet_batch_size(pb);
+    int pb_idx; /* Index in a packet batch. */
+    struct dp_packet *pkt;
+
+    DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
+
+        if (OVS_UNLIKELY((dl_type == htons(ETH_TYPE_IP) &&
+                          ipf_is_valid_v4_frag(pkt)) ||
+                         (dl_type == htons(ETH_TYPE_IPV6) &&
+                          ipf_is_valid_v6_frag(pkt)))) {
+
+            ipf_lock_lock(&ipf_lock);
+            if (!ipf_handle_frag(pkt, dl_type, zone, now, hash_basis)) {
+                dp_packet_batch_refill(pb, pkt, pb_idx);
+            }
+            ipf_lock_unlock(&ipf_lock);
+        } else {
+            dp_packet_batch_refill(pb, pkt, pb_idx);
+        }
+
+    }
+}
+
+/* In case of DPDK, a memory source check is done, as DPDK memory pool
+ * management has trouble dealing with multiple source types.  The
+ * check_source paramater is used to indicate when this check is needed. */
+static bool
+ipf_dp_packet_batch_add(struct dp_packet_batch *pb , struct dp_packet *pkt,
+                        bool check_source OVS_UNUSED)
+    OVS_REQUIRES(ipf_lock)
+{
+#ifdef DPDK_NETDEV
+    if ((pb->count >= NETDEV_MAX_BURST) ||
+        /* DPDK cannot handle multiple sources in a batch. */
+        (check_source && pb->count && pb->packets[0]->source != pkt->source)) {
+#else
+    if (pb->count >= NETDEV_MAX_BURST) {
+#endif
+        return false;
+    }
+
+    dp_packet_batch_add(pb, pkt);
+    return true;
+}
+
+/* This would be used in rare cases where a list cannot be sent. One rare
+ * reason known right now is a mempool source check, which exists due to DPDK
+ * support, where packets are no longer being received on any port with a
+ * source matching the fragment.  Another reason is a race where all
+ * conntrack rules are unconfigured when some fragments are yet to be
+ * flushed.
+ *
+ * Returns true if the list was purged. */
+static bool
+ipf_purge_list_check(struct ipf_list *ipf_list, long long now)
+    OVS_REQUIRES(ipf_lock)
+{
+    enum {
+        IPF_FRAG_LIST_PURGE_TIME_ADJ = 10000
+    };
+
+    if (now < ipf_list->expiration + IPF_FRAG_LIST_PURGE_TIME_ADJ) {
+        return false;
+    }
+
+    struct dp_packet *pkt;
+    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
+        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
+        dp_packet_delete(pkt);
+        atomic_count_dec(&nfrag);
+        ipf_list->last_sent_idx++;
+    }
+
+    COVERAGE_INC(ipf_stuck_frag_list_purged);
+    ipf_count(ipf_list->key.dl_type == htons(ETH_TYPE_IPV6),
+              IPF_COUNTER_NFRAGS_PURGED);
+    return true;
+}
+
+static bool
+ipf_send_frags_in_list(struct ipf_list *ipf_list, struct dp_packet_batch *pb,
+                       enum ipf_list_type list_type, bool v6, long long now)
+    OVS_REQUIRES(ipf_lock)
+{
+    if (ipf_purge_list_check(ipf_list, now)) {
+        return true;
+    }
+
+    struct dp_packet *pkt;
+    while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
+        pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
+        if (ipf_dp_packet_batch_add(pb, pkt, true)) {
+
+            ipf_list->last_sent_idx++;
+            atomic_count_dec(&nfrag);
+
+            if (list_type == IPF_FRAG_COMPLETED_LIST) {
+                ipf_count(v6, IPF_COUNTER_NFRAGS_COMPL_SENT);
+            } else {
+                ipf_count(v6, IPF_COUNTER_NFRAGS_EXPD_SENT);
+                pkt->md.ct_state = CS_INVALID;
+            }
+
+            if (ipf_list->last_sent_idx == ipf_list->last_inuse_idx) {
+                return true;
+            }
+        } else {
+            return false;
+        }
+    }
+    OVS_NOT_REACHED();
+}
+
+static void
+ipf_send_completed_frags(struct dp_packet_batch *pb, long long now, bool v6)
+{
+    if (ovs_list_is_empty(&frag_complete_list)) {
+        return;
+    }
+
+    ipf_lock_lock(&ipf_lock);
+    struct ipf_list *ipf_list, *next;
+
+    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_complete_list) {
+        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_COMPLETED_LIST,
+                                   v6, now)) {
+            ipf_completed_list_clean(ipf_list);
+        } else {
+            break;
+        }
+    }
+    ipf_lock_unlock(&ipf_lock);
+}
+
+static void
+ipf_send_expired_frags(struct dp_packet_batch *pb, long long now, bool v6)
+{
+    enum {
+        /* Very conservative, due to DOS probability. */
+        IPF_FRAG_LIST_MAX_EXPIRED = 1,
+    };
+
+
+    if (ovs_list_is_empty(&frag_exp_list)) {
+        return;
+    }
+
+    ipf_lock_lock(&ipf_lock);
+    struct ipf_list *ipf_list, *next;
+    size_t lists_removed = 0;
+
+    LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_exp_list) {
+        if (!(now > ipf_list->expiration) ||
+            lists_removed >= IPF_FRAG_LIST_MAX_EXPIRED) {
+            break;
+        }
+
+        if (ipf_send_frags_in_list(ipf_list, pb, IPF_FRAG_EXPIRY_LIST, v6,
+                                   now)) {
+            ipf_expiry_list_clean(ipf_list);
+            lists_removed++;
+        } else {
+            break;
+        }
+    }
+    ipf_lock_unlock(&ipf_lock);
+}
+
+static void
+ipf_execute_reass_pkts(struct dp_packet_batch *pb)
+{
+    if (ovs_list_is_empty(&reassembled_pkt_list)) {
+        return;
+    }
+
+    ipf_lock_lock(&ipf_lock);
+    struct reassembled_pkt *rp, *next;
+
+    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
+        if (!rp->list->reass_execute_ctx &&
+            ipf_dp_packet_batch_add(pb, rp->pkt, false)) {
+            rp->list->reass_execute_ctx = rp->pkt;
+        }
+    }
+    ipf_lock_unlock(&ipf_lock);
+}
+
+static void
+ipf_post_execute_reass_pkts(struct dp_packet_batch *pb, bool v6)
+{
+    if (ovs_list_is_empty(&reassembled_pkt_list)) {
+        return;
+    }
+
+    ipf_lock_lock(&ipf_lock);
+    struct reassembled_pkt *rp, *next;
+
+    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
+        const size_t pb_cnt = dp_packet_batch_size(pb);
+        int pb_idx;
+        struct dp_packet *pkt;
+        /* Inner batch loop is constant time since batch size is <=
+         * NETDEV_MAX_BURST. */
+        DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
+            if (pkt == rp->list->reass_execute_ctx) {
+                for (int i = 0; i <= rp->list->last_inuse_idx; i++) {
+                    rp->list->frag_list[i].pkt->md.ct_label = pkt->md.ct_label;
+                    rp->list->frag_list[i].pkt->md.ct_mark = pkt->md.ct_mark;
+                    rp->list->frag_list[i].pkt->md.ct_state = pkt->md.ct_state;
+                    rp->list->frag_list[i].pkt->md.ct_zone = pkt->md.ct_zone;
+                    rp->list->frag_list[i].pkt->md.ct_orig_tuple_ipv6 =
+                        pkt->md.ct_orig_tuple_ipv6;
+                    if (pkt->md.ct_orig_tuple_ipv6) {
+                        rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv6 =
+                            pkt->md.ct_orig_tuple.ipv6;
+                    } else {
+                        rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv4  =
+                            pkt->md.ct_orig_tuple.ipv4;
+                    }
+                }
+
+                const char *tail_frag =
+                    dp_packet_tail(rp->list->frag_list[0].pkt);
+                uint8_t pad_frag =
+                    dp_packet_l2_pad_size(rp->list->frag_list[0].pkt);
+
+                void *l4_frag = dp_packet_l4(rp->list->frag_list[0].pkt);
+                void *l4_reass = dp_packet_l4(pkt);
+                memcpy(l4_frag, l4_reass,
+                       tail_frag - (char *) l4_frag - pad_frag);
+
+                if (v6) {
+                    struct  ovs_16aligned_ip6_hdr *l3_frag =
+                        dp_packet_l3(rp->list->frag_list[0].pkt);
+                    struct  ovs_16aligned_ip6_hdr *l3_reass =
+                        dp_packet_l3(pkt);
+                    l3_frag->ip6_src = l3_reass->ip6_src;
+                    l3_frag->ip6_dst = l3_reass->ip6_dst;
+                } else {
+                    struct ip_header *l3_frag =
+                        dp_packet_l3(rp->list->frag_list[0].pkt);
+                    struct ip_header *l3_reass = dp_packet_l3(pkt);
+                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
+                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
+                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                     frag_ip, reass_ip);
+                    l3_frag->ip_src = l3_reass->ip_src;
+
+                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
+                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
+                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
+                                                     frag_ip, reass_ip);
+                    l3_frag->ip_dst = l3_reass->ip_dst;
+                }
+
+                ipf_completed_list_add(rp->list);
+                ipf_reassembled_list_remove(rp);
+                dp_packet_delete(rp->pkt);
+                free(rp);
+            } else {
+                dp_packet_batch_refill(pb, pkt, pb_idx);
+            }
+        }
+    }
+    ipf_lock_unlock(&ipf_lock);
+}
+
+/* Extracts any fragments from the batch and reassembles them when a
+ * complete packet is received.  Completed packets are attempted to
+ * be added to the batch to be sent through conntrack. */
+void
+ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
+                         ovs_be16 dl_type, uint16_t zone, uint32_t hash_basis)
+{
+    if (ipf_get_enabled()) {
+        ipf_extract_frags_from_batch(pb, dl_type, zone, now, hash_basis);
+    }
+
+    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
+        ipf_execute_reass_pkts(pb);
+    }
+}
+
+/* Updates fragments based on the processing of the reassembled packet sent
+ * through conntrack and adds these fragments to any batches seen.  Expired
+ * fragments are marked as invalid and also added to the batches seen
+ * with low priority.  Reassembled packets are freed. */
+void
+ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
+                          ovs_be16 dl_type)
+{
+    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
+        bool v6 = dl_type == htons(ETH_TYPE_IPV6);
+        ipf_post_execute_reass_pkts(pb, v6);
+        ipf_send_completed_frags(pb, now, v6);
+        ipf_send_expired_frags(pb, now, v6);
+    }
+}
+
+static void *
+ipf_clean_thread_main(void *f OVS_UNUSED)
+{
+    enum {
+        IPF_FRAG_LIST_CLEAN_TIMEOUT = 60000,
+    };
+
+    while (!latch_is_set(&ipf_clean_thread_exit)) {
+
+        long long now = time_msec();
+
+        if (!ovs_list_is_empty(&frag_exp_list) ||
+            !ovs_list_is_empty(&frag_complete_list)) {
+
+            ipf_lock_lock(&ipf_lock);
+
+            struct ipf_list *ipf_list, *next;
+            LIST_FOR_EACH_SAFE (ipf_list, next, list_node, &frag_exp_list) {
+                if (ipf_purge_list_check(ipf_list, now)) {
+                    ipf_expiry_list_clean(ipf_list);
+                }
+            }
+
+            LIST_FOR_EACH_SAFE (ipf_list, next, list_node,
+                                &frag_complete_list) {
+                if (ipf_purge_list_check(ipf_list, now)) {
+                    ipf_completed_list_clean(ipf_list);
+                }
+            }
+
+            ipf_lock_unlock(&ipf_lock);
+        }
+
+        poll_timer_wait_until(now + IPF_FRAG_LIST_CLEAN_TIMEOUT);
+        latch_wait(&ipf_clean_thread_exit);
+        poll_block();
+    }
+
+    return NULL;
+}
+
+void
+ipf_init(void)
+{
+    ipf_lock_init(&ipf_lock);
+    ipf_lock_lock(&ipf_lock);
+    hmap_init(&frag_lists);
+    ovs_list_init(&frag_exp_list);
+    ovs_list_init(&frag_complete_list);
+    ovs_list_init(&reassembled_pkt_list);
+    atomic_init(&min_v4_frag_size, IPF_V4_FRAG_SIZE_MIN_DEF);
+    atomic_init(&min_v6_frag_size, IPF_V6_FRAG_SIZE_MIN_DEF);
+    max_v4_frag_list_size = DIV_ROUND_UP(
+        IPV4_PACKET_MAX_SIZE - IPV4_PACKET_MAX_HDR_SIZE,
+        min_v4_frag_size - IPV4_PACKET_MAX_HDR_SIZE);
+    ipf_lock_unlock(&ipf_lock);
+    atomic_count_init(&nfrag, 0);
+    atomic_init(&n4frag_accepted, 0);
+    atomic_init(&n4frag_completed_sent, 0);
+    atomic_init(&n4frag_expired_sent, 0);
+    atomic_init(&n4frag_too_small, 0);
+    atomic_init(&n4frag_overlap, 0);
+    atomic_init(&n4frag_purged, 0);
+    atomic_init(&n6frag_accepted, 0);
+    atomic_init(&n6frag_completed_sent, 0);
+    atomic_init(&n6frag_expired_sent, 0);
+    atomic_init(&n6frag_too_small, 0);
+    atomic_init(&n6frag_overlap, 0);
+    atomic_init(&n6frag_purged, 0);
+    atomic_init(&nfrag_max, IPF_MAX_FRAGS_DEFAULT);
+    atomic_init(&ifp_v4_enabled, true);
+    atomic_init(&ifp_v6_enabled, true);
+    latch_init(&ipf_clean_thread_exit);
+    ipf_clean_thread = ovs_thread_create("ipf_clean",
+                                         ipf_clean_thread_main, NULL);
+}
+
+void
+ipf_destroy(void)
+{
+    ipf_lock_lock(&ipf_lock);
+
+    latch_set(&ipf_clean_thread_exit);
+    pthread_join(ipf_clean_thread, NULL);
+    latch_destroy(&ipf_clean_thread_exit);
+
+    struct ipf_list *ipf_list;
+    HMAP_FOR_EACH_POP (ipf_list, node, &frag_lists) {
+        struct dp_packet *pkt;
+        while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
+            pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
+            dp_packet_delete(pkt);
+            atomic_count_dec(&nfrag);
+            ipf_list->last_sent_idx++;
+        }
+        free(ipf_list->frag_list);
+        free(ipf_list);
+    }
+
+    if (atomic_count_get(&nfrag)) {
+        VLOG_WARN("ipf destroy with non-zero fragment count. ");
+    }
+
+    struct reassembled_pkt * rp;
+    LIST_FOR_EACH_POP (rp, rp_list_node, &reassembled_pkt_list) {
+        dp_packet_delete(rp->pkt);
+        free(rp);
+    }
+
+    hmap_destroy(&frag_lists);
+    ovs_list_poison(&frag_exp_list);
+    ovs_list_poison(&frag_complete_list);
+    ovs_list_poison(&reassembled_pkt_list);
+    ipf_lock_unlock(&ipf_lock);
+    ipf_lock_destroy(&ipf_lock);
+}
diff --git a/lib/ipf.h b/lib/ipf.h
new file mode 100644
index 0000000..040031f
--- /dev/null
+++ b/lib/ipf.h
@@ -0,0 +1,33 @@ 
+/*
+ * Copyright (c) 2018 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef IPF_H
+#define IPF_H 1
+
+#include "dp-packet.h"
+#include "openvswitch/types.h"
+
+void ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
+                              ovs_be16 dl_type, uint16_t zone,
+                              uint32_t hash_basis);
+
+void ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
+                               ovs_be16 dl_type);
+
+void ipf_init(void);
+void ipf_destroy(void);
+
+#endif /* ipf.h */
diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
index 3296d64..3fbead8 100644
--- a/tests/system-kmod-macros.at
+++ b/tests/system-kmod-macros.at
@@ -77,12 +77,6 @@  m4_define([CHECK_CONNTRACK],
 #
 m4_define([CHECK_CONNTRACK_ALG])
 
-# CHECK_CONNTRACK_FRAG()
-#
-# Perform requirements checks for running conntrack fragmentations tests.
-# The kernel always supports fragmentation, so no check is needed.
-m4_define([CHECK_CONNTRACK_FRAG])
-
 # CHECK_CONNTRACK_LOCAL_STACK()
 #
 # Perform requirements checks for running conntrack tests with local stack.
@@ -91,6 +85,10 @@  m4_define([CHECK_CONNTRACK_FRAG])
 # needed.
 m4_define([CHECK_CONNTRACK_LOCAL_STACK])
 
+# CHECK_CONNTRACK_SMALL_FRAG()
+#
+m4_define([CHECK_CONNTRACK_SMALL_FRAG])
+
 # CHECK_CONNTRACK_FRAG_OVERLAP()
 #
 # The kernel does not support overlapping fragments checking.
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 840fea9..c4f6e47 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -2347,7 +2347,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2381,7 +2380,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation expiry])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2412,7 +2410,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation + vlan])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2448,7 +2445,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation + cvlan])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch . other_config:vlan-limit=0])
 OVS_CHECK_8021AD()
 
@@ -2523,7 +2519,7 @@  AT_CLEANUP
 dnl Uses same first fragment as above 'incomplete reassembled packet' test.
 AT_SETUP([conntrack - IPv4 fragmentation with fragments specified])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2547,7 +2543,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation out of order])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2571,7 +2567,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1 octet])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_OVERLAP()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2595,7 +2591,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv4 fragmentation overlapping fragments by 1 octet out of order])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_OVERLAP()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2619,7 +2615,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2659,7 +2654,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation expiry])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2700,7 +2694,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation + vlan])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2743,7 +2736,6 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation + cvlan])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 OVS_TRAFFIC_VSWITCHD_START([set Open_vSwitch . other_config:vlan-limit=0])
 OVS_CHECK_8021AD()
 
@@ -2818,7 +2810,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation with fragments specified])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2842,7 +2834,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation out of order])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
@@ -2866,7 +2858,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2892,7 +2884,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers + out of order])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2918,7 +2910,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2944,7 +2936,7 @@  AT_CLEANUP
 
 AT_SETUP([conntrack - IPv6 fragmentation, multiple extension headers 2 + out of order])
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
+CHECK_CONNTRACK_SMALL_FRAG()
 CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -2971,7 +2963,6 @@  AT_CLEANUP
 AT_SETUP([conntrack - Fragmentation over vxlan])
 OVS_CHECK_VXLAN()
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 CHECK_CONNTRACK_LOCAL_STACK()
 
 OVS_TRAFFIC_VSWITCHD_START()
@@ -3024,7 +3015,6 @@  AT_CLEANUP
 AT_SETUP([conntrack - IPv6 Fragmentation over vxlan])
 OVS_CHECK_VXLAN()
 CHECK_CONNTRACK()
-CHECK_CONNTRACK_FRAG()
 CHECK_CONNTRACK_LOCAL_STACK()
 
 OVS_TRAFFIC_VSWITCHD_START()
diff --git a/tests/system-userspace-macros.at b/tests/system-userspace-macros.at
index 27bde8b..219c046 100644
--- a/tests/system-userspace-macros.at
+++ b/tests/system-userspace-macros.at
@@ -73,15 +73,6 @@  m4_define([CHECK_CONNTRACK],
 #
 m4_define([CHECK_CONNTRACK_ALG])
 
-# CHECK_CONNTRACK_FRAG()
-#
-# Perform requirements checks for running conntrack fragmentations tests.
-# The userspace doesn't support fragmentation yet, so skip the tests.
-m4_define([CHECK_CONNTRACK_FRAG],
-[
-    AT_SKIP_IF([:])
-])
-
 # CHECK_CONNTRACK_LOCAL_STACK()
 #
 # Perform requirements checks for running conntrack tests with local stack.
@@ -93,22 +84,26 @@  m4_define([CHECK_CONNTRACK_LOCAL_STACK],
     AT_SKIP_IF([:])
 ])
 
-# CHECK_CONNTRACK_FRAG_OVERLAP()
+# CHECK_CONNTRACK_SMALL_FRAG()
 #
-# The userspace datapath does not support fragments yet.
-m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
+m4_define([CHECK_CONNTRACK_SMALL_FRAG],
 [
     AT_SKIP_IF([:])
 ])
 
-# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN()
+# CHECK_CONNTRACK_FRAG_OVERLAP()
 #
 # The userspace datapath does not support fragments yet.
-m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN],
+m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
 [
     AT_SKIP_IF([:])
 ])
 
+# CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN
+#
+# The userspace datapath supports fragments with multiple extension headers.
+m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN])
+
 # CHECK_CONNTRACK_NAT()
 #
 # Perform requirements checks for running conntrack NAT tests. The userspace