diff mbox

[ovs-dev,3/8] Add support for connection tracking.

Message ID 1441850422-33750-4-git-send-email-joestringer@nicira.com
State Superseded
Headers show

Commit Message

Joe Stringer Sept. 10, 2015, 2 a.m. UTC
This patch adds a new action and fields to OVS that allow connection
tracking to be performed. This support works in conjunction with the
Linux kernel support merged into the Linux-4.3 development cycle.

The new 'ct' action sends this flow through the connection tracker,
which accepts the following parameters initally:

- "commit": Connections that execute ct(commit) should be remembered,
  allowing future packets in the connection to be recognized as part of
  an "established" (est) connection, packets in the reply (rpl)
  direction, or related to an existing connection (rel).
- "zone=[u16|NXM]": Perform connection tracking in the zone specified.
  Each zone is an independent connection tracking context. If the
  "commit" parameter is used, the connection will only be remembered in
  the specified zone, and not in other zones. This is 0 by default.
- "table=NUMBER": This makes the conntrack action act like the "output"
  action, with a copy of the packet sent to the connection tracker.
  When the connection tracker has finished processing, it may return
  and resume pipeline processing in the specified table, with connection
  tracking fields populated (initially ct_state and ct_zone). This table
  is strongly recommended to be greater than the current table, to
  prevent loops.

When the "table" option is used, the packet that continues processing in
the specified table will have the ct_state populated. The ct_state may
have any of the following flags set:

- Tracked (trk): Connection tracking has occurred.
- Reply (rpl): The flow is in the reply direction.
- Invalid (inv): The connection tracker couldn't identify the connection.
- New (new): This is the beginning of a new connection.
- Established (est): This is part of an already existing connection.
- Related (rel): This connection is related to an existing connection.

For more information, consult the ovs-ofctl(8) man pages.

Below is a simple example flow table to allow outbound TCP traffic from
port 1 and drop traffic from port 2 that was not initiated by port 1:

    ovs-ofctl del-flows br0
    ovs-ofctl add-flow br0 "arp,action=normal"
    ovs-ofctl add-flow br0 \
        "in_port=1,conn_state=-trk,tcp,action=ct(commit,zone=9),2"
    ovs-ofctl add-flow br0 \
        "in_port=2,conn_state=-trk,tcp,action=ct(table=1,zone=9)"
    ovs-ofctl add-flow br0 \
        "table=1,in_port=2,conn_state=+trk+est-new,tcp,action=1"
    ovs-ofctl add-flow br0 \
        "table=1,in_port=2,conn_state=+trk-est+new,tcp,action=drop"

Based on original design by Justin Pettit, contributions from Thomas
Graf and Daniele Di Proietto.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
Changes since RFCv4:
- OpenFlow and Netlink wire formats have been rearranged.
- Zone may be specified from any NXM field.
- NXM field values for ct_state, etc are different.
- Extend the testsuite.
- Rebase.
---
 NEWS                                              |   3 +
 build-aux/extract-ofp-fields                      |   1 +
 datapath/flow_netlink.c                           |   2 +-
 datapath/linux/compat/include/linux/openvswitch.h |  36 ++
 lib/dpif-netdev.c                                 |  12 +
 lib/dpif.c                                        |   1 +
 lib/flow.c                                        |  42 ++
 lib/flow.h                                        |   6 +-
 lib/match.c                                       |  55 +++
 lib/match.h                                       |   3 +
 lib/meta-flow.c                                   |  77 ++++
 lib/meta-flow.h                                   |  46 +++
 lib/nx-match.c                                    |  10 +
 lib/odp-execute.c                                 |   6 +
 lib/odp-util.c                                    | 213 ++++++++++
 lib/odp-util.h                                    |   8 +-
 lib/ofp-actions.c                                 | 260 ++++++++++++
 lib/ofp-actions.h                                 |  47 +++
 lib/packets.c                                     |  21 +
 lib/packets.h                                     |  31 +-
 ofproto/ofproto-dpif-rid.c                        |   2 +
 ofproto/ofproto-dpif-rid.h                        |   1 +
 ofproto/ofproto-dpif-sflow.c                      |   3 +
 ofproto/ofproto-dpif-xlate.c                      |  90 ++++-
 ofproto/ofproto-dpif.c                            |  70 ++++
 ofproto/ofproto-unixctl.man                       |   4 +
 tests/atlocal.in                                  |   7 +
 tests/automake.mk                                 |   1 +
 tests/dpif-netdev.at                              |   2 +-
 tests/odp.at                                      |  12 +
 tests/ofproto-dpif.at                             |   4 +-
 tests/ofproto.at                                  |   2 +
 tests/system-common-macros.at                     |  18 +
 tests/system-kmod-macros.at                       |  16 +
 tests/system-traffic.at                           | 469 ++++++++++++++++++++++
 tests/system-userspace-macros.at                  |   9 +
 tests/test-l7.py                                  |  72 ++++
 tests/test-odp.c                                  |   3 +
 utilities/ovs-ofctl.8.in                          |  85 ++++
 39 files changed, 1735 insertions(+), 15 deletions(-)
 create mode 100755 tests/test-l7.py

Comments

Ben Pfaff Sept. 10, 2015, 6:05 p.m. UTC | #1
Who do you think should review this?
Joe Stringer Sept. 10, 2015, 10:23 p.m. UTC | #2
On 10 September 2015 at 11:05, Ben Pfaff <blp@nicira.com> wrote:
> Who do you think should review this?

I think that either Justin or Jarno should probably review this.
Jarno Rajahalme Sept. 10, 2015, 11:13 p.m. UTC | #3
> On Sep 10, 2015, at 3:23 PM, Joe Stringer <joestringer@nicira.com> wrote:
> 
> On 10 September 2015 at 11:05, Ben Pfaff <blp@nicira.com> wrote:
>> Who do you think should review this?
> 
> I think that either Justin or Jarno should probably review this.

I’ll review these, but Justin can chime in as well :-)

  Jarno
Jarno Rajahalme Sept. 11, 2015, 2:03 a.m. UTC | #4
> On Sep 9, 2015, at 7:00 PM, Joe Stringer <joestringer@nicira.com> wrote:
> 
> 

(snip)

> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index 578cd88..69bdf32 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -343,6 +343,8 @@ enum ovs_key_attr {
> 	OVS_KEY_ATTR_MPLS,      /* array of struct ovs_key_mpls.
> 				 * The implementation may restrict
> 				 * the accepted length of the array. */
> +	OVS_KEY_ATTR_CT_STATE,	/* u8 bitmask of OVS_CS_F_* */
> +	OVS_KEY_ATTR_CT_ZONE,	/* u16 connection tracking zone. */
> 
> #ifdef __KERNEL__
> 	/* Only used within kernel data path. */
> @@ -456,6 +458,15 @@ struct ovs_key_nd {
> 	__u8	nd_tll[ETH_ALEN];
> };
> 
> +/* OVS_KEY_ATTR_CT_STATE flags */
> +#define OVS_CS_F_NEW               0x01 /* Beginning of a new connection. */
> +#define OVS_CS_F_ESTABLISHED       0x02 /* Part of an existing connection. */
> +#define OVS_CS_F_RELATED           0x04 /* Related to an established
> +					 * connection. */
> +#define OVS_CS_F_INVALID           0x20 /* Could not track connection. */
> +#define OVS_CS_F_REPLY_DIR         0x40 /* Flow is in the reply direction. */
> +#define OVS_CS_F_TRACKED           0x80 /* Conntrack has occurred. */
> +
> /**
>  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
>  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
> @@ -642,6 +653,28 @@ struct ovs_action_push_tnl {
> #endif
> 
> /**
> + * enum ovs_ct_attr - Attributes for %OVS_ACTION_ATTR_CT action.
> + * @OVS_CT_ATTR_FLAGS: u32 connection tracking flags.
> + * @OVS_CT_ATTR_ZONE: u16 connection tracking zone.
> + */
> +enum ovs_ct_attr {
> +	OVS_CT_ATTR_UNSPEC,
> +	OVS_CT_ATTR_FLAGS,      /* u32 bitmask of OVS_CT_F_*. */
> +	OVS_CT_ATTR_ZONE,       /* u16 zone id. */
> +	__OVS_CT_ATTR_MAX
> +};
> +
> +#define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1)
> +
> +/*
> + * OVS_CT_ATTR_FLAGS flags - bitmask of %OVS_CT_F_*
> + * @OVS_CT_F_COMMIT: Commits the flow to the conntrack table. This allows
> + * future packets for the same connection to be identified as 'established'
> + * or 'related'.
> + */
> +#define OVS_CT_F_COMMIT		0x01
> +
> +/**
>  * enum ovs_action_attr - Action types.
>  *
>  * @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
> @@ -672,6 +705,8 @@ struct ovs_action_push_tnl {
>  * indicate the new packet contents. This could potentially still be
>  * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
>  * is no MPLS label stack, as determined by ethertype, no action is taken.
> + * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
> + * entries in the flow key.
>  *
>  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
>  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
> @@ -702,6 +737,7 @@ enum ovs_action_attr {
> 				       * data immediately followed by a mask.
> 				       * The data must be zero for the unmasked
> 				       * bits. */
> +	OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
> 

I guess this should be “Nested OVS_CT_ATTR_*”, as there can be multiple nested attributes?

> #ifndef __KERNEL__
> 	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
> 

(snip)

> diff --git a/lib/flow.h b/lib/flow.h
> index d8632ff..2d6d3ee 100644
> --- a/lib/flow.h
> +++ b/lib/flow.h
> @@ -103,8 +103,10 @@ struct flow {
>     union flow_in_port in_port; /* Input port.*/
>     uint32_t recirc_id;         /* Must be exact match. */
>     uint32_t conj_id;           /* Conjunction ID. */
> +    uint16_t ct_zone;           /* Connection Zone. */
> +    uint8_t ct_state;           /* Connection state. */
> +    uint8_t pad1[3];            /* Pad to 64 bits. */
>     ofp_port_t actset_output;   /* Output port in action set. */
> -    uint8_t pad1[6];            /* Pad to 64 bits. */
> 

It would be slightly better to put ct_zone and ct_state right after recirc_id. This would make miniflows smaller, as the recirc_id, ct_zone, and ct_state would all be in the same miniflow data unit (conj_id and actset_output are userspace-only fields, so they would be zeroes for datapath miniflows).

We should probably make ct_state a 16 bit field (uint16_t) in struct flow already, for future-proofness, even if the kernel interface is still 8 bits.

Also, you should bump FLOW_WC_SEQ to make sure all the relevant places are updated for the new fields.

>     /* L2, Order the same as in the Ethernet header! (64-bit aligned) */
>     struct eth_addr dl_dst;     /* Ethernet destination address. */
> @@ -980,6 +982,8 @@ pkt_metadata_from_flow(struct pkt_metadata *md, const struct flow *flow)
>     md->skb_priority = flow->skb_priority;
>     md->pkt_mark = flow->pkt_mark;
>     md->in_port = flow->in_port;
> +    md->ct_state = flow->ct_state;
> +    md->ct_zone = flow->ct_zone;
> }
> 
> static inline bool is_ip_any(const struct flow *flow)
> 

(snip)

> diff --git a/lib/meta-flow.h b/lib/meta-flow.h
> index 02272ef..44a29e9 100644
> --- a/lib/meta-flow.h
> +++ b/lib/meta-flow.h
> @@ -703,6 +703,51 @@ enum OVS_PACKED_ENUM mf_field_id {
>      */
>     MFF_PKT_MARK,
> 
> +    /* "ct_state".
> +     *
> +     * Connection tracking state.  The field is populated by the NXAST_CT
> +     * action. The following bit values describe the state of the connection:
> +     *
> +     *   - New (0x01): This is the beginning of a new connection.
> +     *   - Established (0x02): This is part of an already existing connection.
> +     *   - Related (0x04): This is a separate connection that is related to an
> +     *                     existing connection.
> +     *   - Invalid (0x20): This flow could not be associated with a connection.
> +     *                     This could be set for a variety of reasons,
> +     *                     including (but not limited to):
> +     *                     - L3/L4 protocol handler is not loaded/unavailable.
> +     *                     - L3/L4 protocol handler determines that the packet
> +     *                       is malformed or invalid for the current FSM stage.
> +     *                     - Packets are unexpected length for protocol.
> +     *   - Reply (0x40): This flow is in the reply direction, ie it did not
> +     *                   initiate the connection.
> +     *   - Tracked (0x80): Connection tracking has occurred.
> +     *

Have we documented anywhere which of the bits are mutually exclusive and how. E.g.,
- If -trk, all the other bits are also zeroes (also invalid?)
- new and established are mutually exclusive (+new+est is not possible)
- new and reply are mutually exclusive (reply direction is defined as opposite of the initial +new)
- related and new can be set at the same time (but does related remain set for the duration of the connection, or is it also mutually exclusive with established?)
- apparently related and reply are not mutually exclusive, but when are they set at the same time?
- are invalid and tracked mutually exclusive?

Also, what are the possible transitions from one state to another?

IMO this information is invaluable for anyone crafting the flows using these bits.

> +     * Type: u8.
> +     * Maskable: bitwise.
> +     * Formatting: conn state.
> +     * Prerequisites: none.
> +     * Access: read-only.
> +     * NXM: NXM_NX_CT_STATE(105) since v2.5.
> +     * OXM: none.
> +     */
> +    MFF_CT_STATE,
> +
> +    /* "ct_zone".
> +     *
> +     * Connection tracking zone.  The field is populated by the
> +     * NXAST_CT action.
> +     *
> +     * Type: be16.
> +     * Maskable: no.
> +     * Formatting: hexadecimal.
> +     * Prerequisites: none.
> +     * Access: read-only.
> +     * NXM: NXM_NX_CT_ZONE(106) since v2.5.
> +     * OXM: none.
> +     */
> +    MFF_CT_ZONE,
> +

(snip)

> diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
> index c1e46b5..6ea421f 100644
> --- a/lib/ofp-actions.c
> +++ b/lib/ofp-actions.c
> @@ -286,6 +286,9 @@ enum ofp_raw_action_type {
>     /* NX1.0+(34): struct nx_action_conjunction. */
>     NXAST_RAW_CONJUNCTION,
> 
> +    /* NX1.0+(35): struct nx_action_conntrack, ... */
> +    NXAST_RAW_CT,
> +
> /* ## ------------------ ## */
> /* ## Debugging actions. ## */
> /* ## ------------------ ## */
> @@ -346,6 +349,10 @@ static void *ofpact_put_raw(struct ofpbuf *, enum ofp_version,
> static char *OVS_WARN_UNUSED_RESULT ofpacts_parse(
>     char *str, struct ofpbuf *ofpacts, enum ofputil_protocol *usable_protocols,
>     bool allow_instructions, enum ofpact_type outer_action);
> +static enum ofperr ofpacts_pull_openflow_actions__(
> +    struct ofpbuf *openflow, unsigned int actions_len,
> +    enum ofp_version version, uint32_t allowed_ovsinsts,
> +    struct ofpbuf *ofpacts, enum ofpact_type outer_action);
> 
> /* Pull off existing actions or instructions. Used by nesting actions to keep
>  * ofpacts_parse() oblivious of actions nesting. */
> @@ -4448,6 +4455,245 @@ format_DEBUG_RECIRC(const struct ofpact_null *a OVS_UNUSED, struct ds *s)
> {
>     ds_put_cstr(s, "debug_recirc");
> }
> +
> +/* Action structure for NXAST_CT.
> + *
> + * Pass traffic to the connection tracker.
> + *
> + * The "zone" specifies a context within which the tracking is done:
> + *
> + *      If 'zone_src' is nonzero, this specifies that the zone should be
> + *      sourced from a field zone_src[ofs:ofs+nbits]. The format and semantics
> + *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
> + *      NXAST_REG_LOAD action. The acceptable nxm_header values for 'zone_src'
> + *      are the same as the acceptable nxm_header values for the 'src' field of
> + *      NXAST_REG_MOVE.
> + *
> + *      If 'zone_src' is zero, then the value of 'zone_imm' will be used as the
> + *      connection tracking zone.
> + *
> + * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this field
> + * specifies that the packet should be logically cloned after the packet is

The packet is logically cloned *before* it is sent through the connection tracker,
as only the cloned packet gets the ct_state and ct_zone, right?

> + * sent through the connection tracker, and pipeline processing for the cloned
> + * packet will continue in the OpenFlow table specified by this value. The
> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
> + * strongly recommended that this table is later than the current table, to
> + * prevent loops.
> + *

‘alg’ and ofpacts are introduced in later patches, and I guess it is too much work to not deal with them here yet?

Anyway, a thought about the nested actions just popped in my mind: The fact that the actions are nested means that they all must be produced by a single OpenFlow rule. I.e., the same rule that does conntrack must also set the labels, etc. This means that either there will be a potentially huge number of rules with conntrack actions, or the values used must be passed through registers (which is kind of redundant). Also, a register may not be wide enough for the data in question (e.g., labels).

> + * A standard "resubmit" action is not sufficient to populate NXM_NX_CT_*
> + * fields, since connection tracking occurs outside of the classifier.
> + */
> +struct nx_action_conntrack {
> +    ovs_be16 type;              /* OFPAT_VENDOR. */
> +    ovs_be16 len;               /* At least 24. */
> +    ovs_be32 vendor;            /* NX_VENDOR_ID. */
> +    ovs_be16 subtype;           /* NXAST_CT. */
> +    ovs_be16 flags;             /* Zero or more NX_CT_F_* flags.
> +                                 * Unspecified flag bits must be zero. */
> +    ovs_be32 zone_src;          /* Connection tracking context. */
> +    union {
> +        ovs_be16 zone_ofs_nbits;/* Range to use from source field. */
> +        ovs_be16 zone_imm;      /* Immediate value for zone. */
> +    };
> +    uint8_t recirc_table;       /* Recirculate to a specific table, or
> +                                   NX_CT_RECIRC_NONE for no recirculation. */
> +    uint8_t pad[3];             /* Zeroes */
> +    ovs_be16 alg;               /* Well-known port number for the protocol.
> +                                 * 0 indicates no ALG is required. */
> +    /* Followed by a sequence of ofpact elements, until the end of the action
> +     * is reached. */
> +};
> +OFP_ASSERT(sizeof(struct nx_action_conntrack) == 24);
> +
> +static enum ofperr
> +decode_ct_zone(const struct nx_action_conntrack *nac,
> +               struct ofpact_conntrack *out)
> +{
> +    if (nac->zone_src) {
> +        enum ofperr error;
> +
> +        out->zone_src.field = mf_from_nxm_header(ntohl(nac->zone_src));
> +        out->zone_src.ofs = nxm_decode_ofs(nac->zone_ofs_nbits);
> +        out->zone_src.n_bits = nxm_decode_n_bits(nac->zone_ofs_nbits);
> +        error = mf_check_src(&out->zone_src, NULL);
> +        if (error) {
> +            return error;
> +        }
> +
> +        if (out->zone_src.n_bits != 16) {
> +            VLOG_WARN_RL(&rl, "zone n_bits %d not within valid range [16..16]",
> +                         out->zone_src.n_bits);
> +            return OFPERR_OFPBAC_BAD_SET_LEN;
> +        }
> +    } else {
> +        out->zone_src.field = NULL;
> +        out->zone_imm = ntohs(nac->zone_imm);
> +    }
> +
> +    return 0;
> +}
> +
> +static enum ofperr
> +decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf *out)
> +{
> +    const size_t ct_offset = ofpacts_pull(out);
> +    struct ofpact_conntrack *conntrack;
> +    struct ofpbuf openflow;
> +    int error = 0;
> +
> +    conntrack = ofpact_put_CT(out);
> +    conntrack->flags = ntohs(nac->flags);
> +    error = decode_ct_zone(nac, conntrack);
> +    if (error) {
> +        goto out;
> +    }
> +    conntrack->recirc_table = nac->recirc_table;
> +    conntrack->alg = ntohs(nac->alg);
> +
> +    ofpbuf_pull(out, sizeof(*conntrack));
> +
> +    /* XXX: version */

What about the version?

> +    ofpbuf_use_const(&openflow, nac + 1, ntohs(nac->len) - sizeof(*nac));
> +    error = ofpacts_pull_openflow_actions__(&openflow, openflow.size,
> +                                            OFP10_VERSION,
> +                                            1u << OVSINST_OFPIT11_APPLY_ACTIONS,
> +                                            out, OFPACT_CT);
> +    if (error) {
> +        goto out;
> +    }
> +
> +    conntrack = ofpbuf_push_uninit(out, sizeof(*conntrack));
> +    out->header = &conntrack->ofpact;
> +    ofpact_update_len(out, &conntrack->ofpact);
> +
> +out:
> +    ofpbuf_push_uninit(out, ct_offset);
> +    return error;
> +}
> +
> +static void
> +encode_CT(const struct ofpact_conntrack *conntrack,
> +          enum ofp_version ofp_version, struct ofpbuf *out)
> +{
> +    struct nx_action_conntrack *nac;
> +    const size_t ofs = out->size;
> +    size_t len;
> +
> +    nac = put_NXAST_CT(out);
> +    nac->flags = htons(conntrack->flags);
> +    if (conntrack->zone_src.field) {
> +        nac->zone_src = htonl(mf_nxm_header(conntrack->zone_src.field->id));
> +        nac->zone_ofs_nbits = nxm_encode_ofs_nbits(conntrack->zone_src.ofs,
> +                                                   conntrack->zone_src.n_bits);
> +    } else {
> +        nac->zone_src = htonl(0);
> +        nac->zone_imm = htons(conntrack->zone_imm);
> +    }
> +    nac->recirc_table = conntrack->recirc_table;
> +    nac->alg = htons(conntrack->alg);
> +
> +    len = ofpacts_put_openflow_actions(conntrack->actions,
> +                                       ofpact_ct_get_action_len(conntrack),
> +                                       out, ofp_version);
> +    len += sizeof(*nac);
> +    nac = ofpbuf_at(out, ofs, sizeof(*nac));
> +    nac->len = htons(len);

nac->len should include the conntrack action itself, too, not just the length of the actions?
 
> +}
> +
> +/* Parses 'arg' as the argument to a "ct" action, and appends such an
> + * action to 'ofpacts'.
> + *

(snip)

> diff --git a/lib/ofp-actions.h b/lib/ofp-actions.h
> index 818f5c8..38a9547 100644
> --- a/lib/ofp-actions.h
> +++ b/lib/ofp-actions.h
> @@ -106,6 +106,7 @@
>     OFPACT(EXIT,            ofpact_null,        ofpact, "exit")         \
>     OFPACT(SAMPLE,          ofpact_sample,      ofpact, "sample")       \
>     OFPACT(UNROLL_XLATE,    ofpact_unroll_xlate, ofpact, "unroll_xlate") \
> +    OFPACT(CT,              ofpact_conntrack,   ofpact, "ct")           \
>                                                                         \
>     /* Debugging actions.                                               \
>      *                                                                  \
> @@ -471,6 +472,52 @@ BUILD_ASSERT_DECL(offsetof(struct ofpact_nest, actions) % OFPACT_ALIGNTO == 0);
> BUILD_ASSERT_DECL(offsetof(struct ofpact_nest, actions)
>                   == sizeof(struct ofpact_nest));
> 
> +/* Bits for 'flags' in struct nx_action_conntrack.
> + *
> + * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
> + * unconfirmed to confirmed list in the tracker. */
> +enum nx_conntrack_flags {
> +    NX_CT_F_COMMIT = 1 << 0,
> +};
> +
> +/* Magic value for struct nx_action_conntrack 'recirc_table' field, to specify
> + * that the packet should not be recirculated. This value should commonly be
> + * used in conjunction with the NX_CT_F_COMMIT flag above. */
> +#define NX_CT_RECIRC_NONE 0xFF
> +

Maybe this should be defined as OFPTT_ALL, as that is the only invalid table number (0xff), i.e., NX_CT_RECIRC_NONE may not be defined as anything else, right?

> +#define CT_MEMBERS                      \
> +    uint16_t flags;                     \
> +    uint16_t zone_imm;                  \
> +    struct mf_subfield zone_src;        \
> +    uint8_t recirc_table;               \
> +    uint16_t alg
> +

Maybe put the struct member first to avoid internal padding (mf_subfield has a pointer, so it will be 64-bit aligned on 64-bit systems)?


> +/* Used to perform build-time size checking for ofpact_conntrack. */
> +struct ofpact_ct_size {
> +    CT_MEMBERS;
> +};
> +
> +/* OFPACT_CT.
> + *
> + * Used for NXAST_CT. */
> +struct ofpact_conntrack {
> +    struct ofpact ofpact;
> +    CT_MEMBERS;

This is slightly ugly, seems to be necessary.

> +    uint8_t pad[PAD_SIZE(sizeof(struct ofpact) + sizeof(struct ofpact_ct_size),
> +                         OFPACT_ALIGNTO)];
> +    struct ofpact actions[];
> +};
> +BUILD_ASSERT_DECL(offsetof(struct ofpact_conntrack, actions)
> +                  % OFPACT_ALIGNTO == 0);
> +BUILD_ASSERT_DECL(offsetof(struct ofpact_conntrack, actions)
> +                  == sizeof(struct ofpact_conntrack));
> +
> 

(snip)

> bool dpid_from_string(const char *s, uint64_t *dpidp);
> @@ -710,6 +717,18 @@ struct tcp_header {
> };
> BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
> 
> +/* Connection states */
> +#define CS_NEW               0x01
> +#define CS_ESTABLISHED       0x02
> +#define CS_RELATED           0x04
> +#define CS_INVALID           0x20
> +#define CS_REPLY_DIR         0x40
> +#define CS_TRACKED           0x80
> +

These seem to match the datapath interface flags. Is that an invariant? Which ones we use in struct flow?

> +/* Undefined connection state bits. */
> +#define CS_UNSUPPORTED_MASK  0x18
> +#define CS_SUPPORTED_MASK    (~CS_UNSUPPORTED_MASK & 0xFF)
> +
> 

(snip)

> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
> index e7d95bf..7754e68 100644
> --- a/ofproto/ofproto-dpif-rid.h
> +++ b/ofproto/ofproto-dpif-rid.h
> @@ -142,6 +142,7 @@ struct recirc_state {
>     struct recirc_metadata metadata; /* Flow metadata. */
>     struct ofpbuf *stack;         /* Stack if any. */
>     mirror_mask_t mirrors;        /* Mirrors already output. */
> +    bool conntrack;               /* Conntrack occurred prior to recirc. */
> 

Should this be also called ‘conntracked’?

>     /* Actions to be translated on recirculation. */
>     uint32_t action_set_len;      /* How much of 'ofpacts' consists of an

(snip)

> static void
> check_support(struct dpif_backer *backer)
> {
> @@ -1239,6 +1278,9 @@ check_support(struct dpif_backer *backer)
>     backer->support.masked_set_action = check_masked_set_action(backer);
>     backer->support.ufid = check_ufid(backer);
>     backer->support.tnl_push_pop = dpif_supports_tnl_push_pop(backer->dpif);
> +
> +    backer->support.odp.ct_state = check_ct_state(backer);
> +    backer->support.odp.ct_zone = check_ct_zone(backer);
> }
> 
> static int
> @@ -3935,10 +3977,38 @@ rule_dealloc(struct rule *rule_)
> }
> 
> static enum ofperr
> +rule_check(struct rule *rule)
> +{
> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
> +    const struct odp_support *support;
> +    struct match match;
> +
> +    minimatch_expand(&rule->cr.match, &match);
> +    support = &ofproto_dpif_get_support(ofproto)->odp;
> +
> +    if ((match.wc.masks.ct_state && !support->ct_state)
> +        || (match.wc.masks.ct_zone && !support->ct_zone)) {
> +        return OFPERR_OFPBMC_BAD_FIELD;
> +    }
> +    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
> +        return OFPERR_OFPBMC_BAD_MASK;
> +    }
> +
> +    return 0;
> +}

This check seems somewhat expensive. Maybe get the masks from the minimatch without expanding it (using MINIFLOW_GET_U8() and MINIFLOW_GET_U16()), like this:

static enum ofperr
rule_check(struct rule *rule)
{
    ct_state_mask = MINIFLOW_GET_U8(&rule->cr.match.mask.masks, ct_state);
    ct_zone_mask = MINIFLOW_GET_U16(&rule->cr.match.mask.masks, ct_zone);

    if (ct_state_mask || ct_zone_mask) {
        struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
        const struct odp_support *support = &ofproto_dpif_get_support(ofproto)->odp;

        if ((ct_state_mask && !support->ct_state)
            || (ct_zone_mask && !support->ct_zone)) {
            return OFPERR_OFPBMC_BAD_FIELD;
        }
        if (ct_state_mask & CS_UNSUPPORTED_MASK) {
            return OFPERR_OFPBMC_BAD_MASK;
        }
    }
    return 0;
}


> +
> +static enum ofperr
> rule_construct(struct rule *rule_)
>     OVS_NO_THREAD_SAFETY_ANALYSIS
> {
>     struct rule_dpif *rule = rule_dpif_cast(rule_);
> +    int error;
> +
> +    error = rule_check(rule_);
> +    if (error) {
> +        return error;
> +    }
> +
>     ovs_mutex_init_adaptive(&rule->stats_mutex);
>     rule->stats.n_packets = 0;
>     rule->stats.n_bytes = 0;

(snip)

> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
> 
> OVS_TRAFFIC_VSWITCHD_STOP
> AT_CLEANUP
> +
> +AT_SETUP([conntrack - controller])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +in_port=1,udp,action=ct(commit),controller
> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +in_port=2,ct_state=+trk+est-new,udp,action=controller

I forgot the default priority value, would be useful to specify the priority of all the flows to make this more readable.

I’ll check the rest later. However, I was not able to run any of the tests with the net-next openvswitch kernel module, so I need to look at that before going on.

  Jarno
Joe Stringer Sept. 11, 2015, 8:07 p.m. UTC | #5
On 10 September 2015 at 19:03, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>
>> On Sep 9, 2015, at 7:00 PM, Joe Stringer <joestringer@nicira.com> wrote:
>> @@ -672,6 +705,8 @@ struct ovs_action_push_tnl {
>>  * indicate the new packet contents. This could potentially still be
>>  * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
>>  * is no MPLS label stack, as determined by ethertype, no action is taken.
>> + * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
>> + * entries in the flow key.
>>  *
>>  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
>>  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
>> @@ -702,6 +737,7 @@ enum ovs_action_attr {
>>                                      * data immediately followed by a mask.
>>                                      * The data must be zero for the unmasked
>>                                      * bits. */
>> +     OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
>>
>
> I guess this should be “Nested OVS_CT_ATTR_*”, as there can be multiple nested attributes?

Right you are, I'll follow up with this on upstream too.


>> diff --git a/lib/flow.h b/lib/flow.h
>> index d8632ff..2d6d3ee 100644
>> --- a/lib/flow.h
>> +++ b/lib/flow.h
>> @@ -103,8 +103,10 @@ struct flow {
>>     union flow_in_port in_port; /* Input port.*/
>>     uint32_t recirc_id;         /* Must be exact match. */
>>     uint32_t conj_id;           /* Conjunction ID. */
>> +    uint16_t ct_zone;           /* Connection Zone. */
>> +    uint8_t ct_state;           /* Connection state. */
>> +    uint8_t pad1[3];            /* Pad to 64 bits. */
>>     ofp_port_t actset_output;   /* Output port in action set. */
>> -    uint8_t pad1[6];            /* Pad to 64 bits. */
>>
>
> It would be slightly better to put ct_zone and ct_state right after recirc_id. This would make miniflows smaller, as the recirc_id, ct_zone, and ct_state would all be in the same miniflow data unit (conj_id and actset_output are userspace-only fields, so they would be zeroes for datapath miniflows).
>
> We should probably make ct_state a 16 bit field (uint16_t) in struct flow already, for future-proofness, even if the kernel interface is still 8 bits.
>
> Also, you should bump FLOW_WC_SEQ to make sure all the relevant places are updated for the new fields.

OK, I'll adjust these.

>> --- a/lib/meta-flow.h
>> +++ b/lib/meta-flow.h
>> @@ -703,6 +703,51 @@ enum OVS_PACKED_ENUM mf_field_id {
>>      */
>>     MFF_PKT_MARK,
>>
>> +    /* "ct_state".
>> +     *
>> +     * Connection tracking state.  The field is populated by the NXAST_CT
>> +     * action. The following bit values describe the state of the connection:
>> +     *
>> +     *   - New (0x01): This is the beginning of a new connection.
>> +     *   - Established (0x02): This is part of an already existing connection.
>> +     *   - Related (0x04): This is a separate connection that is related to an
>> +     *                     existing connection.
>> +     *   - Invalid (0x20): This flow could not be associated with a connection.
>> +     *                     This could be set for a variety of reasons,
>> +     *                     including (but not limited to):
>> +     *                     - L3/L4 protocol handler is not loaded/unavailable.
>> +     *                     - L3/L4 protocol handler determines that the packet
>> +     *                       is malformed or invalid for the current FSM stage.
>> +     *                     - Packets are unexpected length for protocol.
>> +     *   - Reply (0x40): This flow is in the reply direction, ie it did not
>> +     *                   initiate the connection.
>> +     *   - Tracked (0x80): Connection tracking has occurred.
>> +     *

This should also mention that "related" catches related packets such
as ICMP destination unreachable messages (this is "related" to the
connection which initiated the request to the unreachable
destination). I'll fix that up, here and in the manual.

> Have we documented anywhere which of the bits are mutually exclusive and how. E.g.,
> - If -trk, all the other bits are also zeroes (also invalid?)
> - new and established are mutually exclusive (+new+est is not possible)

Right.

> - new and reply are mutually exclusive (reply direction is defined as opposite of the initial +new)

Strictly speaking I don't think this is the case. I don't have an
example, but it is possible for this to be set in the conntrack info
specified in the Linux interface, see
linux/netfilter/nf_conntrack_common.h. ct_state is translated roughly
from this and potentially other pieces of conntrack information.

> - related and new can be set at the same time (but does related remain set for the duration of the connection, or is it also mutually exclusive with established?)

Related can be present with any other bit. I believe that it remains
set for the duration of the connection. Looks like a good candidate
for another test in the testsuite.

> - apparently related and reply are not mutually exclusive, but when are they set at the same time?

I see "related" as another bit that is independent of all of the
others; whether the connection is related or not, the packets can
still be part of a new connection, established, reply, etc. Perhaps
the more pressing question is exactly what "reply" means in this
context: Is it referring to the direction of the original connection,
or only the related connection? This is something I can further
clarify.

> - are invalid and tracked mutually exclusive?

This question suggests that the description of "tracked" is
insufficient, so I'll amend it. See below for a more comprehensive
definition.

> Also, what are the possible transitions from one state to another?

From the perspective of the packet, there are two states: untracked
and tracked. Untracked means that the packet has not yet been sent
through the connection tracker. If the packet is sent to the
connection tracker via "ct(table=X)", the packet becomes tracked. It
remains tracked until the packet egresses the switch.

From the perspective of connections, connection state can only be
identified once the packet is tracked. There are two connection
states: uncommitted and committed. If a packet has the "new" ct_state
flag, then its corresponding connection is uncommitted, meaning that
the connection tracker cannot identify any other packets as belonging
to the same or related connections. If an uncommitted connection is
committed via "ct(commit)", then subsequent packets for the connection
or related connections may be correlated. In general, any connection
may transition back to uncommitted if no packets are seen for some
time. For specific connection types, such as TCP, there may be
particular packet types that also cause committed connecitons to
transition back to the uncommitted state. The exact operation of these
transitions back to uncommitted are outside the scope of the OpenFlow
definition.

The "new" and "invalid" flags map to the uncommitted connection state.
The "established" and "reply" flags map to the committed connection
state. The "related" flag may be present for connections in either
state.

>> @@ -4448,6 +4455,245 @@ format_DEBUG_RECIRC(const struct ofpact_null *a OVS_UNUSED, struct ds *s)
>> {
>>     ds_put_cstr(s, "debug_recirc");
>> }
>> +
>> +/* Action structure for NXAST_CT.
>> + *
>> + * Pass traffic to the connection tracker.
>> + *
>> + * The "zone" specifies a context within which the tracking is done:
>> + *
>> + *      If 'zone_src' is nonzero, this specifies that the zone should be
>> + *      sourced from a field zone_src[ofs:ofs+nbits]. The format and semantics
>> + *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
>> + *      NXAST_REG_LOAD action. The acceptable nxm_header values for 'zone_src'
>> + *      are the same as the acceptable nxm_header values for the 'src' field of
>> + *      NXAST_REG_MOVE.
>> + *
>> + *      If 'zone_src' is zero, then the value of 'zone_imm' will be used as the
>> + *      connection tracking zone.
>> + *
>> + * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this field
>> + * specifies that the packet should be logically cloned after the packet is
>
> The packet is logically cloned *before* it is sent through the connection tracker,
> as only the cloned packet gets the ct_state and ct_zone, right?

True. Next time I'll go through and try to ensure that each of the
commit messages, ofp-actions.c and the manpages are consistent on
their descriptions of this.

Strictly speaking, if the table is specified, then it is like an
output action. The pipeline splits prior to the action; one copy goes
through the action and continues to the logical end of that
pipeline(in output this means send the packet and terminate, in ct
this means send through the connection tracker and return to the
OpenFlow pipeline at the table specified). The original packet does
not go through that action, and instead skips straight to the next
action in the list. If it is the last action, then the pipeline
terminates.

>> + * sent through the connection tracker, and pipeline processing for the cloned
>> + * packet will continue in the OpenFlow table specified by this value. The
>> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
>> + * strongly recommended that this table is later than the current table, to
>> + * prevent loops.
>> + *
>
> ‘alg’ and ofpacts are introduced in later patches, and I guess it is too much work to not deal with them here yet?

I could go back and adjust the definition to place padding there and
drop the ofpacts instead.

> Anyway, a thought about the nested actions just popped in my mind: The fact that the actions are nested means that they all must be produced by a single OpenFlow rule. I.e., the same rule that does conntrack must also set the labels, etc. This means that either there will be a potentially huge number of rules with conntrack actions, or the values used must be passed through registers (which is kind of redundant). Also, a register may not be wide enough for the data in question (e.g., labels).

Correct. The idea behind the nested actions is to make the connection
tracking and the modification of the external metadata atomic. Two
alternative approaches had been considered:
- One, set the metadata (ie ct_mark) prior to performing ct() action.
In this case, what we are saying is that the actual execution of this
is deferred until the next ct() action, which could be never. Cases
like "actions=set_field(1->ct_mark),ct(commit,zone=1),ct(commit,zone=2)"
would need to be defined: does the mark modification apply to both
connections, or only the first?
- Another option, set the metadata after performing ct(). This assumes
that the metadata modification applies to the most recent connection
tracking entry, which applies until the end of the pipeline. Either we
defer the ct() action execution until we are sure that we have
gathered all relevant set_field(...ct_*) actions, or we retain a
reference to the connection tracking information just in case a later
action applies the modification. Some interesting cases include, what
happens if "actions=ct(table=1),set_field(1->ct_mark)". It's difficult
to guarantee whether the "ct_mark" for the current connection is
modified before the forked packet re-enters table 1 or afterwards.

While it may require the use of additional registers, I think that
it's clearer which connections the ct_mark applies to in this case,
and dependencies don't need to be tracked along the pipeline.

>> +static enum ofperr
>> +decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf *out)
>> +{
>> +    const size_t ct_offset = ofpacts_pull(out);
>> +    struct ofpact_conntrack *conntrack;
>> +    struct ofpbuf openflow;
>> +    int error = 0;
>> +
>> +    conntrack = ofpact_put_CT(out);
>> +    conntrack->flags = ntohs(nac->flags);
>> +    error = decode_ct_zone(nac, conntrack);
>> +    if (error) {
>> +        goto out;
>> +    }
>> +    conntrack->recirc_table = nac->recirc_table;
>> +    conntrack->alg = ntohs(nac->alg);
>> +
>> +    ofpbuf_pull(out, sizeof(*conntrack));
>> +
>> +    /* XXX: version */
>
> What about the version?

When decoding the CT action, we have lost the context of which
OpenFlow version we are decoding. So we can't determine whether the
nested actions are valid for the current OpenFlow version. Having
version=OFP10_VERSION has worked in all of the cases I have tried so
far, perhaps because the nicira extensions extend all openflow
versions. I'm not sure if there's a more correct way to apply this, or
if passing OFP10_VERSION is sufficient here. Thoughts?

>> +    nac->recirc_table = conntrack->recirc_table;
>> +    nac->alg = htons(conntrack->alg);
>> +
>> +    len = ofpacts_put_openflow_actions(conntrack->actions,
>> +                                       ofpact_ct_get_action_len(conntrack),
>> +                                       out, ofp_version);
>> +    len += sizeof(*nac);
>> +    nac = ofpbuf_at(out, ofs, sizeof(*nac));
>> +    nac->len = htons(len);
>
> nac->len should include the conntrack action itself, too, not just the length of the actions?

You mean like, "len += sizeof(*nac)"? Am I missing something here?

>> +/* Bits for 'flags' in struct nx_action_conntrack.
>> + *
>> + * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
>> + * unconfirmed to confirmed list in the tracker. */
>> +enum nx_conntrack_flags {
>> +    NX_CT_F_COMMIT = 1 << 0,
>> +};
>> +
>> +/* Magic value for struct nx_action_conntrack 'recirc_table' field, to specify
>> + * that the packet should not be recirculated. This value should commonly be
>> + * used in conjunction with the NX_CT_F_COMMIT flag above. */
>> +#define NX_CT_RECIRC_NONE 0xFF
>> +
>
> Maybe this should be defined as OFPTT_ALL, as that is the only invalid table number (0xff), i.e., NX_CT_RECIRC_NONE may not be defined as anything else, right?

Agree, I'll change this.

>> +#define CT_MEMBERS                      \
>> +    uint16_t flags;                     \
>> +    uint16_t zone_imm;                  \
>> +    struct mf_subfield zone_src;        \
>> +    uint8_t recirc_table;               \
>> +    uint16_t alg
>> +
>
> Maybe put the struct member first to avoid internal padding (mf_subfield has a pointer, so it will be 64-bit aligned on 64-bit systems)?

Sure.

>> +/* Used to perform build-time size checking for ofpact_conntrack. */
>> +struct ofpact_ct_size {
>> +    CT_MEMBERS;
>> +};
>> +
>> +/* OFPACT_CT.
>> + *
>> + * Used for NXAST_CT. */
>> +struct ofpact_conntrack {
>> +    struct ofpact ofpact;
>> +    CT_MEMBERS;
>
> This is slightly ugly, seems to be necessary.

Right, I'm not sure that we can calculate the correct padding size
without a trick like this.

>> bool dpid_from_string(const char *s, uint64_t *dpidp);
>> @@ -710,6 +717,18 @@ struct tcp_header {
>> };
>> BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
>>
>> +/* Connection states */
>> +#define CS_NEW               0x01
>> +#define CS_ESTABLISHED       0x02
>> +#define CS_RELATED           0x04
>> +#define CS_INVALID           0x20
>> +#define CS_REPLY_DIR         0x40
>> +#define CS_TRACKED           0x80
>> +
>
> These seem to match the datapath interface flags. Is that an invariant? Which ones we use in struct flow?

This is "coincidental". Strictly speaking the OpenFlow wire format for
these flags should be translated into an internal representation,
which should be subsequently translated into the equivalent datapath
representation, but in practice they're all the same. Perhaps this
would be more explicit to define them using the netlink interface
definitions:

#define CS_NEW    OVS_CS_F_NEW
....

>> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
>> index e7d95bf..7754e68 100644
>> --- a/ofproto/ofproto-dpif-rid.h
>> +++ b/ofproto/ofproto-dpif-rid.h
>> @@ -142,6 +142,7 @@ struct recirc_state {
>>     struct recirc_metadata metadata; /* Flow metadata. */
>>     struct ofpbuf *stack;         /* Stack if any. */
>>     mirror_mask_t mirrors;        /* Mirrors already output. */
>> +    bool conntrack;               /* Conntrack occurred prior to recirc. */
>>
>
> Should this be also called ‘conntracked’?

OK.

>> static enum ofperr
>> +rule_check(struct rule *rule)
>> +{
>> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>> +    const struct odp_support *support;
>> +    struct match match;
>> +
>> +    minimatch_expand(&rule->cr.match, &match);
>> +    support = &ofproto_dpif_get_support(ofproto)->odp;
>> +
>> +    if ((match.wc.masks.ct_state && !support->ct_state)
>> +        || (match.wc.masks.ct_zone && !support->ct_zone)) {
>> +        return OFPERR_OFPBMC_BAD_FIELD;
>> +    }
>> +    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
>> +        return OFPERR_OFPBMC_BAD_MASK;
>> +    }
>> +
>> +    return 0;
>> +}
>
> This check seems somewhat expensive. Maybe get the masks from the minimatch without expanding it (using MINIFLOW_GET_U8() and MINIFLOW_GET_U16()), like this:
>
> static enum ofperr
> rule_check(struct rule *rule)
> {
>     ct_state_mask = MINIFLOW_GET_U8(&rule->cr.match.mask.masks, ct_state);
>     ct_zone_mask = MINIFLOW_GET_U16(&rule->cr.match.mask.masks, ct_zone);
>
>     if (ct_state_mask || ct_zone_mask) {
>         struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>         const struct odp_support *support = &ofproto_dpif_get_support(ofproto)->odp;
>
>         if ((ct_state_mask && !support->ct_state)
>             || (ct_zone_mask && !support->ct_zone)) {
>             return OFPERR_OFPBMC_BAD_FIELD;
>         }
>         if (ct_state_mask & CS_UNSUPPORTED_MASK) {
>             return OFPERR_OFPBMC_BAD_MASK;
>         }
>     }
>     return 0;
> }

Thanks for the suggestion, looks reasonable.

>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
>>
>> OVS_TRAFFIC_VSWITCHD_STOP
>> AT_CLEANUP
>> +
>> +AT_SETUP([conntrack - controller])
>> +CHECK_CONNTRACK()
>> +OVS_TRAFFIC_VSWITCHD_START(
>> +   [set-fail-mode br0 standalone -- ])
>> +
>> +ADD_NAMESPACES(at_ns0, at_ns1)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +in_port=1,udp,action=ct(commit),controller
>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>
> I forgot the default priority value, would be useful to specify the priority of all the flows to make this more readable.

OK.
Joe Stringer Sept. 11, 2015, 8:35 p.m. UTC | #6
On 10 September 2015 at 11:05, Ben Pfaff <blp@nicira.com> wrote:
> Who do you think should review this?

Jarno and I discussed the OpenFlow interface a little at lunch, would
you mind looking at this particular aspect from an API cleanliness
point of view?
Jarno Rajahalme Sept. 11, 2015, 9:02 p.m. UTC | #7
> On Sep 11, 2015, at 1:07 PM, Joe Stringer <joestringer@nicira.com> wrote:
> 
> On 10 September 2015 at 19:03, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>> 
>>> On Sep 9, 2015, at 7:00 PM, Joe Stringer <joestringer@nicira.com> wrote:
>>> @@ -672,6 +705,8 @@ struct ovs_action_push_tnl {
>>> * indicate the new packet contents. This could potentially still be
>>> * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
>>> * is no MPLS label stack, as determined by ethertype, no action is taken.
>>> + * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
>>> + * entries in the flow key.
>>> *
>>> * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
>>> * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
>>> @@ -702,6 +737,7 @@ enum ovs_action_attr {
>>>                                     * data immediately followed by a mask.
>>>                                     * The data must be zero for the unmasked
>>>                                     * bits. */
>>> +     OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
>>> 
>> 
>> I guess this should be “Nested OVS_CT_ATTR_*”, as there can be multiple nested attributes?
> 
> Right you are, I'll follow up with this on upstream too.
> 
> 
>>> diff --git a/lib/flow.h b/lib/flow.h
>>> index d8632ff..2d6d3ee 100644
>>> --- a/lib/flow.h
>>> +++ b/lib/flow.h
>>> @@ -103,8 +103,10 @@ struct flow {
>>>    union flow_in_port in_port; /* Input port.*/
>>>    uint32_t recirc_id;         /* Must be exact match. */
>>>    uint32_t conj_id;           /* Conjunction ID. */
>>> +    uint16_t ct_zone;           /* Connection Zone. */
>>> +    uint8_t ct_state;           /* Connection state. */
>>> +    uint8_t pad1[3];            /* Pad to 64 bits. */
>>>    ofp_port_t actset_output;   /* Output port in action set. */
>>> -    uint8_t pad1[6];            /* Pad to 64 bits. */
>>> 
>> 
>> It would be slightly better to put ct_zone and ct_state right after recirc_id. This would make miniflows smaller, as the recirc_id, ct_zone, and ct_state would all be in the same miniflow data unit (conj_id and actset_output are userspace-only fields, so they would be zeroes for datapath miniflows).
>> 
>> We should probably make ct_state a 16 bit field (uint16_t) in struct flow already, for future-proofness, even if the kernel interface is still 8 bits.
>> 
>> Also, you should bump FLOW_WC_SEQ to make sure all the relevant places are updated for the new fields.
> 
> OK, I'll adjust these.
> 
>>> --- a/lib/meta-flow.h
>>> +++ b/lib/meta-flow.h
>>> @@ -703,6 +703,51 @@ enum OVS_PACKED_ENUM mf_field_id {
>>>     */
>>>    MFF_PKT_MARK,
>>> 
>>> +    /* "ct_state".
>>> +     *
>>> +     * Connection tracking state.  The field is populated by the NXAST_CT
>>> +     * action. The following bit values describe the state of the connection:
>>> +     *
>>> +     *   - New (0x01): This is the beginning of a new connection.
>>> +     *   - Established (0x02): This is part of an already existing connection.
>>> +     *   - Related (0x04): This is a separate connection that is related to an
>>> +     *                     existing connection.
>>> +     *   - Invalid (0x20): This flow could not be associated with a connection.
>>> +     *                     This could be set for a variety of reasons,
>>> +     *                     including (but not limited to):
>>> +     *                     - L3/L4 protocol handler is not loaded/unavailable.
>>> +     *                     - L3/L4 protocol handler determines that the packet
>>> +     *                       is malformed or invalid for the current FSM stage.
>>> +     *                     - Packets are unexpected length for protocol.
>>> +     *   - Reply (0x40): This flow is in the reply direction, ie it did not
>>> +     *                   initiate the connection.
>>> +     *   - Tracked (0x80): Connection tracking has occurred.
>>> +     *
> 
> This should also mention that "related" catches related packets such
> as ICMP destination unreachable messages (this is "related" to the
> connection which initiated the request to the unreachable
> destination). I'll fix that up, here and in the manual.
> 
>> Have we documented anywhere which of the bits are mutually exclusive and how. E.g.,
>> - If -trk, all the other bits are also zeroes (also invalid?)
>> - new and established are mutually exclusive (+new+est is not possible)
> 
> Right.
> 
>> - new and reply are mutually exclusive (reply direction is defined as opposite of the initial +new)
> 
> Strictly speaking I don't think this is the case. I don't have an
> example, but it is possible for this to be set in the conntrack info
> specified in the Linux interface, see
> linux/netfilter/nf_conntrack_common.h. ct_state is translated roughly
> from this and potentially other pieces of conntrack information.
> 

While IP_CT_NEW_REPLY is defined, it is only used by openvswitch/connntrack.c, and the next line says “(no NEW in reply dirn.)”

So it looks like IP_CT_NEW_REPLY should not even be defined. I have prepared a patch to remove the IP_CT_NEW_REPLY definition, but have no immediate plans to send it out.

As said, it seems to me that reply direction is defined as the opposing direction to the packet that initially created the conntrack entry. I’m still a bit confused about related vs. related_reply, though.

>> - related and new can be set at the same time (but does related remain set for the duration of the connection, or is it also mutually exclusive with established?)
> 
> Related can be present with any other bit. I believe that it remains
> set for the duration of the connection. Looks like a good candidate
> for another test in the testsuite.
> 

ctinfo enum can not hold related and established at the same time, though.

>> - apparently related and reply are not mutually exclusive, but when are they set at the same time?
> 
> I see "related" as another bit that is independent of all of the
> others; whether the connection is related or not, the packets can
> still be part of a new connection, established, reply, etc. Perhaps
> the more pressing question is exactly what "reply" means in this
> context: Is it referring to the direction of the original connection,
> or only the related connection? This is something I can further
> clarify.
> 

in ctinfo enum related, new, and established are all mutually exclusive. In OVS ct_state this is not always the case, though, as we explicitly set the new bit for related, too. I agree that it would be nice to be able to persist the related bit also when the connection state is “established”.

>> - are invalid and tracked mutually exclusive?
> 
> This question suggests that the description of "tracked" is
> insufficient, so I'll amend it. See below for a more comprehensive
> definition.
> 
>> Also, what are the possible transitions from one state to another?
> 
> From the perspective of the packet, there are two states: untracked
> and tracked. Untracked means that the packet has not yet been sent
> through the connection tracker. If the packet is sent to the
> connection tracker via "ct(table=X)", the packet becomes tracked. It
> remains tracked until the packet egresses the switch.
> 
> From the perspective of connections, connection state can only be
> identified once the packet is tracked. There are two connection
> states: uncommitted and committed. If a packet has the "new" ct_state
> flag, then its corresponding connection is uncommitted, meaning that
> the connection tracker cannot identify any other packets as belonging
> to the same or related connections. If an uncommitted connection is
> committed via "ct(commit)", then subsequent packets for the connection
> or related connections may be correlated. In general, any connection
> may transition back to uncommitted if no packets are seen for some
> time. For specific connection types, such as TCP, there may be
> particular packet types that also cause committed connecitons to
> transition back to the uncommitted state. The exact operation of these
> transitions back to uncommitted are outside the scope of the OpenFlow
> definition.
> 
> The "new" and "invalid" flags map to the uncommitted connection state.
> The "established" and "reply" flags map to the committed connection
> state. The "related" flag may be present for connections in either
> state.
> 
>>> @@ -4448,6 +4455,245 @@ format_DEBUG_RECIRC(const struct ofpact_null *a OVS_UNUSED, struct ds *s)
>>> {
>>>    ds_put_cstr(s, "debug_recirc");
>>> }
>>> +
>>> +/* Action structure for NXAST_CT.
>>> + *
>>> + * Pass traffic to the connection tracker.
>>> + *
>>> + * The "zone" specifies a context within which the tracking is done:
>>> + *
>>> + *      If 'zone_src' is nonzero, this specifies that the zone should be
>>> + *      sourced from a field zone_src[ofs:ofs+nbits]. The format and semantics
>>> + *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
>>> + *      NXAST_REG_LOAD action. The acceptable nxm_header values for 'zone_src'
>>> + *      are the same as the acceptable nxm_header values for the 'src' field of
>>> + *      NXAST_REG_MOVE.
>>> + *
>>> + *      If 'zone_src' is zero, then the value of 'zone_imm' will be used as the
>>> + *      connection tracking zone.
>>> + *
>>> + * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this field
>>> + * specifies that the packet should be logically cloned after the packet is
>> 
>> The packet is logically cloned *before* it is sent through the connection tracker,
>> as only the cloned packet gets the ct_state and ct_zone, right?
> 
> True. Next time I'll go through and try to ensure that each of the
> commit messages, ofp-actions.c and the manpages are consistent on
> their descriptions of this.
> 
> Strictly speaking, if the table is specified, then it is like an
> output action. The pipeline splits prior to the action; one copy goes
> through the action and continues to the logical end of that
> pipeline(in output this means send the packet and terminate, in ct
> this means send through the connection tracker and return to the
> OpenFlow pipeline at the table specified). The original packet does
> not go through that action, and instead skips straight to the next
> action in the list. If it is the last action, then the pipeline
> terminates.
> 
>>> + * sent through the connection tracker, and pipeline processing for the cloned
>>> + * packet will continue in the OpenFlow table specified by this value. The
>>> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
>>> + * strongly recommended that this table is later than the current table, to
>>> + * prevent loops.
>>> + *
>> 
>> ‘alg’ and ofpacts are introduced in later patches, and I guess it is too much work to not deal with them here yet?
> 
> I could go back and adjust the definition to place padding there and
> drop the ofpacts instead.

Seems like a lot of work, though, essentially only for reviewing purposes. Up to you :-)

> 
>> Anyway, a thought about the nested actions just popped in my mind: The fact that the actions are nested means that they all must be produced by a single OpenFlow rule. I.e., the same rule that does conntrack must also set the labels, etc. This means that either there will be a potentially huge number of rules with conntrack actions, or the values used must be passed through registers (which is kind of redundant). Also, a register may not be wide enough for the data in question (e.g., labels).
> 
> Correct. The idea behind the nested actions is to make the connection
> tracking and the modification of the external metadata atomic. Two
> alternative approaches had been considered:
> - One, set the metadata (ie ct_mark) prior to performing ct() action.
> In this case, what we are saying is that the actual execution of this
> is deferred until the next ct() action, which could be never. Cases
> like "actions=set_field(1->ct_mark),ct(commit,zone=1),ct(commit,zone=2)"
> would need to be defined: does the mark modification apply to both
> connections, or only the first?
> - Another option, set the metadata after performing ct(). This assumes
> that the metadata modification applies to the most recent connection
> tracking entry, which applies until the end of the pipeline. Either we
> defer the ct() action execution until we are sure that we have
> gathered all relevant set_field(...ct_*) actions, or we retain a
> reference to the connection tracking information just in case a later
> action applies the modification. Some interesting cases include, what
> happens if "actions=ct(table=1),set_field(1->ct_mark)". It's difficult
> to guarantee whether the "ct_mark" for the current connection is
> modified before the forked packet re-enters table 1 or afterwards.
> 
> While it may require the use of additional registers, I think that
> it's clearer which connections the ct_mark applies to in this case,
> and dependencies don't need to be tracked along the pipeline.
> 
>>> +static enum ofperr
>>> +decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf *out)
>>> +{
>>> +    const size_t ct_offset = ofpacts_pull(out);
>>> +    struct ofpact_conntrack *conntrack;
>>> +    struct ofpbuf openflow;
>>> +    int error = 0;
>>> +
>>> +    conntrack = ofpact_put_CT(out);
>>> +    conntrack->flags = ntohs(nac->flags);
>>> +    error = decode_ct_zone(nac, conntrack);
>>> +    if (error) {
>>> +        goto out;
>>> +    }
>>> +    conntrack->recirc_table = nac->recirc_table;
>>> +    conntrack->alg = ntohs(nac->alg);
>>> +
>>> +    ofpbuf_pull(out, sizeof(*conntrack));
>>> +
>>> +    /* XXX: version */
>> 
>> What about the version?
> 
> When decoding the CT action, we have lost the context of which
> OpenFlow version we are decoding. So we can't determine whether the
> nested actions are valid for the current OpenFlow version. Having
> version=OFP10_VERSION has worked in all of the cases I have tried so
> far, perhaps because the nicira extensions extend all openflow
> versions. I'm not sure if there's a more correct way to apply this, or
> if passing OFP10_VERSION is sufficient here. Thoughts?
> 

Not sure. Maybe test with each OpenFlow version and see what happens?

>>> +    nac->recirc_table = conntrack->recirc_table;
>>> +    nac->alg = htons(conntrack->alg);
>>> +
>>> +    len = ofpacts_put_openflow_actions(conntrack->actions,
>>> +                                       ofpact_ct_get_action_len(conntrack),
>>> +                                       out, ofp_version);
>>> +    len += sizeof(*nac);
>>> +    nac = ofpbuf_at(out, ofs, sizeof(*nac));
>>> +    nac->len = htons(len);
>> 
>> nac->len should include the conntrack action itself, too, not just the length of the actions?
> 
> You mean like, "len += sizeof(*nac)"? Am I missing something here?
> 

I simply missed the line you refer to, sorry.

>>> +/* Bits for 'flags' in struct nx_action_conntrack.
>>> + *
>>> + * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
>>> + * unconfirmed to confirmed list in the tracker. */
>>> +enum nx_conntrack_flags {
>>> +    NX_CT_F_COMMIT = 1 << 0,
>>> +};
>>> +
>>> +/* Magic value for struct nx_action_conntrack 'recirc_table' field, to specify
>>> + * that the packet should not be recirculated. This value should commonly be
>>> + * used in conjunction with the NX_CT_F_COMMIT flag above. */
>>> +#define NX_CT_RECIRC_NONE 0xFF
>>> +
>> 
>> Maybe this should be defined as OFPTT_ALL, as that is the only invalid table number (0xff), i.e., NX_CT_RECIRC_NONE may not be defined as anything else, right?
> 
> Agree, I'll change this.
> 
>>> +#define CT_MEMBERS                      \
>>> +    uint16_t flags;                     \
>>> +    uint16_t zone_imm;                  \
>>> +    struct mf_subfield zone_src;        \
>>> +    uint8_t recirc_table;               \
>>> +    uint16_t alg
>>> +
>> 
>> Maybe put the struct member first to avoid internal padding (mf_subfield has a pointer, so it will be 64-bit aligned on 64-bit systems)?
> 
> Sure.
> 
>>> +/* Used to perform build-time size checking for ofpact_conntrack. */
>>> +struct ofpact_ct_size {
>>> +    CT_MEMBERS;
>>> +};
>>> +
>>> +/* OFPACT_CT.
>>> + *
>>> + * Used for NXAST_CT. */
>>> +struct ofpact_conntrack {
>>> +    struct ofpact ofpact;
>>> +    CT_MEMBERS;
>> 
>> This is slightly ugly, seems to be necessary.
> 
> Right, I'm not sure that we can calculate the correct padding size
> without a trick like this.
> 
>>> bool dpid_from_string(const char *s, uint64_t *dpidp);
>>> @@ -710,6 +717,18 @@ struct tcp_header {
>>> };
>>> BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
>>> 
>>> +/* Connection states */
>>> +#define CS_NEW               0x01
>>> +#define CS_ESTABLISHED       0x02
>>> +#define CS_RELATED           0x04
>>> +#define CS_INVALID           0x20
>>> +#define CS_REPLY_DIR         0x40
>>> +#define CS_TRACKED           0x80
>>> +
>> 
>> These seem to match the datapath interface flags. Is that an invariant? Which ones we use in struct flow?
> 
> This is "coincidental". Strictly speaking the OpenFlow wire format for
> these flags should be translated into an internal representation,
> which should be subsequently translated into the equivalent datapath
> representation, but in practice they're all the same. Perhaps this
> would be more explicit to define them using the netlink interface
> definitions:
> 
> #define CS_NEW    OVS_CS_F_NEW
> ….

Would the current code work if the mapping was not 1:1? If yes, then it does not matter. If not, then we should at least assert that the mapping stays 1:1 also in the future.

>>> diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
>>> index e7d95bf..7754e68 100644
>>> --- a/ofproto/ofproto-dpif-rid.h
>>> +++ b/ofproto/ofproto-dpif-rid.h
>>> @@ -142,6 +142,7 @@ struct recirc_state {
>>>    struct recirc_metadata metadata; /* Flow metadata. */
>>>    struct ofpbuf *stack;         /* Stack if any. */
>>>    mirror_mask_t mirrors;        /* Mirrors already output. */
>>> +    bool conntrack;               /* Conntrack occurred prior to recirc. */
>>> 
>> 
>> Should this be also called ‘conntracked’?
> 
> OK.
> 
>>> static enum ofperr
>>> +rule_check(struct rule *rule)
>>> +{
>>> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>>> +    const struct odp_support *support;
>>> +    struct match match;
>>> +
>>> +    minimatch_expand(&rule->cr.match, &match);
>>> +    support = &ofproto_dpif_get_support(ofproto)->odp;
>>> +
>>> +    if ((match.wc.masks.ct_state && !support->ct_state)
>>> +        || (match.wc.masks.ct_zone && !support->ct_zone)) {
>>> +        return OFPERR_OFPBMC_BAD_FIELD;
>>> +    }
>>> +    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
>>> +        return OFPERR_OFPBMC_BAD_MASK;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> 
>> This check seems somewhat expensive. Maybe get the masks from the minimatch without expanding it (using MINIFLOW_GET_U8() and MINIFLOW_GET_U16()), like this:
>> 
>> static enum ofperr
>> rule_check(struct rule *rule)
>> {
>>    ct_state_mask = MINIFLOW_GET_U8(&rule->cr.match.mask.masks, ct_state);
>>    ct_zone_mask = MINIFLOW_GET_U16(&rule->cr.match.mask.masks, ct_zone);
>> 
>>    if (ct_state_mask || ct_zone_mask) {
>>        struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
>>        const struct odp_support *support = &ofproto_dpif_get_support(ofproto)->odp;
>> 
>>        if ((ct_state_mask && !support->ct_state)
>>            || (ct_zone_mask && !support->ct_zone)) {
>>            return OFPERR_OFPBMC_BAD_FIELD;
>>        }
>>        if (ct_state_mask & CS_UNSUPPORTED_MASK) {
>>            return OFPERR_OFPBMC_BAD_MASK;
>>        }
>>    }
>>    return 0;
>> }
> 
> Thanks for the suggestion, looks reasonable.
> 
>>> --- a/tests/system-traffic.at
>>> +++ b/tests/system-traffic.at
>>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
>>> 
>>> OVS_TRAFFIC_VSWITCHD_STOP
>>> AT_CLEANUP
>>> +
>>> +AT_SETUP([conntrack - controller])
>>> +CHECK_CONNTRACK()
>>> +OVS_TRAFFIC_VSWITCHD_START(
>>> +   [set-fail-mode br0 standalone -- ])
>>> +
>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +in_port=1,udp,action=ct(commit),controller
>>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>> 
>> I forgot the default priority value, would be useful to specify the priority of all the flows to make this more readable.
> 
> OK.

Btw. I got the “make check-kernel” run successfully for this patch :-)

  Jarno
Jarno Rajahalme Sept. 11, 2015, 9:42 p.m. UTC | #8
Joe,

Below some comments on the tests,

  Jarno

> --- a/tests/atlocal.in
> +++ b/tests/atlocal.in
> @@ -110,3 +110,10 @@ fi
> if test "$IS_WIN32" = "yes"; then
>     HAVE_PYTHON="no"
> fi
> +
> +# Conntrack test requirements
> +if test x`which conntrack` != x; then
> +    HAVE_CONNTRACK="yes"
> +else
> +    HAVE_CONNTRACK="no"
> +fi
> diff --git a/tests/automake.mk b/tests/automake.mk
> index 4198039..cef67c0 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -337,6 +337,7 @@ CHECK_PYFILES = \
> 	tests/test-daemon.py \
> 	tests/test-json.py \
> 	tests/test-jsonrpc.py \
> +	tests/test-l7.py \
> 	tests/test-ovsdb.py \
> 	tests/test-reconnect.py \
> 	tests/MockXenAPI.py \
> diff --git a/tests/dpif-netdev.at b/tests/dpif-netdev.at
> index e9af63f..43108e7 100644
> --- a/tests/dpif-netdev.at
> +++ b/tests/dpif-netdev.at
> @@ -82,7 +82,7 @@ AT_CHECK([cat ovs-vswitchd.log | grep -A 1 'miss upcall' | tail -n 1], [0], [dnl
> skb_priority(0),skb_mark(0),recirc_id(0),dp_hash(0),in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=1,tos=0,ttl=64,frag=no),icmp(type=8,code=0)
> ])
> AT_CHECK([cat ovs-vswitchd.log | FILTER_FLOW_INSTALL | STRIP_XOUT], [0], [dnl
> -pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions: <del>
> +pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions: <del>
> recirc_id=0,ip,in_port=1,vlan_tci=0x0000/0x1fff,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_frag=no, actions: <del>
> ])
> 
> diff --git a/tests/odp.at b/tests/odp.at
> index 61edde3..74d44ff 100644
> --- a/tests/odp.at
> +++ b/tests/odp.at
> @@ -86,6 +86,11 @@ sed '/bos=0/{
> s/^/ODP_FIT_TOO_LITTLE: /
> }' < odp-in.txt > odp-out.txt
> 
> +dnl Some fields are always printed for this test, because wildcards aren't
> +dnl specified. We can skip these.
> +sed -i 's/\(skb_mark(0)\),\(ct\)/\1,ct_state(0),ct_zone(0),\2/' odp-out.txt
> +sed -i 's/\(skb_mark([[^)]]*)\),\(recirc\)/\1,ct_state(0),ct_zone(0),\2/' odp-out.txt
> +
> AT_CHECK_UNQUOTED([ovstest test-odp parse-keys < odp-in.txt], [0], [`cat odp-out.txt`
> ])
> AT_CLEANUP
> @@ -153,6 +158,10 @@ s/\(eth([[^)]]*)\),*/\1,eth_type(0x8100),vlan(vid=99,pcp=7),encap(/
> s/$/)/' odp-base.txt
> 
>  echo
> + echo '# Valid forms with conntrack fields.'
> + sed 's/\(eth([[^)]]*),?\)/\1,ct_state(+trk),/' odp-base.txt
> +
> + echo
>  echo '# Valid forms with IP first fragment.'
> sed -n 's/,frag=no),/,frag=first),/p' odp-base.txt
> 
> @@ -293,6 +302,9 @@ tnl_push(tnl_port(6),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:1
> tnl_push(tnl_port(6),header(size=50,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0x0),geneve(oam,vni=0x1c7)),out_port(1))
> tnl_push(tnl_port(6),header(size=58,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0x0),geneve(crit,vni=0x1c7,options({class=0xffff,type=0x80,len=4,0xa}))),out_port(1))
> tnl_push(tnl_port(6),header(size=50,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0xffff),geneve(vni=0x1c7)),out_port(1))
> +ct
> +ct(commit)
> +ct(commit,zone=5)
> ])
> AT_CHECK_UNQUOTED([ovstest test-odp parse-actions < actions.txt], [0],
>   [`cat actions.txt`
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index eb8647b..1e3013b 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -6486,8 +6486,8 @@ for i in 1 2 3 4; do
> done
> sleep 1
> AT_CHECK([cat ovs-vswitchd.log | STRIP_UFID | FILTER_FLOW_INSTALL | STRIP_USED], [0], [dnl
> -pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:2
> -pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:0b,dl_dst=50:54:00:00:00:0c,nw_src=10.0.0.4,nw_dst=10.0.0.3,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:drop
> +pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:2
> +pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:0b,dl_dst=50:54:00:00:00:0c,nw_src=10.0.0.4,nw_dst=10.0.0.3,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:drop
> ])
> AT_CHECK([cat ovs-vswitchd.log | STRIP_UFID | FILTER_FLOW_DUMP | grep 'packets:3'], [0], [dnl
> skb_priority(0),skb_mark(0),recirc_id(0),dp_hash(0),in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=1,tos=0,ttl=64,frag=no),icmp(type=8,code=0), packets:3, bytes:180, used:0.0s, actions:2
> diff --git a/tests/ofproto.at b/tests/ofproto.at
> index 7e80293..c522a85 100644
> --- a/tests/ofproto.at
> +++ b/tests/ofproto.at
> @@ -1609,6 +1609,8 @@ head_table () {
>       in_port_oxm: exact match or wildcard
>       actset_output: exact match or wildcard
>       pkt_mark: arbitrary mask
> +      ct_state: arbitrary mask
> +      ct_zone: exact match or wildcard
>       reg0: arbitrary mask
>       reg1: arbitrary mask
>       reg2: arbitrary mask
> diff --git a/tests/system-common-macros.at b/tests/system-common-macros.at
> index c5691e7..8f3b318 100644
> --- a/tests/system-common-macros.at
> +++ b/tests/system-common-macros.at
> @@ -118,3 +118,21 @@ m4_define([ADD_NATIVE_TUNNEL],
> # Strip variant pieces from ping output so the output can be reliably compared.
> #
> m4_define([FORMAT_PING], [grep "transmitted" | sed 's/time.*ms$/time 0ms/'])
> +
> +# FORMAT_CT()
> +#
> +# Strip content from the piped input which would differ from test to test.
> +#
> +m4_define([FORMAT_CT],
> +    [[grep "dst=$1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e 's/  */ /g' -e 's/secctx[^ ]* //' | cut -d' ' -f4- | sort | uniq]])
> +
> +# NETNS_DAEMONIZE([namespace], [command], [pidfile])
> +#
> +# Run 'command' as a background process within 'namespace' and record its pid
> +# to 'pidfile' to allow cleanup on exit.
> +#
> +m4_define([NETNS_DAEMONIZE],
> +   [ip netns exec $1 $2 & echo $! > $3
> +     echo "kill \`cat $3\`" >> cleanup
> +   ]
> +)
> diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
> index 1216db8..a48e8d9 100644
> --- a/tests/system-kmod-macros.at
> +++ b/tests/system-kmod-macros.at
> @@ -43,3 +43,19 @@ m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
>   [OVS_VSWITCHD_STOP([$1])
>    AT_CHECK([:; $2])
>   ])
> +
> +# CHECK_CONNTRACK()
> +#
> +# Perform requirements checks for running conntrack tests, and flush the
> +# kernel conntrack tables when the test is finished.
> +#
> +m4_define([CHECK_CONNTRACK],
> +    [AT_SKIP_IF([test $HAVE_CONNTRACK = no])
> +     AT_SKIP_IF([test $HAVE_PYTHON = no])
> +     m4_foreach([mod], [[nf_conntrack_ipv4], [nf_conntrack_ipv6]],
> +                [modprobe mod || echo "Module mod not loaded."
> +                 on_exit 'modprobe -r mod'
> +                ])
> +     on_exit 'conntrack -F'
> +    ]
> +)
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 7dbed68..de6b016 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
> 
> OVS_TRAFFIC_VSWITCHD_STOP
> AT_CLEANUP
> +
> +AT_SETUP([conntrack - controller])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +in_port=1,udp,action=ct(commit),controller
> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
> +in_port=2,ct_state=+trk+est-new,udp,action=controller

ct_state=+est should be sufficient here?

> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl Send an unsolicited reply from port 2. This should be dropped.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c00000000001100000a0101020a0101010002000100080000'])
> +
> +dnl OK, now start a new connection from port 1.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c00000000001100000a0101010a0101020001000200080000'])
> +
> +dnl Now try a reply from port 2.
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c00000000001100000a0101020a0101010002000100080000'])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
> +NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,in_port=2 (via action) data_len=42 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv4 HTTP])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,action=ct(commit),2
> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
> +in_port=2,ct_state=+trk+est-new,tcp,action=1

Having explicit priorities here as well would be helpful. The comment above about “+est” should apply here as well.

> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl Basic connectivity check.
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 >/dev/null])
> +
> +dnl HTTP requests from ns0->ns1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])


wget was trying to resolve the proxy I had configured in the “http_proxy” environment variable. The resolution from the netns failed, making the test case fail. I had to clear “http_proxy" to make the test work.

> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 use=1

Do you know how/if [ASSURED] maps to ct_state flags?

> +])
> +
> +dnl HTTP requests from ns1->ns0 should fail due to network failure.
> +dnl Try 3 times, in 1 second intervals.
> +NETNS_DAEMONIZE([at_ns0], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns1], [wget 10.1.1.1 -t 3 -T 1 -v -o wget1.log], [4])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - IPv6 HTTP])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
> +ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,icmp6,action=normal
> +in_port=1,tcp6,action=ct(commit),2
> +in_port=2,ct_state=-trk,tcp6,action=ct(table=0)
> +in_port=2,ct_state=+trk+est-new,tcp6,action=1

Ditto.

> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl Without this sleep, we get occasional failures due to the following error:
> +dnl "connect: Cannot assign requested address"
> +sleep 2;
> +
> +dnl HTTP requests from ns0->ns1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py http6]], [http0.pid])
> +
> +NS_CHECK_EXEC([at_ns0], [wget http://[[fc00::2]] -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +dnl HTTP requests from ns1->ns0 should fail due to network failure.
> +dnl Try 3 times, in 1 second intervals.
> +NETNS_DAEMONIZE([at_ns0], [[$PYTHON $srcdir/test-l7.py http6]], [http1.pid])
> +NS_CHECK_EXEC([at_ns1], [wget http://[[fc00::1]] -t 3 -T 1 -v -o wget1.log], [4])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - commit, recirc])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1, ns2->ns3.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)

Maybe some of the test cases should use non-zero table?

> +in_port=1,tcp,ct_state=+trk,action=2
> +in_port=2,tcp,ct_state=-trk,action=ct(table=0)
> +in_port=2,tcp,ct_state=+trk,action=1
> +in_port=3,tcp,ct_state=-trk,action=set_field:0->metadata,ct(table=0)
> +in_port=3,tcp,ct_state=+trk,metadata=0,action=set_field:1->metadata,ct(commit,table=0)
> +in_port=3,tcp,ct_state=+trk,metadata=1,action=4
> +in_port=4,tcp,ct_state=-trk,action=ct(commit,table=0)
> +in_port=4,tcp,ct_state=+trk,action=3
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +dnl HTTP requests from p2->p3 should work fine.
> +NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - preserve registers])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1, ns2->ns3.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
> +in_port=1,tcp,ct_state=+trk,action=2
> +in_port=2,tcp,ct_state=-trk,action=ct(table=0)
> +in_port=2,tcp,ct_state=+trk,action=1
> +in_port=3,tcp,ct_state=-trk,action=load:0->NXM_NX_REG0[[]],ct(table=0)
> +in_port=3,tcp,ct_state=+trk,reg0=0,action=load:1->NXM_NX_REG0[[]],ct(commit,table=0)
> +in_port=3,tcp,ct_state=+trk,reg0=1,action=4
> +in_port=4,tcp,ct_state=-trk,action=ct(commit,table=0)
> +in_port=4,tcp,ct_state=+trk,action=3
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +dnl HTTP requests from p2->p3 should work fine.
> +NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - invalid])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.

Update comment?

> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,action=ct(),2
> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
> +in_port=2,ct_state=+trk+new,tcp,action=1

Here “+new” should be sufficient, in general, “+trk” is not needed if any of the other bits are matched for being set, right?

> +in_port=3,tcp,action=ct(),4
> +in_port=4,ct_state=-trk,tcp,action=ct(table=0)
> +in_port=4,ct_state=+trk+inv,tcp,action=3
> +in_port=4,ct_state=+trk+new,tcp,action=3
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl We set up our rules to allow the request without committing. The return
> +dnl traffic can't be identified, because the initial request wasn't committed.
> +dnl For the first pair of ports, this means that the connection fails.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log], [4])
> +
> +dnl For the second pair, we allow packets from invalid connections, so it works.
> +NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - zones])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.

Does this comment need an update?

> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,action=ct(commit,zone=1),2
> +in_port=2,ct_state=-trk,tcp,action=ct(table=0,zone=1)
> +in_port=2,ct_state=+trk,ct_zone=1,tcp,action=1
> +in_port=3,tcp,action=ct(commit,zone=2),4
> +in_port=4,ct_state=-trk,tcp,action=ct(table=0,zone=2)
> +in_port=4,ct_state=+trk,ct_zone=1,tcp,action=3
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=1 use=1
> +])
> +
> +dnl HTTP requests from p2->p3 should fail due to network failure.
> +dnl Try 3 times, in 1 second intervals.
> +NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 -v -o wget1.log], [4])
> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.4)], [0], [dnl
> +SYN_RECV src=10.1.1.3 dst=10.1.1.4 sport=<cleared> dport=<cleared> src=10.1.1.4 dst=10.1.1.3 sport=<cleared> dport=<cleared> mark=0 zone=2 use=1
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - zones from field])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,action=load:1->NXM_NX_REG0[[0..15]],ct(commit,zone=NXM_NX_REG0[[0..15]]),2
> +in_port=2,ct_state=-trk,tcp,action=load:1->NXM_NX_REG0[[0..15]],ct(table=0,zone=NXM_NX_REG0[[0..15]])
> +in_port=2,ct_state=+trk,ct_zone=1,tcp,action=1
> +in_port=3,tcp,action=load:2->NXM_NX_REG0[[0..15]],ct(commit,zone=NXM_NX_REG0[[0..15]]),4
> +in_port=4,ct_state=-trk,tcp,action=load:2->NXM_NX_REG0[[0..15]],ct(table=0,zone=NXM_NX_REG0[[0..15]])
> +in_port=4,ct_state=+trk,ct_zone=1,tcp,action=3
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=1 use=1
> +])
> +
> +dnl HTTP requests from p2->p3 should fail due to network failure.
> +dnl Try 3 times, in 1 second intervals.
> +NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
> +NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 -v -o wget1.log], [4])
> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.4)], [0], [dnl
> +SYN_RECV src=10.1.1.3 dst=10.1.1.4 sport=<cleared> dport=<cleared> src=10.1.1.4 dst=10.1.1.3 sport=<cleared> dport=<cleared> mark=0 zone=2 use=1
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - multiple bridges])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone --\
> +    add-br br1 --\
> +    add-port br0 patch+ -- set int patch+ type=patch options:peer=patch- --\
> +    add-port br1 patch- -- set int patch- type=patch options:peer=patch+ --])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br1, "10.1.1.2/24")
> +
> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
> +AT_DATA([flows-br0.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=2,tcp,ct_state=-trk,action=ct(commit,zone=1),1
> +in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=1)
> +in_port=1,tcp,ct_state=+trk+est,ct_zone=1,action=2
> +])
> +
> +dnl Allow any traffic from ns0->ns1, allow established in reverse.

Should this comment be different from the one above??

> +AT_DATA([flows-br1.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=2)
> +in_port=1,tcp,ct_state=+trk+new,ct_zone=2,action=ct(commit,zone=2),2
> +in_port=1,tcp,ct_state=+trk+est,ct_zone=2,action=2
> +in_port=2,tcp,ct_state=-trk,action=ct(table=0,zone=2)
> +in_port=2,tcp,ct_state=+trk+est,ct_zone=2,action=ct(commit,zone=2),1
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows-br0.txt])
> +AT_CHECK([ovs-ofctl add-flows br1 flows-br1.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - multiple zones])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
> +
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.

How about ns2 and ns3?

> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +priority=10,icmp,action=normal
> +in_port=1,tcp,action=ct(commit,zone=1),ct(commit,zone=2),2
> +in_port=2,ct_state=-trk,tcp,action=ct(table=0,zone=2)
> +in_port=2,ct_state=+trk,ct_zone=2,tcp,action=1
> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +dnl HTTP requests from p0->p1 should work fine.
> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +dnl (again) HTTP requests from p0->p1 should work fine.
> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
> +
> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
> +SYN_SENT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> [[UNREPLIED]] src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> mark=0 zone=1 use=1
> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=2 use=1
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
> +AT_SETUP([conntrack - ICMP related 2])
> +CHECK_CONNTRACK()
> +OVS_TRAFFIC_VSWITCHD_START(
> +   [set-fail-mode br0 standalone -- ])
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +
> +ADD_VETH(p0, at_ns0, br0, "172.16.0.1/24")
> +ADD_VETH(p1, at_ns1, br0, "172.16.0.2/24")
> +
> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
> +AT_DATA([flows.txt], [dnl
> +priority=1,action=drop
> +priority=10,arp,action=normal
> +in_port=1,ct_state=-trk,udp,action=ct(commit,table=0)
> +in_port=1,ct_state=+trk,actions=controller
> +in_port=2,ct_state=-trk,action=ct(table=0)
> +in_port=2,ct_state=+trk-inv-new,action=controller

Could there be a match on +rel here?

> +])
> +
> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> +
> +AT_CAPTURE_FILE([ofctl_monitor.log])
> +AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
> +
> +dnl 1. Send an ICMP port unreach reply for port 8738, without any previous request
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 'f64c473528c9c6f54ecb72db080045c0003d2e8700004001f355ac100004ac1000030303553f0000000045000021317040004011b138ac100003ac10000411112222000d20966369616f0a'])
> +
> +dnl 2. Send and UDP packet to port 5555
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit,table=0\) 'c6f94ecb72dbe64c473528c9080045000021317040004011b138ac100001ac100002a28e15b3000d20966369616f0a'])
> +
> +dnl 3. Send an ICMP port unreach reply for port 5555, related to the first packet
> +AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 'e64c473528c9c6f94ecb72db080045c0003d2e8700004001f355ac100002ac1000010303553f0000000045000021317040004011b138ac100001ac100002a28e15b3000d20966369616f0a'])
> +
> +dnl Check this output. We only see the latter two packets, not the first.
> +AT_CHECK([cat ofctl_monitor.log], [0], [dnl
> +NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=47 ct_state=new|trk,in_port=1 (via action) data_len=47 (unbuffered)
> +udp,vlan_tci=0x0000,dl_src=e6:4c:47:35:28:c9,dl_dst=c6:f9:4e:cb:72:db,nw_src=172.16.0.1,nw_dst=172.16.0.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=41614,tp_dst=5555 udp_csum:2096
> +NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=75 ct_state=rel|rpl|trk,in_port=2 (via action) data_len=75 (unbuffered)
> +icmp,vlan_tci=0x0000,dl_src=c6:f9:4e:cb:72:db,dl_dst=e6:4c:47:35:28:c9,nw_src=172.16.0.2,nw_dst=172.16.0.1,nw_tos=192,nw_ecn=0,nw_ttl=64,icmp_type=3,icmp_code=3 icmp_csum:553f
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> diff --git a/tests/system-userspace-macros.at b/tests/system-userspace-macros.at
> index fca26f7..16256cb 100644
> --- a/tests/system-userspace-macros.at
> +++ b/tests/system-userspace-macros.at
> @@ -37,3 +37,12 @@ m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
> /dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d"])
>    AT_CHECK([:; $2])
>   ])
> +
> +# CHECK_CONNTRACK()
> +#
> +# Perform requirements checks for running conntrack tests, and flush the
> +# kernel conntrack tables when the test is finished.
> +#
> +m4_define([CHECK_CONNTRACK],
> +    [AT_SKIP_IF(true)]
> +)
> diff --git a/tests/test-l7.py b/tests/test-l7.py
> new file mode 100755
> index 0000000..65c6c2a
> --- /dev/null
> +++ b/tests/test-l7.py
> @@ -0,0 +1,72 @@
> +# Copyright (c) 2015 Nicira, Inc.
> +#
> +# Licensed under the Apache License, Version 2.0 (the "License");
> +# you may not use this file except in compliance with the License.
> +# You may obtain a copy of the License at:
> +#
> +#     http://www.apache.org/licenses/LICENSE-2.0
> +#
> +# Unless required by applicable law or agreed to in writing, software
> +# distributed under the License is distributed on an "AS IS" BASIS,
> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +# See the License for the specific language governing permissions and
> +# limitations under the License.
> +
> +import argparse
> +import socket
> +
> +from BaseHTTPServer import HTTPServer
> +from SimpleHTTPServer import SimpleHTTPRequestHandler
> +from SocketServer import TCPServer
> +
> +
> +class TCPServerV6(HTTPServer):
> +    address_family = socket.AF_INET6
> +
> +
> +def get_ftpd():
> +    try:
> +        from pyftpdlib.authorizers import DummyAuthorizer
> +        from pyftpdlib.handlers import FTPHandler
> +        from pyftpdlib.servers import FTPServer
> +
> +        class OVSFTPHandler(FTPHandler):
> +            authorizer = DummyAuthorizer()
> +            authorizer.add_anonymous("/tmp")
> +        server = [FTPServer, OVSFTPHandler, 21]
> +    except ImportError:
> +        server = None
> +        pass
> +    return server
> +
> +
> +def main():
> +    SERVERS = {
> +        'http':  [TCPServer,   SimpleHTTPRequestHandler, 80],
> +        'http6': [TCPServerV6, SimpleHTTPRequestHandler, 80],
> +    }
> +
> +    ftpd = get_ftpd()
> +    if ftpd is not None:
> +        SERVERS['ftp'] = ftpd
> +
> +    protocols = [srv for srv in SERVERS]
> +    parser = argparse.ArgumentParser(
> +            description='Run basic application servers.')
> +    parser.add_argument('proto', default='http', nargs='?',
> +            help='protocol to serve (%s)' % protocols)
> +    args = parser.parse_args()
> +
> +    if args.proto not in SERVERS:
> +        parser.print_help()
> +        exit(1)
> +
> +    constructor = SERVERS[args.proto][0]
> +    handler = SERVERS[args.proto][1]
> +    port = SERVERS[args.proto][2]
> +    srv = constructor(('', port), handler)
> +    srv.serve_forever()
> +
> +
> +if __name__ == '__main__':
> +    main()
> diff --git a/tests/test-odp.c b/tests/test-odp.c
> index 3e7939e..cb245d6 100644
> --- a/tests/test-odp.c
> +++ b/tests/test-odp.c
> @@ -57,7 +57,10 @@ parse_keys(bool wc_keys)
>             struct odp_flow_key_parms odp_parms = {
>                 .flow = &flow,
>                 .support = {
> +                    .max_mpls_depth = SIZE_MAX,

Why this?

>                     .recirc = true,
> +                    .ct_state = true,
> +                    .ct_zone = true,
>                 },
>             };
> 
> diff --git a/utilities/ovs-ofctl.8.in b/utilities/ovs-ofctl.8.in
> index 8bb3715..7e5fcaa 100644
> --- a/utilities/ovs-ofctl.8.in
> +++ b/utilities/ovs-ofctl.8.in
> @@ -1305,6 +1305,65 @@ is used only with the \fBconjunction\fR action (see below).
> .IP
> This field was introduced in Open vSwitch 2.4.
> .
> +.IP \fBct_state=\fIflags\fB/\fImask\fR
> +.IQ \fBct_state=\fR[\fB+\fIflag\fR...][\fB-\fIflag\fR...]
> +Bitwise match on connection state flags.  The flags are only available
> +after a call to the \fBct\fR action with the \fBtable\fR specified.
> +
> +.IP
> +The \fIflags\fR and \fImask\fR are 8-bit numbers written in decimal or
> +in hexadecimal prefixed by \fB0x\fR.  Each 1-bit in \fImask\fR requires
> +that the corresponding bit in \fIflags\fR must match.  Each 0-bit in
> +\fImask\fR causes the corresponding bit to be ignored.
> +.IP
> +Alternatively, the flags can be specified by their symbolic names
> +(listed below), each preceded by either \fB+\fR for a flag that must
> +be set, or \fB\-\fR for a flag that must be unset, without any other
> +delimiters between the flags.  Flags not mentioned are wildcarded.  For
> +example, \fBtcp,ct_state=+trk\-new\fR matches TCP packets that
> +have been run through the connection tracker and do not establish a new
> +flow.
> +.IP
> +The following flags describe the state of the tracking:
> +.RS
> +.IP "\fB0x80: trk\fR"
> +Connection tracking has occurred.
> +.IP "\fB0x40: rpl\fR"
> +The flow is in the reply direction, meaning it did not initiate the
> +connection.
> +.IP "\fB0x20: inv\fR"
> +The flow is invalid, meaning that the connection tracker couldn't identify the
> +connection.
> +.RS
> +.PP
> +This flag may be set for the following reasons:
> +.RS
> +L3/L4 protocol handler is not loaded/unavailable. With the Linux kernel
> +datapath, this may mean that the "nf_conntrack_ipv4" or "nf_conntrack_ipv6"
> +modules are not loaded.
> +.PP
> +L3/L4 protocol handler determines that the packet is malformed or invalid.
> +.PP
> +Packets are unexpected length for protocol.
> +.RE
> +.RE
> +.IP "\fB0x01: new\fR"
> +This is the beginning of a new connection.
> +.IP "\fB0x02: est\fR"
> +This is part of an already existing connection.
> +.IP "\fB0x04: rel\fR"
> +This is a connection that is related to an existing connection, for
> +instance ICMP "destination unreachable" messages or FTP data connections.
> +.RE
> +.
> +.PP
> +The following fields are data associated with the connection tracker and
> +can only be matched or set after running through the connection tracker
> +by using the \fBct\fR action.
> +.
> +.IP \fBct_zone=\fIvalue
> +Matches connection zone \fIvalue\fR exactly.
> +.
> .PP
> Defining IPv6 flows (those with \fBdl_type\fR equal to 0x86dd) requires
> support for NXM.  The following shorthand notations are available for
> @@ -1542,6 +1601,32 @@ OpenFlow implementations do not support queuing at all.
> Restores the queue to the value it was before any \fBset_queue\fR
> actions were applied.
> .
> +.IP \fBct\fR
> +.IQ \fBct\fB(\fR[\fIargument\fR][\fB,\fIargument\fR...]\fB)
> +Send the packet through the connection tracker.  The following arguments
> +are supported:
> +
> +.RS
> +.IP \fBcommit\fR
> +Commit the flow to the connection tracking module.
> +.IP \fBtable=\fInumber\fR
> +After ct processing, the packet should be re-inserted into OpenFlow pipeline
> +processing in table \fInumber\fR, with the \fBct_state\fR and other ct match
> +fields set. It is strongly recommended to use a table later than the current
> +table to prevent loops.
> +.IP \fBzone=\fIvalue\fR
> +.IQ \fBzone=\fIsrc\fB[\fIstart\fB..\fIend\fB]\fR
> +A 16-bit context id that can be used to isolate connections into separate
> +domains, allowing overlapping network addresses in different zones. If a zone
> +is not provided, then the default is to use zone zero. The \fBzone\fR may be
> +specified either as an immediate 16-bit \fIvalue\fR, or may be provided from an
> +NXM field \fIsrc\fR. The \fIstart\fR and \fIend\fR pair are inclusive, and must
> +specify a 16-bit range within the field.
> +.RE
> +.IP
> +Currently, connection tracking is only available on Linux kernels with the
> +conntrack module loaded.
> +.
> .IP \fBdec_ttl\fR
> .IQ \fBdec_ttl\fB[\fR(\fIid1,id2\fI)\fR]\fR
> Decrement TTL of IPv4 packet or hop limit of IPv6 packet.  If the
> -- 
> 2.1.4
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
Joe Stringer Sept. 11, 2015, 9:44 p.m. UTC | #9
On 11 September 2015 at 14:02, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>>> - new and reply are mutually exclusive (reply direction is defined as opposite of the initial +new)
>>
>> Strictly speaking I don't think this is the case. I don't have an
>> example, but it is possible for this to be set in the conntrack info
>> specified in the Linux interface, see
>> linux/netfilter/nf_conntrack_common.h. ct_state is translated roughly
>> from this and potentially other pieces of conntrack information.
>>
>
> While IP_CT_NEW_REPLY is defined, it is only used by openvswitch/connntrack.c, and the next line says “(no NEW in reply dirn.)”
>
> So it looks like IP_CT_NEW_REPLY should not even be defined. I have prepared a patch to remove the IP_CT_NEW_REPLY definition, but have no immediate plans to send it out.
>
> As said, it seems to me that reply direction is defined as the opposing direction to the packet that initially created the conntrack entry. I’m still a bit confused about related vs. related_reply, though.

Fair enough. I can describe this constraint in the documentation.

I don't have an answer for related_reply right now, but I can
investigate and tell you whether the "reply" refers to the original
connection or the related connection.

>>> - related and new can be set at the same time (but does related remain set for the duration of the connection, or is it also mutually exclusive with established?)
>>
>> Related can be present with any other bit. I believe that it remains
>> set for the duration of the connection. Looks like a good candidate
>> for another test in the testsuite.
>>
>
> ctinfo enum can not hold related and established at the same time, though.

True, but the ovs key can (and does). Each time we perform the ct
lookup, we translate the state, then, if there is a related
(ct->master) connection, set the related bit.

>>> - apparently related and reply are not mutually exclusive, but when are they set at the same time?
>>
>> I see "related" as another bit that is independent of all of the
>> others; whether the connection is related or not, the packets can
>> still be part of a new connection, established, reply, etc. Perhaps
>> the more pressing question is exactly what "reply" means in this
>> context: Is it referring to the direction of the original connection,
>> or only the related connection? This is something I can further
>> clarify.
>>
>
> in ctinfo enum related, new, and established are all mutually exclusive. In OVS ct_state this is not always the case, though, as we explicitly set the new bit for related, too. I agree that it would be nice to be able to persist the related bit also when the connection state is “established”.

I agree, and I'm pretty sure this is what we do already (but I don't
have a test to explicitly check that this is the behaviour).

>>>> + * sent through the connection tracker, and pipeline processing for the cloned
>>>> + * packet will continue in the OpenFlow table specified by this value. The
>>>> + * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
>>>> + * strongly recommended that this table is later than the current table, to
>>>> + * prevent loops.
>>>> + *
>>>
>>> ‘alg’ and ofpacts are introduced in later patches, and I guess it is too much work to not deal with them here yet?
>>
>> I could go back and adjust the definition to place padding there and
>> drop the ofpacts instead.
>
> Seems like a lot of work, though, essentially only for reviewing purposes. Up to you :-)

Guess why I didn't do it in the first place :-)

>>> What about the version?
>>
>> When decoding the CT action, we have lost the context of which
>> OpenFlow version we are decoding. So we can't determine whether the
>> nested actions are valid for the current OpenFlow version. Having
>> version=OFP10_VERSION has worked in all of the cases I have tried so
>> far, perhaps because the nicira extensions extend all openflow
>> versions. I'm not sure if there's a more correct way to apply this, or
>> if passing OFP10_VERSION is sufficient here. Thoughts?
>>
>
> Not sure. Maybe test with each OpenFlow version and see what happens?

I also tried OFP11_VERSION, and for the tests I have, there is no
difference. Maybe it's entirely incidental. I haven't followed the
code closely enough to discover any potential pitfalls.

Perhaps this could just be highlighted, eg at the top of the function,
something like:

int version = OFP10_VERSION; /* As this is an nx-action, all features
are available so OFP10_VERSION is sufficient to parse all possible
nested actions. */

>> This is "coincidental". Strictly speaking the OpenFlow wire format for
>> these flags should be translated into an internal representation,
>> which should be subsequently translated into the equivalent datapath
>> representation, but in practice they're all the same. Perhaps this
>> would be more explicit to define them using the netlink interface
>> definitions:
>>
>> #define CS_NEW    OVS_CS_F_NEW
>> ….
>
> Would the current code work if the mapping was not 1:1? If yes, then it does not matter. If not, then we should at least assert that the mapping stays 1:1 also in the future.

No; the state is copied directly from the nx_action_conntrack ->
ofpact_conntrack -> nla_put_*(). Adding an assert sounds prudent, I'll
do this.

> Btw. I got the “make check-kernel” run successfully for this patch :-)

Great, hoping to spread understanding of this feature so people can
build more datapath tests :D
Joe Stringer Sept. 11, 2015, 11:15 p.m. UTC | #10
On 11 September 2015 at 14:42, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>> index 7dbed68..de6b016 100644
>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
>>
>> OVS_TRAFFIC_VSWITCHD_STOP
>> AT_CLEANUP
>> +
>> +AT_SETUP([conntrack - controller])
>> +CHECK_CONNTRACK()
>> +OVS_TRAFFIC_VSWITCHD_START(
>> +   [set-fail-mode br0 standalone -- ])
>> +
>> +ADD_NAMESPACES(at_ns0, at_ns1)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +in_port=1,udp,action=ct(commit),controller
>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>
> ct_state=+est should be sufficient here?

True, +est-new is a little redundant.

>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +priority=10,icmp,action=normal
>> +in_port=1,tcp,action=ct(commit),2
>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>> +in_port=2,ct_state=+trk+est-new,tcp,action=1
>
> Having explicit priorities here as well would be helpful. The comment above about “+est” should apply here as well.

I can update any/all tests that have priorities to either all provide
specific priorities, or use no priorities.

>> +])
>> +
>> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
>> +
>> +dnl Basic connectivity check.
>> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 >/dev/null])
>> +
>> +dnl HTTP requests from ns0->ns1 should work fine.
>> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
>> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
>
>
> wget was trying to resolve the proxy I had configured in the “http_proxy” environment variable. The resolution from the netns failed, making the test case fail. I had to clear “http_proxy" to make the test work.

Hmm, I don't have a good solution for this. Personally I always push
code to a remote test VM that I use specifically for the purpose of
running these tests, so any configuration I might have in my local
desktop environment wouldn't cause issues such as this. (This is
equivalent to how the Vagrant tests run)

>> +
>> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
>> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 use=1
>
> Do you know how/if [ASSURED] maps to ct_state flags?

I can look into it. My understanding is that it's equivalent to
"established", although it may be more aligned with our concept of
"committed".

>> +AT_SETUP([conntrack - commit, recirc])
>> +CHECK_CONNTRACK()
>> +OVS_TRAFFIC_VSWITCHD_START(
>> +   [set-fail-mode br0 standalone -- ])
>> +
>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>> +
>> +dnl Allow any traffic from ns0->ns1, ns2->ns3.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +priority=10,icmp,action=normal
>> +in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
>
> Maybe some of the test cases should use non-zero table?

Agreed, there's several directions that we can increase the coverage
in the testsuite. I'll add this to the list.

>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>
> Update comment?

For ns2/ns3 as well? Sure.

>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +priority=10,icmp,action=normal
>> +in_port=1,tcp,action=ct(),2
>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>> +in_port=2,ct_state=+trk+new,tcp,action=1
>
> Here “+new” should be sufficient, in general, “+trk” is not needed if any of the other bits are matched for being set, right?

For several of these I have made the rule more explicit than strictly
necessary, so that if there is an unexpected combination of bits, then
those flows will hit the "drop" rule rather than silently matching the
accept rule.

>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>
> Does this comment need an update?

Yes, I'll update it. I can see the fallacy of copy-paste test crafting
here: The comments are initially added to improve readability, but
when they're copied several times it's easy to miss the fact that
they're out of date. It wouldn't be a problem if there were no
comments, but then it would be more difficult to grok what's going on.

>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>> +AT_DATA([flows-br0.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +priority=10,icmp,action=normal
>> +in_port=2,tcp,ct_state=-trk,action=ct(commit,zone=1),1
>> +in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=1)
>> +in_port=1,tcp,ct_state=+trk+est,ct_zone=1,action=2
>> +])
>> +
>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>
> Should this comment be different from the one above??

I think it's saying the same thing, it just wasn't copy/pasted from
the earlier one. I'll review each of these comments.

>> +AT_SETUP([conntrack - ICMP related 2])
>> +CHECK_CONNTRACK()
>> +OVS_TRAFFIC_VSWITCHD_START(
>> +   [set-fail-mode br0 standalone -- ])
>> +
>> +ADD_NAMESPACES(at_ns0, at_ns1)
>> +
>> +ADD_VETH(p0, at_ns0, br0, "172.16.0.1/24")
>> +ADD_VETH(p1, at_ns1, br0, "172.16.0.2/24")
>> +
>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> +AT_DATA([flows.txt], [dnl
>> +priority=1,action=drop
>> +priority=10,arp,action=normal
>> +in_port=1,ct_state=-trk,udp,action=ct(commit,table=0)
>> +in_port=1,ct_state=+trk,actions=controller
>> +in_port=2,ct_state=-trk,action=ct(table=0)
>> +in_port=2,ct_state=+trk-inv-new,action=controller
>
> Could there be a match on +rel here?

Yes, we could expand the flows to be more explicit about the exact
packets seen (looks like there's a "new" and a "related, reply"
packets)

Thanks for the thorough review,

>> diff --git a/tests/test-odp.c b/tests/test-odp.c
>> index 3e7939e..cb245d6 100644
>> --- a/tests/test-odp.c
>> +++ b/tests/test-odp.c
>> @@ -57,7 +57,10 @@ parse_keys(bool wc_keys)
>>             struct odp_flow_key_parms odp_parms = {
>>                 .flow = &flow,
>>                 .support = {
>> +                    .max_mpls_depth = SIZE_MAX,
>
> Why this?

I think this was intended to be in a separate patch. From memory, if
any of the .support fields are false, then we don't test ODP parsing
of that feature.

Cheers,
Joe
Jarno Rajahalme Sept. 11, 2015, 11:37 p.m. UTC | #11
Here is an provisional ACK, I trust you to address the comments to my satisfaction :-)

Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

  Jarno

> On Sep 11, 2015, at 4:15 PM, Joe Stringer <joestringer@nicira.com> wrote:
> 
> On 11 September 2015 at 14:42, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>> index 7dbed68..de6b016 100644
>>> --- a/tests/system-traffic.at
>>> +++ b/tests/system-traffic.at
>>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
>>> 
>>> OVS_TRAFFIC_VSWITCHD_STOP
>>> AT_CLEANUP
>>> +
>>> +AT_SETUP([conntrack - controller])
>>> +CHECK_CONNTRACK()
>>> +OVS_TRAFFIC_VSWITCHD_START(
>>> +   [set-fail-mode br0 standalone -- ])
>>> +
>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +in_port=1,udp,action=ct(commit),controller
>>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>> 
>> ct_state=+est should be sufficient here?
> 
> True, +est-new is a little redundant.
> 
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +priority=10,icmp,action=normal
>>> +in_port=1,tcp,action=ct(commit),2
>>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>>> +in_port=2,ct_state=+trk+est-new,tcp,action=1
>> 
>> Having explicit priorities here as well would be helpful. The comment above about “+est” should apply here as well.
> 
> I can update any/all tests that have priorities to either all provide
> specific priorities, or use no priorities.
> 
>>> +])
>>> +
>>> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
>>> +
>>> +dnl Basic connectivity check.
>>> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 >/dev/null])
>>> +
>>> +dnl HTTP requests from ns0->ns1 should work fine.
>>> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
>>> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
>> 
>> 
>> wget was trying to resolve the proxy I had configured in the “http_proxy” environment variable. The resolution from the netns failed, making the test case fail. I had to clear “http_proxy" to make the test work.
> 
> Hmm, I don't have a good solution for this. Personally I always push
> code to a remote test VM that I use specifically for the purpose of
> running these tests, so any configuration I might have in my local
> desktop environment wouldn't cause issues such as this. (This is
> equivalent to how the Vagrant tests run)

Maybe a comment in some doc describing how to run check-kernel?



> 
>>> +
>>> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
>>> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 use=1
>> 
>> Do you know how/if [ASSURED] maps to ct_state flags?
> 
> I can look into it. My understanding is that it's equivalent to
> "established", although it may be more aligned with our concept of
> "committed".
> 
>>> +AT_SETUP([conntrack - commit, recirc])
>>> +CHECK_CONNTRACK()
>>> +OVS_TRAFFIC_VSWITCHD_START(
>>> +   [set-fail-mode br0 standalone -- ])
>>> +
>>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1, ns2->ns3.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +priority=10,icmp,action=normal
>>> +in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
>> 
>> Maybe some of the test cases should use non-zero table?
> 
> Agreed, there's several directions that we can increase the coverage
> in the testsuite. I'll add this to the list.
> 
>>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> 
>> Update comment?
> 
> For ns2/ns3 as well? Sure.
> 
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +priority=10,icmp,action=normal
>>> +in_port=1,tcp,action=ct(),2
>>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>>> +in_port=2,ct_state=+trk+new,tcp,action=1
>> 
>> Here “+new” should be sufficient, in general, “+trk” is not needed if any of the other bits are matched for being set, right?
> 
> For several of these I have made the rule more explicit than strictly
> necessary, so that if there is an unexpected combination of bits, then
> those flows will hit the "drop" rule rather than silently matching the
> accept rule.
> 
>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>> 
>> Does this comment need an update?
> 
> Yes, I'll update it. I can see the fallacy of copy-paste test crafting
> here: The comments are initially added to improve readability, but
> when they're copied several times it's easy to miss the fact that
> they're out of date. It wouldn't be a problem if there were no
> comments, but then it would be more difficult to grok what's going on.
> 
>>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>>> +AT_DATA([flows-br0.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +priority=10,icmp,action=normal
>>> +in_port=2,tcp,ct_state=-trk,action=ct(commit,zone=1),1
>>> +in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=1)
>>> +in_port=1,tcp,ct_state=+trk+est,ct_zone=1,action=2
>>> +])
>>> +
>>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>> 
>> Should this comment be different from the one above??
> 
> I think it's saying the same thing, it just wasn't copy/pasted from
> the earlier one. I'll review each of these comments.
> 
>>> +AT_SETUP([conntrack - ICMP related 2])
>>> +CHECK_CONNTRACK()
>>> +OVS_TRAFFIC_VSWITCHD_START(
>>> +   [set-fail-mode br0 standalone -- ])
>>> +
>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>> +
>>> +ADD_VETH(p0, at_ns0, br0, "172.16.0.1/24")
>>> +ADD_VETH(p1, at_ns1, br0, "172.16.0.2/24")
>>> +
>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> +AT_DATA([flows.txt], [dnl
>>> +priority=1,action=drop
>>> +priority=10,arp,action=normal
>>> +in_port=1,ct_state=-trk,udp,action=ct(commit,table=0)
>>> +in_port=1,ct_state=+trk,actions=controller
>>> +in_port=2,ct_state=-trk,action=ct(table=0)
>>> +in_port=2,ct_state=+trk-inv-new,action=controller
>> 
>> Could there be a match on +rel here?
> 
> Yes, we could expand the flows to be more explicit about the exact
> packets seen (looks like there's a "new" and a "related, reply"
> packets)
> 
> Thanks for the thorough review,
> 
>>> diff --git a/tests/test-odp.c b/tests/test-odp.c
>>> index 3e7939e..cb245d6 100644
>>> --- a/tests/test-odp.c
>>> +++ b/tests/test-odp.c
>>> @@ -57,7 +57,10 @@ parse_keys(bool wc_keys)
>>>            struct odp_flow_key_parms odp_parms = {
>>>                .flow = &flow,
>>>                .support = {
>>> +                    .max_mpls_depth = SIZE_MAX,
>> 
>> Why this?
> 
> I think this was intended to be in a separate patch. From memory, if
> any of the .support fields are false, then we don't test ODP parsing
> of that feature.
> 
> Cheers,
> Joe
Jarno Rajahalme Sept. 11, 2015, 11:42 p.m. UTC | #12
Conditional to Ben’s approval of the nested actions within the OF ct action, of course :-)

  Jarno

> On Sep 11, 2015, at 4:37 PM, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
> 
> Here is an provisional ACK, I trust you to address the comments to my satisfaction :-)
> 
> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
> 
>  Jarno
> 
>> On Sep 11, 2015, at 4:15 PM, Joe Stringer <joestringer@nicira.com> wrote:
>> 
>> On 11 September 2015 at 14:42, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>>> index 7dbed68..de6b016 100644
>>>> --- a/tests/system-traffic.at
>>>> +++ b/tests/system-traffic.at
>>>> @@ -139,3 +139,472 @@ NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
>>>> 
>>>> OVS_TRAFFIC_VSWITCHD_STOP
>>>> AT_CLEANUP
>>>> +
>>>> +AT_SETUP([conntrack - controller])
>>>> +CHECK_CONNTRACK()
>>>> +OVS_TRAFFIC_VSWITCHD_START(
>>>> +   [set-fail-mode br0 standalone -- ])
>>>> +
>>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>>> +
>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>>> +AT_DATA([flows.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +in_port=1,udp,action=ct(commit),controller
>>>> +in_port=2,ct_state=-trk,udp,action=ct(table=0)
>>>> +in_port=2,ct_state=+trk+est-new,udp,action=controller
>>> 
>>> ct_state=+est should be sufficient here?
>> 
>> True, +est-new is a little redundant.
>> 
>>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>>> +AT_DATA([flows.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +priority=10,icmp,action=normal
>>>> +in_port=1,tcp,action=ct(commit),2
>>>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>>>> +in_port=2,ct_state=+trk+est-new,tcp,action=1
>>> 
>>> Having explicit priorities here as well would be helpful. The comment above about “+est” should apply here as well.
>> 
>> I can update any/all tests that have priorities to either all provide
>> specific priorities, or use no priorities.
>> 
>>>> +])
>>>> +
>>>> +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
>>>> +
>>>> +dnl Basic connectivity check.
>>>> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 >/dev/null])
>>>> +
>>>> +dnl HTTP requests from ns0->ns1 should work fine.
>>>> +NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
>>>> +NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
>>> 
>>> 
>>> wget was trying to resolve the proxy I had configured in the “http_proxy” environment variable. The resolution from the netns failed, making the test case fail. I had to clear “http_proxy" to make the test work.
>> 
>> Hmm, I don't have a good solution for this. Personally I always push
>> code to a remote test VM that I use specifically for the purpose of
>> running these tests, so any configuration I might have in my local
>> desktop environment wouldn't cause issues such as this. (This is
>> equivalent to how the Vagrant tests run)
> 
> Maybe a comment in some doc describing how to run check-kernel?
> 
> 
> 
>> 
>>>> +
>>>> +AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
>>>> +TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 use=1
>>> 
>>> Do you know how/if [ASSURED] maps to ct_state flags?
>> 
>> I can look into it. My understanding is that it's equivalent to
>> "established", although it may be more aligned with our concept of
>> "committed".
>> 
>>>> +AT_SETUP([conntrack - commit, recirc])
>>>> +CHECK_CONNTRACK()
>>>> +OVS_TRAFFIC_VSWITCHD_START(
>>>> +   [set-fail-mode br0 standalone -- ])
>>>> +
>>>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>>>> +
>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1, ns2->ns3.
>>>> +AT_DATA([flows.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +priority=10,icmp,action=normal
>>>> +in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
>>> 
>>> Maybe some of the test cases should use non-zero table?
>> 
>> Agreed, there's several directions that we can increase the coverage
>> in the testsuite. I'll add this to the list.
>> 
>>>> +ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
>>>> +
>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> 
>>> Update comment?
>> 
>> For ns2/ns3 as well? Sure.
>> 
>>>> +AT_DATA([flows.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +priority=10,icmp,action=normal
>>>> +in_port=1,tcp,action=ct(),2
>>>> +in_port=2,ct_state=-trk,tcp,action=ct(table=0)
>>>> +in_port=2,ct_state=+trk+new,tcp,action=1
>>> 
>>> Here “+new” should be sufficient, in general, “+trk” is not needed if any of the other bits are matched for being set, right?
>> 
>> For several of these I have made the rule more explicit than strictly
>> necessary, so that if there is an unexpected combination of bits, then
>> those flows will hit the "drop" rule rather than silently matching the
>> accept rule.
>> 
>>>> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
>>>> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
>>>> +ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
>>>> +ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>> 
>>> Does this comment need an update?
>> 
>> Yes, I'll update it. I can see the fallacy of copy-paste test crafting
>> here: The comments are initially added to improve readability, but
>> when they're copied several times it's easy to miss the fact that
>> they're out of date. It wouldn't be a problem if there were no
>> comments, but then it would be more difficult to grok what's going on.
>> 
>>>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>>>> +AT_DATA([flows-br0.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +priority=10,icmp,action=normal
>>>> +in_port=2,tcp,ct_state=-trk,action=ct(commit,zone=1),1
>>>> +in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=1)
>>>> +in_port=1,tcp,ct_state=+trk+est,ct_zone=1,action=2
>>>> +])
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1, allow established in reverse.
>>> 
>>> Should this comment be different from the one above??
>> 
>> I think it's saying the same thing, it just wasn't copy/pasted from
>> the earlier one. I'll review each of these comments.
>> 
>>>> +AT_SETUP([conntrack - ICMP related 2])
>>>> +CHECK_CONNTRACK()
>>>> +OVS_TRAFFIC_VSWITCHD_START(
>>>> +   [set-fail-mode br0 standalone -- ])
>>>> +
>>>> +ADD_NAMESPACES(at_ns0, at_ns1)
>>>> +
>>>> +ADD_VETH(p0, at_ns0, br0, "172.16.0.1/24")
>>>> +ADD_VETH(p1, at_ns1, br0, "172.16.0.2/24")
>>>> +
>>>> +dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
>>>> +AT_DATA([flows.txt], [dnl
>>>> +priority=1,action=drop
>>>> +priority=10,arp,action=normal
>>>> +in_port=1,ct_state=-trk,udp,action=ct(commit,table=0)
>>>> +in_port=1,ct_state=+trk,actions=controller
>>>> +in_port=2,ct_state=-trk,action=ct(table=0)
>>>> +in_port=2,ct_state=+trk-inv-new,action=controller
>>> 
>>> Could there be a match on +rel here?
>> 
>> Yes, we could expand the flows to be more explicit about the exact
>> packets seen (looks like there's a "new" and a "related, reply"
>> packets)
>> 
>> Thanks for the thorough review,
>> 
>>>> diff --git a/tests/test-odp.c b/tests/test-odp.c
>>>> index 3e7939e..cb245d6 100644
>>>> --- a/tests/test-odp.c
>>>> +++ b/tests/test-odp.c
>>>> @@ -57,7 +57,10 @@ parse_keys(bool wc_keys)
>>>>           struct odp_flow_key_parms odp_parms = {
>>>>               .flow = &flow,
>>>>               .support = {
>>>> +                    .max_mpls_depth = SIZE_MAX,
>>> 
>>> Why this?
>> 
>> I think this was intended to be in a separate patch. From memory, if
>> any of the .support fields are false, then we don't test ODP parsing
>> of that feature.
>> 
>> Cheers,
>> Joe
>
Joe Stringer Sept. 12, 2015, 12:02 a.m. UTC | #13
On 11 September 2015 at 16:37, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
> Here is an provisional ACK, I trust you to address the comments to my satisfaction :-)
>
> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
>
>   Jarno

Thanks, but I may yet send a round 2; there's a lot of scattered
changes, so I wouldn't mind getting some extra viewing on the patches.

<snip>

>> Hmm, I don't have a good solution for this. Personally I always push
>> code to a remote test VM that I use specifically for the purpose of
>> running these tests, so any configuration I might have in my local
>> desktop environment wouldn't cause issues such as this. (This is
>> equivalent to how the Vagrant tests run)
>
> Maybe a comment in some doc describing how to run check-kernel?

I could've sworn that we had some description like this, but it seems
not. It's all hidden under the Vagrant how-to in INSTALL.md. Sounds
like a good idea.
Jarno Rajahalme Sept. 12, 2015, 12:09 a.m. UTC | #14
> On Sep 11, 2015, at 5:02 PM, Joe Stringer <joestringer@nicira.com> wrote:
> 
> On 11 September 2015 at 16:37, Jarno Rajahalme <jrajahalme@nicira.com> wrote:
>> Here is an provisional ACK, I trust you to address the comments to my satisfaction :-)
>> 
>> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
>> 
>>  Jarno
> 
> Thanks, but I may yet send a round 2; there's a lot of scattered
> changes, so I wouldn't mind getting


> some extra viewing on the patches.
> 

Exactly what I was trying to avoid ;-)

  Jarno

> <snip>
> 
>>> Hmm, I don't have a good solution for this. Personally I always push
>>> code to a remote test VM that I use specifically for the purpose of
>>> running these tests, so any configuration I might have in my local
>>> desktop environment wouldn't cause issues such as this. (This is
>>> equivalent to how the Vagrant tests run)
>> 
>> Maybe a comment in some doc describing how to run check-kernel?
> 
> I could've sworn that we had some description like this, but it seems
> not. It's all hidden under the Vagrant how-to in INSTALL.md. Sounds
> like a good idea.
diff mbox

Patch

diff --git a/NEWS b/NEWS
index ca22c8e..45ae059 100644
--- a/NEWS
+++ b/NEWS
@@ -21,6 +21,9 @@  Post-v2.4.0
      targets to run a new system testsuite.  These tests can be run inside
      a Vagrant box.  See INSTALL.md for details
    - Dropped support for GRE64 tunnel.
+   - Add support for connection tracking through the new "ct" action
+     and "ct_state"/"ct_zone" match fields.  Only available on Linux kernels
+     with the connection tracking module loaded.
 
 
 v2.4.0 - 20 Aug 2015
diff --git a/build-aux/extract-ofp-fields b/build-aux/extract-ofp-fields
index e0284f9..b0d9681 100755
--- a/build-aux/extract-ofp-fields
+++ b/build-aux/extract-ofp-fields
@@ -24,6 +24,7 @@  TYPES = {"u8":       (1,   False),
 
 FORMATTING = {"decimal":            ("MFS_DECIMAL",      1,   8),
               "hexadecimal":        ("MFS_HEXADECIMAL",  1, 127),
+              "conn state":         ("MFS_CT_STATE",     1,   1),
               "Ethernet":           ("MFS_ETHERNET",     6,   6),
               "IPv4":               ("MFS_IPV4",         4,   4),
               "IPv6":               ("MFS_IPV6",        16,  16),
diff --git a/datapath/flow_netlink.c b/datapath/flow_netlink.c
index 9115d15..8527fe0 100644
--- a/datapath/flow_netlink.c
+++ b/datapath/flow_netlink.c
@@ -281,7 +281,7 @@  size_t ovs_key_attr_size(void)
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 22);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 24);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index 578cd88..69bdf32 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -343,6 +343,8 @@  enum ovs_key_attr {
 	OVS_KEY_ATTR_MPLS,      /* array of struct ovs_key_mpls.
 				 * The implementation may restrict
 				 * the accepted length of the array. */
+	OVS_KEY_ATTR_CT_STATE,	/* u8 bitmask of OVS_CS_F_* */
+	OVS_KEY_ATTR_CT_ZONE,	/* u16 connection tracking zone. */
 
 #ifdef __KERNEL__
 	/* Only used within kernel data path. */
@@ -456,6 +458,15 @@  struct ovs_key_nd {
 	__u8	nd_tll[ETH_ALEN];
 };
 
+/* OVS_KEY_ATTR_CT_STATE flags */
+#define OVS_CS_F_NEW               0x01 /* Beginning of a new connection. */
+#define OVS_CS_F_ESTABLISHED       0x02 /* Part of an existing connection. */
+#define OVS_CS_F_RELATED           0x04 /* Related to an established
+					 * connection. */
+#define OVS_CS_F_INVALID           0x20 /* Could not track connection. */
+#define OVS_CS_F_REPLY_DIR         0x40 /* Flow is in the reply direction. */
+#define OVS_CS_F_TRACKED           0x80 /* Conntrack has occurred. */
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -642,6 +653,28 @@  struct ovs_action_push_tnl {
 #endif
 
 /**
+ * enum ovs_ct_attr - Attributes for %OVS_ACTION_ATTR_CT action.
+ * @OVS_CT_ATTR_FLAGS: u32 connection tracking flags.
+ * @OVS_CT_ATTR_ZONE: u16 connection tracking zone.
+ */
+enum ovs_ct_attr {
+	OVS_CT_ATTR_UNSPEC,
+	OVS_CT_ATTR_FLAGS,      /* u32 bitmask of OVS_CT_F_*. */
+	OVS_CT_ATTR_ZONE,       /* u16 zone id. */
+	__OVS_CT_ATTR_MAX
+};
+
+#define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1)
+
+/*
+ * OVS_CT_ATTR_FLAGS flags - bitmask of %OVS_CT_F_*
+ * @OVS_CT_F_COMMIT: Commits the flow to the conntrack table. This allows
+ * future packets for the same connection to be identified as 'established'
+ * or 'related'.
+ */
+#define OVS_CT_F_COMMIT		0x01
+
+/**
  * enum ovs_action_attr - Action types.
  *
  * @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
@@ -672,6 +705,8 @@  struct ovs_action_push_tnl {
  * indicate the new packet contents. This could potentially still be
  * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
  * is no MPLS label stack, as determined by ethertype, no action is taken.
+ * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
+ * entries in the flow key.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -702,6 +737,7 @@  enum ovs_action_attr {
 				       * data immediately followed by a mask.
 				       * The data must be zero for the unmasked
 				       * bits. */
+	OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
 
 #ifndef __KERNEL__
 	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index db76290..0fc5f8c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1920,6 +1920,11 @@  dpif_netdev_flow_from_nlattrs(const struct nlattr *key, uint32_t key_len,
         return EINVAL;
     }
 
+    /* Userspace datapath doesn't support conntrack. */
+    if (flow->ct_state || flow->ct_zone) {
+        return EINVAL;
+    }
+
     return 0;
 }
 
@@ -3595,6 +3600,13 @@  dp_execute_cb(void *aux_, struct dp_packet **packets, int cnt,
         VLOG_WARN("Packet dropped. Max recirculation depth exceeded.");
         break;
 
+    case OVS_ACTION_ATTR_CT:
+        /* If a flow with this action is slow-pathed, datapath assistance is
+         * required to implement it. However, we don't support this action
+         * in the userspace datapath. */
+        VLOG_WARN("Cannot execute conntrack action in userspace.");
+        break;
+
     case OVS_ACTION_ATTR_PUSH_VLAN:
     case OVS_ACTION_ATTR_POP_VLAN:
     case OVS_ACTION_ATTR_PUSH_MPLS:
diff --git a/lib/dpif.c b/lib/dpif.c
index 9a67a24..c03aa1d 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1097,6 +1097,7 @@  dpif_execute_helper_cb(void *aux_, struct dp_packet **packets, int cnt,
     ovs_assert(cnt == 1);
 
     switch ((enum ovs_action_attr)type) {
+    case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
diff --git a/lib/flow.c b/lib/flow.c
index 84048e8..032df14 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -199,6 +199,23 @@  BUILD_MESSAGE("FLOW_WC_SEQ changed: miniflow_extract() will have runtime "
     }                                           \
 }
 
+#define miniflow_push_uint8_(MF, OFS, VALUE)                    \
+{                                                               \
+    MINIFLOW_ASSERT(MF.data < MF.end)                           \
+                                                                \
+    if ((OFS) % 8 == 0) {                                       \
+        miniflow_set_map(MF, OFS / 8);                          \
+        *(uint8_t *)MF.data = VALUE;                            \
+    } else if ((OFS) % 8 < 7) {                                 \
+        miniflow_assert_in_map(MF, OFS / 8);                    \
+        *((uint8_t *)MF.data + (OFS % 8)) = VALUE;              \
+    } else {                                                    \
+        miniflow_assert_in_map(MF, OFS / 8);                    \
+        *((uint8_t *)MF.data + 7) = VALUE;                      \
+        MF.data++;                                              \
+    }                                                           \
+}
+
 #define miniflow_pad_to_64_(MF, OFS)                            \
 {                                                               \
     MINIFLOW_ASSERT((OFS) % 8 != 0);                            \
@@ -262,6 +279,9 @@  BUILD_MESSAGE("FLOW_WC_SEQ changed: miniflow_extract() will have runtime "
 #define miniflow_push_be16(MF, FIELD, VALUE)                        \
     miniflow_push_be16_(MF, offsetof(struct flow, FIELD), VALUE)
 
+#define miniflow_push_uint8(MF, FIELD, VALUE)                      \
+    miniflow_push_uint8_(MF, offsetof(struct flow, FIELD), VALUE)
+
 #define miniflow_pad_to_64(MF, FIELD)                       \
     miniflow_pad_to_64_(MF, offsetof(struct flow, FIELD))
 
@@ -482,6 +502,12 @@  miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
         miniflow_pad_to_64(mf, conj_id);
     }
 
+    if (md->ct_state) {
+        miniflow_push_uint16(mf, ct_zone, md->ct_zone);
+        miniflow_push_uint8(mf, ct_state, md->ct_state);
+        miniflow_pad_to_64(mf, pad1);
+    }
+
     /* Initialize packet's layer pointer and offsets. */
     l2 = data;
     dp_packet_reset_offsets(packet);
@@ -832,6 +858,12 @@  flow_get_metadata(const struct flow *flow, struct match *flow_metadata)
     }
 
     match_set_in_port(flow_metadata, flow->in_port.ofp_port);
+    if (flow->ct_state != 0) {
+        match_set_ct_state(flow_metadata, flow->ct_state);
+    }
+    if (flow->ct_zone != 0) {
+        match_set_ct_zone(flow_metadata, flow->ct_zone);
+    }
 }
 
 char *
@@ -1107,6 +1139,12 @@  flow_format(struct ds *ds, const struct flow *flow)
     if (!flow->dp_hash) {
         WC_UNMASK_FIELD(wc, dp_hash);
     }
+    if (!flow->ct_state) {
+        WC_UNMASK_FIELD(wc, ct_state);
+    }
+    if (!flow->ct_zone) {
+        WC_UNMASK_FIELD(wc, ct_zone);
+    }
     for (int i = 0; i < FLOW_N_REGS; i++) {
         if (!flow->regs[i]) {
             WC_UNMASK_FIELD(wc, regs[i]);
@@ -1181,6 +1219,8 @@  void flow_wildcards_init_for_packet(struct flow_wildcards *wc,
 
     WC_MASK_FIELD(wc, skb_priority);
     WC_MASK_FIELD(wc, pkt_mark);
+    WC_MASK_FIELD(wc, ct_state);
+    WC_MASK_FIELD(wc, ct_zone);
     WC_MASK_FIELD(wc, recirc_id);
     WC_MASK_FIELD(wc, dp_hash);
     WC_MASK_FIELD(wc, in_port);
@@ -1284,6 +1324,8 @@  flow_wc_map(const struct flow *flow, struct flowmap *map)
     FLOWMAP_SET(map, dl_src);
     FLOWMAP_SET(map, dl_type);
     FLOWMAP_SET(map, vlan_tci);
+    FLOWMAP_SET(map, ct_state);
+    FLOWMAP_SET(map, ct_zone);
 
     /* Ethertype-dependent fields. */
     if (OVS_LIKELY(flow->dl_type == htons(ETH_TYPE_IP))) {
diff --git a/lib/flow.h b/lib/flow.h
index d8632ff..2d6d3ee 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -103,8 +103,10 @@  struct flow {
     union flow_in_port in_port; /* Input port.*/
     uint32_t recirc_id;         /* Must be exact match. */
     uint32_t conj_id;           /* Conjunction ID. */
+    uint16_t ct_zone;           /* Connection Zone. */
+    uint8_t ct_state;           /* Connection state. */
+    uint8_t pad1[3];            /* Pad to 64 bits. */
     ofp_port_t actset_output;   /* Output port in action set. */
-    uint8_t pad1[6];            /* Pad to 64 bits. */
 
     /* L2, Order the same as in the Ethernet header! (64-bit aligned) */
     struct eth_addr dl_dst;     /* Ethernet destination address. */
@@ -980,6 +982,8 @@  pkt_metadata_from_flow(struct pkt_metadata *md, const struct flow *flow)
     md->skb_priority = flow->skb_priority;
     md->pkt_mark = flow->pkt_mark;
     md->in_port = flow->in_port;
+    md->ct_state = flow->ct_state;
+    md->ct_zone = flow->ct_zone;
 }
 
 static inline bool is_ip_any(const struct flow *flow)
diff --git a/lib/match.c b/lib/match.c
index 9e465d8..87f940c 100644
--- a/lib/match.c
+++ b/lib/match.c
@@ -285,6 +285,26 @@  match_set_pkt_mark_masked(struct match *match, uint32_t pkt_mark, uint32_t mask)
 }
 
 void
+match_set_ct_state(struct match *match, uint8_t ct_state)
+{
+    match_set_ct_state_masked(match, ct_state, UINT8_MAX);
+}
+
+void
+match_set_ct_state_masked(struct match *match, uint8_t ct_state, uint8_t mask)
+{
+    match->flow.ct_state = ct_state & mask;
+    match->wc.masks.ct_state = mask;
+}
+
+void
+match_set_ct_zone(struct match *match, uint16_t ct_zone)
+{
+    match->flow.ct_zone = ct_zone;
+    match->wc.masks.ct_zone = UINT16_MAX;
+}
+
+void
 match_set_dl_type(struct match *match, ovs_be16 dl_type)
 {
     match->wc.masks.dl_type = OVS_BE16_MAX;
@@ -816,6 +836,21 @@  format_ipv6_netmask(struct ds *s, const char *name,
 }
 
 static void
+format_uint16_masked(struct ds *s, const char *name,
+                   uint16_t value, uint16_t mask)
+{
+    if (mask != 0) {
+        ds_put_format(s, "%s=", name);
+        if (mask == UINT16_MAX) {
+            ds_put_format(s, "%"PRIu16, value);
+        } else {
+            ds_put_format(s, "0x%"PRIx16"/0x%"PRIx16, value, mask);
+        }
+        ds_put_char(s, ',');
+    }
+}
+
+static void
 format_be16_masked(struct ds *s, const char *name,
                    ovs_be16 value, ovs_be16 mask)
 {
@@ -953,6 +988,26 @@  match_format(const struct match *match, struct ds *s, int priority)
         ds_put_char(s, ',');
     }
 
+    if (wc->masks.ct_state) {
+        if (wc->masks.ct_state == UINT8_MAX) {
+            ds_put_cstr(s, "ct_state=");
+            if (f->ct_state) {
+                format_flags(s, packet_ct_state_to_string, f->ct_state,
+                             '|');
+            } else {
+                ds_put_cstr(s, "0"); /* No state. */
+            }
+        } else {
+            format_flags_masked(s, "ct_state", packet_ct_state_to_string,
+                                f->ct_state, wc->masks.ct_state, UINT8_MAX);
+        }
+        ds_put_char(s, ',');
+    }
+
+    if (wc->masks.ct_zone) {
+        format_uint16_masked(s, "ct_zone", f->ct_zone, wc->masks.ct_zone);
+    }
+
     if (wc->masks.dl_type) {
         skip_type = true;
         if (f->dl_type == htons(ETH_TYPE_IP)) {
diff --git a/lib/match.h b/lib/match.h
index 3e133e5..f264551 100644
--- a/lib/match.h
+++ b/lib/match.h
@@ -83,6 +83,9 @@  void match_set_tun_gbp_flags(struct match *match, uint8_t flags);
 void match_set_in_port(struct match *, ofp_port_t ofp_port);
 void match_set_pkt_mark(struct match *, uint32_t pkt_mark);
 void match_set_pkt_mark_masked(struct match *, uint32_t pkt_mark, uint32_t mask);
+void match_set_ct_state(struct match *, uint8_t ct_state);
+void match_set_ct_state_masked(struct match *, uint8_t ct_state, uint8_t mask);
+void match_set_ct_zone(struct match *, uint16_t ct_zone);
 void match_set_skb_priority(struct match *, uint32_t skb_priority);
 void match_set_dl_type(struct match *, ovs_be16);
 void match_set_dl_src(struct match *, const struct eth_addr );
diff --git a/lib/meta-flow.c b/lib/meta-flow.c
index 224ba53..4a7ec7f 100644
--- a/lib/meta-flow.c
+++ b/lib/meta-flow.c
@@ -214,6 +214,10 @@  mf_is_all_wild(const struct mf_field *mf, const struct flow_wildcards *wc)
         return !wc->masks.skb_priority;
     case MFF_PKT_MARK:
         return !wc->masks.pkt_mark;
+    case MFF_CT_STATE:
+        return !wc->masks.ct_state;
+    case MFF_CT_ZONE:
+        return !wc->masks.ct_zone;
     CASE_MFF_REGS:
         return !wc->masks.regs[mf->id - MFF_REG0];
     CASE_MFF_XREGS:
@@ -497,6 +501,8 @@  mf_is_value_valid(const struct mf_field *mf, const union mf_value *value)
     case MFF_IN_PORT:
     case MFF_SKB_PRIORITY:
     case MFF_PKT_MARK:
+    case MFF_CT_STATE:
+    case MFF_CT_ZONE:
     CASE_MFF_REGS:
     CASE_MFF_XREGS:
     case MFF_ETH_SRC:
@@ -644,6 +650,14 @@  mf_get_value(const struct mf_field *mf, const struct flow *flow,
         value->be32 = htonl(flow->pkt_mark);
         break;
 
+    case MFF_CT_STATE:
+        value->u8 = flow->ct_state;
+        break;
+
+    case MFF_CT_ZONE:
+        value->be16 = htons(flow->ct_zone);
+        break;
+
     CASE_MFF_REGS:
         value->be32 = htonl(flow->regs[mf->id - MFF_REG0]);
         break;
@@ -876,6 +890,14 @@  mf_set_value(const struct mf_field *mf,
         match_set_pkt_mark(match, ntohl(value->be32));
         break;
 
+    case MFF_CT_STATE:
+        match_set_ct_state(match, value->u8);
+        break;
+
+    case MFF_CT_ZONE:
+        match_set_ct_zone(match, ntohs(value->be16));
+        break;
+
     CASE_MFF_REGS:
         match_set_reg(match, mf->id - MFF_REG0, ntohl(value->be32));
         break;
@@ -1160,6 +1182,14 @@  mf_set_flow_value(const struct mf_field *mf,
         flow->pkt_mark = ntohl(value->be32);
         break;
 
+    case MFF_CT_STATE:
+        flow->ct_state = value->u8;
+        break;
+
+    case MFF_CT_ZONE:
+        flow->ct_zone = ntohs(value->be16);
+        break;
+
     CASE_MFF_REGS:
         flow->regs[mf->id - MFF_REG0] = ntohl(value->be32);
         break;
@@ -1449,6 +1479,16 @@  mf_set_wild(const struct mf_field *mf, struct match *match, char **err_str)
         match->wc.masks.pkt_mark = 0;
         break;
 
+    case MFF_CT_STATE:
+        match->flow.ct_state = 0;
+        match->wc.masks.ct_state = 0;
+        break;
+
+    case MFF_CT_ZONE:
+        match->flow.ct_zone = 0;
+        match->wc.masks.ct_zone = 0;
+        break;
+
     CASE_MFF_REGS:
         match_set_reg_masked(match, mf->id - MFF_REG0, 0, 0);
         break;
@@ -1636,6 +1676,7 @@  mf_set(const struct mf_field *mf,
     }
 
     switch (mf->id) {
+    case MFF_CT_ZONE:
     case MFF_RECIRC_ID:
     case MFF_CONJ_ID:
     case MFF_IN_PORT:
@@ -1711,6 +1752,10 @@  mf_set(const struct mf_field *mf,
                                   ntohl(mask->be32));
         break;
 
+    case MFF_CT_STATE:
+        match_set_ct_state_masked(match, value->u8, mask->u8);
+        break;
+
     case MFF_ETH_DST:
         match_set_dl_dst_masked(match, value->mac, mask->mac);
         break;
@@ -2103,6 +2148,22 @@  mf_from_tun_flags_string(const char *s, ovs_be16 *flagsp, ovs_be16 *maskp)
                           htons(FLOW_TNL_PUB_F_MASK), maskp);
 }
 
+static char *
+mf_from_ct_state_string(const char *s, uint8_t *flagsp, uint8_t *maskp)
+{
+    ovs_be16 flags, mask;
+    char *error;
+
+    error = parse_mf_flags(s, packet_ct_state_to_string, "ct_state", &flags,
+                           htons(CS_SUPPORTED_MASK), &mask);
+    if (!error) {
+        *flagsp = ntohs(flags);
+        *maskp = ntohs(mask);
+    }
+
+    return error;
+}
+
 /* Parses 's', a string value for field 'mf', into 'value' and 'mask'.  Returns
  * NULL if successful, otherwise a malloc()'d string describing the error. */
 char *
@@ -2124,6 +2185,11 @@  mf_parse(const struct mf_field *mf, const char *s,
                                        (uint8_t *) value, (uint8_t *) mask);
         break;
 
+    case MFS_CT_STATE:
+        ovs_assert(mf->n_bytes == sizeof(uint8_t));
+        error = mf_from_ct_state_string(s, &value->u8, &mask->u8);
+        break;
+
     case MFS_ETHERNET:
         error = mf_from_ethernet_string(mf, s, &value->mac, &mask->mac);
         break;
@@ -2244,6 +2310,13 @@  mf_format_tcp_flags_string(ovs_be16 value, ovs_be16 mask, struct ds *s)
                         TCP_FLAGS(mask), TCP_FLAGS(OVS_BE16_MAX));
 }
 
+static void
+mf_format_ct_state_string(uint8_t value, uint8_t mask, struct ds *s)
+{
+    format_flags_masked(s, NULL, packet_ct_state_to_string, value, mask,
+                        UINT8_MAX);
+}
+
 /* Appends to 's' a string representation of field 'mf' whose value is in
  * 'value' and 'mask'.  'mask' may be NULL to indicate an exact match. */
 void
@@ -2280,6 +2353,10 @@  mf_format(const struct mf_field *mf,
         mf_format_integer_string(mf, (uint8_t *) value, (uint8_t *) mask, s);
         break;
 
+    case MFS_CT_STATE:
+        mf_format_ct_state_string(value->u8, mask ? mask->u8 : UINT8_MAX, s);
+        break;
+
     case MFS_ETHERNET:
         eth_format_masked(value->mac, mask ? &mask->mac : NULL, s);
         break;
diff --git a/lib/meta-flow.h b/lib/meta-flow.h
index 02272ef..44a29e9 100644
--- a/lib/meta-flow.h
+++ b/lib/meta-flow.h
@@ -703,6 +703,51 @@  enum OVS_PACKED_ENUM mf_field_id {
      */
     MFF_PKT_MARK,
 
+    /* "ct_state".
+     *
+     * Connection tracking state.  The field is populated by the NXAST_CT
+     * action. The following bit values describe the state of the connection:
+     *
+     *   - New (0x01): This is the beginning of a new connection.
+     *   - Established (0x02): This is part of an already existing connection.
+     *   - Related (0x04): This is a separate connection that is related to an
+     *                     existing connection.
+     *   - Invalid (0x20): This flow could not be associated with a connection.
+     *                     This could be set for a variety of reasons,
+     *                     including (but not limited to):
+     *                     - L3/L4 protocol handler is not loaded/unavailable.
+     *                     - L3/L4 protocol handler determines that the packet
+     *                       is malformed or invalid for the current FSM stage.
+     *                     - Packets are unexpected length for protocol.
+     *   - Reply (0x40): This flow is in the reply direction, ie it did not
+     *                   initiate the connection.
+     *   - Tracked (0x80): Connection tracking has occurred.
+     *
+     * Type: u8.
+     * Maskable: bitwise.
+     * Formatting: conn state.
+     * Prerequisites: none.
+     * Access: read-only.
+     * NXM: NXM_NX_CT_STATE(105) since v2.5.
+     * OXM: none.
+     */
+    MFF_CT_STATE,
+
+    /* "ct_zone".
+     *
+     * Connection tracking zone.  The field is populated by the
+     * NXAST_CT action.
+     *
+     * Type: be16.
+     * Maskable: no.
+     * Formatting: hexadecimal.
+     * Prerequisites: none.
+     * Access: read-only.
+     * NXM: NXM_NX_CT_ZONE(106) since v2.5.
+     * OXM: none.
+     */
+    MFF_CT_ZONE,
+
 #if FLOW_N_REGS == 8
     /* "reg<N>".
      *
@@ -1679,6 +1724,7 @@  enum OVS_PACKED_ENUM mf_string {
     MFS_HEXADECIMAL,
 
     /* Other formats. */
+    MFS_CT_STATE,               /* Connection tracking state */
     MFS_ETHERNET,
     MFS_IPV4,
     MFS_IPV6,
diff --git a/lib/nx-match.c b/lib/nx-match.c
index eef2c54..4268b42 100644
--- a/lib/nx-match.c
+++ b/lib/nx-match.c
@@ -1036,6 +1036,16 @@  nx_put_raw(struct ofpbuf *b, enum ofp_version oxm, const struct match *match,
     nxm_put_32m(b, MFF_PKT_MARK, oxm, htonl(flow->pkt_mark),
                 htonl(match->wc.masks.pkt_mark));
 
+    /* Connection tracking. */
+    if (match->wc.masks.ct_state) {
+        nxm_put_8m(b, MFF_CT_STATE, oxm, flow->ct_state,
+                   match->wc.masks.ct_state);
+    }
+    if (match->wc.masks.ct_zone) {
+        nxm_put_16m(b, MFF_CT_ZONE, oxm, htons(flow->ct_zone),
+                    htons(match->wc.masks.ct_zone));
+    }
+
     /* OpenFlow 1.1+ Metadata. */
     nxm_put_64m(b, MFF_METADATA, oxm,
                 flow->metadata, match->wc.masks.metadata);
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 54a43cd..575631c 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -326,6 +326,8 @@  odp_execute_set_action(struct dp_packet *packet, const struct nlattr *a)
     case OVS_KEY_ATTR_ICMP:
     case OVS_KEY_ATTR_ICMPV6:
     case OVS_KEY_ATTR_TCP_FLAGS:
+    case OVS_KEY_ATTR_CT_STATE:
+    case OVS_KEY_ATTR_CT_ZONE:
     case __OVS_KEY_ATTR_MAX:
     default:
         OVS_NOT_REACHED();
@@ -414,6 +416,8 @@  odp_execute_masked_set_action(struct dp_packet *packet,
 
     case OVS_KEY_ATTR_TUNNEL:    /* Masked data not supported for tunnel. */
     case OVS_KEY_ATTR_UNSPEC:
+    case OVS_KEY_ATTR_CT_STATE:
+    case OVS_KEY_ATTR_CT_ZONE:
     case OVS_KEY_ATTR_ENCAP:
     case OVS_KEY_ATTR_ETHERTYPE:
     case OVS_KEY_ATTR_IN_PORT:
@@ -476,6 +480,7 @@  requires_datapath_assistance(const struct nlattr *a)
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
     case OVS_ACTION_ATTR_RECIRC:
+    case OVS_ACTION_ATTR_CT:
         return true;
 
     case OVS_ACTION_ATTR_SET:
@@ -611,6 +616,7 @@  odp_execute_actions(void *dp, struct dp_packet **packets, int cnt, bool steal,
         case OVS_ACTION_ATTR_TUNNEL_POP:
         case OVS_ACTION_ATTR_USERSPACE:
         case OVS_ACTION_ATTR_RECIRC:
+        case OVS_ACTION_ATTR_CT:
         case OVS_ACTION_ATTR_UNSPEC:
         case __OVS_ACTION_ATTR_MAX:
             OVS_NOT_REACHED();
diff --git a/lib/odp-util.c b/lib/odp-util.c
index d7696c9..1ce6258 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -113,6 +113,7 @@  odp_action_len(uint16_t type)
     case OVS_ACTION_ATTR_SET: return ATTR_LEN_VARIABLE;
     case OVS_ACTION_ATTR_SET_MASKED: return ATTR_LEN_VARIABLE;
     case OVS_ACTION_ATTR_SAMPLE: return ATTR_LEN_VARIABLE;
+    case OVS_ACTION_ATTR_CT: return ATTR_LEN_VARIABLE;
 
     case OVS_ACTION_ATTR_UNSPEC:
     case __OVS_ACTION_ATTR_MAX:
@@ -134,6 +135,8 @@  ovs_key_attr_to_string(enum ovs_key_attr attr, char *namebuf, size_t bufsize)
     case OVS_KEY_ATTR_ENCAP: return "encap";
     case OVS_KEY_ATTR_PRIORITY: return "skb_priority";
     case OVS_KEY_ATTR_SKB_MARK: return "skb_mark";
+    case OVS_KEY_ATTR_CT_STATE: return "ct_state";
+    case OVS_KEY_ATTR_CT_ZONE: return "ct_zone";
     case OVS_KEY_ATTR_TUNNEL: return "tunnel";
     case OVS_KEY_ATTR_IN_PORT: return "in_port";
     case OVS_KEY_ATTR_ETHERNET: return "eth";
@@ -532,6 +535,44 @@  format_odp_tnl_push_action(struct ds *ds, const struct nlattr *attr)
     ds_put_format(ds, ",out_port(%"PRIu32"))", data->out_port);
 }
 
+static const struct nl_policy ovs_conntrack_policy[] = {
+    [OVS_CT_ATTR_FLAGS] = { .type = NL_A_U32, .optional = true,
+                            .min_len = sizeof(uint32_t) },
+    [OVS_CT_ATTR_ZONE] = { .type = NL_A_U16, .optional = true,
+                            .min_len = sizeof(uint16_t)},
+};
+
+static void
+format_odp_conntrack_action(struct ds *ds, const struct nlattr *attr)
+{
+    struct nlattr *a[ARRAY_SIZE(ovs_conntrack_policy)];
+    uint32_t flags;
+    uint16_t zone;
+
+    if (!nl_parse_nested(attr, ovs_conntrack_policy, a, ARRAY_SIZE(a))) {
+        ds_put_cstr(ds, "ct(error)");
+        return;
+    }
+
+    flags = a[OVS_CT_ATTR_FLAGS] ? nl_attr_get_u32(a[OVS_CT_ATTR_FLAGS]) : 0;
+    zone = a[OVS_CT_ATTR_ZONE] ? nl_attr_get_u16(a[OVS_CT_ATTR_ZONE]) : 0;
+
+    ds_put_format(ds, "ct");
+    if (flags || zone) {
+        ds_put_cstr(ds, "(");
+        if (flags & OVS_CT_F_COMMIT) {
+            ds_put_format(ds, "commit");
+        }
+        if (zone) {
+            if (ds_last(ds) != '(') {
+                ds_put_char(ds, ',');
+            }
+            ds_put_format(ds, "zone=%"PRIu16, zone);
+        }
+        ds_put_cstr(ds, ")");
+    }
+}
+
 static void
 format_odp_action(struct ds *ds, const struct nlattr *a)
 {
@@ -622,6 +663,9 @@  format_odp_action(struct ds *ds, const struct nlattr *a)
     case OVS_ACTION_ATTR_SAMPLE:
         format_odp_sample_action(ds, a);
         break;
+    case OVS_ACTION_ATTR_CT:
+        format_odp_conntrack_action(ds, a);
+        break;
     case OVS_ACTION_ATTR_UNSPEC:
     case __OVS_ACTION_ATTR_MAX:
     default:
@@ -960,6 +1004,59 @@  ovs_parse_tnl_push(const char *s, struct ovs_action_push_tnl *data)
 }
 
 static int
+parse_conntrack_action(const char *s_, struct ofpbuf *actions)
+{
+    const char *s = s_;
+
+    if (ovs_scan(s, "ct")) {
+        uint32_t flags = 0;
+        uint16_t zone = 0;
+        size_t start;
+        char *end;
+
+        s += 2;
+        if (ovs_scan(s, "(")) {
+            s++;
+            end = strchr(s, ')');
+            if (!end) {
+                return -EINVAL;
+            }
+
+            while (s != end) {
+                int n = -1;
+
+                s += strspn(s, delimiters);
+                if (ovs_scan(s, "commit%n", &n)) {
+                    flags |= OVS_CT_F_COMMIT;
+                    s += n;
+                    continue;
+                }
+                if (ovs_scan(s, "zone=%"SCNu16"%n", &zone, &n)) {
+                    s += n;
+                    continue;
+                }
+
+                if (n < 0) {
+                    return -EINVAL;
+                }
+            }
+            s++;
+        }
+
+        start = nl_msg_start_nested(actions, OVS_ACTION_ATTR_CT);
+        if (flags) {
+            nl_msg_put_u32(actions, OVS_CT_ATTR_FLAGS, flags);
+        }
+        if (zone) {
+            nl_msg_put_u16(actions, OVS_CT_ATTR_ZONE, zone);
+        }
+        nl_msg_end_nested(actions, start);
+    }
+
+    return s - s_;
+}
+
+static int
 parse_odp_action(const char *s, const struct simap *port_names,
                  struct ofpbuf *actions)
 {
@@ -1117,6 +1214,15 @@  parse_odp_action(const char *s, const struct simap *port_names,
     }
 
     {
+        int retval;
+
+        retval = parse_conntrack_action(s, actions);
+        if (retval) {
+            return retval;
+        }
+    }
+
+    {
         struct ovs_action_push_tnl data;
         int n;
 
@@ -1211,6 +1317,8 @@  static const struct attr_len_tbl ovs_flow_key_attr_lens[OVS_KEY_ATTR_MAX + 1] =
     [OVS_KEY_ATTR_ICMPV6]    = { .len = sizeof(struct ovs_key_icmpv6) },
     [OVS_KEY_ATTR_ARP]       = { .len = sizeof(struct ovs_key_arp) },
     [OVS_KEY_ATTR_ND]        = { .len = sizeof(struct ovs_key_nd) },
+    [OVS_KEY_ATTR_CT_STATE]  = { .len = 1 },
+    [OVS_KEY_ATTR_CT_ZONE]   = { .len = 2 },
 };
 
 /* Returns the correct length of the payload for a flow key attribute of the
@@ -2016,6 +2124,21 @@  format_frag(struct ds *ds, const char *name, uint8_t key,
     }
 }
 
+static bool
+mask_empty(const struct nlattr *ma)
+{
+    const void *mask;
+    size_t n;
+
+    if (!ma) {
+        return true;
+    }
+    mask = nl_attr_get(ma);
+    n = nl_attr_get_size(ma);
+
+    return is_all_zeros(mask, n);
+}
+
 static void
 format_odp_key_attr(const struct nlattr *a, const struct nlattr *ma,
                     const struct hmap *portno_names, struct ds *ds,
@@ -2057,6 +2180,24 @@  format_odp_key_attr(const struct nlattr *a, const struct nlattr *ma,
         }
         break;
 
+    case OVS_KEY_ATTR_CT_STATE:
+        if (!is_exact) {
+            format_flags_masked(ds, NULL, packet_ct_state_to_string,
+                                nl_attr_get_u8(a), nl_attr_get_u8(ma),
+                                UINT8_MAX);
+        } else {
+            format_flags(ds, packet_ct_state_to_string,
+                         nl_attr_get_u8(a), ',');
+        }
+        break;
+
+    case OVS_KEY_ATTR_CT_ZONE:
+        if (verbose || !mask_empty(ma)) {
+            ds_put_format(ds, "%"PRIx16, nl_attr_get_u16(a));
+        }
+        break;
+
+
     case OVS_KEY_ATTR_TUNNEL:
         format_odp_tun_attr(a, ma, ds, verbose);
         break;
@@ -2503,6 +2644,26 @@  scan_u8(const char *s, uint8_t *key, uint8_t *mask)
 }
 
 static int
+scan_u16(const char *s, uint16_t *key, uint16_t *mask)
+{
+    int n;
+
+    if (ovs_scan(s, "%"SCNi16"%n", key, &n)) {
+        int len = n;
+
+        if (mask) {
+            if (ovs_scan(s + len, "/%"SCNi16"%n", mask, &n)) {
+                len += n;
+            } else {
+                *mask = UINT16_MAX;
+            }
+        }
+        return len;
+    }
+    return 0;
+}
+
+static int
 scan_u32(const char *s, uint32_t *key, uint32_t *mask)
 {
     int n;
@@ -2605,6 +2766,25 @@  scan_tcp_flags(const char *s, ovs_be16 *key, ovs_be16 *mask)
 }
 
 static int
+scan_ct_state(const char *s, uint8_t *key, uint8_t *mask)
+{
+    uint32_t flags, fmask;
+    int n;
+
+    n = parse_flags(s, packet_ct_state_to_string, ')', NULL, NULL, &flags,
+                    CS_SUPPORTED_MASK, mask ? &fmask : NULL);
+
+    if (n >= 0) {
+        *key = flags;
+        if (mask) {
+            *mask = fmask;
+        }
+        return n;
+    }
+    return 0;
+}
+
+static int
 scan_frag(const char *s, uint8_t *key, uint8_t *mask)
 {
     int n;
@@ -3127,6 +3307,9 @@  parse_odp_key_mask_attr(const char *s, const struct simap *port_names,
                              OVS_KEY_ATTR_RECIRC_ID);
     SCAN_SINGLE("dp_hash(", uint32_t, u32, OVS_KEY_ATTR_DP_HASH);
 
+    SCAN_SINGLE("ct_state(", uint8_t, ct_state, OVS_KEY_ATTR_CT_STATE);
+    SCAN_SINGLE("ct_zone(", uint16_t, u16, OVS_KEY_ATTR_CT_ZONE);
+
     SCAN_BEGIN_NESTED("tunnel(", OVS_KEY_ATTR_TUNNEL) {
         SCAN_FIELD_NESTED("tun_id=", ovs_be64, be64, OVS_TUNNEL_KEY_ATTR_ID);
         SCAN_FIELD_NESTED("src=", ovs_be32, ipv4, OVS_TUNNEL_KEY_ATTR_IPV4_SRC);
@@ -3361,6 +3544,12 @@  odp_flow_key_from_flow__(const struct odp_flow_key_parms *parms,
 
     nl_msg_put_u32(buf, OVS_KEY_ATTR_SKB_MARK, data->pkt_mark);
 
+    if (parms->support.ct_state) {
+        nl_msg_put_u8(buf, OVS_KEY_ATTR_CT_STATE, data->ct_state);
+    }
+    if (parms->support.ct_zone) {
+        nl_msg_put_u16(buf, OVS_KEY_ATTR_CT_ZONE, data->ct_zone);
+    }
     if (parms->support.recirc) {
         nl_msg_put_u32(buf, OVS_KEY_ATTR_RECIRC_ID, data->recirc_id);
         nl_msg_put_u32(buf, OVS_KEY_ATTR_DP_HASH, data->dp_hash);
@@ -3543,6 +3732,13 @@  odp_key_from_pkt_metadata(struct ofpbuf *buf, const struct pkt_metadata *md)
 
     nl_msg_put_u32(buf, OVS_KEY_ATTR_SKB_MARK, md->pkt_mark);
 
+    if (md->ct_state) {
+        nl_msg_put_u8(buf, OVS_KEY_ATTR_CT_STATE, md->ct_state);
+        if (md->ct_zone) {
+            nl_msg_put_u16(buf, OVS_KEY_ATTR_CT_ZONE, md->ct_zone);
+        }
+    }
+
     /* Add an ingress port attribute if 'odp_in_port' is not the magical
      * value "ODPP_NONE". */
     if (md->in_port.odp_port != ODPP_NONE) {
@@ -3590,6 +3786,14 @@  odp_key_to_pkt_metadata(const struct nlattr *key, size_t key_len,
             md->pkt_mark = nl_attr_get_u32(nla);
             wanted_attrs &= ~(1u << OVS_KEY_ATTR_SKB_MARK);
             break;
+        case OVS_KEY_ATTR_CT_STATE:
+            md->ct_state = nl_attr_get_u8(nla);
+            wanted_attrs &= ~(1u << OVS_KEY_ATTR_CT_STATE);
+            break;
+        case OVS_KEY_ATTR_CT_ZONE:
+            md->ct_zone = nl_attr_get_u16(nla);
+            wanted_attrs &= ~(1u << OVS_KEY_ATTR_CT_ZONE);
+            break;
         case OVS_KEY_ATTR_TUNNEL: {
             enum odp_key_fitness res;
 
@@ -4144,6 +4348,15 @@  odp_flow_key_to_flow__(const struct nlattr *key, size_t key_len,
         expected_attrs |= UINT64_C(1) << OVS_KEY_ATTR_SKB_MARK;
     }
 
+    if (present_attrs & (UINT64_C(1) << OVS_KEY_ATTR_CT_STATE)) {
+        flow->ct_state = nl_attr_get_u8(attrs[OVS_KEY_ATTR_CT_STATE]);
+        expected_attrs |= UINT64_C(1) << OVS_KEY_ATTR_CT_STATE;
+    }
+    if (present_attrs & (UINT64_C(1) << OVS_KEY_ATTR_CT_ZONE)) {
+        flow->ct_zone = nl_attr_get_u16(attrs[OVS_KEY_ATTR_CT_ZONE]);
+        expected_attrs |= UINT64_C(1) << OVS_KEY_ATTR_CT_ZONE;
+    }
+
     if (present_attrs & (UINT64_C(1) << OVS_KEY_ATTR_TUNNEL)) {
         enum odp_key_fitness res;
 
diff --git a/lib/odp-util.h b/lib/odp-util.h
index bc27794..5822777 100644
--- a/lib/odp-util.h
+++ b/lib/odp-util.h
@@ -120,6 +120,8 @@  void odp_portno_names_destroy(struct hmap *portno_names);
  *  OVS_KEY_ATTR_SKB_MARK                4    --     4      8
  *  OVS_KEY_ATTR_DP_HASH                 4    --     4      8
  *  OVS_KEY_ATTR_RECIRC_ID               4    --     4      8
+ *  OVS_KEY_ATTR_CONN_STATE              1     3     4      8
+ *  OVS_KEY_ATTR_CONN_ZONE               2     2     4      8
  *  OVS_KEY_ATTR_ETHERNET               12    --     4     16
  *  OVS_KEY_ATTR_ETHERTYPE               2     2     4      8  (outer VLAN ethertype)
  *  OVS_KEY_ATTR_VLAN                    2     2     4      8
@@ -129,7 +131,7 @@  void odp_portno_names_destroy(struct hmap *portno_names);
  *  OVS_KEY_ATTR_ICMPV6                  2     2     4      8
  *  OVS_KEY_ATTR_ND                     28    --     4     32
  *  ----------------------------------------------------------
- *  total                                                 488
+ *  total                                                 504
  *
  * We include some slack space in case the calculation isn't quite right or we
  * add another field and forget to adjust this value.
@@ -166,6 +168,10 @@  struct odp_support {
 
     /* If this is true, then recirculation fields will always be serialised. */
     bool recirc;
+
+    /* If true, serialise the corresponding OVS_KEY_ATTR_CONN_* field. */
+    bool ct_state;
+    bool ct_zone;
 };
 
 struct odp_flow_key_parms {
diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
index c1e46b5..6ea421f 100644
--- a/lib/ofp-actions.c
+++ b/lib/ofp-actions.c
@@ -286,6 +286,9 @@  enum ofp_raw_action_type {
     /* NX1.0+(34): struct nx_action_conjunction. */
     NXAST_RAW_CONJUNCTION,
 
+    /* NX1.0+(35): struct nx_action_conntrack, ... */
+    NXAST_RAW_CT,
+
 /* ## ------------------ ## */
 /* ## Debugging actions. ## */
 /* ## ------------------ ## */
@@ -346,6 +349,10 @@  static void *ofpact_put_raw(struct ofpbuf *, enum ofp_version,
 static char *OVS_WARN_UNUSED_RESULT ofpacts_parse(
     char *str, struct ofpbuf *ofpacts, enum ofputil_protocol *usable_protocols,
     bool allow_instructions, enum ofpact_type outer_action);
+static enum ofperr ofpacts_pull_openflow_actions__(
+    struct ofpbuf *openflow, unsigned int actions_len,
+    enum ofp_version version, uint32_t allowed_ovsinsts,
+    struct ofpbuf *ofpacts, enum ofpact_type outer_action);
 
 /* Pull off existing actions or instructions. Used by nesting actions to keep
  * ofpacts_parse() oblivious of actions nesting. */
@@ -4448,6 +4455,245 @@  format_DEBUG_RECIRC(const struct ofpact_null *a OVS_UNUSED, struct ds *s)
 {
     ds_put_cstr(s, "debug_recirc");
 }
+
+/* Action structure for NXAST_CT.
+ *
+ * Pass traffic to the connection tracker.
+ *
+ * The "zone" specifies a context within which the tracking is done:
+ *
+ *      If 'zone_src' is nonzero, this specifies that the zone should be
+ *      sourced from a field zone_src[ofs:ofs+nbits]. The format and semantics
+ *      of 'zone_src' and 'zone_ofs_nbits' are similar to those for the
+ *      NXAST_REG_LOAD action. The acceptable nxm_header values for 'zone_src'
+ *      are the same as the acceptable nxm_header values for the 'src' field of
+ *      NXAST_REG_MOVE.
+ *
+ *      If 'zone_src' is zero, then the value of 'zone_imm' will be used as the
+ *      connection tracking zone.
+ *
+ * If "recirc_table" has a value other than NX_CT_RECIRC_NONE, then this field
+ * specifies that the packet should be logically cloned after the packet is
+ * sent through the connection tracker, and pipeline processing for the cloned
+ * packet will continue in the OpenFlow table specified by this value. The
+ * NXM_NX_CT_* fields will be populated for the subsequent processing. It is
+ * strongly recommended that this table is later than the current table, to
+ * prevent loops.
+ *
+ * A standard "resubmit" action is not sufficient to populate NXM_NX_CT_*
+ * fields, since connection tracking occurs outside of the classifier.
+ */
+struct nx_action_conntrack {
+    ovs_be16 type;              /* OFPAT_VENDOR. */
+    ovs_be16 len;               /* At least 24. */
+    ovs_be32 vendor;            /* NX_VENDOR_ID. */
+    ovs_be16 subtype;           /* NXAST_CT. */
+    ovs_be16 flags;             /* Zero or more NX_CT_F_* flags.
+                                 * Unspecified flag bits must be zero. */
+    ovs_be32 zone_src;          /* Connection tracking context. */
+    union {
+        ovs_be16 zone_ofs_nbits;/* Range to use from source field. */
+        ovs_be16 zone_imm;      /* Immediate value for zone. */
+    };
+    uint8_t recirc_table;       /* Recirculate to a specific table, or
+                                   NX_CT_RECIRC_NONE for no recirculation. */
+    uint8_t pad[3];             /* Zeroes */
+    ovs_be16 alg;               /* Well-known port number for the protocol.
+                                 * 0 indicates no ALG is required. */
+    /* Followed by a sequence of ofpact elements, until the end of the action
+     * is reached. */
+};
+OFP_ASSERT(sizeof(struct nx_action_conntrack) == 24);
+
+static enum ofperr
+decode_ct_zone(const struct nx_action_conntrack *nac,
+               struct ofpact_conntrack *out)
+{
+    if (nac->zone_src) {
+        enum ofperr error;
+
+        out->zone_src.field = mf_from_nxm_header(ntohl(nac->zone_src));
+        out->zone_src.ofs = nxm_decode_ofs(nac->zone_ofs_nbits);
+        out->zone_src.n_bits = nxm_decode_n_bits(nac->zone_ofs_nbits);
+        error = mf_check_src(&out->zone_src, NULL);
+        if (error) {
+            return error;
+        }
+
+        if (out->zone_src.n_bits != 16) {
+            VLOG_WARN_RL(&rl, "zone n_bits %d not within valid range [16..16]",
+                         out->zone_src.n_bits);
+            return OFPERR_OFPBAC_BAD_SET_LEN;
+        }
+    } else {
+        out->zone_src.field = NULL;
+        out->zone_imm = ntohs(nac->zone_imm);
+    }
+
+    return 0;
+}
+
+static enum ofperr
+decode_NXAST_RAW_CT(const struct nx_action_conntrack *nac, struct ofpbuf *out)
+{
+    const size_t ct_offset = ofpacts_pull(out);
+    struct ofpact_conntrack *conntrack;
+    struct ofpbuf openflow;
+    int error = 0;
+
+    conntrack = ofpact_put_CT(out);
+    conntrack->flags = ntohs(nac->flags);
+    error = decode_ct_zone(nac, conntrack);
+    if (error) {
+        goto out;
+    }
+    conntrack->recirc_table = nac->recirc_table;
+    conntrack->alg = ntohs(nac->alg);
+
+    ofpbuf_pull(out, sizeof(*conntrack));
+
+    /* XXX: version */
+    ofpbuf_use_const(&openflow, nac + 1, ntohs(nac->len) - sizeof(*nac));
+    error = ofpacts_pull_openflow_actions__(&openflow, openflow.size,
+                                            OFP10_VERSION,
+                                            1u << OVSINST_OFPIT11_APPLY_ACTIONS,
+                                            out, OFPACT_CT);
+    if (error) {
+        goto out;
+    }
+
+    conntrack = ofpbuf_push_uninit(out, sizeof(*conntrack));
+    out->header = &conntrack->ofpact;
+    ofpact_update_len(out, &conntrack->ofpact);
+
+out:
+    ofpbuf_push_uninit(out, ct_offset);
+    return error;
+}
+
+static void
+encode_CT(const struct ofpact_conntrack *conntrack,
+          enum ofp_version ofp_version, struct ofpbuf *out)
+{
+    struct nx_action_conntrack *nac;
+    const size_t ofs = out->size;
+    size_t len;
+
+    nac = put_NXAST_CT(out);
+    nac->flags = htons(conntrack->flags);
+    if (conntrack->zone_src.field) {
+        nac->zone_src = htonl(mf_nxm_header(conntrack->zone_src.field->id));
+        nac->zone_ofs_nbits = nxm_encode_ofs_nbits(conntrack->zone_src.ofs,
+                                                   conntrack->zone_src.n_bits);
+    } else {
+        nac->zone_src = htonl(0);
+        nac->zone_imm = htons(conntrack->zone_imm);
+    }
+    nac->recirc_table = conntrack->recirc_table;
+    nac->alg = htons(conntrack->alg);
+
+    len = ofpacts_put_openflow_actions(conntrack->actions,
+                                       ofpact_ct_get_action_len(conntrack),
+                                       out, ofp_version);
+    len += sizeof(*nac);
+    nac = ofpbuf_at(out, ofs, sizeof(*nac));
+    nac->len = htons(len);
+}
+
+/* Parses 'arg' as the argument to a "ct" action, and appends such an
+ * action to 'ofpacts'.
+ *
+ * Returns NULL if successful, otherwise a malloc()'d string describing the
+ * error.  The caller is responsible for freeing the returned string. */
+static char * OVS_WARN_UNUSED_RESULT
+parse_CT(char *arg, struct ofpbuf *ofpacts,
+         enum ofputil_protocol *usable_protocols)
+{
+    const size_t ct_offset = ofpacts_pull(ofpacts);
+    struct ofpact_conntrack *oc;
+    char *error = NULL;
+    char *key, *value;
+
+    oc = ofpact_put_CT(ofpacts);
+    oc->flags = 0;
+    oc->recirc_table = NX_CT_RECIRC_NONE;
+    while (ofputil_parse_key_value(&arg, &key, &value)) {
+        if (!strcmp(key, "commit")) {
+            oc->flags |= NX_CT_F_COMMIT;
+        } else if (!strcmp(key, "table")) {
+            error = str_to_u8(value, "recirc_table", &oc->recirc_table);
+        } else if (!strcmp(key, "zone")) {
+            error = str_to_u16(value, "zone", &oc->zone_imm);
+
+            if (error) {
+                free(error);
+                error = mf_parse_subfield(&oc->zone_src, value);
+                if (error) {
+                    return error;
+                }
+            }
+        } else if (!strcmp(key, "exec")) {
+            /* Hide existing actions from ofpacts_parse_actions(), so the
+             * nesting can be handled transparently. */
+            ofpbuf_pull(ofpacts, sizeof(*oc));
+            error = ofpacts_parse_actions(value, ofpacts, usable_protocols,
+                                          OFPACT_CT);
+
+            /* Put the CT action back on and update its length. */
+            oc = ofpbuf_push_uninit(ofpacts, sizeof(*oc));
+            oc->ofpact.len = ofpacts->size;
+        } else {
+            error = xasprintf("invalid argument to \"ct\" action: `%s'", key);
+        }
+        if (error) {
+            break;
+        }
+    }
+
+    ofpbuf_push_uninit(ofpacts, ct_offset);
+    return error;
+}
+
+static void
+append_comma(struct ds *s, bool *first)
+{
+    if (*first) {
+        *first = false;
+    } else {
+        ds_put_char(s, ',');
+    }
+}
+
+static void
+format_CT(const struct ofpact_conntrack *a, struct ds *s)
+{
+    bool first = true;
+
+    ds_put_cstr(s, "ct(");
+    if (a->flags & NX_CT_F_COMMIT) {
+        append_comma(s, &first);
+        ds_put_cstr(s, "commit");
+    }
+    if (a->recirc_table != NX_CT_RECIRC_NONE) {
+        append_comma(s, &first);
+        ds_put_format(s, "table=%"PRIu8, a->recirc_table);
+    }
+    if (a->zone_src.field) {
+        append_comma(s, &first);
+        ds_put_format(s, "zone=");
+        mf_format_subfield(&a->zone_src, s);
+    } else if (a->zone_imm) {
+        append_comma(s, &first);
+        ds_put_format(s, "zone=%"PRIu16, a->zone_imm);
+    }
+    if (ofpact_ct_get_action_len(a)) {
+        append_comma(s, &first);
+        ds_put_cstr(s, "exec(");
+        ofpacts_format(a->actions, ofpact_ct_get_action_len(a), s);
+        ds_put_char(s, ')');
+    }
+    ds_put_char(s, ')');
+}
 
 /* Meter instruction. */
 
@@ -4827,6 +5073,7 @@  ofpact_is_set_or_move_action(const struct ofpact *a)
         return true;
     case OFPACT_BUNDLE:
     case OFPACT_CLEAR_ACTIONS:
+    case OFPACT_CT:
     case OFPACT_CONTROLLER:
     case OFPACT_DEC_MPLS_TTL:
     case OFPACT_DEC_TTL:
@@ -4901,6 +5148,7 @@  ofpact_is_allowed_in_actions_set(const struct ofpact *a)
      * in the action set is undefined. */
     case OFPACT_BUNDLE:
     case OFPACT_CONTROLLER:
+    case OFPACT_CT:
     case OFPACT_ENQUEUE:
     case OFPACT_EXIT:
     case OFPACT_UNROLL_XLATE:
@@ -5130,6 +5378,7 @@  ovs_instruction_type_from_ofpact_type(enum ofpact_type type)
     case OFPACT_UNROLL_XLATE:
     case OFPACT_SAMPLE:
     case OFPACT_DEBUG_RECIRC:
+    case OFPACT_CT:
     default:
         return OVSINST_OFPIT11_APPLY_ACTIONS;
     }
@@ -5686,6 +5935,16 @@  ofpact_check__(enum ofputil_protocol *usable_protocols, struct ofpact *a,
     case OFPACT_SAMPLE:
         return 0;
 
+    case OFPACT_CT: {
+        struct ofpact_conntrack *oc = ofpact_get_CT(a);
+
+        if (!dl_type_is_ip_any(flow->dl_type)
+            || (flow->ct_state & CS_INVALID && oc->flags & NX_CT_F_COMMIT)) {
+            inconsistent_match(usable_protocols);
+        }
+        return 0;
+    }
+
     case OFPACT_CLEAR_ACTIONS:
         return 0;
 
@@ -6155,6 +6414,7 @@  ofpact_outputs_to_port(const struct ofpact *ofpact, ofp_port_t port)
     case OFPACT_METER:
     case OFPACT_GROUP:
     case OFPACT_DEBUG_RECIRC:
+    case OFPACT_CT:
     default:
         return false;
     }
diff --git a/lib/ofp-actions.h b/lib/ofp-actions.h
index 818f5c8..38a9547 100644
--- a/lib/ofp-actions.h
+++ b/lib/ofp-actions.h
@@ -106,6 +106,7 @@ 
     OFPACT(EXIT,            ofpact_null,        ofpact, "exit")         \
     OFPACT(SAMPLE,          ofpact_sample,      ofpact, "sample")       \
     OFPACT(UNROLL_XLATE,    ofpact_unroll_xlate, ofpact, "unroll_xlate") \
+    OFPACT(CT,              ofpact_conntrack,   ofpact, "ct")           \
                                                                         \
     /* Debugging actions.                                               \
      *                                                                  \
@@ -471,6 +472,52 @@  BUILD_ASSERT_DECL(offsetof(struct ofpact_nest, actions) % OFPACT_ALIGNTO == 0);
 BUILD_ASSERT_DECL(offsetof(struct ofpact_nest, actions)
                   == sizeof(struct ofpact_nest));
 
+/* Bits for 'flags' in struct nx_action_conntrack.
+ *
+ * If NX_CT_F_COMMIT is set, then the connection entry is moved from the
+ * unconfirmed to confirmed list in the tracker. */
+enum nx_conntrack_flags {
+    NX_CT_F_COMMIT = 1 << 0,
+};
+
+/* Magic value for struct nx_action_conntrack 'recirc_table' field, to specify
+ * that the packet should not be recirculated. This value should commonly be
+ * used in conjunction with the NX_CT_F_COMMIT flag above. */
+#define NX_CT_RECIRC_NONE 0xFF
+
+#define CT_MEMBERS                      \
+    uint16_t flags;                     \
+    uint16_t zone_imm;                  \
+    struct mf_subfield zone_src;        \
+    uint8_t recirc_table;               \
+    uint16_t alg
+
+/* Used to perform build-time size checking for ofpact_conntrack. */
+struct ofpact_ct_size {
+    CT_MEMBERS;
+};
+
+/* OFPACT_CT.
+ *
+ * Used for NXAST_CT. */
+struct ofpact_conntrack {
+    struct ofpact ofpact;
+    CT_MEMBERS;
+    uint8_t pad[PAD_SIZE(sizeof(struct ofpact) + sizeof(struct ofpact_ct_size),
+                         OFPACT_ALIGNTO)];
+    struct ofpact actions[];
+};
+BUILD_ASSERT_DECL(offsetof(struct ofpact_conntrack, actions)
+                  % OFPACT_ALIGNTO == 0);
+BUILD_ASSERT_DECL(offsetof(struct ofpact_conntrack, actions)
+                  == sizeof(struct ofpact_conntrack));
+
+static inline size_t
+ofpact_ct_get_action_len(const struct ofpact_conntrack *oc)
+{
+    return oc->ofpact.len - offsetof(struct ofpact_conntrack, actions);
+}
+
 static inline size_t
 ofpact_nest_get_action_len(const struct ofpact_nest *on)
 {
diff --git a/lib/packets.c b/lib/packets.c
index a4d7854..d80f0b2 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -1062,3 +1062,24 @@  packet_csum_pseudoheader(const struct ip_header *ip)
 
     return partial;
 }
+
+const char *
+packet_ct_state_to_string(uint32_t flag)
+{
+    switch (flag) {
+    case CS_REPLY_DIR:
+        return "rpl";
+    case CS_TRACKED:
+        return "trk";
+    case CS_NEW:
+        return "new";
+    case CS_ESTABLISHED:
+        return "est";
+    case CS_RELATED:
+        return "rel";
+    case CS_INVALID:
+        return "inv";
+    default:
+        return NULL;
+    }
+}
diff --git a/lib/packets.h b/lib/packets.h
index 4fb1427..ed4540c 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -126,6 +126,8 @@  struct pkt_metadata {
     uint32_t skb_priority;      /* Packet priority for QoS. */
     uint32_t pkt_mark;          /* Packet mark. */
     union flow_in_port in_port; /* Input port. */
+    uint8_t ct_state;           /* Connection state. */
+    uint16_t ct_zone;           /* Connection zone. */
     struct flow_tnl tunnel;     /* Encapsulating tunnel parameters. Note that
                                  * if 'ip_dst' == 0, the rest of the fields may
                                  * be uninitialized. */
@@ -134,13 +136,18 @@  struct pkt_metadata {
 static inline void
 pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
 {
-    /* It can be expensive to zero out all of the tunnel metadata. However,
-     * we can just zero out ip_dst and the rest of the data will never be
-     * looked at. */
-    memset(md, 0, offsetof(struct pkt_metadata, tunnel));
-    md->tunnel.ip_dst = 0;
+    /* It can be expensive to zero out all metadata. Therefore:
+     *
+     * - We can just zero out 'tunnel.ip_dst' and the rest of the 'tunnel'
+     *   field will never be looked at.
+     * - We can just zero out 'ct_state' and the rest of the 'ct_*' members
+     *   will never be looked at. */
+    memset(md, 0, offsetof(struct pkt_metadata, in_port));
 
     md->in_port.odp_port = port;
+
+    md->ct_state = 0;
+    md->tunnel.ip_dst = 0;
 }
 
 bool dpid_from_string(const char *s, uint64_t *dpidp);
@@ -710,6 +717,18 @@  struct tcp_header {
 };
 BUILD_ASSERT_DECL(TCP_HEADER_LEN == sizeof(struct tcp_header));
 
+/* Connection states */
+#define CS_NEW               0x01
+#define CS_ESTABLISHED       0x02
+#define CS_RELATED           0x04
+#define CS_INVALID           0x20
+#define CS_REPLY_DIR         0x40
+#define CS_TRACKED           0x80
+
+/* Undefined connection state bits. */
+#define CS_UNSUPPORTED_MASK  0x18
+#define CS_SUPPORTED_MASK    (~CS_UNSUPPORTED_MASK & 0xFF)
+
 #define ARP_HRD_ETHERNET 1
 #define ARP_PRO_IP 0x0800
 #define ARP_OP_REQUEST 1
@@ -937,4 +956,6 @@  void compose_arp(struct dp_packet *, uint16_t arp_op,
                  ovs_be32 arp_spa, ovs_be32 arp_tpa);
 uint32_t packet_csum_pseudoheader(const struct ip_header *);
 
+const char *packet_ct_state_to_string(uint32_t flag);
+
 #endif /* packets.h */
diff --git a/ofproto/ofproto-dpif-rid.c b/ofproto/ofproto-dpif-rid.c
index 5a02bb0..eb0b0d9 100644
--- a/ofproto/ofproto-dpif-rid.c
+++ b/ofproto/ofproto-dpif-rid.c
@@ -137,6 +137,7 @@  recirc_metadata_hash(const struct recirc_state *state)
                             flow_tnl_size(state->metadata.tunnel)
                             / sizeof(uint64_t), hash);
     }
+    hash = hash_boolean(state->conntrack, hash);
     hash = hash_words64((const uint64_t *) &state->metadata.metadata,
                         (sizeof state->metadata - sizeof state->metadata.tunnel)
                         / sizeof(uint64_t),
@@ -168,6 +169,7 @@  recirc_metadata_equal(const struct recirc_state *a,
                  (!b->stack || !b->stack->size))
                 || (a->stack && b->stack && ofpbuf_equal(a->stack, b->stack)))
             && a->mirrors == b->mirrors
+            && a->conntrack == b->conntrack
             && a->action_set_len == b->action_set_len
             && ofpacts_equal(a->ofpacts, a->ofpacts_len,
                              b->ofpacts, b->ofpacts_len));
diff --git a/ofproto/ofproto-dpif-rid.h b/ofproto/ofproto-dpif-rid.h
index e7d95bf..7754e68 100644
--- a/ofproto/ofproto-dpif-rid.h
+++ b/ofproto/ofproto-dpif-rid.h
@@ -142,6 +142,7 @@  struct recirc_state {
     struct recirc_metadata metadata; /* Flow metadata. */
     struct ofpbuf *stack;         /* Stack if any. */
     mirror_mask_t mirrors;        /* Mirrors already output. */
+    bool conntrack;               /* Conntrack occurred prior to recirc. */
 
     /* Actions to be translated on recirculation. */
     uint32_t action_set_len;      /* How much of 'ofpacts' consists of an
diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
index 260c01b..0f7d258 100644
--- a/ofproto/ofproto-dpif-sflow.c
+++ b/ofproto/ofproto-dpif-sflow.c
@@ -1029,6 +1029,8 @@  sflow_read_set_action(const struct nlattr *attr,
     case OVS_KEY_ATTR_ICMPV6:
     case OVS_KEY_ATTR_ARP:
     case OVS_KEY_ATTR_ND:
+    case OVS_KEY_ATTR_CT_STATE:
+    case OVS_KEY_ATTR_CT_ZONE:
     case OVS_KEY_ATTR_UNSPEC:
     case __OVS_KEY_ATTR_MAX:
     default:
@@ -1137,6 +1139,7 @@  dpif_sflow_read_actions(const struct flow *flow,
 	case OVS_ACTION_ATTR_USERSPACE:
 	case OVS_ACTION_ATTR_RECIRC:
 	case OVS_ACTION_ATTR_HASH:
+        case OVS_ACTION_ATTR_CT:
 	    break;
 
 	case OVS_ACTION_ATTR_SET_MASKED:
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 81838be..44064a6 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -300,6 +300,11 @@  struct xlate_ctx {
      * the MPLS label stack that was originally present. */
     bool was_mpls;
 
+    /* True if conntrack has been performed on this packet during processing
+     * on the current bridge. This is used to determine whether conntrack
+     * state from the datapath should be honored after recirculation. */
+    bool conntracked;
+
     /* OpenFlow 1.1+ action set.
      *
      * 'action_set' accumulates "struct ofpact"s added by OFPACT_WRITE_ACTIONS.
@@ -2798,6 +2803,13 @@  xlate_commit_actions(struct xlate_ctx *ctx)
 }
 
 static void
+clear_conntrack(struct flow *flow)
+{
+    flow->ct_state = 0;
+    flow->ct_zone = 0;
+}
+
+static void
 compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                         const struct xlate_bond_recirc *xr, bool check_stp)
 {
@@ -2807,6 +2819,8 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
     struct flow_tnl flow_tnl;
     ovs_be16 flow_vlan_tci;
     uint32_t flow_pkt_mark;
+    uint8_t flow_ct_state;
+    uint16_t flow_ct_zone;
     uint8_t flow_nw_tos;
     odp_port_t out_port, odp_port;
     bool tnl_push_pop_send = false;
@@ -2852,6 +2866,7 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
     if (xport->peer) {
         const struct xport *peer = xport->peer;
         struct flow old_flow = ctx->xin->flow;
+        bool old_conntrack = ctx->conntracked;
         bool old_was_mpls = ctx->was_mpls;
         cls_version_t old_version = ctx->tables_version;
         struct ofpbuf old_stack = ctx->stack;
@@ -2867,6 +2882,8 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
         memset(&flow->tunnel, 0, sizeof flow->tunnel);
         memset(flow->regs, 0, sizeof flow->regs);
         flow->actset_output = OFPP_UNSET;
+        ctx->conntracked = false;
+        clear_conntrack(flow);
 
         /* The bridge is now known so obtain its table version. */
         ctx->tables_version
@@ -2921,6 +2938,10 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
          * bridge. */
         ctx->was_mpls = old_was_mpls;
 
+        /* The peer bridge's conntrack execution should have no effect on the
+         * original bridge. */
+        ctx->conntracked = old_conntrack;
+
         /* The fact that the peer bridge exits (for any reason) does not mean
          * that the original bridge should exit.  Specifically, if the peer
          * bridge recirculates (which typically modifies the packet), the
@@ -2948,6 +2969,8 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
 
     flow_vlan_tci = flow->vlan_tci;
     flow_pkt_mark = flow->pkt_mark;
+    flow_ct_state = flow->ct_state;
+    flow_ct_zone = flow->ct_zone;
     flow_nw_tos = flow->nw_tos;
 
     if (count_skb_priorities(xport)) {
@@ -3070,6 +3093,8 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
     /* Restore flow */
     flow->vlan_tci = flow_vlan_tci;
     flow->pkt_mark = flow_pkt_mark;
+    flow->ct_state = flow_ct_state;
+    flow->ct_zone = flow_ct_zone;
     flow->nw_tos = flow_nw_tos;
 }
 
@@ -3509,24 +3534,23 @@  execute_controller_action(struct xlate_ctx *ctx, int len,
     dp_packet_delete(packet);
 }
 
-/* Called only when ctx->recirc_action_offset is set. */
 static void
-compose_recirculate_action(struct xlate_ctx *ctx)
+compose_recirculate_action__(struct xlate_ctx *ctx, uint8_t table)
 {
     struct recirc_metadata md;
     uint32_t id;
 
-    xlate_commit_actions(ctx);
     recirc_metadata_from_flow(&md, &ctx->xin->flow);
 
     ovs_assert(ctx->recirc_action_offset >= 0);
 
     struct recirc_state state = {
-        .table_id = 0,
+        .table_id = table,
         .ofproto = ctx->xbridge->ofproto,
         .metadata = md,
         .stack = &ctx->stack,
         .mirrors = ctx->mirrors,
+        .conntrack = ctx->conntracked,
         .action_set_len = ctx->recirc_action_offset,
         .ofpacts_len = ctx->action_set.size,
         .ofpacts = ctx->action_set.data,
@@ -3563,6 +3587,14 @@  compose_recirculate_action(struct xlate_ctx *ctx)
     ctx->last_unroll_offset = -1;
 }
 
+/* Called only when ctx->recirc_action_offset is set. */
+static void
+compose_recirculate_action(struct xlate_ctx *ctx)
+{
+    xlate_commit_actions(ctx);
+    compose_recirculate_action__(ctx, 0);
+}
+
 static void
 compose_mpls_push_action(struct xlate_ctx *ctx, struct ofpact_push_mpls *mpls)
 {
@@ -4096,6 +4128,7 @@  recirc_unroll_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
         case OFPACT_METER:
         case OFPACT_SAMPLE:
         case OFPACT_DEBUG_RECIRC:
+        case OFPACT_CT:
             break;
 
             /* These need not be copied for restoration. */
@@ -4119,6 +4152,46 @@  recirc_unroll_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
     }
 
 static void
+compose_conntrack_action(struct xlate_ctx *ctx, struct ofpact_conntrack *ofc)
+{
+    uint32_t flags = 0;
+    size_t ct_offset;
+    uint16_t zone;
+
+    /* Ensure that any prior actions are applied before composing the new
+     * conntrack action. */
+    xlate_commit_actions(ctx);
+
+    /* Process nested actions first, to populate the key. */
+    do_xlate_actions(ofc->actions, ofpact_ct_get_action_len(ofc), ctx);
+
+    if (ofc->flags & NX_CT_F_COMMIT) {
+        flags |= OVS_CT_F_COMMIT;
+    }
+    if (ofc->zone_src.field) {
+        zone = mf_get_subfield(&ofc->zone_src, &ctx->xin->flow) & 0xFF;
+    } else {
+        zone = ofc->zone_imm;
+    }
+
+    ct_offset = nl_msg_start_nested(ctx->odp_actions, OVS_ACTION_ATTR_CT);
+    nl_msg_put_u32(ctx->odp_actions, OVS_CT_ATTR_FLAGS, flags);
+    nl_msg_put_u16(ctx->odp_actions, OVS_CT_ATTR_ZONE, zone);
+    nl_msg_end_nested(ctx->odp_actions, ct_offset);
+
+    if (ofc->recirc_table == NX_CT_RECIRC_NONE) {
+        /* If we do not recirculate as part of this action, hide the results of
+         * connection tracking from subsequent recirculations. */
+        ctx->conntracked = false;
+    } else {
+        /* Use ct_* fields from datapath during recirculation upcall. */
+        ctx->conntracked = true;
+        ctx_trigger_recirculation(ctx);
+        compose_recirculate_action__(ctx, ofc->recirc_table);
+    }
+}
+
+static void
 do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                  struct xlate_ctx *ctx)
 {
@@ -4482,6 +4555,10 @@  do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
             xlate_sample_action(ctx, ofpact_get_SAMPLE(a));
             break;
 
+        case OFPACT_CT:
+            compose_conntrack_action(ctx, ofpact_get_CT(a));
+            break;
+
         case OFPACT_DEBUG_RECIRC:
             ctx_trigger_recirculation(ctx);
             a = ofpact_next(a);
@@ -4786,6 +4863,7 @@  xlate_actions(struct xlate_in *xin, struct xlate_out *xout)
         .last_unroll_offset = -1,
 
         .was_mpls = false,
+        .conntracked = false,
 
         .action_set_has_group = false,
         .action_set = OFPBUF_STUB_INITIALIZER(action_set_stub),
@@ -4855,6 +4933,10 @@  xlate_actions(struct xlate_in *xin, struct xlate_out *xout)
         ctx.table_id = state->table_id;
         xlate_report(&ctx, "- Resuming from table %"PRIu8, ctx.table_id);
 
+        if (!state->conntrack) {
+            clear_conntrack(flow);
+        }
+
         /* Restore pipeline metadata. May change flow's in_port and other
          * metadata to the values that existed when recirculation was
          * triggered. */
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index e39a959..51ab3f0 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1228,6 +1228,45 @@  check_masked_set_action(struct dpif_backer *backer)
     return !error;
 }
 
+#define CHECK_FEATURE__(NAME, FIELD)                                        \
+static bool                                                                 \
+check_##NAME(struct dpif_backer *backer)                                    \
+{                                                                           \
+    struct flow flow;                                                       \
+    struct odputil_keybuf keybuf;                                           \
+    struct ofpbuf key;                                                      \
+    bool enable;                                                            \
+    struct odp_flow_key_parms odp_parms = {                                 \
+        .flow = &flow,                                                      \
+        .support = {                                                        \
+            .NAME = true,                                                   \
+        },                                                                  \
+    };                                                                      \
+                                                                            \
+    memset(&flow, 0, sizeof flow);                                          \
+    flow.FIELD = 1;                                                         \
+                                                                            \
+    ofpbuf_use_stack(&key, &keybuf, sizeof keybuf);                         \
+    odp_flow_key_from_flow(&odp_parms, &key);                               \
+    enable = dpif_probe_feature(backer->dpif, #NAME, &key, NULL);           \
+                                                                            \
+    if (enable) {                                                           \
+        VLOG_INFO("%s: Datapath supports "#NAME, dpif_name(backer->dpif));  \
+    } else {                                                                \
+        VLOG_INFO("%s: Datapath does not support "#NAME,                    \
+                  dpif_name(backer->dpif));                                 \
+    }                                                                       \
+                                                                            \
+    return enable;                                                          \
+}
+#define CHECK_FEATURE(FIELD) CHECK_FEATURE__(FIELD, FIELD)
+
+CHECK_FEATURE(ct_state)
+CHECK_FEATURE(ct_zone)
+
+#undef CHECK_FEATURE
+#undef CHECK_FEATURE__
+
 static void
 check_support(struct dpif_backer *backer)
 {
@@ -1239,6 +1278,9 @@  check_support(struct dpif_backer *backer)
     backer->support.masked_set_action = check_masked_set_action(backer);
     backer->support.ufid = check_ufid(backer);
     backer->support.tnl_push_pop = dpif_supports_tnl_push_pop(backer->dpif);
+
+    backer->support.odp.ct_state = check_ct_state(backer);
+    backer->support.odp.ct_zone = check_ct_zone(backer);
 }
 
 static int
@@ -3935,10 +3977,38 @@  rule_dealloc(struct rule *rule_)
 }
 
 static enum ofperr
+rule_check(struct rule *rule)
+{
+    struct ofproto_dpif *ofproto = ofproto_dpif_cast(rule->ofproto);
+    const struct odp_support *support;
+    struct match match;
+
+    minimatch_expand(&rule->cr.match, &match);
+    support = &ofproto_dpif_get_support(ofproto)->odp;
+
+    if ((match.wc.masks.ct_state && !support->ct_state)
+        || (match.wc.masks.ct_zone && !support->ct_zone)) {
+        return OFPERR_OFPBMC_BAD_FIELD;
+    }
+    if (match.wc.masks.ct_state & CS_UNSUPPORTED_MASK) {
+        return OFPERR_OFPBMC_BAD_MASK;
+    }
+
+    return 0;
+}
+
+static enum ofperr
 rule_construct(struct rule *rule_)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
     struct rule_dpif *rule = rule_dpif_cast(rule_);
+    int error;
+
+    error = rule_check(rule_);
+    if (error) {
+        return error;
+    }
+
     ovs_mutex_init_adaptive(&rule->stats_mutex);
     rule->stats.n_packets = 0;
     rule->stats.n_bytes = 0;
diff --git a/ofproto/ofproto-unixctl.man b/ofproto/ofproto-unixctl.man
index 89013d9..892f6fc 100644
--- a/ofproto/ofproto-unixctl.man
+++ b/ofproto/ofproto-unixctl.man
@@ -103,6 +103,10 @@  only metadata. The metadata can be:
 Packet QoS priority.
 .IP \fIpkt_mark\fR
 Mark of the packet.
+.IP \fIct_state\fR
+Connection state of the packet.
+.IP \fIct_zone\fR
+Connection tracking zone for packet.
 .IP \fItun_id\fR
 The tunnel ID on which the packet arrived.
 .IP \fIin_port\fR
diff --git a/tests/atlocal.in b/tests/atlocal.in
index 5946a3c..8e9fd9b 100644
--- a/tests/atlocal.in
+++ b/tests/atlocal.in
@@ -110,3 +110,10 @@  fi
 if test "$IS_WIN32" = "yes"; then
     HAVE_PYTHON="no"
 fi
+
+# Conntrack test requirements
+if test x`which conntrack` != x; then
+    HAVE_CONNTRACK="yes"
+else
+    HAVE_CONNTRACK="no"
+fi
diff --git a/tests/automake.mk b/tests/automake.mk
index 4198039..cef67c0 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -337,6 +337,7 @@  CHECK_PYFILES = \
 	tests/test-daemon.py \
 	tests/test-json.py \
 	tests/test-jsonrpc.py \
+	tests/test-l7.py \
 	tests/test-ovsdb.py \
 	tests/test-reconnect.py \
 	tests/MockXenAPI.py \
diff --git a/tests/dpif-netdev.at b/tests/dpif-netdev.at
index e9af63f..43108e7 100644
--- a/tests/dpif-netdev.at
+++ b/tests/dpif-netdev.at
@@ -82,7 +82,7 @@  AT_CHECK([cat ovs-vswitchd.log | grep -A 1 'miss upcall' | tail -n 1], [0], [dnl
 skb_priority(0),skb_mark(0),recirc_id(0),dp_hash(0),in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=1,tos=0,ttl=64,frag=no),icmp(type=8,code=0)
 ])
 AT_CHECK([cat ovs-vswitchd.log | FILTER_FLOW_INSTALL | STRIP_XOUT], [0], [dnl
-pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions: <del>
+pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions: <del>
 recirc_id=0,ip,in_port=1,vlan_tci=0x0000/0x1fff,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_frag=no, actions: <del>
 ])
 
diff --git a/tests/odp.at b/tests/odp.at
index 61edde3..74d44ff 100644
--- a/tests/odp.at
+++ b/tests/odp.at
@@ -86,6 +86,11 @@  sed '/bos=0/{
 s/^/ODP_FIT_TOO_LITTLE: /
 }' < odp-in.txt > odp-out.txt
 
+dnl Some fields are always printed for this test, because wildcards aren't
+dnl specified. We can skip these.
+sed -i 's/\(skb_mark(0)\),\(ct\)/\1,ct_state(0),ct_zone(0),\2/' odp-out.txt
+sed -i 's/\(skb_mark([[^)]]*)\),\(recirc\)/\1,ct_state(0),ct_zone(0),\2/' odp-out.txt
+
 AT_CHECK_UNQUOTED([ovstest test-odp parse-keys < odp-in.txt], [0], [`cat odp-out.txt`
 ])
 AT_CLEANUP
@@ -153,6 +158,10 @@  s/\(eth([[^)]]*)\),*/\1,eth_type(0x8100),vlan(vid=99,pcp=7),encap(/
 s/$/)/' odp-base.txt
 
  echo
+ echo '# Valid forms with conntrack fields.'
+ sed 's/\(eth([[^)]]*),?\)/\1,ct_state(+trk),/' odp-base.txt
+
+ echo
  echo '# Valid forms with IP first fragment.'
 sed -n 's/,frag=no),/,frag=first),/p' odp-base.txt
 
@@ -293,6 +302,9 @@  tnl_push(tnl_port(6),header(size=50,type=4,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:1
 tnl_push(tnl_port(6),header(size=50,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0x0),geneve(oam,vni=0x1c7)),out_port(1))
 tnl_push(tnl_port(6),header(size=58,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0x0),geneve(crit,vni=0x1c7,options({class=0xffff,type=0x80,len=4,0xa}))),out_port(1))
 tnl_push(tnl_port(6),header(size=50,type=5,eth(dst=f8:bc:12:44:34:b6,src=f8:bc:12:46:58:e0,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=17,tos=0,ttl=64,frag=0x40),udp(src=0,dst=6081,csum=0xffff),geneve(vni=0x1c7)),out_port(1))
+ct
+ct(commit)
+ct(commit,zone=5)
 ])
 AT_CHECK_UNQUOTED([ovstest test-odp parse-actions < actions.txt], [0],
   [`cat actions.txt`
diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index eb8647b..1e3013b 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -6486,8 +6486,8 @@  for i in 1 2 3 4; do
 done
 sleep 1
 AT_CHECK([cat ovs-vswitchd.log | STRIP_UFID | FILTER_FLOW_INSTALL | STRIP_USED], [0], [dnl
-pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:2
-pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:0b,dl_dst=50:54:00:00:00:0c,nw_src=10.0.0.4,nw_dst=10.0.0.3,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:drop
+pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:2
+pkt_mark=0,recirc_id=0,dp_hash=0,skb_priority=0,ct_state=0,ct_zone=0,icmp,in_port=1,vlan_tci=0x0000,dl_src=50:54:00:00:00:0b,dl_dst=50:54:00:00:00:0c,nw_src=10.0.0.4,nw_dst=10.0.0.3,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0, actions:drop
 ])
 AT_CHECK([cat ovs-vswitchd.log | STRIP_UFID | FILTER_FLOW_DUMP | grep 'packets:3'], [0], [dnl
 skb_priority(0),skb_mark(0),recirc_id(0),dp_hash(0),in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=1,tos=0,ttl=64,frag=no),icmp(type=8,code=0), packets:3, bytes:180, used:0.0s, actions:2
diff --git a/tests/ofproto.at b/tests/ofproto.at
index 7e80293..c522a85 100644
--- a/tests/ofproto.at
+++ b/tests/ofproto.at
@@ -1609,6 +1609,8 @@  head_table () {
       in_port_oxm: exact match or wildcard
       actset_output: exact match or wildcard
       pkt_mark: arbitrary mask
+      ct_state: arbitrary mask
+      ct_zone: exact match or wildcard
       reg0: arbitrary mask
       reg1: arbitrary mask
       reg2: arbitrary mask
diff --git a/tests/system-common-macros.at b/tests/system-common-macros.at
index c5691e7..8f3b318 100644
--- a/tests/system-common-macros.at
+++ b/tests/system-common-macros.at
@@ -118,3 +118,21 @@  m4_define([ADD_NATIVE_TUNNEL],
 # Strip variant pieces from ping output so the output can be reliably compared.
 #
 m4_define([FORMAT_PING], [grep "transmitted" | sed 's/time.*ms$/time 0ms/'])
+
+# FORMAT_CT()
+#
+# Strip content from the piped input which would differ from test to test.
+#
+m4_define([FORMAT_CT],
+    [[grep "dst=$1" | sed -e 's/port=[0-9]*/port=<cleared>/g' -e 's/  */ /g' -e 's/secctx[^ ]* //' | cut -d' ' -f4- | sort | uniq]])
+
+# NETNS_DAEMONIZE([namespace], [command], [pidfile])
+#
+# Run 'command' as a background process within 'namespace' and record its pid
+# to 'pidfile' to allow cleanup on exit.
+#
+m4_define([NETNS_DAEMONIZE],
+   [ip netns exec $1 $2 & echo $! > $3
+     echo "kill \`cat $3\`" >> cleanup
+   ]
+)
diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
index 1216db8..a48e8d9 100644
--- a/tests/system-kmod-macros.at
+++ b/tests/system-kmod-macros.at
@@ -43,3 +43,19 @@  m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
   [OVS_VSWITCHD_STOP([$1])
    AT_CHECK([:; $2])
   ])
+
+# CHECK_CONNTRACK()
+#
+# Perform requirements checks for running conntrack tests, and flush the
+# kernel conntrack tables when the test is finished.
+#
+m4_define([CHECK_CONNTRACK],
+    [AT_SKIP_IF([test $HAVE_CONNTRACK = no])
+     AT_SKIP_IF([test $HAVE_PYTHON = no])
+     m4_foreach([mod], [[nf_conntrack_ipv4], [nf_conntrack_ipv6]],
+                [modprobe mod || echo "Module mod not loaded."
+                 on_exit 'modprobe -r mod'
+                ])
+     on_exit 'conntrack -F'
+    ]
+)
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 7dbed68..de6b016 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -139,3 +139,472 @@  NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PI
 
 OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
+
+AT_SETUP([conntrack - controller])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+in_port=1,udp,action=ct(commit),controller
+in_port=2,ct_state=-trk,udp,action=ct(table=0)
+in_port=2,ct_state=+trk+est-new,udp,action=controller
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl Send an unsolicited reply from port 2. This should be dropped.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c00000000001100000a0101020a0101010002000100080000'])
+
+dnl OK, now start a new connection from port 1.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit\),controller '50540000000a50540000000908004500001c00000000001100000a0101010a0101020001000200080000'])
+
+dnl Now try a reply from port 2.
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) '50540000000a50540000000908004500001c00000000001100000a0101020a0101010002000100080000'])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN (xid=0x0): total_len=42 in_port=1 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.1,nw_dst=10.1.1.2,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=1,tp_dst=2 udp_csum:0
+NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=42 ct_state=est|rpl|trk,in_port=2 (via action) data_len=42 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=50:54:00:00:00:09,dl_dst=50:54:00:00:00:0a,nw_src=10.1.1.2,nw_dst=10.1.1.1,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=2,tp_dst=1 udp_csum:0
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv4 HTTP])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,action=ct(commit),2
+in_port=2,ct_state=-trk,tcp,action=ct(table=0)
+in_port=2,ct_state=+trk+est-new,tcp,action=1
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl Basic connectivity check.
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 >/dev/null])
+
+dnl HTTP requests from ns0->ns1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 use=1
+])
+
+dnl HTTP requests from ns1->ns0 should fail due to network failure.
+dnl Try 3 times, in 1 second intervals.
+NETNS_DAEMONIZE([at_ns0], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns1], [wget 10.1.1.1 -t 3 -T 1 -v -o wget1.log], [4])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - IPv6 HTTP])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,icmp6,action=normal
+in_port=1,tcp6,action=ct(commit),2
+in_port=2,ct_state=-trk,tcp6,action=ct(table=0)
+in_port=2,ct_state=+trk+est-new,tcp6,action=1
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl Without this sleep, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+sleep 2;
+
+dnl HTTP requests from ns0->ns1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py http6]], [http0.pid])
+
+NS_CHECK_EXEC([at_ns0], [wget http://[[fc00::2]] -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+dnl HTTP requests from ns1->ns0 should fail due to network failure.
+dnl Try 3 times, in 1 second intervals.
+NETNS_DAEMONIZE([at_ns0], [[$PYTHON $srcdir/test-l7.py http6]], [http1.pid])
+NS_CHECK_EXEC([at_ns1], [wget http://[[fc00::1]] -t 3 -T 1 -v -o wget1.log], [4])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - commit, recirc])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1, ns2->ns3.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
+in_port=1,tcp,ct_state=+trk,action=2
+in_port=2,tcp,ct_state=-trk,action=ct(table=0)
+in_port=2,tcp,ct_state=+trk,action=1
+in_port=3,tcp,ct_state=-trk,action=set_field:0->metadata,ct(table=0)
+in_port=3,tcp,ct_state=+trk,metadata=0,action=set_field:1->metadata,ct(commit,table=0)
+in_port=3,tcp,ct_state=+trk,metadata=1,action=4
+in_port=4,tcp,ct_state=-trk,action=ct(commit,table=0)
+in_port=4,tcp,ct_state=+trk,action=3
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+dnl HTTP requests from p2->p3 should work fine.
+NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - preserve registers])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1, ns2->ns3.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,ct_state=-trk,action=ct(commit,table=0)
+in_port=1,tcp,ct_state=+trk,action=2
+in_port=2,tcp,ct_state=-trk,action=ct(table=0)
+in_port=2,tcp,ct_state=+trk,action=1
+in_port=3,tcp,ct_state=-trk,action=load:0->NXM_NX_REG0[[]],ct(table=0)
+in_port=3,tcp,ct_state=+trk,reg0=0,action=load:1->NXM_NX_REG0[[]],ct(commit,table=0)
+in_port=3,tcp,ct_state=+trk,reg0=1,action=4
+in_port=4,tcp,ct_state=-trk,action=ct(commit,table=0)
+in_port=4,tcp,ct_state=+trk,action=3
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+dnl HTTP requests from p2->p3 should work fine.
+NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - invalid])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,action=ct(),2
+in_port=2,ct_state=-trk,tcp,action=ct(table=0)
+in_port=2,ct_state=+trk+new,tcp,action=1
+in_port=3,tcp,action=ct(),4
+in_port=4,ct_state=-trk,tcp,action=ct(table=0)
+in_port=4,ct_state=+trk+inv,tcp,action=3
+in_port=4,ct_state=+trk+new,tcp,action=3
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl We set up our rules to allow the request without committing. The return
+dnl traffic can't be identified, because the initial request wasn't committed.
+dnl For the first pair of ports, this means that the connection fails.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log], [4])
+
+dnl For the second pair, we allow packets from invalid connections, so it works.
+NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 --retry-connrefused -v -o wget1.log])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - zones])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,action=ct(commit,zone=1),2
+in_port=2,ct_state=-trk,tcp,action=ct(table=0,zone=1)
+in_port=2,ct_state=+trk,ct_zone=1,tcp,action=1
+in_port=3,tcp,action=ct(commit,zone=2),4
+in_port=4,ct_state=-trk,tcp,action=ct(table=0,zone=2)
+in_port=4,ct_state=+trk,ct_zone=1,tcp,action=3
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=1 use=1
+])
+
+dnl HTTP requests from p2->p3 should fail due to network failure.
+dnl Try 3 times, in 1 second intervals.
+NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 -v -o wget1.log], [4])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.4)], [0], [dnl
+SYN_RECV src=10.1.1.3 dst=10.1.1.4 sport=<cleared> dport=<cleared> src=10.1.1.4 dst=10.1.1.3 sport=<cleared> dport=<cleared> mark=0 zone=2 use=1
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - zones from field])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,action=load:1->NXM_NX_REG0[[0..15]],ct(commit,zone=NXM_NX_REG0[[0..15]]),2
+in_port=2,ct_state=-trk,tcp,action=load:1->NXM_NX_REG0[[0..15]],ct(table=0,zone=NXM_NX_REG0[[0..15]])
+in_port=2,ct_state=+trk,ct_zone=1,tcp,action=1
+in_port=3,tcp,action=load:2->NXM_NX_REG0[[0..15]],ct(commit,zone=NXM_NX_REG0[[0..15]]),4
+in_port=4,ct_state=-trk,tcp,action=load:2->NXM_NX_REG0[[0..15]],ct(table=0,zone=NXM_NX_REG0[[0..15]])
+in_port=4,ct_state=+trk,ct_zone=1,tcp,action=3
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=1 use=1
+])
+
+dnl HTTP requests from p2->p3 should fail due to network failure.
+dnl Try 3 times, in 1 second intervals.
+NETNS_DAEMONIZE([at_ns3], [[$PYTHON $srcdir/test-l7.py]], [http1.pid])
+NS_CHECK_EXEC([at_ns2], [wget 10.1.1.4 -t 3 -T 1 -v -o wget1.log], [4])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.4)], [0], [dnl
+SYN_RECV src=10.1.1.3 dst=10.1.1.4 sport=<cleared> dport=<cleared> src=10.1.1.4 dst=10.1.1.3 sport=<cleared> dport=<cleared> mark=0 zone=2 use=1
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - multiple bridges])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone --\
+    add-br br1 --\
+    add-port br0 patch+ -- set int patch+ type=patch options:peer=patch- --\
+    add-port br1 patch- -- set int patch- type=patch options:peer=patch+ --])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br1, "10.1.1.2/24")
+
+dnl Allow any traffic from ns0->ns1, allow established in reverse.
+AT_DATA([flows-br0.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=2,tcp,ct_state=-trk,action=ct(commit,zone=1),1
+in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=1)
+in_port=1,tcp,ct_state=+trk+est,ct_zone=1,action=2
+])
+
+dnl Allow any traffic from ns0->ns1, allow established in reverse.
+AT_DATA([flows-br1.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,ct_state=-trk,action=ct(table=0,zone=2)
+in_port=1,tcp,ct_state=+trk+new,ct_zone=2,action=ct(commit,zone=2),2
+in_port=1,tcp,ct_state=+trk+est,ct_zone=2,action=2
+in_port=2,tcp,ct_state=-trk,action=ct(table=0,zone=2)
+in_port=2,tcp,ct_state=+trk+est,ct_zone=2,action=ct(commit,zone=2),1
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows-br0.txt])
+AT_CHECK([ovs-ofctl add-flows br1 flows-br1.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - multiple zones])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2, at_ns3)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+ADD_VETH(p2, at_ns2, br0, "10.1.1.3/24")
+ADD_VETH(p3, at_ns3, br0, "10.1.1.4/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+priority=10,icmp,action=normal
+in_port=1,tcp,action=ct(commit,zone=1),ct(commit,zone=2),2
+in_port=2,ct_state=-trk,tcp,action=ct(table=0,zone=2)
+in_port=2,ct_state=+trk,ct_zone=2,tcp,action=1
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl HTTP requests from p0->p1 should work fine.
+NETNS_DAEMONIZE([at_ns1], [[$PYTHON $srcdir/test-l7.py]], [http0.pid])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+dnl (again) HTTP requests from p0->p1 should work fine.
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+AT_CHECK([conntrack -L 2>&1 | FORMAT_CT(10.1.1.2)], [0], [dnl
+SYN_SENT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> [[UNREPLIED]] src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> mark=0 zone=1 use=1
+TIME_WAIT src=10.1.1.1 dst=10.1.1.2 sport=<cleared> dport=<cleared> src=10.1.1.2 dst=10.1.1.1 sport=<cleared> dport=<cleared> [[ASSURED]] mark=0 zone=2 use=1
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([conntrack - ICMP related 2])
+CHECK_CONNTRACK()
+OVS_TRAFFIC_VSWITCHD_START(
+   [set-fail-mode br0 standalone -- ])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "172.16.0.1/24")
+ADD_VETH(p1, at_ns1, br0, "172.16.0.2/24")
+
+dnl Allow any traffic from ns0->ns1. Only allow nd, return traffic from ns1->ns0.
+AT_DATA([flows.txt], [dnl
+priority=1,action=drop
+priority=10,arp,action=normal
+in_port=1,ct_state=-trk,udp,action=ct(commit,table=0)
+in_port=1,ct_state=+trk,actions=controller
+in_port=2,ct_state=-trk,action=ct(table=0)
+in_port=2,ct_state=+trk-inv-new,action=controller
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CAPTURE_FILE([ofctl_monitor.log])
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+
+dnl 1. Send an ICMP port unreach reply for port 8738, without any previous request
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 'f64c473528c9c6f54ecb72db080045c0003d2e8700004001f355ac100004ac1000030303553f0000000045000021317040004011b138ac100003ac10000411112222000d20966369616f0a'])
+
+dnl 2. Send and UDP packet to port 5555
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 1 ct\(commit,table=0\) 'c6f94ecb72dbe64c473528c9080045000021317040004011b138ac100001ac100002a28e15b3000d20966369616f0a'])
+
+dnl 3. Send an ICMP port unreach reply for port 5555, related to the first packet
+AT_CHECK([ovs-ofctl -O OpenFlow13 packet-out br0 2 ct\(table=0\) 'e64c473528c9c6f94ecb72db080045c0003d2e8700004001f355ac100002ac1000010303553f0000000045000021317040004011b138ac100001ac100002a28e15b3000d20966369616f0a'])
+
+dnl Check this output. We only see the latter two packets, not the first.
+AT_CHECK([cat ofctl_monitor.log], [0], [dnl
+NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=47 ct_state=new|trk,in_port=1 (via action) data_len=47 (unbuffered)
+udp,vlan_tci=0x0000,dl_src=e6:4c:47:35:28:c9,dl_dst=c6:f9:4e:cb:72:db,nw_src=172.16.0.1,nw_dst=172.16.0.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=41614,tp_dst=5555 udp_csum:2096
+NXT_PACKET_IN (xid=0x0): cookie=0x0 total_len=75 ct_state=rel|rpl|trk,in_port=2 (via action) data_len=75 (unbuffered)
+icmp,vlan_tci=0x0000,dl_src=c6:f9:4e:cb:72:db,dl_dst=e6:4c:47:35:28:c9,nw_src=172.16.0.2,nw_dst=172.16.0.1,nw_tos=192,nw_ecn=0,nw_ttl=64,icmp_type=3,icmp_code=3 icmp_csum:553f
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
diff --git a/tests/system-userspace-macros.at b/tests/system-userspace-macros.at
index fca26f7..16256cb 100644
--- a/tests/system-userspace-macros.at
+++ b/tests/system-userspace-macros.at
@@ -37,3 +37,12 @@  m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
 /dpif_netlink.*Generic Netlink family 'ovs_datapath' does not exist. The Open vSwitch kernel module is probably not loaded./d"])
    AT_CHECK([:; $2])
   ])
+
+# CHECK_CONNTRACK()
+#
+# Perform requirements checks for running conntrack tests, and flush the
+# kernel conntrack tables when the test is finished.
+#
+m4_define([CHECK_CONNTRACK],
+    [AT_SKIP_IF(true)]
+)
diff --git a/tests/test-l7.py b/tests/test-l7.py
new file mode 100755
index 0000000..65c6c2a
--- /dev/null
+++ b/tests/test-l7.py
@@ -0,0 +1,72 @@ 
+# Copyright (c) 2015 Nicira, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at:
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import socket
+
+from BaseHTTPServer import HTTPServer
+from SimpleHTTPServer import SimpleHTTPRequestHandler
+from SocketServer import TCPServer
+
+
+class TCPServerV6(HTTPServer):
+    address_family = socket.AF_INET6
+
+
+def get_ftpd():
+    try:
+        from pyftpdlib.authorizers import DummyAuthorizer
+        from pyftpdlib.handlers import FTPHandler
+        from pyftpdlib.servers import FTPServer
+
+        class OVSFTPHandler(FTPHandler):
+            authorizer = DummyAuthorizer()
+            authorizer.add_anonymous("/tmp")
+        server = [FTPServer, OVSFTPHandler, 21]
+    except ImportError:
+        server = None
+        pass
+    return server
+
+
+def main():
+    SERVERS = {
+        'http':  [TCPServer,   SimpleHTTPRequestHandler, 80],
+        'http6': [TCPServerV6, SimpleHTTPRequestHandler, 80],
+    }
+
+    ftpd = get_ftpd()
+    if ftpd is not None:
+        SERVERS['ftp'] = ftpd
+
+    protocols = [srv for srv in SERVERS]
+    parser = argparse.ArgumentParser(
+            description='Run basic application servers.')
+    parser.add_argument('proto', default='http', nargs='?',
+            help='protocol to serve (%s)' % protocols)
+    args = parser.parse_args()
+
+    if args.proto not in SERVERS:
+        parser.print_help()
+        exit(1)
+
+    constructor = SERVERS[args.proto][0]
+    handler = SERVERS[args.proto][1]
+    port = SERVERS[args.proto][2]
+    srv = constructor(('', port), handler)
+    srv.serve_forever()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tests/test-odp.c b/tests/test-odp.c
index 3e7939e..cb245d6 100644
--- a/tests/test-odp.c
+++ b/tests/test-odp.c
@@ -57,7 +57,10 @@  parse_keys(bool wc_keys)
             struct odp_flow_key_parms odp_parms = {
                 .flow = &flow,
                 .support = {
+                    .max_mpls_depth = SIZE_MAX,
                     .recirc = true,
+                    .ct_state = true,
+                    .ct_zone = true,
                 },
             };
 
diff --git a/utilities/ovs-ofctl.8.in b/utilities/ovs-ofctl.8.in
index 8bb3715..7e5fcaa 100644
--- a/utilities/ovs-ofctl.8.in
+++ b/utilities/ovs-ofctl.8.in
@@ -1305,6 +1305,65 @@  is used only with the \fBconjunction\fR action (see below).
 .IP
 This field was introduced in Open vSwitch 2.4.
 .
+.IP \fBct_state=\fIflags\fB/\fImask\fR
+.IQ \fBct_state=\fR[\fB+\fIflag\fR...][\fB-\fIflag\fR...]
+Bitwise match on connection state flags.  The flags are only available
+after a call to the \fBct\fR action with the \fBtable\fR specified.
+
+.IP
+The \fIflags\fR and \fImask\fR are 8-bit numbers written in decimal or
+in hexadecimal prefixed by \fB0x\fR.  Each 1-bit in \fImask\fR requires
+that the corresponding bit in \fIflags\fR must match.  Each 0-bit in
+\fImask\fR causes the corresponding bit to be ignored.
+.IP
+Alternatively, the flags can be specified by their symbolic names
+(listed below), each preceded by either \fB+\fR for a flag that must
+be set, or \fB\-\fR for a flag that must be unset, without any other
+delimiters between the flags.  Flags not mentioned are wildcarded.  For
+example, \fBtcp,ct_state=+trk\-new\fR matches TCP packets that
+have been run through the connection tracker and do not establish a new
+flow.
+.IP
+The following flags describe the state of the tracking:
+.RS
+.IP "\fB0x80: trk\fR"
+Connection tracking has occurred.
+.IP "\fB0x40: rpl\fR"
+The flow is in the reply direction, meaning it did not initiate the
+connection.
+.IP "\fB0x20: inv\fR"
+The flow is invalid, meaning that the connection tracker couldn't identify the
+connection.
+.RS
+.PP
+This flag may be set for the following reasons:
+.RS
+L3/L4 protocol handler is not loaded/unavailable. With the Linux kernel
+datapath, this may mean that the "nf_conntrack_ipv4" or "nf_conntrack_ipv6"
+modules are not loaded.
+.PP
+L3/L4 protocol handler determines that the packet is malformed or invalid.
+.PP
+Packets are unexpected length for protocol.
+.RE
+.RE
+.IP "\fB0x01: new\fR"
+This is the beginning of a new connection.
+.IP "\fB0x02: est\fR"
+This is part of an already existing connection.
+.IP "\fB0x04: rel\fR"
+This is a connection that is related to an existing connection, for
+instance ICMP "destination unreachable" messages or FTP data connections.
+.RE
+.
+.PP
+The following fields are data associated with the connection tracker and
+can only be matched or set after running through the connection tracker
+by using the \fBct\fR action.
+.
+.IP \fBct_zone=\fIvalue
+Matches connection zone \fIvalue\fR exactly.
+.
 .PP
 Defining IPv6 flows (those with \fBdl_type\fR equal to 0x86dd) requires
 support for NXM.  The following shorthand notations are available for
@@ -1542,6 +1601,32 @@  OpenFlow implementations do not support queuing at all.
 Restores the queue to the value it was before any \fBset_queue\fR
 actions were applied.
 .
+.IP \fBct\fR
+.IQ \fBct\fB(\fR[\fIargument\fR][\fB,\fIargument\fR...]\fB)
+Send the packet through the connection tracker.  The following arguments
+are supported:
+
+.RS
+.IP \fBcommit\fR
+Commit the flow to the connection tracking module.
+.IP \fBtable=\fInumber\fR
+After ct processing, the packet should be re-inserted into OpenFlow pipeline
+processing in table \fInumber\fR, with the \fBct_state\fR and other ct match
+fields set. It is strongly recommended to use a table later than the current
+table to prevent loops.
+.IP \fBzone=\fIvalue\fR
+.IQ \fBzone=\fIsrc\fB[\fIstart\fB..\fIend\fB]\fR
+A 16-bit context id that can be used to isolate connections into separate
+domains, allowing overlapping network addresses in different zones. If a zone
+is not provided, then the default is to use zone zero. The \fBzone\fR may be
+specified either as an immediate 16-bit \fIvalue\fR, or may be provided from an
+NXM field \fIsrc\fR. The \fIstart\fR and \fIend\fR pair are inclusive, and must
+specify a 16-bit range within the field.
+.RE
+.IP
+Currently, connection tracking is only available on Linux kernels with the
+conntrack module loaded.
+.
 .IP \fBdec_ttl\fR
 .IQ \fBdec_ttl\fB[\fR(\fIid1,id2\fI)\fR]\fR
 Decrement TTL of IPv4 packet or hop limit of IPv6 packet.  If the