diff mbox series

[ovs-dev,v4,1/5] dpif-netdev: associate flow with a mark id

Message ID 1511768584-19167-2-git-send-email-yliu@fridaylinux.org
State Superseded
Delegated to: Ian Stokes
Headers show
Series OVS-DPDK flow offload with rte_flow | expand

Commit Message

Yuanhan Liu Nov. 27, 2017, 7:43 a.m. UTC
Most modern NICs have the ability to bind a flow with a mark, so that
every pkt matches such flow will have that mark present in its desc.

The basic idea of doing that is, when we receives pkts later, we could
directly get the flow from the mark. That could avoid some very costly
CPU operations, including (but not limiting to) miniflow_extract, emc
lookup, dpcls lookup, etc. Thus, performance could be greatly improved.

Thus, the mojor work of this patch is to associate a flow with a mark
id (an uint32_t number). The association in netdev datapatch is done
by CMAP, while in hardware it's done by the rte_flow MARK action.

One tricky thing in OVS-DPDK is, the flow tables is per-PMD. For the
case there is only one phys port but with 2 queues, there could be 2
PMDs. In another word, even for a single mega flow (i.e. udp,tp_src=1000),
there could be 2 different dp_netdev flows, one for each PMD. That could
results to the same mega flow being offloaded twice in the hardware,
worse, we may get 2 different marks and only the last one will work.

To avoid that, a megaflow_to_mark CMAP is created. An entry will be
added for the first PMD wants to offload a flow. For later PMDs, it
will see such megaflow is already offloaded, then the flow will not
be offloaded to HW twice.

Meanwhile, the mark to flow mapping becomes to 1:N mapping. That is
what the mark_to_flow CMAP for. For the first PMD wants to offload a
flow, it allocates a new mark and do the flow offload by reusing the
->flow_put method. When it succeeds, a "mark to flow" entry will be
added. For later PMDs, it will get the corresponding mark by above
megaflow_to_mark CMAP. Then, another "mark to flow" entry will be
added.

Another thing might worth mentioning is that hte megaflow is created
by masking all the bytes from match->flow with match->wc. It works
well so far, but I have a feeling that is not the best way.

Co-authored-by: Finn Christensen <fc@napatech.com>
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Finn Christensen <fc@napatech.com>
---
 lib/dpif-netdev.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/netdev.h      |   6 ++
 2 files changed, 278 insertions(+)

Comments

Finn Christensen Dec. 12, 2017, 2:47 p.m. UTC | #1
Hi Yuanhan,

Nice work. I have started testing it on our NICs.
A few comments below.

Regards,
Finn Christensen

    -----Original Message-----
    From: Yuanhan Liu [mailto:yliu@fridaylinux.org]
    Sent: 27. november 2017 08:43
    To: dev@openvswitch.org
    Cc: Finn Christensen <fc@napatech.com>; Darrell Ball
    <dball@vmware.com>; Chandran Sugesh <sugesh.chandran@intel.com>;
    Simon Horman <simon.horman@netronome.com>; Yuanhan Liu
    <yliu@fridaylinux.org>
    Subject: [PATCH v4 1/5] dpif-netdev: associate flow with a mark id
    
    Most modern NICs have the ability to bind a flow with a mark, so that
    every pkt matches such flow will have that mark present in its desc.
    
    The basic idea of doing that is, when we receives pkts later, we could
    directly get the flow from the mark. That could avoid some very costly CPU
    operations, including (but not limiting to) miniflow_extract, emc lookup,
    dpcls lookup, etc. Thus, performance could be greatly improved.
    
    Thus, the mojor work of this patch is to associate a flow with a mark id (an
    uint32_t number). The association in netdev datapatch is done by CMAP,
    while in hardware it's done by the rte_flow MARK action.
    
    One tricky thing in OVS-DPDK is, the flow tables is per-PMD. For the case
    there is only one phys port but with 2 queues, there could be 2 PMDs. In
    another word, even for a single mega flow (i.e. udp,tp_src=1000), there
    could be 2 different dp_netdev flows, one for each PMD. That could
    results to the same mega flow being offloaded twice in the hardware,
    worse, we may get 2 different marks and only the last one will work.
    
    To avoid that, a megaflow_to_mark CMAP is created. An entry will be
    added for the first PMD wants to offload a flow. For later PMDs, it will see
    such megaflow is already offloaded, then the flow will not be offloaded to
    HW twice.

    Meanwhile, the mark to flow mapping becomes to 1:N mapping. That is
    what the mark_to_flow CMAP for. For the first PMD wants to offload a
    flow, it allocates a new mark and do the flow offload by reusing the
    ->flow_put method. When it succeeds, a "mark to flow" entry will be
    added. For later PMDs, it will get the corresponding mark by above
    megaflow_to_mark CMAP. Then, another "mark to flow" entry will be
    added.
    
    Another thing might worth mentioning is that hte megaflow is created by
    masking all the bytes from match->flow with match->wc. It works well so
    far, but I have a feeling that is not the best way.
    
    Co-authored-by: Finn Christensen <fc@napatech.com>
    Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
    Signed-off-by: Finn Christensen <fc@napatech.com>
    ---
     lib/dpif-netdev.c | 272
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++
     lib/netdev.h      |   6 ++
     2 files changed, 278 insertions(+)
    
    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..8579474
    100644
    --- a/lib/dpif-netdev.c
    +++ b/lib/dpif-netdev.c
    @@ -77,6 +77,7 @@
     #include "tnl-ports.h"
     #include "unixctl.h"
     #include "util.h"
    +#include "uuid.h"
    
     VLOG_DEFINE_THIS_MODULE(dpif_netdev);
    
    @@ -442,7 +443,9 @@ struct dp_netdev_flow {
         /* Hash table index by unmasked flow. */
         const struct cmap_node node; /* In owning dp_netdev_pmd_thread's
    */
                                      /* 'flow_table'. */
    +    const struct cmap_node mark_node; /* In owning flow_mark's
    + mark_to_flow */
         const ovs_u128 ufid;         /* Unique flow identifier. */
    +    const ovs_u128 mega_ufid;
         const unsigned pmd_id;       /* The 'core_id' of pmd thread owning this
    */
                                      /* flow. */
    
    @@ -453,6 +456,7 @@ struct dp_netdev_flow {
         struct ovs_refcount ref_cnt;
    
         bool dead;
    +    uint32_t mark;               /* Unique flow mark assiged to a flow */
    
         /* Statistics. */
         struct dp_netdev_flow_stats stats;
    @@ -1854,6 +1858,175 @@ dp_netdev_pmd_find_dpcls(struct
    dp_netdev_pmd_thread *pmd,
         return cls;
     }
    
    +#define MAX_FLOW_MARK       (UINT32_MAX - 1)
    +#define INVALID_FLOW_MARK   (UINT32_MAX)
    +
    +struct megaflow_to_mark_data {
    +    const struct cmap_node node;
    +    ovs_u128 mega_ufid;
    +    uint32_t mark;
    +};
    +
    +struct flow_mark {
    +    struct cmap megaflow_to_mark;
    +    struct cmap mark_to_flow;
    +    struct id_pool *pool;
    +    struct ovs_mutex mutex;

[Finn] Is this mutex needed? - the structure seems to be used by the offload thread only.

    +};
    +
    +struct flow_mark flow_mark = {
    +    .megaflow_to_mark = CMAP_INITIALIZER,
    +    .mark_to_flow = CMAP_INITIALIZER,
    +    .mutex = OVS_MUTEX_INITIALIZER,
    +};
    +
    +static uint32_t
    +flow_mark_alloc(void)
    +{
    +    uint32_t mark;
    +
    +    if (!flow_mark.pool) {
    +        /* Haven't initiated yet, do it here */
    +        flow_mark.pool = id_pool_create(0, MAX_FLOW_MARK);
    +    }
    +
    +    if (id_pool_alloc_id(flow_mark.pool, &mark)) {
    +        return mark;
    +    }
    +
    +    return INVALID_FLOW_MARK;
    +}
    +
    +static void
    +flow_mark_free(uint32_t mark)
    +{
    +    id_pool_free_id(flow_mark.pool, mark); }
    +
    +/* associate flow with a mark, which is a 1:1 mapping */ static void
    +megaflow_to_mark_associate(const ovs_u128 *mega_ufid, uint32_t
    mark) {
    +    size_t hash = dp_netdev_flow_hash(mega_ufid);
    +    struct megaflow_to_mark_data *data = xzalloc(sizeof(*data));
    +
    +    data->mega_ufid = *mega_ufid;
    +    data->mark = mark;
    +
    +    cmap_insert(&flow_mark.megaflow_to_mark,
    +                CONST_CAST(struct cmap_node *, &data->node), hash); }
    +
    +/* disassociate flow with a mark */
    +static void
    +megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid) {
    +    size_t hash = dp_netdev_flow_hash(mega_ufid);
    +    struct megaflow_to_mark_data *data;
    +
    +    CMAP_FOR_EACH_WITH_HASH (data, node, hash,
    &flow_mark.megaflow_to_mark) {
    +        if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
    +            cmap_remove(&flow_mark.megaflow_to_mark,
    +                        CONST_CAST(struct cmap_node *, &data->node), hash);
    +            free(data);
    +            return;
    +        }
    +    }
    +
    +    VLOG_WARN("masked ufid "UUID_FMT" is not associated with a
    mark?\n",
    +              UUID_ARGS((struct uuid *)mega_ufid)); }
    +
    +static inline uint32_t
    +megaflow_to_mark_find(const ovs_u128 *mega_ufid) {
    +    size_t hash = dp_netdev_flow_hash(mega_ufid);
    +    struct megaflow_to_mark_data *data;
    +
    +    CMAP_FOR_EACH_WITH_HASH (data, node, hash,
    &flow_mark.megaflow_to_mark) {
    +        if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
    +            return data->mark;
    +        }
    +    }
    +
    +    return -1;
    +}
    +
    +/* associate mark with a flow, which is 1:N mapping */ static void
    +mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow
    +*flow) {
    +    dp_netdev_flow_ref(flow);
    +
    +    cmap_insert(&flow_mark.mark_to_flow,
    +                CONST_CAST(struct cmap_node *, &flow->mark_node),
    +                mark);
    +    flow->mark = mark;
    +
    +    VLOG_INFO("associated dp_netdev flow %p with mark %u\n", flow,
    +mark); }
    +
    +static bool
    +is_last_flow_mark_reference(uint32_t mark) {
    +    struct dp_netdev_flow *flow;
    +
    +    CMAP_FOR_EACH_WITH_HASH (flow, mark_node, mark,
    +                             &flow_mark.mark_to_flow) {
    +        return false;
    +    }
    +
    +    return true;
    +}
    +
    +static void
    +mark_to_flow_disassociate(struct dp_netdev_pmd_thread *pmd,
    +                          struct dp_netdev_flow *flow) {
    +    uint32_t mark = flow->mark;
    +    struct cmap_node *mark_node = CONST_CAST(struct cmap_node *,
    +                                             &flow->mark_node);
    +    VLOG_INFO(" ");
    +    VLOG_INFO(":: about to REMOVE offload:\n");
    +    VLOG_INFO("   ufid: "UUID_FMT"\n",
    +              UUID_ARGS((struct uuid *)&flow->ufid));
    +    VLOG_INFO("   mask: "UUID_FMT"\n",
    +              UUID_ARGS((struct uuid *)&flow->mega_ufid));
    +
    +    cmap_remove(&flow_mark.mark_to_flow, mark_node, mark);
    +    flow->mark = INVALID_FLOW_MARK;
    +
    +    if (is_last_flow_mark_reference(mark)) {
    +        struct dp_netdev_port *port;
    +        odp_port_t in_port = flow->flow.in_port.odp_port;
    +
    +        port = dp_netdev_lookup_port(pmd->dp, in_port);
    +        if (port) {
    +            netdev_flow_del(port->netdev, &flow->mega_ufid, NULL);
    +        }
    +
    +        ovs_mutex_lock(&flow_mark.mutex);
    +        flow_mark_free(mark);
    +        ovs_mutex_unlock(&flow_mark.mutex);
    +        VLOG_INFO("freed flow mark %u\n", mark);
    +
    +        megaflow_to_mark_disassociate(&flow->mega_ufid);
    +    }
    +    dp_netdev_flow_unref(flow);
    +}
    +
    +static void
    +flow_mark_flush(struct dp_netdev_pmd_thread *pmd) {
    +    struct dp_netdev_flow *flow;
    +
    +    CMAP_FOR_EACH (flow, mark_node, &flow_mark.mark_to_flow) {
    +        if (flow->pmd_id == pmd->core_id) {
    +            mark_to_flow_disassociate(pmd, flow);
    +        }
    +    }
    +}
    +
     static void
     dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
                               struct dp_netdev_flow *flow) @@ -1867,6 +2040,9 @@
    dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
         ovs_assert(cls != NULL);
         dpcls_remove(cls, &flow->cr);
         cmap_remove(&pmd->flow_table, node,
    dp_netdev_flow_hash(&flow->ufid));
    +    if (flow->mark != INVALID_FLOW_MARK) {
    +        mark_to_flow_disassociate(pmd, flow);
    +    }
         flow->dead = true;
    
         dp_netdev_flow_unref(flow);
    @@ -2446,6 +2622,91 @@ out:
         return error;
     }
    
    +static void
    +try_netdev_flow_put(struct dp_netdev_pmd_thread *pmd, odp_port_t
    in_port,
    +                    struct dp_netdev_flow *flow, struct match *match,
    +                    const ovs_u128 *ufid, const struct nlattr *actions,
    +                    size_t actions_len) {
    +    struct offload_info info;
    +    struct dp_netdev_port *port;
    +    bool modification = flow->mark != INVALID_FLOW_MARK;
    +    const char *op = modification ? "modify" : "add";
    +    uint32_t mark;
    +    int ret;
    +
    +    port = dp_netdev_lookup_port(pmd->dp, in_port);
    +    if (!port) {
    +        return;
    +    }
    +
    +    ovs_mutex_lock(&flow_mark.mutex);
    +
    +    VLOG_INFO(" ");
    +    VLOG_INFO(":: about to offload:\n");
    +    VLOG_INFO("   ufid: "UUID_FMT"\n",
    +              UUID_ARGS((struct uuid *)ufid));
    +    VLOG_INFO("   mask: "UUID_FMT"\n",
    +              UUID_ARGS((struct uuid *)&flow->mega_ufid));
    +
    +    if (modification) {
    +        mark = flow->mark;
    +    } else {
    +        if (!netdev_is_flow_api_enabled()) {
    +            goto out;
    +        }
    +
    +        /*
    +         * If a mega flow has already been offloaded (from other PMD
    +         * instances), do not offload it again.
    +         */
    +        mark = megaflow_to_mark_find(&flow->mega_ufid);
    +        if (mark != INVALID_FLOW_MARK) {
    +            VLOG_INFO("## got a previously installed mark %u\n", mark);
    +            mark_to_flow_associate(mark, flow);
    +            goto out;
    +        }
    +
    +        mark = flow_mark_alloc();
    +        if (mark == INVALID_FLOW_MARK) {
    +            VLOG_ERR("failed to allocate flow mark!\n");
    +            goto out;
    +        }
    +    }
    +
    +    info.flow_mark = mark;
    +    ret = netdev_flow_put(port->netdev, match,
    +                          CONST_CAST(struct nlattr *, actions),
    +                          actions_len, &flow->mega_ufid, &info, NULL);
    +    if (ret) {
    +        VLOG_ERR("failed to %s netdev flow with mark %u\n", op, mark);
    +        flow_mark_free(mark);

[Finn] If "modification" is true, then an existing flow has been deleted in netdev-dpdk, while calling netdev_flow_put, and the new flow could not be offloaded. But the old, deleted flow still has associations here. I think it should be disassociated here also.

    +        goto out;
    +    }
    +
    +    if (!modification) {
    +        megaflow_to_mark_associate(&flow->mega_ufid, mark);
    +        mark_to_flow_associate(mark, flow);
    +    }
    +    VLOG_INFO("succeed to %s netdev flow with mark %u\n", op, mark);
    +
    +out:
    +    ovs_mutex_unlock(&flow_mark.mutex);
    +}
    +
    +static void
    +dp_netdev_get_mega_ufid(const struct match *match, ovs_u128
    *mega_ufid)
    +{
    +    struct flow masked_flow;
    +    size_t i;
    +
    +    for (i = 0; i < sizeof(struct flow); i++) {
    +        ((uint8_t *)&masked_flow)[i] = ((uint8_t *)&match->flow)[i] &
    +                                       ((uint8_t *)&match->wc)[i];
    +    }
    +    dpif_flow_hash(NULL, &masked_flow, sizeof(struct flow), mega_ufid);
    +}
    +
     static struct dp_netdev_flow *
     dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
                        struct match *match, const ovs_u128 *ufid, @@ -2481,12
    +2742,15 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread
    *pmd,
         memset(&flow->stats, 0, sizeof flow->stats);
         flow->dead = false;
         flow->batch = NULL;
    +    flow->mark = INVALID_FLOW_MARK;
         *CONST_CAST(unsigned *, &flow->pmd_id) = pmd->core_id;
         *CONST_CAST(struct flow *, &flow->flow) = match->flow;
         *CONST_CAST(ovs_u128 *, &flow->ufid) = *ufid;
         ovs_refcount_init(&flow->ref_cnt);
         ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions,
    actions_len));
    
    +    dp_netdev_get_mega_ufid(match, CONST_CAST(ovs_u128 *,
    + &flow->mega_ufid));
    +
         netdev_flow_key_init_masked(&flow->cr.flow, &match->flow, &mask);
    
         /* Select dpcls for in_port. Relies on in_port to be exact match. */ @@ -
    2496,6 +2760,9 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread
    *pmd,
         cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *,
    &flow->node),
                     dp_netdev_flow_hash(&flow->ufid));
    
    +    try_netdev_flow_put(pmd, in_port, flow, match, ufid,
    +                        actions, actions_len);
    +
         if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl)))) {
             struct ds ds = DS_EMPTY_INITIALIZER;
             struct ofpbuf key_buf, mask_buf; @@ -2576,6 +2843,7 @@
    flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd,
             if (put->flags & DPIF_FP_MODIFY) {
                 struct dp_netdev_actions *new_actions;
                 struct dp_netdev_actions *old_actions;
    +            odp_port_t in_port = netdev_flow->flow.in_port.odp_port;
    
                 new_actions = dp_netdev_actions_create(put->actions,
                                                        put->actions_len); @@ -2583,6 +2851,9 @@
    flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd,
                 old_actions = dp_netdev_flow_get_actions(netdev_flow);
                 ovsrcu_set(&netdev_flow->actions, new_actions);
    
    +            try_netdev_flow_put(pmd, in_port, netdev_flow, match, ufid,
    +                                put->actions, put->actions_len);
    +
                 if (stats) {
                     get_dpif_flow_stats(netdev_flow, stats);
                 }
    @@ -3576,6 +3847,7 @@ reload_affected_pmds(struct dp_netdev *dp)
    
         CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
             if (pmd->need_reload) {
    +            flow_mark_flush(pmd);
                 dp_netdev_reload_pmd__(pmd);
                 pmd->need_reload = false;
             }
    diff --git a/lib/netdev.h b/lib/netdev.h index 3a545fe..0c1946a 100644
    --- a/lib/netdev.h
    +++ b/lib/netdev.h
    @@ -188,6 +188,12 @@ void netdev_send_wait(struct netdev *, int qid);
    struct offload_info {
         const struct dpif_class *dpif_class;
         ovs_be16 tp_dst_port; /* Destination port for tunnel in SET action */
    +
    +    /*
    +     * The flow mark id assigened to the flow. If any pkts hit the flow,
    +     * it will be in the pkt meta data.
    +     */
    +    uint32_t flow_mark;
     };
     struct dpif_class;
     struct netdev_flow_dump;
    --
    2.7.4
Yuanhan Liu Dec. 20, 2017, 2:34 p.m. UTC | #2
On Tue, Dec 12, 2017 at 02:47:37PM +0000, Finn Christensen wrote:
>     +struct flow_mark {
>     +    struct cmap megaflow_to_mark;
>     +    struct cmap mark_to_flow;
>     +    struct id_pool *pool;
>     +    struct ovs_mutex mutex;
> 
> [Finn] Is this mutex needed? - the structure seems to be used by the offload thread only.

Nice catch, yes, it's not needed. and it's been fixed in v5 (which will be
sent out soon).

>     +    info.flow_mark = mark;
>     +    ret = netdev_flow_put(port->netdev, match,
>     +                          CONST_CAST(struct nlattr *, actions),
>     +                          actions_len, &flow->mega_ufid, &info, NULL);
>     +    if (ret) {
>     +        VLOG_ERR("failed to %s netdev flow with mark %u\n", op, mark);
>     +        flow_mark_free(mark);
> 
> [Finn] If "modification" is true, then an existing flow has been deleted in netdev-dpdk, while calling netdev_flow_put, and the new flow could not be offloaded. But the old, deleted flow still has associations here. I think it should be disassociated here also.
> 

I think you are right. I was thinking more about the "add" case, then we
should delete the reclaim the mark id if it fails. For "modification", I
will just ido disassociation, and let it to handle the mark id reclaim (
if there is no references any more for that mark id).

Thanks for your view, BTW!

	--yliu
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0a62630..8579474 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -77,6 +77,7 @@ 
 #include "tnl-ports.h"
 #include "unixctl.h"
 #include "util.h"
+#include "uuid.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
@@ -442,7 +443,9 @@  struct dp_netdev_flow {
     /* Hash table index by unmasked flow. */
     const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */
                                  /* 'flow_table'. */
+    const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */
     const ovs_u128 ufid;         /* Unique flow identifier. */
+    const ovs_u128 mega_ufid;
     const unsigned pmd_id;       /* The 'core_id' of pmd thread owning this */
                                  /* flow. */
 
@@ -453,6 +456,7 @@  struct dp_netdev_flow {
     struct ovs_refcount ref_cnt;
 
     bool dead;
+    uint32_t mark;               /* Unique flow mark assiged to a flow */
 
     /* Statistics. */
     struct dp_netdev_flow_stats stats;
@@ -1854,6 +1858,175 @@  dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd,
     return cls;
 }
 
+#define MAX_FLOW_MARK       (UINT32_MAX - 1)
+#define INVALID_FLOW_MARK   (UINT32_MAX)
+
+struct megaflow_to_mark_data {
+    const struct cmap_node node;
+    ovs_u128 mega_ufid;
+    uint32_t mark;
+};
+
+struct flow_mark {
+    struct cmap megaflow_to_mark;
+    struct cmap mark_to_flow;
+    struct id_pool *pool;
+    struct ovs_mutex mutex;
+};
+
+struct flow_mark flow_mark = {
+    .megaflow_to_mark = CMAP_INITIALIZER,
+    .mark_to_flow = CMAP_INITIALIZER,
+    .mutex = OVS_MUTEX_INITIALIZER,
+};
+
+static uint32_t
+flow_mark_alloc(void)
+{
+    uint32_t mark;
+
+    if (!flow_mark.pool) {
+        /* Haven't initiated yet, do it here */
+        flow_mark.pool = id_pool_create(0, MAX_FLOW_MARK);
+    }
+
+    if (id_pool_alloc_id(flow_mark.pool, &mark)) {
+        return mark;
+    }
+
+    return INVALID_FLOW_MARK;
+}
+
+static void
+flow_mark_free(uint32_t mark)
+{
+    id_pool_free_id(flow_mark.pool, mark);
+}
+
+/* associate flow with a mark, which is a 1:1 mapping */
+static void
+megaflow_to_mark_associate(const ovs_u128 *mega_ufid, uint32_t mark)
+{
+    size_t hash = dp_netdev_flow_hash(mega_ufid);
+    struct megaflow_to_mark_data *data = xzalloc(sizeof(*data));
+
+    data->mega_ufid = *mega_ufid;
+    data->mark = mark;
+
+    cmap_insert(&flow_mark.megaflow_to_mark,
+                CONST_CAST(struct cmap_node *, &data->node), hash);
+}
+
+/* disassociate flow with a mark */
+static void
+megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
+{
+    size_t hash = dp_netdev_flow_hash(mega_ufid);
+    struct megaflow_to_mark_data *data;
+
+    CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+        if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
+            cmap_remove(&flow_mark.megaflow_to_mark,
+                        CONST_CAST(struct cmap_node *, &data->node), hash);
+            free(data);
+            return;
+        }
+    }
+
+    VLOG_WARN("masked ufid "UUID_FMT" is not associated with a mark?\n",
+              UUID_ARGS((struct uuid *)mega_ufid));
+}
+
+static inline uint32_t
+megaflow_to_mark_find(const ovs_u128 *mega_ufid)
+{
+    size_t hash = dp_netdev_flow_hash(mega_ufid);
+    struct megaflow_to_mark_data *data;
+
+    CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+        if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
+            return data->mark;
+        }
+    }
+
+    return -1;
+}
+
+/* associate mark with a flow, which is 1:N mapping */
+static void
+mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow *flow)
+{
+    dp_netdev_flow_ref(flow);
+
+    cmap_insert(&flow_mark.mark_to_flow,
+                CONST_CAST(struct cmap_node *, &flow->mark_node),
+                mark);
+    flow->mark = mark;
+
+    VLOG_INFO("associated dp_netdev flow %p with mark %u\n", flow, mark);
+}
+
+static bool
+is_last_flow_mark_reference(uint32_t mark)
+{
+    struct dp_netdev_flow *flow;
+
+    CMAP_FOR_EACH_WITH_HASH (flow, mark_node, mark,
+                             &flow_mark.mark_to_flow) {
+        return false;
+    }
+
+    return true;
+}
+
+static void
+mark_to_flow_disassociate(struct dp_netdev_pmd_thread *pmd,
+                          struct dp_netdev_flow *flow)
+{
+    uint32_t mark = flow->mark;
+    struct cmap_node *mark_node = CONST_CAST(struct cmap_node *,
+                                             &flow->mark_node);
+    VLOG_INFO(" ");
+    VLOG_INFO(":: about to REMOVE offload:\n");
+    VLOG_INFO("   ufid: "UUID_FMT"\n",
+              UUID_ARGS((struct uuid *)&flow->ufid));
+    VLOG_INFO("   mask: "UUID_FMT"\n",
+              UUID_ARGS((struct uuid *)&flow->mega_ufid));
+
+    cmap_remove(&flow_mark.mark_to_flow, mark_node, mark);
+    flow->mark = INVALID_FLOW_MARK;
+
+    if (is_last_flow_mark_reference(mark)) {
+        struct dp_netdev_port *port;
+        odp_port_t in_port = flow->flow.in_port.odp_port;
+
+        port = dp_netdev_lookup_port(pmd->dp, in_port);
+        if (port) {
+            netdev_flow_del(port->netdev, &flow->mega_ufid, NULL);
+        }
+
+        ovs_mutex_lock(&flow_mark.mutex);
+        flow_mark_free(mark);
+        ovs_mutex_unlock(&flow_mark.mutex);
+        VLOG_INFO("freed flow mark %u\n", mark);
+
+        megaflow_to_mark_disassociate(&flow->mega_ufid);
+    }
+    dp_netdev_flow_unref(flow);
+}
+
+static void
+flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
+{
+    struct dp_netdev_flow *flow;
+
+    CMAP_FOR_EACH (flow, mark_node, &flow_mark.mark_to_flow) {
+        if (flow->pmd_id == pmd->core_id) {
+            mark_to_flow_disassociate(pmd, flow);
+        }
+    }
+}
+
 static void
 dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
                           struct dp_netdev_flow *flow)
@@ -1867,6 +2040,9 @@  dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
     ovs_assert(cls != NULL);
     dpcls_remove(cls, &flow->cr);
     cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
+    if (flow->mark != INVALID_FLOW_MARK) {
+        mark_to_flow_disassociate(pmd, flow);
+    }
     flow->dead = true;
 
     dp_netdev_flow_unref(flow);
@@ -2446,6 +2622,91 @@  out:
     return error;
 }
 
+static void
+try_netdev_flow_put(struct dp_netdev_pmd_thread *pmd, odp_port_t in_port,
+                    struct dp_netdev_flow *flow, struct match *match,
+                    const ovs_u128 *ufid, const struct nlattr *actions,
+                    size_t actions_len)
+{
+    struct offload_info info;
+    struct dp_netdev_port *port;
+    bool modification = flow->mark != INVALID_FLOW_MARK;
+    const char *op = modification ? "modify" : "add";
+    uint32_t mark;
+    int ret;
+
+    port = dp_netdev_lookup_port(pmd->dp, in_port);
+    if (!port) {
+        return;
+    }
+
+    ovs_mutex_lock(&flow_mark.mutex);
+
+    VLOG_INFO(" ");
+    VLOG_INFO(":: about to offload:\n");
+    VLOG_INFO("   ufid: "UUID_FMT"\n",
+              UUID_ARGS((struct uuid *)ufid));
+    VLOG_INFO("   mask: "UUID_FMT"\n",
+              UUID_ARGS((struct uuid *)&flow->mega_ufid));
+
+    if (modification) {
+        mark = flow->mark;
+    } else {
+        if (!netdev_is_flow_api_enabled()) {
+            goto out;
+        }
+
+        /*
+         * If a mega flow has already been offloaded (from other PMD
+         * instances), do not offload it again.
+         */
+        mark = megaflow_to_mark_find(&flow->mega_ufid);
+        if (mark != INVALID_FLOW_MARK) {
+            VLOG_INFO("## got a previously installed mark %u\n", mark);
+            mark_to_flow_associate(mark, flow);
+            goto out;
+        }
+
+        mark = flow_mark_alloc();
+        if (mark == INVALID_FLOW_MARK) {
+            VLOG_ERR("failed to allocate flow mark!\n");
+            goto out;
+        }
+    }
+
+    info.flow_mark = mark;
+    ret = netdev_flow_put(port->netdev, match,
+                          CONST_CAST(struct nlattr *, actions),
+                          actions_len, &flow->mega_ufid, &info, NULL);
+    if (ret) {
+        VLOG_ERR("failed to %s netdev flow with mark %u\n", op, mark);
+        flow_mark_free(mark);
+        goto out;
+    }
+
+    if (!modification) {
+        megaflow_to_mark_associate(&flow->mega_ufid, mark);
+        mark_to_flow_associate(mark, flow);
+    }
+    VLOG_INFO("succeed to %s netdev flow with mark %u\n", op, mark);
+
+out:
+    ovs_mutex_unlock(&flow_mark.mutex);
+}
+
+static void
+dp_netdev_get_mega_ufid(const struct match *match, ovs_u128 *mega_ufid)
+{
+    struct flow masked_flow;
+    size_t i;
+
+    for (i = 0; i < sizeof(struct flow); i++) {
+        ((uint8_t *)&masked_flow)[i] = ((uint8_t *)&match->flow)[i] &
+                                       ((uint8_t *)&match->wc)[i];
+    }
+    dpif_flow_hash(NULL, &masked_flow, sizeof(struct flow), mega_ufid);
+}
+
 static struct dp_netdev_flow *
 dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
                    struct match *match, const ovs_u128 *ufid,
@@ -2481,12 +2742,15 @@  dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
     memset(&flow->stats, 0, sizeof flow->stats);
     flow->dead = false;
     flow->batch = NULL;
+    flow->mark = INVALID_FLOW_MARK;
     *CONST_CAST(unsigned *, &flow->pmd_id) = pmd->core_id;
     *CONST_CAST(struct flow *, &flow->flow) = match->flow;
     *CONST_CAST(ovs_u128 *, &flow->ufid) = *ufid;
     ovs_refcount_init(&flow->ref_cnt);
     ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions, actions_len));
 
+    dp_netdev_get_mega_ufid(match, CONST_CAST(ovs_u128 *, &flow->mega_ufid));
+
     netdev_flow_key_init_masked(&flow->cr.flow, &match->flow, &mask);
 
     /* Select dpcls for in_port. Relies on in_port to be exact match. */
@@ -2496,6 +2760,9 @@  dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
     cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *, &flow->node),
                 dp_netdev_flow_hash(&flow->ufid));
 
+    try_netdev_flow_put(pmd, in_port, flow, match, ufid,
+                        actions, actions_len);
+
     if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl)))) {
         struct ds ds = DS_EMPTY_INITIALIZER;
         struct ofpbuf key_buf, mask_buf;
@@ -2576,6 +2843,7 @@  flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd,
         if (put->flags & DPIF_FP_MODIFY) {
             struct dp_netdev_actions *new_actions;
             struct dp_netdev_actions *old_actions;
+            odp_port_t in_port = netdev_flow->flow.in_port.odp_port;
 
             new_actions = dp_netdev_actions_create(put->actions,
                                                    put->actions_len);
@@ -2583,6 +2851,9 @@  flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd,
             old_actions = dp_netdev_flow_get_actions(netdev_flow);
             ovsrcu_set(&netdev_flow->actions, new_actions);
 
+            try_netdev_flow_put(pmd, in_port, netdev_flow, match, ufid,
+                                put->actions, put->actions_len);
+
             if (stats) {
                 get_dpif_flow_stats(netdev_flow, stats);
             }
@@ -3576,6 +3847,7 @@  reload_affected_pmds(struct dp_netdev *dp)
 
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         if (pmd->need_reload) {
+            flow_mark_flush(pmd);
             dp_netdev_reload_pmd__(pmd);
             pmd->need_reload = false;
         }
diff --git a/lib/netdev.h b/lib/netdev.h
index 3a545fe..0c1946a 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -188,6 +188,12 @@  void netdev_send_wait(struct netdev *, int qid);
 struct offload_info {
     const struct dpif_class *dpif_class;
     ovs_be16 tp_dst_port; /* Destination port for tunnel in SET action */
+
+    /*
+     * The flow mark id assigened to the flow. If any pkts hit the flow,
+     * it will be in the pkt meta data.
+     */
+    uint32_t flow_mark;
 };
 struct dpif_class;
 struct netdev_flow_dump;