diff mbox

[ovs-dev,RFC,v2] Conntrack: Avoid recirculation for established connections.

Message ID 1495728155-8614-1-git-send-email-antonio.fischetti@intel.com
State Changes Requested
Headers show

Commit Message

Fischetti, Antonio May 25, 2017, 4:02 p.m. UTC
From: Antonio Fischetti <antonio.fischetti@intel.com>

With conntrack enabled, packets get recirculated and this impacts
the performance with thousands of active concurrent connections.

This patch is aimed at avoiding recirculation for packets belonging to
established connections in steady state. This is achieved by
manipulating the megaflows and actions. In this case the actions of the
megaflow of recirculated packet is initially updated with new 'CT_SKIP'
action and later updated with the actions of the megaflow of the
original packet(to handle zones). There after the EMC entry
belonging to the original packet is updated to point to the 'megaflow' of
the recirculated packet there by avoiding recirculation.

Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
---
- ~10% performance improvement is observed in our testing with UDP streams.
- Limited testing is done with RFC patch and the tests include hundreds of
  concurrent active TCP connections.
- This is very early implementation and we are posting here to get initial
  feedback.
- This RFC patch is implemented leveraging EMC, but optimization could be
  extended to dpcls as well to handle higher flows.
- No VXLAN testing is done with this patch.

 datapath/linux/compat/include/linux/openvswitch.h |   1 +
 lib/dpif-netdev.c                                 | 148 ++++++++++++++++++++--
 lib/packets.h                                     |   2 +
 3 files changed, 143 insertions(+), 8 deletions(-)

Comments

Fischetti, Antonio May 29, 2017, 3:24 p.m. UTC | #1
Thanks Joe for your feedback and the interesting insights in conntrack in your earlier communication.
We have added all the details that we considered for this first implementation. Also, some answers are inline. 

The purpose of this implementation is to avoid recirculation just for those packets that are part of established connections.

This shouldn't affect the packet recirculation for actions other than conntrack. For example in MPLS, after a pop_mpls action the packet will still be recirculated to follow the usual datapath.

Most importantly, the CT module isn't by-passed in this implementation. 

Besides the recirculation, all other action[s] shall be executed as-is on each packet. 
Any new CT change or action set by the controller will be managed as usual.

For our testing we set up a simple firewall, below are the flows.


 Flow Setup 
 ----------
table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
table=0, priority=10,arp actions=NORMAL
table=0, priority=1 actions=drop
table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1


 Basic idea
 ----------
With the above rules, all packets belonging to an established connection will first hit the flow 
"ct_state=-trk,ip actions=ct(table=1)"

then on recirculation, they will hit
"ct_state=+est+trk,ip,in_port=.. actions=output:X".

The basic idea is to do the following 2 steps.
1. Combine the two sets of actions by removing the recirculation.
   a) Original actions:
    - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]
    - "output:X"
   b) Combined Actions after Removing recirculation:   
    - "ct(zone=N), output:X".

2. All the subsequent packets shall hit a flow with the combined actions.


 Considerations
 --------------
On EMC entries: 
All packets hitting a connection in the same direction will have the same 5-tuple. 
They shall all hit the same EMC entry - here referred as 'EMC_ORIG' - and point to the flow 
"ct_state=-trk,ip actions=ct(table=1)", referred from here as 'FLOW_ORIG'.

On recirculation the packets shall hit another EMC entry - referred as 'EMC_RECIRC' - and point to the flow  
"ct_state=+est+trk,ip,in_port=.. actions=output:X", 
referred from here as 'FLOW_RECIRC'.

FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC is the flow hit
by the recirculated packet.


 Implementation Details
 ----------------------
We thought about different implementations, but for the purpose of this RFC we stick to the following approach.

On packet recirculation, a new flow has to be created with a list of appropriate actions. At this 
stage we prepend a new action called 'ct(Skip)' to the original list of actions. This 
new action is used as a temporary place-holder that will be over written later on.

   Original Actions: "Output:2"    ==>    New Actions: "ct(Skip), Output:2"

For all the packets of an established conn. (i.e marked with +trk,+est bits), do the 3 steps:
1. Retrieve Flows from EMC. 
    - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id) 
    - FLOW_RECIRC whose actions include: ct(Skip), Output:X

2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with the ct action from FLOW_ORIG, ie ct(zone=N). 
    ct(Skip), Output:X  =>  ct(zone=N), Output:X

3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow.

In this way, we avoid recirculating the packets belonging to established connections and 
this no ways skips sending the packets to the CT module. 


Antonio

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com
> Sent: Thursday, May 25, 2017 5:03 PM
> To: dev@openvswitch.org
> Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for
> established connections.
> 
> From: Antonio Fischetti <antonio.fischetti@intel.com>
> 
> With conntrack enabled, packets get recirculated and this impacts
> the performance with thousands of active concurrent connections.
> 
> This patch is aimed at avoiding recirculation for packets belonging to
> established connections in steady state. This is achieved by
> manipulating the megaflows and actions. In this case the actions of the
> megaflow of recirculated packet is initially updated with new 'CT_SKIP'
> action and later updated with the actions of the megaflow of the
> original packet(to handle zones). There after the EMC entry
> belonging to the original packet is updated to point to the 'megaflow' of
> the recirculated packet there by avoiding recirculation.
> 
> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
> Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
> ---
> - ~10% performance improvement is observed in our testing with UDP
> streams.
> - Limited testing is done with RFC patch and the tests include hundreds of
>   concurrent active TCP connections.
> - This is very early implementation and we are posting here to get initial
>   feedback.
> - This RFC patch is implemented leveraging EMC, but optimization could be
>   extended to dpcls as well to handle higher flows.
> - No VXLAN testing is done with this patch.
> 
>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>  lib/dpif-netdev.c                                 | 148
> ++++++++++++++++++++--
>  lib/packets.h                                     |   2 +
>  3 files changed, 143 insertions(+), 8 deletions(-)
> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h
> b/datapath/linux/compat/include/linux/openvswitch.h
> index d22102e..6dc5674 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -762,6 +762,7 @@ enum ovs_ct_attr {
>  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
>  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
>  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
> +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT. */
>  	__OVS_CT_ATTR_MAX
>  };
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index b50164b..fbbb42e 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread
> *pmd,
>  }
> 
>  static inline struct dp_netdev_flow *
> -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
> +emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key,
> +        void **pfound_entry)
>  {
>      struct emc_entry *current_entry;
> 
> @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache, const struct
> netdev_flow_key *key)
>              && emc_entry_alive(current_entry)
>              && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
> 
> +            *pfound_entry = current_entry;
>              /* We found the entry with the 'key->mf' miniflow */
>              return current_entry->flow;
>          }
>      }
> 
> +    *pfound_entry = NULL;
>      return NULL;
>  }
> 
> @@ -4583,10 +4586,12 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>          key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
> 
>          /* If EMC is disabled skip emc_lookup */
> -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
> +        flow = (cur_min == 0) ? NULL:
> +                emc_lookup(flow_cache, key, &packet->md.prev_emc_entry);
>          if (OVS_LIKELY(flow)) {
>              dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                      n_batches);
> +            packet->md.prev_flow = flow;
>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */
> @@ -4595,6 +4600,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>               * must be returned to the caller. The next key should be
> extracted
>               * to 'keys[n_missed + 1]'. */
>              key = &keys[++n_missed];
> +            packet->md.prev_flow = NULL;
>          }
>      }
> 
> @@ -4604,6 +4610,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>      return dp_packet_batch_size(packets_);
>  }
> 
> +#define CT_PREPEND_ACTION_LEN 0x0C
> +#define CT_PREPEND_ACTION_SPARE_LEN 32
> +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +
> CT_PREPEND_ACTION_SPARE_LEN)
>  static inline void
>  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet *packet,
> @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct dp_netdev_pmd_thread
> *pmd,
>          ovs_mutex_lock(&pmd->flow_mutex);
>          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);
>          if (OVS_LIKELY(!netdev_flow)) {
> -            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
> -                                             add_actions->data,
> -                                             add_actions->size);
> +            /* Recirculated packets that belong to established
> connections
> +             * are treated differently.  A place-holder is prepended to
> the
> +             * list of actions. */
> +            if (!(packet->md.ct_state & CS_INVALID) &&
> +                 (packet->md.ct_state & CS_TRACKED) &&
> +                 (packet->md.ct_state & CS_ESTABLISHED) &&
> +                 (packet->md.recirc_id)) {
> +


[Joe] "Just because the value of ct_state in the packet metadata of this
particular packet happens to have those particular state flags doesn't
mean that the OpenFlow pipeline is enforcing this constraint for all
packets hitting this flow. You would have to ensure that the mask in
the flow is also ensuring that all future packets hitting the new flow
will also have that particular ct_state."

This part of the code should prepend a ct(Skip) action to the
FLOW_RECIRC only and just for packets that belong to an established conn.
The FLOW_ORIG shouldn't be altered, because it will be hit by all ingress
packets.

> +                /* Prepend an OVS_CT_ATTR_SKIP to the list of actions
> inside
> +                 * add_actions.  It does not really invoke the ConnTrack
> module,
> +                 * it's a Ct action that works as a place-holder.  It
> will be
> +                 * overwritten inside emc_patch_established_conn() with
> +                 * the proper ct(zone=X) action. */
> +                struct dp_netdev_actions *new_actions;
> +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {
> +                        0x0C, 0x00, 0x0C, 0x00,
> +                        0x06, 0x00, 0x09, 0x00,
> +                        0x00, 0x00, 0x00, 0x00,
> +                        /* will append here add_actions. */
> +                };
> +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],
> +                        add_actions->data,
> +                        add_actions->size);
> +                new_actions = dp_netdev_actions_create(
> +                        (const struct nlattr *)tmp_action,
> +                        CT_PREPEND_ACTION_LEN + add_actions->size);

[Joe] "Hmm. In some ways I think this would be cleaner to be implemented in
the xlate code, around compose_conntrack_action().. but that code is
currently pretty much dataplane-oblivious and this suggestion is a
userspace-specific optimization."


> +                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
> +                                                 new_actions->actions,
> +                                                 new_actions->size);
> +            } else {
> +                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
> +                                                 add_actions->data,
> +                                                 add_actions->size);
> +            }
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> -        emc_probabilistic_insert(pmd, key, netdev_flow);
> +        /* For this RFC, probabilistic emc insertion is disabled. */
> +        emc_insert(&pmd->flow_cache, key, netdev_flow);
>      }
>  }
> 
> @@ -4761,7 +4802,8 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> +        /* For testing purposes the emc_insert is restored here. */
> +        emc_insert(&pmd->flow_cache, &keys[i], flow);
>          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
>      }
> 
> @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct
> dp_netdev_pmd_thread *pmd,
>      }
>  }
> 
> +/* Search into EMC to retrieve the entry related to the original packet
> + * and the entry related to the recirculated packet.
> + * If both EMC entries are alive, then:
> + *  - The flow actions of the recirculated packet are updated with
> + *    'ct(zone=N)' as retrieved from the actions of the original flow.
> + *  - The EMC orig entry flow is updated with the flow pointer of recirc
> entry.
> + */
> +static inline void
> +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,
> +        struct dp_packet *packet, long long now OVS_UNUSED)
> +{
> +    struct emc_cache *cache = &pmd->flow_cache;
> +    struct netdev_flow_key key;
> +    struct emc_entry *orig_entry; /* EMC entry hit by the original
> packet. */
> +    struct emc_entry *recirc_entry; /* EMC entry for recirculated packet.
> */
> +    uint32_t old_hash;
> +
> +    if (!packet->md.prev_emc_entry) {
> +        return;
> +    }
> +
> +    orig_entry = packet->md.prev_emc_entry;
> +    if (!emc_entry_alive(orig_entry)) {
> +        return;
> +    }
> +
> +    /* Check that the original EMC entry was not overwritten. */
> +    struct dp_netdev_flow *orig_flow = orig_entry->flow;
> +    if (packet->md.prev_flow && (packet->md.prev_flow != orig_flow)) {
> +       return;
> +    }
> +
> +    /* TODO as we are calling miniflow_extract now, can we avoid to
> invoke
> +     * it again when we'll classify this recirculated packet? */
> +    miniflow_extract(packet, &key.mf);
> +    key.len = 0; /* Not computed yet. */
> +    old_hash = dp_packet_get_rss_hash(packet);
> +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);
> +    dp_packet_set_rss_hash(packet, old_hash);
> +
> +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {
> +        if (recirc_entry->key.hash == key.hash
> +            && emc_entry_alive(recirc_entry)
> +            && netdev_flow_key_equal_mf(&recirc_entry->key, &key.mf)) {
> +
> +            if (orig_entry->flow == recirc_entry->flow) {
> +                /* Nothing to do, already patched by a packet of this
> +                 * same batch. */
> +                return;
> +            }
> +            /* We found the entry related to the recirculated packet.
> +             * Retrieve the actions for orig and recirc entries. * */
> +            struct dp_netdev_actions * orig_actions =
> +                    dp_netdev_flow_get_actions(orig_entry->flow);
> +            struct dp_netdev_actions * recirc_actions =
> +                    dp_netdev_flow_get_actions(recirc_entry->flow);
> +
> +            /* The orig entry action contains a CT action with the zone
> info. */
> +            struct nlattr *p = &orig_actions->actions[0];
> +
> +            /* Overwrite the CT Skip action of recirc entry with
> ct(zone=N). */
> +            memcpy(recirc_actions->actions, p, p->nla_len);
> +
> +            /* Update orig EMC entry. */
> +            orig_entry->flow = recirc_entry->flow;
> +            dp_netdev_flow_ref(orig_entry->flow);
> +
> +            return;
> +        }
> +    }
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool may_steal)
> @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>              } else {
>                  tx_qid = pmd->static_tx_qid;
>              }
> -
>              netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
>                          dynamic_txqs);
>              return;

[Joe] "Unrelated style change."

Will do that.

> @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>              struct dp_packet *packet;
>              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
>                  packet->md.recirc_id = nl_attr_get_u32(a);
> +
> +                if (!(packet->md.ct_state & CS_INVALID) &&
> +                     (packet->md.ct_state & CS_TRACKED) &&
> +                     (packet->md.ct_state & CS_ESTABLISHED)) {
> +                    (*depth)++;
> +                    emc_patch_established_conn(pmd, packet, now);
> +                    (*depth)--;
> +                }
>              }
> 
>              (*depth)++;
>              dp_netdev_recirculate(pmd, packets_);
> +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
> +                packet->md.recirc_id = 0;
> +            }
>              (*depth)--;
> 
>              return;

[Joe] "Can you give some more detail about why resetting the recirc_id is
correct here? Could/should this be applied as an independent change
regardless of the rest of this series?"

This addition is needed so that in handle_packet_upcall we add the ct(Skip)
just to the FLOW_RECIRC and not to the FLOW_ORIG.
I guess this could be also applied as an independent change?

> @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>          const char *helper = NULL;
>          const uint32_t *setmark = NULL;
>          const struct ovs_key_ct_labels *setlabel = NULL;
> +        bool skip_ct = false;
> 
>          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),
>                                   nl_attr_get_size(a)) {
> @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>                  /* Silently ignored, as userspace datapath does not
> generate
>                   * netlink events. */
>                  break;
> +            case OVS_CT_ATTR_SKIP:
> +                skip_ct = true;
> +                break;
>              case OVS_CT_ATTR_NAT:
>              case OVS_CT_ATTR_UNSPEC:
>              case __OVS_CT_ATTR_MAX:
> @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>              }
>          }
> 
> +        if (OVS_UNLIKELY(skip_ct)) {
> +            break;
> +        }
> +
>          conntrack_execute(&dp->conntrack, packets_, aux->flow->dl_type,
> force,
>                            commit, zone, setmark, setlabel, helper);
>          break;
> diff --git a/lib/packets.h b/lib/packets.h
> index 7dbf6dd..2c46a15 100644
> --- a/lib/packets.h
> +++ b/lib/packets.h
> @@ -112,6 +112,8 @@ struct pkt_metadata {
>      struct flow_tnl tunnel;     /* Encapsulating tunnel parameters. Note
> that
>                                   * if 'ip_dst' == 0, the rest of the
> fields may
>                                   * be uninitialized. */
> +    void *prev_emc_entry;       /* EMC entry that this packet matched. */
> +    void *prev_flow;            /* Flow pointed by the matching EMC
> entry. */
>  };
> 
>  static inline void
> --
> 2.4.11
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Darrell Ball June 1, 2017, 5:19 p.m. UTC | #2
Comments inline

On 5/29/17, 8:24 AM, "ovs-dev-bounces@openvswitch.org on behalf of Fischetti, Antonio" <ovs-dev-bounces@openvswitch.org on behalf of antonio.fischetti@intel.com> wrote:

    Thanks Joe for your feedback and the interesting insights in conntrack in your earlier communication.
    We have added all the details that we considered for this first implementation. Also, some answers are inline. 
    
    The purpose of this implementation is to avoid recirculation just for those packets that are part of established connections.
    
    This shouldn't affect the packet recirculation for actions other than conntrack. For example in MPLS, after a pop_mpls action the packet will still be recirculated to follow the usual datapath.
    
    Most importantly, the CT module isn't by-passed in this implementation. 
    
    Besides the recirculation, all other action[s] shall be executed as-is on each packet. 
    Any new CT change or action set by the controller will be managed as usual.
    
    For our testing we set up a simple firewall, below are the flows.
    
    
     Flow Setup 
     ----------
    table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
    table=0, priority=10,arp actions=NORMAL
    table=0, priority=1 actions=drop
    table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
    table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
    table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
    table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
    
    
     Basic idea
     ----------
    With the above rules, all packets belonging to an established connection will first hit the flow 
    "ct_state=-trk,ip actions=ct(table=1)"
    
    then on recirculation, they will hit
    "ct_state=+est+trk,ip,in_port=.. actions=output:X".
    
    The basic idea is to do the following 2 steps.
    1. Combine the two sets of actions by removing the recirculation.
       a) Original actions:
        - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]
        - "output:X"
       b) Combined Actions after Removing recirculation:   
        - "ct(zone=N), output:X".
    
    2. All the subsequent packets shall hit a flow with the combined actions.


[Darrell]

1) This would be constraining on how rules are written such that we can’t go back to
              the beginning of the pipeline, just because a packet is established.
              Meaning we can’t run stateless or stateful rules earlier in the pipeline.
              There are lots of possible combinations.
              One case is:
              We may run a bunch of firewall checks in table X, later on (table X+1) run the non-filtered packets through conntrack 
              and then want to go back to the firewall table.

2) This allows the dataplane to override what the ofproto layer specifies.
This looks like a layering violation.

I guess this also speaks to the datapath (userspace) specific design here.


I don’t think it matters what the performance advantage is, although it is small in this case.


    
    
     Considerations
     --------------
    On EMC entries: 
    All packets hitting a connection in the same direction will have the same 5-tuple. 
    They shall all hit the same EMC entry - here referred as 'EMC_ORIG' - and point to the flow 
    "ct_state=-trk,ip actions=ct(table=1)", referred from here as 'FLOW_ORIG'.
    
    On recirculation the packets shall hit another EMC entry - referred as 'EMC_RECIRC' - and point to the flow  
    "ct_state=+est+trk,ip,in_port=.. actions=output:X", 
    referred from here as 'FLOW_RECIRC'.
    
    FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC is the flow hit
    by the recirculated packet.
    
    
     Implementation Details
     ----------------------
    We thought about different implementations, but for the purpose of this RFC we stick to the following approach.
    
    On packet recirculation, a new flow has to be created with a list of appropriate actions. At this 
    stage we prepend a new action called 'ct(Skip)' to the original list of actions. This 
    new action is used as a temporary place-holder that will be over written later on.
    
       Original Actions: "Output:2"    ==>    New Actions: "ct(Skip), Output:2"
    
    For all the packets of an established conn. (i.e marked with +trk,+est bits), do the 3 steps:
    1. Retrieve Flows from EMC. 
        - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id) 
        - FLOW_RECIRC whose actions include: ct(Skip), Output:X
    
    2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with the ct action from FLOW_ORIG, ie ct(zone=N). 
        ct(Skip), Output:X  =>  ct(zone=N), Output:X
    
    3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow.
    
    In this way, we avoid recirculating the packets belonging to established connections and 
    this no ways skips sending the packets to the CT module. 
    
    
    Antonio
    
    > -----Original Message-----

    > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

    > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com

    > Sent: Thursday, May 25, 2017 5:03 PM

    > To: dev@openvswitch.org

    > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

    > established connections.

    > 

    > From: Antonio Fischetti <antonio.fischetti@intel.com>

    > 

    > With conntrack enabled, packets get recirculated and this impacts

    > the performance with thousands of active concurrent connections.

    > 

    > This patch is aimed at avoiding recirculation for packets belonging to

    > established connections in steady state. This is achieved by

    > manipulating the megaflows and actions. In this case the actions of the

    > megaflow of recirculated packet is initially updated with new 'CT_SKIP'

    > action and later updated with the actions of the megaflow of the

    > original packet(to handle zones). There after the EMC entry

    > belonging to the original packet is updated to point to the 'megaflow' of

    > the recirculated packet there by avoiding recirculation.

    > 

    > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

    > Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

    > Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

    > ---

    > - ~10% performance improvement is observed in our testing with UDP

    > streams.

    > - Limited testing is done with RFC patch and the tests include hundreds of

    >   concurrent active TCP connections.

    > - This is very early implementation and we are posting here to get initial

    >   feedback.

    > - This RFC patch is implemented leveraging EMC, but optimization could be

    >   extended to dpcls as well to handle higher flows.

    > - No VXLAN testing is done with this patch.

    > 

    >  datapath/linux/compat/include/linux/openvswitch.h |   1 +

    >  lib/dpif-netdev.c                                 | 148

    > ++++++++++++++++++++--

    >  lib/packets.h                                     |   2 +

    >  3 files changed, 143 insertions(+), 8 deletions(-)

    > 

    > diff --git a/datapath/linux/compat/include/linux/openvswitch.h

    > b/datapath/linux/compat/include/linux/openvswitch.h

    > index d22102e..6dc5674 100644

    > --- a/datapath/linux/compat/include/linux/openvswitch.h

    > +++ b/datapath/linux/compat/include/linux/openvswitch.h

    > @@ -762,6 +762,7 @@ enum ovs_ct_attr {

    >  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */

    >  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */

    >  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */

    > +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT. */

    >  	__OVS_CT_ATTR_MAX

    >  };

    > 

    > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

    > index b50164b..fbbb42e 100644

    > --- a/lib/dpif-netdev.c

    > +++ b/lib/dpif-netdev.c

    > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct dp_netdev_pmd_thread

    > *pmd,

    >  }

    > 

    >  static inline struct dp_netdev_flow *

    > -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)

    > +emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key,

    > +        void **pfound_entry)

    >  {

    >      struct emc_entry *current_entry;

    > 

    > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache, const struct

    > netdev_flow_key *key)

    >              && emc_entry_alive(current_entry)

    >              && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {

    > 

    > +            *pfound_entry = current_entry;

    >              /* We found the entry with the 'key->mf' miniflow */

    >              return current_entry->flow;

    >          }

    >      }

    > 

    > +    *pfound_entry = NULL;

    >      return NULL;

    >  }

    > 

    > @@ -4583,10 +4586,12 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,

    >          key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);

    > 

    >          /* If EMC is disabled skip emc_lookup */

    > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

    > +        flow = (cur_min == 0) ? NULL:

    > +                emc_lookup(flow_cache, key, &packet->md.prev_emc_entry);

    >          if (OVS_LIKELY(flow)) {

    >              dp_netdev_queue_batches(packet, flow, &key->mf, batches,

    >                                      n_batches);

    > +            packet->md.prev_flow = flow;

    >          } else {

    >              /* Exact match cache missed. Group missed packets together at

    >               * the beginning of the 'packets' array. */

    > @@ -4595,6 +4600,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,

    >               * must be returned to the caller. The next key should be

    > extracted

    >               * to 'keys[n_missed + 1]'. */

    >              key = &keys[++n_missed];

    > +            packet->md.prev_flow = NULL;

    >          }

    >      }

    > 

    > @@ -4604,6 +4610,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,

    >      return dp_packet_batch_size(packets_);

    >  }

    > 

    > +#define CT_PREPEND_ACTION_LEN 0x0C

    > +#define CT_PREPEND_ACTION_SPARE_LEN 32

    > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +

    > CT_PREPEND_ACTION_SPARE_LEN)

    >  static inline void

    >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,

    >                       struct dp_packet *packet,

    > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct dp_netdev_pmd_thread

    > *pmd,

    >          ovs_mutex_lock(&pmd->flow_mutex);

    >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);

    >          if (OVS_LIKELY(!netdev_flow)) {

    > -            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

    > -                                             add_actions->data,

    > -                                             add_actions->size);

    > +            /* Recirculated packets that belong to established

    > connections

    > +             * are treated differently.  A place-holder is prepended to

    > the

    > +             * list of actions. */

    > +            if (!(packet->md.ct_state & CS_INVALID) &&

    > +                 (packet->md.ct_state & CS_TRACKED) &&

    > +                 (packet->md.ct_state & CS_ESTABLISHED) &&

    > +                 (packet->md.recirc_id)) {

    > +

    
    
    [Joe] "Just because the value of ct_state in the packet metadata of this
    particular packet happens to have those particular state flags doesn't
    mean that the OpenFlow pipeline is enforcing this constraint for all
    packets hitting this flow. You would have to ensure that the mask in
    the flow is also ensuring that all future packets hitting the new flow
    will also have that particular ct_state."
    
    This part of the code should prepend a ct(Skip) action to the
    FLOW_RECIRC only and just for packets that belong to an established conn.
    The FLOW_ORIG shouldn't be altered, because it will be hit by all ingress
    packets.
    
    > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of actions

    > inside

    > +                 * add_actions.  It does not really invoke the ConnTrack

    > module,

    > +                 * it's a Ct action that works as a place-holder.  It

    > will be

    > +                 * overwritten inside emc_patch_established_conn() with

    > +                 * the proper ct(zone=X) action. */

    > +                struct dp_netdev_actions *new_actions;

    > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {

    > +                        0x0C, 0x00, 0x0C, 0x00,

    > +                        0x06, 0x00, 0x09, 0x00,

    > +                        0x00, 0x00, 0x00, 0x00,

    > +                        /* will append here add_actions. */

    > +                };

    > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],

    > +                        add_actions->data,

    > +                        add_actions->size);

    > +                new_actions = dp_netdev_actions_create(

    > +                        (const struct nlattr *)tmp_action,

    > +                        CT_PREPEND_ACTION_LEN + add_actions->size);

    
    [Joe] "Hmm. In some ways I think this would be cleaner to be implemented in
    the xlate code, around compose_conntrack_action().. but that code is
    currently pretty much dataplane-oblivious and this suggestion is a
    userspace-specific optimization."
    
    
    > +                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

    > +                                                 new_actions->actions,

    > +                                                 new_actions->size);

    > +            } else {

    > +                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

    > +                                                 add_actions->data,

    > +                                                 add_actions->size);

    > +            }

    >          }

    >          ovs_mutex_unlock(&pmd->flow_mutex);

    > -        emc_probabilistic_insert(pmd, key, netdev_flow);

    > +        /* For this RFC, probabilistic emc insertion is disabled. */

    > +        emc_insert(&pmd->flow_cache, key, netdev_flow);

    >      }

    >  }

    > 

    > @@ -4761,7 +4802,8 @@ fast_path_processing(struct dp_netdev_pmd_thread

    > *pmd,

    > 

    >          flow = dp_netdev_flow_cast(rules[i]);

    > 

    > -        emc_probabilistic_insert(pmd, &keys[i], flow);

    > +        /* For testing purposes the emc_insert is restored here. */

    > +        emc_insert(&pmd->flow_cache, &keys[i], flow);

    >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

    > n_batches);

    >      }

    > 

    > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct

    > dp_netdev_pmd_thread *pmd,

    >      }

    >  }

    > 

    > +/* Search into EMC to retrieve the entry related to the original packet

    > + * and the entry related to the recirculated packet.

    > + * If both EMC entries are alive, then:

    > + *  - The flow actions of the recirculated packet are updated with

    > + *    'ct(zone=N)' as retrieved from the actions of the original flow.

    > + *  - The EMC orig entry flow is updated with the flow pointer of recirc

    > entry.

    > + */

    > +static inline void

    > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,

    > +        struct dp_packet *packet, long long now OVS_UNUSED)

    > +{

    > +    struct emc_cache *cache = &pmd->flow_cache;

    > +    struct netdev_flow_key key;

    > +    struct emc_entry *orig_entry; /* EMC entry hit by the original

    > packet. */

    > +    struct emc_entry *recirc_entry; /* EMC entry for recirculated packet.

    > */

    > +    uint32_t old_hash;

    > +

    > +    if (!packet->md.prev_emc_entry) {

    > +        return;

    > +    }

    > +

    > +    orig_entry = packet->md.prev_emc_entry;

    > +    if (!emc_entry_alive(orig_entry)) {

    > +        return;

    > +    }

    > +

    > +    /* Check that the original EMC entry was not overwritten. */

    > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;

    > +    if (packet->md.prev_flow && (packet->md.prev_flow != orig_flow)) {

    > +       return;

    > +    }

    > +

    > +    /* TODO as we are calling miniflow_extract now, can we avoid to

    > invoke

    > +     * it again when we'll classify this recirculated packet? */

    > +    miniflow_extract(packet, &key.mf);

    > +    key.len = 0; /* Not computed yet. */

    > +    old_hash = dp_packet_get_rss_hash(packet);

    > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);

    > +    dp_packet_set_rss_hash(packet, old_hash);

    > +

    > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {

    > +        if (recirc_entry->key.hash == key.hash

    > +            && emc_entry_alive(recirc_entry)

    > +            && netdev_flow_key_equal_mf(&recirc_entry->key, &key.mf)) {

    > +

    > +            if (orig_entry->flow == recirc_entry->flow) {

    > +                /* Nothing to do, already patched by a packet of this

    > +                 * same batch. */

    > +                return;

    > +            }

    > +            /* We found the entry related to the recirculated packet.

    > +             * Retrieve the actions for orig and recirc entries. * */

    > +            struct dp_netdev_actions * orig_actions =

    > +                    dp_netdev_flow_get_actions(orig_entry->flow);

    > +            struct dp_netdev_actions * recirc_actions =

    > +                    dp_netdev_flow_get_actions(recirc_entry->flow);

    > +

    > +            /* The orig entry action contains a CT action with the zone

    > info. */

    > +            struct nlattr *p = &orig_actions->actions[0];

    > +

    > +            /* Overwrite the CT Skip action of recirc entry with

    > ct(zone=N). */

    > +            memcpy(recirc_actions->actions, p, p->nla_len);

    > +

    > +            /* Update orig EMC entry. */

    > +            orig_entry->flow = recirc_entry->flow;

    > +            dp_netdev_flow_ref(orig_entry->flow);

    > +

    > +            return;

    > +        }

    > +    }

    > +}

    > +

    >  static void

    >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,

    >                const struct nlattr *a, bool may_steal)

    > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct dp_packet_batch

    > *packets_,

    >              } else {

    >                  tx_qid = pmd->static_tx_qid;

    >              }

    > -

    >              netdev_send(p->port->netdev, tx_qid, packets_, may_steal,

    >                          dynamic_txqs);

    >              return;

    
    [Joe] "Unrelated style change."
    
    Will do that.
    
    > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct dp_packet_batch

    > *packets_,

    >              struct dp_packet *packet;

    >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

    >                  packet->md.recirc_id = nl_attr_get_u32(a);

    > +

    > +                if (!(packet->md.ct_state & CS_INVALID) &&

    > +                     (packet->md.ct_state & CS_TRACKED) &&

    > +                     (packet->md.ct_state & CS_ESTABLISHED)) {

    > +                    (*depth)++;

    > +                    emc_patch_established_conn(pmd, packet, now);

    > +                    (*depth)--;

    > +                }

    >              }

    > 

    >              (*depth)++;

    >              dp_netdev_recirculate(pmd, packets_);

    > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

    > +                packet->md.recirc_id = 0;

    > +            }

    >              (*depth)--;

    > 

    >              return;

    
    [Joe] "Can you give some more detail about why resetting the recirc_id is
    correct here? Could/should this be applied as an independent change
    regardless of the rest of this series?"
    
    This addition is needed so that in handle_packet_upcall we add the ct(Skip)
    just to the FLOW_RECIRC and not to the FLOW_ORIG.
    I guess this could be also applied as an independent change?
    
    > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch

    > *packets_,

    >          const char *helper = NULL;

    >          const uint32_t *setmark = NULL;

    >          const struct ovs_key_ct_labels *setlabel = NULL;

    > +        bool skip_ct = false;

    > 

    >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),

    >                                   nl_attr_get_size(a)) {

    > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct dp_packet_batch

    > *packets_,

    >                  /* Silently ignored, as userspace datapath does not

    > generate

    >                   * netlink events. */

    >                  break;

    > +            case OVS_CT_ATTR_SKIP:

    > +                skip_ct = true;

    > +                break;

    >              case OVS_CT_ATTR_NAT:

    >              case OVS_CT_ATTR_UNSPEC:

    >              case __OVS_CT_ATTR_MAX:

    > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct dp_packet_batch

    > *packets_,

    >              }

    >          }

    > 

    > +        if (OVS_UNLIKELY(skip_ct)) {

    > +            break;

    > +        }

    > +

    >          conntrack_execute(&dp->conntrack, packets_, aux->flow->dl_type,

    > force,

    >                            commit, zone, setmark, setlabel, helper);

    >          break;

    > diff --git a/lib/packets.h b/lib/packets.h

    > index 7dbf6dd..2c46a15 100644

    > --- a/lib/packets.h

    > +++ b/lib/packets.h

    > @@ -112,6 +112,8 @@ struct pkt_metadata {

    >      struct flow_tnl tunnel;     /* Encapsulating tunnel parameters. Note

    > that

    >                                   * if 'ip_dst' == 0, the rest of the

    > fields may

    >                                   * be uninitialized. */

    > +    void *prev_emc_entry;       /* EMC entry that this packet matched. */

    > +    void *prev_flow;            /* Flow pointed by the matching EMC

    > entry. */

    >  };

    > 

    >  static inline void

    > --

    > 2.4.11

    > 

    > _______________________________________________

    > dev mailing list

    > dev@openvswitch.org

    > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e= 

    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=
Fischetti, Antonio June 6, 2017, 4:14 p.m. UTC | #3
Thanks Darrel for your useful comments, I've tried to replicate the
usecases you mentioned, please find inline some details.

Regards,
Antonio

> -----Original Message-----

> From: Darrell Ball [mailto:dball@vmware.com]

> Sent: Thursday, June 1, 2017 6:20 PM

> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

> Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

> established connections.

> 

> Comments inline

> 

> On 5/29/17, 8:24 AM, "ovs-dev-bounces@openvswitch.org on behalf of

> Fischetti, Antonio" <ovs-dev-bounces@openvswitch.org on behalf of

> antonio.fischetti@intel.com> wrote:

> 

>     Thanks Joe for your feedback and the interesting insights in conntrack

> in your earlier communication.

>     We have added all the details that we considered for this first

> implementation. Also, some answers are inline.

> 

>     The purpose of this implementation is to avoid recirculation just for

> those packets that are part of established connections.

> 

>     This shouldn't affect the packet recirculation for actions other than

> conntrack. For example in MPLS, after a pop_mpls action the packet will

> still be recirculated to follow the usual datapath.

> 

>     Most importantly, the CT module isn't by-passed in this

> implementation.

> 

>     Besides the recirculation, all other action[s] shall be executed as-is

> on each packet.

>     Any new CT change or action set by the controller will be managed as

> usual.

> 

>     For our testing we set up a simple firewall, below are the flows.

> 

> 

>      Flow Setup

>      ----------

>     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

>     table=0, priority=10,arp actions=NORMAL

>     table=0, priority=1 actions=drop

>     table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2

>     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

>     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

>     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

> 

> 

>      Basic idea

>      ----------

>     With the above rules, all packets belonging to an established

> connection will first hit the flow

>     "ct_state=-trk,ip actions=ct(table=1)"

> 

>     then on recirculation, they will hit

>     "ct_state=+est+trk,ip,in_port=.. actions=output:X".

> 

>     The basic idea is to do the following 2 steps.

>     1. Combine the two sets of actions by removing the recirculation.

>        a) Original actions:

>         - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]

>         - "output:X"

>        b) Combined Actions after Removing recirculation:

>         - "ct(zone=N), output:X".

> 

>     2. All the subsequent packets shall hit a flow with the combined

> actions.

> 

> 

> [Darrell]

> 

> 1) This would be constraining on how rules are written such that we can’t

> go back to

>               the beginning of the pipeline, just because a packet is

> established.

>               Meaning we can’t run stateless or stateful rules earlier in

> the pipeline.

>               There are lots of possible combinations.

>               One case is:

>               We may run a bunch of firewall checks in table X, later on

> (table X+1) run the non-filtered packets through conntrack

>               and then want to go back to the firewall table.

> 


[Antonio]
I've tried to replicate the case you described. I've used the 
following set of rules. They may have no practical meaning,
the purpose is to have - at least for packets from port 1 - that
they traverse tables #0 => 1 => 2 and then are sent back to table#0.

NXST_FLOW reply (xid=0x4):
 table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
 table=1, ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24 actions=ct(commit),output:2
 table=1, ct_state=+est+trk,ip,in_port=1 actions=mod_nw_src:3.3.3.3,resubmit(,2)
 table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop
 table=2, ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0)
 table=0, ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2
 table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
 table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
 table=0, priority=10,arp actions=NORMAL

For packets from port1:
 - table#0: untracked packets go through the CT, then are recirculated, ie go to table#1.
 - table#1: new conn are committed.
 - table#1: packets of est conn are modified in the IPsrc that becomes 3.3.3.3,
            then sent to table#2.
 - table#2: TCP packets are dropped - but I'm sending just UDP traffic.
 - table#2: UDP packets go back to table#0.
 - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2.

In this case we still see that the packets of est conn will skip one
recirculation.

In fact what happens is the following.
For packets belonging to est connections - and coming from
port1 - the actions 
   mod_nw_src:3.3.3.3,resubmit(,2)
become
   ct(),mod_nw_src:3.3.3.3,resubmit(,2)

and the EMC entry that was hit by the original packet will point to the megaflow
   table=1, ct_state=+est+trk,ip,in_port=1 actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2)
So new packets of the same connection will skip one recirculation.

For packets from port2 the megaflow
   table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
becomes
   table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1
and will be pointed by the EMC entry hit by the original packet.
So new packets of the same connection will skip one recirculation.

Is this usecase somewhat similar to the one you were mentioning? 
Or maybe you can suggest some other rule setup?


> 2) This allows the dataplane to override what the ofproto layer specifies.

> This looks like a layering violation.

> 

> I guess this also speaks to the datapath (userspace) specific design here.

> 

> 

> I don’t think it matters what the performance advantage is, although it is

> small in this case.

> 

> 

> 

> 

>      Considerations

>      --------------

>     On EMC entries:

>     All packets hitting a connection in the same direction will have the

> same 5-tuple.

>     They shall all hit the same EMC entry - here referred as 'EMC_ORIG' -

> and point to the flow

>     "ct_state=-trk,ip actions=ct(table=1)", referred from here as

> 'FLOW_ORIG'.

> 

>     On recirculation the packets shall hit another EMC entry - referred as

> 'EMC_RECIRC' - and point to the flow

>     "ct_state=+est+trk,ip,in_port=.. actions=output:X",

>     referred from here as 'FLOW_RECIRC'.

> 

>     FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC is

> the flow hit

>     by the recirculated packet.

> 

> 

>      Implementation Details

>      ----------------------

>     We thought about different implementations, but for the purpose of

> this RFC we stick to the following approach.

> 

>     On packet recirculation, a new flow has to be created with a list of

> appropriate actions. At this

>     stage we prepend a new action called 'ct(Skip)' to the original list

> of actions. This

>     new action is used as a temporary place-holder that will be over

> written later on.

> 

>        Original Actions: "Output:2"    ==>    New Actions: "ct(Skip),

> Output:2"

> 

>     For all the packets of an established conn. (i.e marked with +trk,+est

> bits), do the 3 steps:

>     1. Retrieve Flows from EMC.

>         - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id)

>         - FLOW_RECIRC whose actions include: ct(Skip), Output:X

> 

>     2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with the

> ct action from FLOW_ORIG, ie ct(zone=N).

>         ct(Skip), Output:X  =>  ct(zone=N), Output:X

> 

>     3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow.

> 

>     In this way, we avoid recirculating the packets belonging to

> established connections and

>     this no ways skips sending the packets to the CT module.

> 

> 

>     Antonio

> 

>     > -----Original Message-----

>     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

>     > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com

>     > Sent: Thursday, May 25, 2017 5:03 PM

>     > To: dev@openvswitch.org

>     > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

>     > established connections.

>     >

>     > From: Antonio Fischetti <antonio.fischetti@intel.com>

>     >

>     > With conntrack enabled, packets get recirculated and this impacts

>     > the performance with thousands of active concurrent connections.

>     >

>     > This patch is aimed at avoiding recirculation for packets belonging

> to

>     > established connections in steady state. This is achieved by

>     > manipulating the megaflows and actions. In this case the actions of

> the

>     > megaflow of recirculated packet is initially updated with new

> 'CT_SKIP'

>     > action and later updated with the actions of the megaflow of the

>     > original packet(to handle zones). There after the EMC entry

>     > belonging to the original packet is updated to point to the

> 'megaflow' of

>     > the recirculated packet there by avoiding recirculation.

>     >

>     > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

>     > Signed-off-by: Bhanuprakash Bodireddy

> <bhanuprakash.bodireddy@intel.com>

>     > Co-authored-by: Bhanuprakash Bodireddy

> <bhanuprakash.bodireddy@intel.com>

>     > ---

>     > - ~10% performance improvement is observed in our testing with UDP

>     > streams.

>     > - Limited testing is done with RFC patch and the tests include

> hundreds of

>     >   concurrent active TCP connections.

>     > - This is very early implementation and we are posting here to get

> initial

>     >   feedback.

>     > - This RFC patch is implemented leveraging EMC, but optimization

> could be

>     >   extended to dpcls as well to handle higher flows.

>     > - No VXLAN testing is done with this patch.

>     >

>     >  datapath/linux/compat/include/linux/openvswitch.h |   1 +

>     >  lib/dpif-netdev.c                                 | 148

>     > ++++++++++++++++++++--

>     >  lib/packets.h                                     |   2 +

>     >  3 files changed, 143 insertions(+), 8 deletions(-)

>     >

>     > diff --git a/datapath/linux/compat/include/linux/openvswitch.h

>     > b/datapath/linux/compat/include/linux/openvswitch.h

>     > index d22102e..6dc5674 100644

>     > --- a/datapath/linux/compat/include/linux/openvswitch.h

>     > +++ b/datapath/linux/compat/include/linux/openvswitch.h

>     > @@ -762,6 +762,7 @@ enum ovs_ct_attr {

>     >  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */

>     >  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */

>     >  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */

>     > +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT.

> */

>     >  	__OVS_CT_ATTR_MAX

>     >  };

>     >

>     > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

>     > index b50164b..fbbb42e 100644

>     > --- a/lib/dpif-netdev.c

>     > +++ b/lib/dpif-netdev.c

>     > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct

> dp_netdev_pmd_thread

>     > *pmd,

>     >  }

>     >

>     >  static inline struct dp_netdev_flow *

>     > -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key

> *key)

>     > +emc_lookup(struct emc_cache *cache, const struct netdev_flow_key

> *key,

>     > +        void **pfound_entry)

>     >  {

>     >      struct emc_entry *current_entry;

>     >

>     > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache, const

> struct

>     > netdev_flow_key *key)

>     >              && emc_entry_alive(current_entry)

>     >              && netdev_flow_key_equal_mf(&current_entry->key, &key-

> >mf)) {

>     >

>     > +            *pfound_entry = current_entry;

>     >              /* We found the entry with the 'key->mf' miniflow */

>     >              return current_entry->flow;

>     >          }

>     >      }

>     >

>     > +    *pfound_entry = NULL;

>     >      return NULL;

>     >  }

>     >

>     > @@ -4583,10 +4586,12 @@ emc_processing(struct dp_netdev_pmd_thread

> *pmd,

>     >          key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

> >mf);

>     >

>     >          /* If EMC is disabled skip emc_lookup */

>     > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

>     > +        flow = (cur_min == 0) ? NULL:

>     > +                emc_lookup(flow_cache, key, &packet-

> >md.prev_emc_entry);

>     >          if (OVS_LIKELY(flow)) {

>     >              dp_netdev_queue_batches(packet, flow, &key->mf,

> batches,

>     >                                      n_batches);

>     > +            packet->md.prev_flow = flow;

>     >          } else {

>     >              /* Exact match cache missed. Group missed packets

> together at

>     >               * the beginning of the 'packets' array. */

>     > @@ -4595,6 +4600,7 @@ emc_processing(struct dp_netdev_pmd_thread

> *pmd,

>     >               * must be returned to the caller. The next key should

> be

>     > extracted

>     >               * to 'keys[n_missed + 1]'. */

>     >              key = &keys[++n_missed];

>     > +            packet->md.prev_flow = NULL;

>     >          }

>     >      }

>     >

>     > @@ -4604,6 +4610,9 @@ emc_processing(struct dp_netdev_pmd_thread

> *pmd,

>     >      return dp_packet_batch_size(packets_);

>     >  }

>     >

>     > +#define CT_PREPEND_ACTION_LEN 0x0C

>     > +#define CT_PREPEND_ACTION_SPARE_LEN 32

>     > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +

>     > CT_PREPEND_ACTION_SPARE_LEN)

>     >  static inline void

>     >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,

>     >                       struct dp_packet *packet,

>     > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct

> dp_netdev_pmd_thread

>     > *pmd,

>     >          ovs_mutex_lock(&pmd->flow_mutex);

>     >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);

>     >          if (OVS_LIKELY(!netdev_flow)) {

>     > -            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

>     > -                                             add_actions->data,

>     > -                                             add_actions->size);

>     > +            /* Recirculated packets that belong to established

>     > connections

>     > +             * are treated differently.  A place-holder is

> prepended to

>     > the

>     > +             * list of actions. */

>     > +            if (!(packet->md.ct_state & CS_INVALID) &&

>     > +                 (packet->md.ct_state & CS_TRACKED) &&

>     > +                 (packet->md.ct_state & CS_ESTABLISHED) &&

>     > +                 (packet->md.recirc_id)) {

>     > +

> 

> 

>     [Joe] "Just because the value of ct_state in the packet metadata of

> this

>     particular packet happens to have those particular state flags doesn't

>     mean that the OpenFlow pipeline is enforcing this constraint for all

>     packets hitting this flow. You would have to ensure that the mask in

>     the flow is also ensuring that all future packets hitting the new flow

>     will also have that particular ct_state."

> 

>     This part of the code should prepend a ct(Skip) action to the

>     FLOW_RECIRC only and just for packets that belong to an established

> conn.

>     The FLOW_ORIG shouldn't be altered, because it will be hit by all

> ingress

>     packets.

> 

>     > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of

> actions

>     > inside

>     > +                 * add_actions.  It does not really invoke the

> ConnTrack

>     > module,

>     > +                 * it's a Ct action that works as a place-holder.

> It

>     > will be

>     > +                 * overwritten inside emc_patch_established_conn()

> with

>     > +                 * the proper ct(zone=X) action. */

>     > +                struct dp_netdev_actions *new_actions;

>     > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {

>     > +                        0x0C, 0x00, 0x0C, 0x00,

>     > +                        0x06, 0x00, 0x09, 0x00,

>     > +                        0x00, 0x00, 0x00, 0x00,

>     > +                        /* will append here add_actions. */

>     > +                };

>     > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],

>     > +                        add_actions->data,

>     > +                        add_actions->size);

>     > +                new_actions = dp_netdev_actions_create(

>     > +                        (const struct nlattr *)tmp_action,

>     > +                        CT_PREPEND_ACTION_LEN + add_actions->size);

> 

>     [Joe] "Hmm. In some ways I think this would be cleaner to be

> implemented in

>     the xlate code, around compose_conntrack_action().. but that code is

>     currently pretty much dataplane-oblivious and this suggestion is a

>     userspace-specific optimization."

> 

> 

>     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> &ufid,

>     > +                                                 new_actions-

> >actions,

>     > +                                                 new_actions-

> >size);

>     > +            } else {

>     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> &ufid,

>     > +                                                 add_actions->data,

>     > +                                                 add_actions-

> >size);

>     > +            }

>     >          }

>     >          ovs_mutex_unlock(&pmd->flow_mutex);

>     > -        emc_probabilistic_insert(pmd, key, netdev_flow);

>     > +        /* For this RFC, probabilistic emc insertion is disabled.

> */

>     > +        emc_insert(&pmd->flow_cache, key, netdev_flow);

>     >      }

>     >  }

>     >

>     > @@ -4761,7 +4802,8 @@ fast_path_processing(struct

> dp_netdev_pmd_thread

>     > *pmd,

>     >

>     >          flow = dp_netdev_flow_cast(rules[i]);

>     >

>     > -        emc_probabilistic_insert(pmd, &keys[i], flow);

>     > +        /* For testing purposes the emc_insert is restored here. */

>     > +        emc_insert(&pmd->flow_cache, &keys[i], flow);

>     >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

>     > n_batches);

>     >      }

>     >

>     > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct

>     > dp_netdev_pmd_thread *pmd,

>     >      }

>     >  }

>     >

>     > +/* Search into EMC to retrieve the entry related to the original

> packet

>     > + * and the entry related to the recirculated packet.

>     > + * If both EMC entries are alive, then:

>     > + *  - The flow actions of the recirculated packet are updated with

>     > + *    'ct(zone=N)' as retrieved from the actions of the original

> flow.

>     > + *  - The EMC orig entry flow is updated with the flow pointer of

> recirc

>     > entry.

>     > + */

>     > +static inline void

>     > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,

>     > +        struct dp_packet *packet, long long now OVS_UNUSED)

>     > +{

>     > +    struct emc_cache *cache = &pmd->flow_cache;

>     > +    struct netdev_flow_key key;

>     > +    struct emc_entry *orig_entry; /* EMC entry hit by the original

>     > packet. */

>     > +    struct emc_entry *recirc_entry; /* EMC entry for recirculated

> packet.

>     > */

>     > +    uint32_t old_hash;

>     > +

>     > +    if (!packet->md.prev_emc_entry) {

>     > +        return;

>     > +    }

>     > +

>     > +    orig_entry = packet->md.prev_emc_entry;

>     > +    if (!emc_entry_alive(orig_entry)) {

>     > +        return;

>     > +    }

>     > +

>     > +    /* Check that the original EMC entry was not overwritten. */

>     > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;

>     > +    if (packet->md.prev_flow && (packet->md.prev_flow !=

> orig_flow)) {

>     > +       return;

>     > +    }

>     > +

>     > +    /* TODO as we are calling miniflow_extract now, can we avoid to

>     > invoke

>     > +     * it again when we'll classify this recirculated packet? */

>     > +    miniflow_extract(packet, &key.mf);

>     > +    key.len = 0; /* Not computed yet. */

>     > +    old_hash = dp_packet_get_rss_hash(packet);

>     > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);

>     > +    dp_packet_set_rss_hash(packet, old_hash);

>     > +

>     > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {

>     > +        if (recirc_entry->key.hash == key.hash

>     > +            && emc_entry_alive(recirc_entry)

>     > +            && netdev_flow_key_equal_mf(&recirc_entry->key,

> &key.mf)) {

>     > +

>     > +            if (orig_entry->flow == recirc_entry->flow) {

>     > +                /* Nothing to do, already patched by a packet of

> this

>     > +                 * same batch. */

>     > +                return;

>     > +            }

>     > +            /* We found the entry related to the recirculated

> packet.

>     > +             * Retrieve the actions for orig and recirc entries. *

> */

>     > +            struct dp_netdev_actions * orig_actions =

>     > +                    dp_netdev_flow_get_actions(orig_entry->flow);

>     > +            struct dp_netdev_actions * recirc_actions =

>     > +                    dp_netdev_flow_get_actions(recirc_entry->flow);

>     > +

>     > +            /* The orig entry action contains a CT action with the

> zone

>     > info. */

>     > +            struct nlattr *p = &orig_actions->actions[0];

>     > +

>     > +            /* Overwrite the CT Skip action of recirc entry with

>     > ct(zone=N). */

>     > +            memcpy(recirc_actions->actions, p, p->nla_len);

>     > +

>     > +            /* Update orig EMC entry. */

>     > +            orig_entry->flow = recirc_entry->flow;

>     > +            dp_netdev_flow_ref(orig_entry->flow);

>     > +

>     > +            return;

>     > +        }

>     > +    }

>     > +}

>     > +

>     >  static void

>     >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,

>     >                const struct nlattr *a, bool may_steal)

>     > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct

> dp_packet_batch

>     > *packets_,

>     >              } else {

>     >                  tx_qid = pmd->static_tx_qid;

>     >              }

>     > -

>     >              netdev_send(p->port->netdev, tx_qid, packets_,

> may_steal,

>     >                          dynamic_txqs);

>     >              return;

> 

>     [Joe] "Unrelated style change."

> 

>     Will do that.

> 

>     > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct

> dp_packet_batch

>     > *packets_,

>     >              struct dp_packet *packet;

>     >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

>     >                  packet->md.recirc_id = nl_attr_get_u32(a);

>     > +

>     > +                if (!(packet->md.ct_state & CS_INVALID) &&

>     > +                     (packet->md.ct_state & CS_TRACKED) &&

>     > +                     (packet->md.ct_state & CS_ESTABLISHED)) {

>     > +                    (*depth)++;

>     > +                    emc_patch_established_conn(pmd, packet, now);

>     > +                    (*depth)--;

>     > +                }

>     >              }

>     >

>     >              (*depth)++;

>     >              dp_netdev_recirculate(pmd, packets_);

>     > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

>     > +                packet->md.recirc_id = 0;

>     > +            }

>     >              (*depth)--;

>     >

>     >              return;

> 

>     [Joe] "Can you give some more detail about why resetting the recirc_id

> is

>     correct here? Could/should this be applied as an independent change

>     regardless of the rest of this series?"

> 

>     This addition is needed so that in handle_packet_upcall we add the

> ct(Skip)

>     just to the FLOW_RECIRC and not to the FLOW_ORIG.

>     I guess this could be also applied as an independent change?

> 

>     > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct

> dp_packet_batch

>     > *packets_,

>     >          const char *helper = NULL;

>     >          const uint32_t *setmark = NULL;

>     >          const struct ovs_key_ct_labels *setlabel = NULL;

>     > +        bool skip_ct = false;

>     >

>     >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),

>     >                                   nl_attr_get_size(a)) {

>     > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct

> dp_packet_batch

>     > *packets_,

>     >                  /* Silently ignored, as userspace datapath does not

>     > generate

>     >                   * netlink events. */

>     >                  break;

>     > +            case OVS_CT_ATTR_SKIP:

>     > +                skip_ct = true;

>     > +                break;

>     >              case OVS_CT_ATTR_NAT:

>     >              case OVS_CT_ATTR_UNSPEC:

>     >              case __OVS_CT_ATTR_MAX:

>     > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct

> dp_packet_batch

>     > *packets_,

>     >              }

>     >          }

>     >

>     > +        if (OVS_UNLIKELY(skip_ct)) {

>     > +            break;

>     > +        }

>     > +

>     >          conntrack_execute(&dp->conntrack, packets_, aux->flow-

> >dl_type,

>     > force,

>     >                            commit, zone, setmark, setlabel, helper);

>     >          break;

>     > diff --git a/lib/packets.h b/lib/packets.h

>     > index 7dbf6dd..2c46a15 100644

>     > --- a/lib/packets.h

>     > +++ b/lib/packets.h

>     > @@ -112,6 +112,8 @@ struct pkt_metadata {

>     >      struct flow_tnl tunnel;     /* Encapsulating tunnel parameters.

> Note

>     > that

>     >                                   * if 'ip_dst' == 0, the rest of

> the

>     > fields may

>     >                                   * be uninitialized. */

>     > +    void *prev_emc_entry;       /* EMC entry that this packet

> matched. */

>     > +    void *prev_flow;            /* Flow pointed by the matching EMC

>     > entry. */

>     >  };

>     >

>     >  static inline void

>     > --

>     > 2.4.11

>     >

>     > _______________________________________________

>     > dev mailing list

>     > dev@openvswitch.org

>     > https://urldefense.proofpoint.com/v2/url?u=https-

> 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

>     _______________________________________________

>     dev mailing list

>     dev@openvswitch.org

>     https://urldefense.proofpoint.com/v2/url?u=https-

> 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

>
Chandran, Sugesh June 23, 2017, 3:20 p.m. UTC | #4
Hi Antonio,

Thank you for the patches, 

Please find my comments below.



Regards
_Sugesh


> -----Original Message-----

> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> bounces@openvswitch.org] On Behalf Of Fischetti, Antonio

> Sent: Tuesday, June 6, 2017 5:15 PM

> To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org

> Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

> established connections.

> 

> Thanks Darrel for your useful comments, I've tried to replicate the usecases

> you mentioned, please find inline some details.

> 

> Regards,

> Antonio

> 

> > -----Original Message-----

> > From: Darrell Ball [mailto:dball@vmware.com]

> > Sent: Thursday, June 1, 2017 6:20 PM

> > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> > dev@openvswitch.org

> > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation

> > for established connections.

> >

> > Comments inline

> >

> > On 5/29/17, 8:24 AM, "ovs-dev-bounces@openvswitch.org on behalf of

> > Fischetti, Antonio" <ovs-dev-bounces@openvswitch.org on behalf of

> > antonio.fischetti@intel.com> wrote:

> >

> >     Thanks Joe for your feedback and the interesting insights in

> > conntrack in your earlier communication.

> >     We have added all the details that we considered for this first

> > implementation. Also, some answers are inline.

> >

> >     The purpose of this implementation is to avoid recirculation just

> > for those packets that are part of established connections.

> >

> >     This shouldn't affect the packet recirculation for actions other

> > than conntrack. For example in MPLS, after a pop_mpls action the

> > packet will still be recirculated to follow the usual datapath.

> >

> >     Most importantly, the CT module isn't by-passed in this

> > implementation.

> >

> >     Besides the recirculation, all other action[s] shall be executed

> > as-is on each packet.

> >     Any new CT change or action set by the controller will be managed

> > as usual.

> >

> >     For our testing we set up a simple firewall, below are the flows.

> >

> >

> >      Flow Setup

> >      ----------

> >     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

> >     table=0, priority=10,arp actions=NORMAL

> >     table=0, priority=1 actions=drop

> >     table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2

> >     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

> >     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> >     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

> >

> >

> >      Basic idea

> >      ----------

> >     With the above rules, all packets belonging to an established

> > connection will first hit the flow

> >     "ct_state=-trk,ip actions=ct(table=1)"

> >

> >     then on recirculation, they will hit

> >     "ct_state=+est+trk,ip,in_port=.. actions=output:X".

> >

> >     The basic idea is to do the following 2 steps.

> >     1. Combine the two sets of actions by removing the recirculation.

> >        a) Original actions:

> >         - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]

> >         - "output:X"

> >        b) Combined Actions after Removing recirculation:

> >         - "ct(zone=N), output:X".

> >

> >     2. All the subsequent packets shall hit a flow with the combined

> > actions.

> >

> >

> > [Darrell]

> >

> > 1) This would be constraining on how rules are written such that we

> > can’t go back to

> >               the beginning of the pipeline, just because a packet is

> > established.

> >               Meaning we can’t run stateless or stateful rules earlier

> > in the pipeline.

> >               There are lots of possible combinations.

> >               One case is:

> >               We may run a bunch of firewall checks in table X, later

> > on (table X+1) run the non-filtered packets through conntrack

> >               and then want to go back to the firewall table.

> >

> 

> [Antonio]

> I've tried to replicate the case you described. I've used the following set of

> rules. They may have no practical meaning, the purpose is to have - at least

> for packets from port 1 - that they traverse tables #0 => 1 => 2 and then are

> sent back to table#0.

> 

> NXST_FLOW reply (xid=0x4):

>  table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)  table=1,

> ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24

> actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1

> actions=mod_nw_src:3.3.3.3,resubmit(,2)

>  table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop  table=2,

> ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0)  table=0,

> ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2  table=1,

> ct_state=+new+trk,ip,in_port=2 actions=drop  table=1,

> ct_state=+est+trk,ip,in_port=2 actions=output:1  table=0, priority=10,arp

> actions=NORMAL

> 

> For packets from port1:

>  - table#0: untracked packets go through the CT, then are recirculated, ie go

> to table#1.

>  - table#1: new conn are committed.

>  - table#1: packets of est conn are modified in the IPsrc that becomes 3.3.3.3,

>             then sent to table#2.

>  - table#2: TCP packets are dropped - but I'm sending just UDP traffic.

>  - table#2: UDP packets go back to table#0.

>  - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2.

> 

> In this case we still see that the packets of est conn will skip one recirculation.

> 

> In fact what happens is the following.

> For packets belonging to est connections - and coming from

> port1 - the actions

>    mod_nw_src:3.3.3.3,resubmit(,2)

> become

>    ct(),mod_nw_src:3.3.3.3,resubmit(,2)

> 

> and the EMC entry that was hit by the original packet will point to the

> megaflow

>    table=1, ct_state=+est+trk,ip,in_port=1

> actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2)

> So new packets of the same connection will skip one recirculation.

> 

> For packets from port2 the megaflow

>    table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 becomes

>    table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1 and will be

> pointed by the EMC entry hit by the original packet.

> So new packets of the same connection will skip one recirculation.

> 

> Is this usecase somewhat similar to the one you were mentioning?

> Or maybe you can suggest some other rule setup?

> 

> 

> > 2) This allows the dataplane to override what the ofproto layer specifies.

> > This looks like a layering violation.

> >

> > I guess this also speaks to the datapath (userspace) specific design here.

> >

> >

> > I don’t think it matters what the performance advantage is, although

> > it is small in this case.

> >

> >

> >

> >

> >      Considerations

> >      --------------

> >     On EMC entries:

> >     All packets hitting a connection in the same direction will have

> > the same 5-tuple.

> >     They shall all hit the same EMC entry - here referred as

> > 'EMC_ORIG' - and point to the flow

> >     "ct_state=-trk,ip actions=ct(table=1)", referred from here as

> > 'FLOW_ORIG'.

> >

> >     On recirculation the packets shall hit another EMC entry -

> > referred as 'EMC_RECIRC' - and point to the flow

> >     "ct_state=+est+trk,ip,in_port=.. actions=output:X",

> >     referred from here as 'FLOW_RECIRC'.

> >

> >     FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC

> > is the flow hit

> >     by the recirculated packet.

> >

> >

> >      Implementation Details

> >      ----------------------

> >     We thought about different implementations, but for the purpose of

> > this RFC we stick to the following approach.

> >

> >     On packet recirculation, a new flow has to be created with a list

> > of appropriate actions. At this

> >     stage we prepend a new action called 'ct(Skip)' to the original

> > list of actions. This

> >     new action is used as a temporary place-holder that will be over

> > written later on.

> >

> >        Original Actions: "Output:2"    ==>    New Actions: "ct(Skip),

> > Output:2"

> >

> >     For all the packets of an established conn. (i.e marked with

> > +trk,+est bits), do the 3 steps:

> >     1. Retrieve Flows from EMC.

> >         - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id)

> >         - FLOW_RECIRC whose actions include: ct(Skip), Output:X

> >

> >     2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with

> > the ct action from FLOW_ORIG, ie ct(zone=N).

> >         ct(Skip), Output:X  =>  ct(zone=N), Output:X

> >

> >     3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow.

> >

> >     In this way, we avoid recirculating the packets belonging to

> > established connections and

> >     this no ways skips sending the packets to the CT module.

> >

> >

> >     Antonio

> >

> >     > -----Original Message-----

> >     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> >     > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com

> >     > Sent: Thursday, May 25, 2017 5:03 PM

> >     > To: dev@openvswitch.org

> >     > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

> >     > established connections.

> >     >

> >     > From: Antonio Fischetti <antonio.fischetti@intel.com>

> >     >

> >     > With conntrack enabled, packets get recirculated and this impacts

> >     > the performance with thousands of active concurrent connections.

> >     >

> >     > This patch is aimed at avoiding recirculation for packets

> > belonging to

> >     > established connections in steady state. This is achieved by

> >     > manipulating the megaflows and actions. In this case the actions

> > of the

> >     > megaflow of recirculated packet is initially updated with new

> > 'CT_SKIP'

> >     > action and later updated with the actions of the megaflow of the

> >     > original packet(to handle zones). There after the EMC entry

> >     > belonging to the original packet is updated to point to the

> > 'megaflow' of

> >     > the recirculated packet there by avoiding recirculation.

> >     >

> >     > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

> >     > Signed-off-by: Bhanuprakash Bodireddy

> > <bhanuprakash.bodireddy@intel.com>

> >     > Co-authored-by: Bhanuprakash Bodireddy

> > <bhanuprakash.bodireddy@intel.com>

> >     > ---

> >     > - ~10% performance improvement is observed in our testing with UDP

> >     > streams.

> >     > - Limited testing is done with RFC patch and the tests include

> > hundreds of

> >     >   concurrent active TCP connections.

> >     > - This is very early implementation and we are posting here to

> > get initial

> >     >   feedback.

> >     > - This RFC patch is implemented leveraging EMC, but optimization

> > could be

> >     >   extended to dpcls as well to handle higher flows.

> >     > - No VXLAN testing is done with this patch.

> >     >

> >     >  datapath/linux/compat/include/linux/openvswitch.h |   1 +

> >     >  lib/dpif-netdev.c                                 | 148

> >     > ++++++++++++++++++++--

> >     >  lib/packets.h                                     |   2 +

> >     >  3 files changed, 143 insertions(+), 8 deletions(-)

> >     >

> >     > diff --git a/datapath/linux/compat/include/linux/openvswitch.h

> >     > b/datapath/linux/compat/include/linux/openvswitch.h

> >     > index d22102e..6dc5674 100644

> >     > --- a/datapath/linux/compat/include/linux/openvswitch.h

> >     > +++ b/datapath/linux/compat/include/linux/openvswitch.h

> >     > @@ -762,6 +762,7 @@ enum ovs_ct_attr {

> >     >  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */

> >     >  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */

> >     >  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */

> >     > +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT.

> > */

> >     >  	__OVS_CT_ATTR_MAX

> >     >  };

> >     >

> >     > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

> >     > index b50164b..fbbb42e 100644

> >     > --- a/lib/dpif-netdev.c

> >     > +++ b/lib/dpif-netdev.c

> >     > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct

> > dp_netdev_pmd_thread

> >     > *pmd,

> >     >  }

> >     >

> >     >  static inline struct dp_netdev_flow *

> >     > -emc_lookup(struct emc_cache *cache, const struct

> > netdev_flow_key

> > *key)

> >     > +emc_lookup(struct emc_cache *cache, const struct

> > netdev_flow_key *key,

> >     > +        void **pfound_entry)

> >     >  {

> >     >      struct emc_entry *current_entry;

> >     >

> >     > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache,

> > const struct

> >     > netdev_flow_key *key)

> >     >              && emc_entry_alive(current_entry)

> >     >              && netdev_flow_key_equal_mf(&current_entry->key, &key-

> > >mf)) {

> >     >

> >     > +            *pfound_entry = current_entry;

> >     >              /* We found the entry with the 'key->mf' miniflow */

> >     >              return current_entry->flow;

> >     >          }

> >     >      }

> >     >

> >     > +    *pfound_entry = NULL;

> >     >      return NULL;

> >     >  }

> >     >

> >     > @@ -4583,10 +4586,12 @@ emc_processing(struct

> > dp_netdev_pmd_thread *pmd,

> >     >          key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

> > >mf);

> >     >

> >     >          /* If EMC is disabled skip emc_lookup */

> >     > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

> >     > +        flow = (cur_min == 0) ? NULL:

> >     > +                emc_lookup(flow_cache, key, &packet-

> > >md.prev_emc_entry);

> >     >          if (OVS_LIKELY(flow)) {

> >     >              dp_netdev_queue_batches(packet, flow, &key->mf,

> > batches,

> >     >                                      n_batches);

> >     > +            packet->md.prev_flow = flow;

> >     >          } else {

> >     >              /* Exact match cache missed. Group missed packets

> > together at

> >     >               * the beginning of the 'packets' array. */

> >     > @@ -4595,6 +4600,7 @@ emc_processing(struct

> dp_netdev_pmd_thread

> > *pmd,

> >     >               * must be returned to the caller. The next key should

> > be

> >     > extracted

> >     >               * to 'keys[n_missed + 1]'. */

> >     >              key = &keys[++n_missed];

> >     > +            packet->md.prev_flow = NULL;

> >     >          }

> >     >      }

> >     >

> >     > @@ -4604,6 +4610,9 @@ emc_processing(struct

> dp_netdev_pmd_thread

> > *pmd,

> >     >      return dp_packet_batch_size(packets_);

> >     >  }

> >     >

> >     > +#define CT_PREPEND_ACTION_LEN 0x0C

> >     > +#define CT_PREPEND_ACTION_SPARE_LEN 32

> >     > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +

> >     > CT_PREPEND_ACTION_SPARE_LEN)

> >     >  static inline void

> >     >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,

> >     >                       struct dp_packet *packet,

> >     > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct

> > dp_netdev_pmd_thread

> >     > *pmd,

> >     >          ovs_mutex_lock(&pmd->flow_mutex);

> >     >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);

> >     >          if (OVS_LIKELY(!netdev_flow)) {

> >     > -            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

> >     > -                                             add_actions->data,

> >     > -                                             add_actions->size);

> >     > +            /* Recirculated packets that belong to established

> >     > connections

> >     > +             * are treated differently.  A place-holder is

> > prepended to

> >     > the

> >     > +             * list of actions. */

> >     > +            if (!(packet->md.ct_state & CS_INVALID) &&

> >     > +                 (packet->md.ct_state & CS_TRACKED) &&

> >     > +                 (packet->md.ct_state & CS_ESTABLISHED) &&

> >     > +                 (packet->md.recirc_id)) {

> >     > +

> >

> >

> >     [Joe] "Just because the value of ct_state in the packet metadata

> > of this

> >     particular packet happens to have those particular state flags doesn't

> >     mean that the OpenFlow pipeline is enforcing this constraint for all

> >     packets hitting this flow. You would have to ensure that the mask in

> >     the flow is also ensuring that all future packets hitting the new flow

> >     will also have that particular ct_state."

> >

> >     This part of the code should prepend a ct(Skip) action to the

> >     FLOW_RECIRC only and just for packets that belong to an

> > established conn.

> >     The FLOW_ORIG shouldn't be altered, because it will be hit by all

> > ingress

> >     packets.

> >

> >     > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of

> > actions

> >     > inside

> >     > +                 * add_actions.  It does not really invoke the

> > ConnTrack

> >     > module,

> >     > +                 * it's a Ct action that works as a place-holder.

> > It

> >     > will be

> >     > +                 * overwritten inside emc_patch_established_conn()

> > with

> >     > +                 * the proper ct(zone=X) action. */

> >     > +                struct dp_netdev_actions *new_actions;

> >     > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {

> >     > +                        0x0C, 0x00, 0x0C, 0x00,

> >     > +                        0x06, 0x00, 0x09, 0x00,

> >     > +                        0x00, 0x00, 0x00, 0x00,

> >     > +                        /* will append here add_actions. */

> >     > +                };

> >     > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],

> >     > +                        add_actions->data,

> >     > +                        add_actions->size);

> >     > +                new_actions = dp_netdev_actions_create(

> >     > +                        (const struct nlattr *)tmp_action,

> >     > +                        CT_PREPEND_ACTION_LEN + add_actions->size);

> >

> >     [Joe] "Hmm. In some ways I think this would be cleaner to be

> > implemented in

> >     the xlate code, around compose_conntrack_action().. but that code is

> >     currently pretty much dataplane-oblivious and this suggestion is a

> >     userspace-specific optimization."

> >

> >

> >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> > &ufid,

> >     > +                                                 new_actions-

> > >actions,

> >     > +                                                 new_actions-

> > >size);

> >     > +            } else {

> >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> > &ufid,

> >     > +                                                 add_actions->data,

> >     > +                                                 add_actions-

> > >size);

> >     > +            }

> >     >          }

> >     >          ovs_mutex_unlock(&pmd->flow_mutex);

> >     > -        emc_probabilistic_insert(pmd, key, netdev_flow);

> >     > +        /* For this RFC, probabilistic emc insertion is disabled.

> > */

> >     > +        emc_insert(&pmd->flow_cache, key, netdev_flow);

> >     >      }

> >     >  }

> >     >

> >     > @@ -4761,7 +4802,8 @@ fast_path_processing(struct

> > dp_netdev_pmd_thread

> >     > *pmd,

> >     >

> >     >          flow = dp_netdev_flow_cast(rules[i]);

> >     >

> >     > -        emc_probabilistic_insert(pmd, &keys[i], flow);

> >     > +        /* For testing purposes the emc_insert is restored here. */

> >     > +        emc_insert(&pmd->flow_cache, &keys[i], flow);

> >     >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

> >     > n_batches);

> >     >      }

> >     >

> >     > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct

> >     > dp_netdev_pmd_thread *pmd,

> >     >      }

> >     >  }

> >     >

> >     > +/* Search into EMC to retrieve the entry related to the

> > original packet

> >     > + * and the entry related to the recirculated packet.

> >     > + * If both EMC entries are alive, then:

> >     > + *  - The flow actions of the recirculated packet are updated with

> >     > + *    'ct(zone=N)' as retrieved from the actions of the original

> > flow.

> >     > + *  - The EMC orig entry flow is updated with the flow pointer

> > of recirc

> >     > entry.

> >     > + */

> >     > +static inline void

> >     > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,

> >     > +        struct dp_packet *packet, long long now OVS_UNUSED)

> >     > +{

> >     > +    struct emc_cache *cache = &pmd->flow_cache;

> >     > +    struct netdev_flow_key key;

> >     > +    struct emc_entry *orig_entry; /* EMC entry hit by the original

> >     > packet. */

> >     > +    struct emc_entry *recirc_entry; /* EMC entry for recirculated

> > packet.

> >     > */

> >     > +    uint32_t old_hash;

> >     > +

> >     > +    if (!packet->md.prev_emc_entry) {

> >     > +        return;

> >     > +    }

> >     > +

> >     > +    orig_entry = packet->md.prev_emc_entry;

> >     > +    if (!emc_entry_alive(orig_entry)) {

> >     > +        return;

> >     > +    }

> >     > +

> >     > +    /* Check that the original EMC entry was not overwritten. */

> >     > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;

> >     > +    if (packet->md.prev_flow && (packet->md.prev_flow !=

> > orig_flow)) {

> >     > +       return;

> >     > +    }

> >     > +

> >     > +    /* TODO as we are calling miniflow_extract now, can we avoid to

> >     > invoke

> >     > +     * it again when we'll classify this recirculated packet? */

> >     > +    miniflow_extract(packet, &key.mf);

> >     > +    key.len = 0; /* Not computed yet. */

> >     > +    old_hash = dp_packet_get_rss_hash(packet);

> >     > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);

> >     > +    dp_packet_set_rss_hash(packet, old_hash);

> >     > +

> >     > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {

> >     > +        if (recirc_entry->key.hash == key.hash

> >     > +            && emc_entry_alive(recirc_entry)

> >     > +            && netdev_flow_key_equal_mf(&recirc_entry->key,

> > &key.mf)) {

> >     > +

> >     > +            if (orig_entry->flow == recirc_entry->flow) {

> >     > +                /* Nothing to do, already patched by a packet of

> > this

> >     > +                 * same batch. */

> >     > +                return;

> >     > +            }

> >     > +            /* We found the entry related to the recirculated

> > packet.

> >     > +             * Retrieve the actions for orig and recirc entries. *

> > */

> >     > +            struct dp_netdev_actions * orig_actions =

> >     > +                    dp_netdev_flow_get_actions(orig_entry->flow);

> >     > +            struct dp_netdev_actions * recirc_actions =

> >     > +                    dp_netdev_flow_get_actions(recirc_entry->flow);

> >     > +

> >     > +            /* The orig entry action contains a CT action with the

> > zone

> >     > info. */

[Sugesh] is it safe to assume first action would be ct ??
> >     > +            struct nlattr *p = &orig_actions->actions[0];

> >     > +

> >     > +            /* Overwrite the CT Skip action of recirc entry with

> >     > ct(zone=N). */

> >     > +            memcpy(recirc_actions->actions, p, p->nla_len);

[Sugesh] What if recirc_action size is smaller than p? shouldn’t we make sure the size of skip action match with size of actions present in ct_table?

> >     > +

> >     > +            /* Update orig EMC entry. */

> >     > +            orig_entry->flow = recirc_entry->flow;

> >     > +            dp_netdev_flow_ref(orig_entry->flow);

> >     > +

> >     > +            return;

> >     > +        }

> >     > +    }

> >     > +}

> >     > +

> >     >  static void

> >     >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,

> >     >                const struct nlattr *a, bool may_steal)

> >     > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct

> > dp_packet_batch

> >     > *packets_,

> >     >              } else {

> >     >                  tx_qid = pmd->static_tx_qid;

> >     >              }

> >     > -

> >     >              netdev_send(p->port->netdev, tx_qid, packets_,

> > may_steal,

> >     >                          dynamic_txqs);

> >     >              return;

> >

> >     [Joe] "Unrelated style change."

> >

> >     Will do that.

> >

> >     > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct

> > dp_packet_batch

> >     > *packets_,

> >     >              struct dp_packet *packet;

> >     >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

> >     >                  packet->md.recirc_id = nl_attr_get_u32(a);

> >     > +

> >     > +                if (!(packet->md.ct_state & CS_INVALID) &&

> >     > +                     (packet->md.ct_state & CS_TRACKED) &&

> >     > +                     (packet->md.ct_state & CS_ESTABLISHED)) {

> >     > +                    (*depth)++;

> >     > +                    emc_patch_established_conn(pmd, packet, now);

> >     > +                    (*depth)--;

> >     > +                }

> >     >              }

> >     >

> >     >              (*depth)++;

> >     >              dp_netdev_recirculate(pmd, packets_);

> >     > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

> >     > +                packet->md.recirc_id = 0;

> >     > +            }

> >     >              (*depth)--;

> >     >

> >     >              return;

> >

> >     [Joe] "Can you give some more detail about why resetting the

> > recirc_id is

> >     correct here? Could/should this be applied as an independent change

> >     regardless of the rest of this series?"

> >

> >     This addition is needed so that in handle_packet_upcall we add the

> > ct(Skip)

> >     just to the FLOW_RECIRC and not to the FLOW_ORIG.

> >     I guess this could be also applied as an independent change?

> >

> >     > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct

> > dp_packet_batch

> >     > *packets_,

> >     >          const char *helper = NULL;

> >     >          const uint32_t *setmark = NULL;

> >     >          const struct ovs_key_ct_labels *setlabel = NULL;

> >     > +        bool skip_ct = false;

> >     >

> >     >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),

> >     >                                   nl_attr_get_size(a)) {

> >     > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct

> > dp_packet_batch

> >     > *packets_,

> >     >                  /* Silently ignored, as userspace datapath does not

> >     > generate

> >     >                   * netlink events. */

> >     >                  break;

> >     > +            case OVS_CT_ATTR_SKIP:

> >     > +                skip_ct = true;

> >     > +                break;

> >     >              case OVS_CT_ATTR_NAT:

> >     >              case OVS_CT_ATTR_UNSPEC:

> >     >              case __OVS_CT_ATTR_MAX:

> >     > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct

> > dp_packet_batch

> >     > *packets_,

> >     >              }

> >     >          }

> >     >

> >     > +        if (OVS_UNLIKELY(skip_ct)) {

> >     > +            break;

> >     > +        }

> >     > +

> >     >          conntrack_execute(&dp->conntrack, packets_, aux->flow-

> > >dl_type,

> >     > force,

> >     >                            commit, zone, setmark, setlabel, helper);

> >     >          break;

> >     > diff --git a/lib/packets.h b/lib/packets.h

> >     > index 7dbf6dd..2c46a15 100644

> >     > --- a/lib/packets.h

> >     > +++ b/lib/packets.h

> >     > @@ -112,6 +112,8 @@ struct pkt_metadata {

> >     >      struct flow_tnl tunnel;     /* Encapsulating tunnel parameters.

> > Note

> >     > that

> >     >                                   * if 'ip_dst' == 0, the rest of

> > the

> >     > fields may

> >     >                                   * be uninitialized. */

[Sugesh] I don’t see the initiation of these fields. Have you verified the cost of these fields in pkt_metadata for normal traffic,?. I mean the performance impact.?
> >     > +    void *prev_emc_entry;       /* EMC entry that this packet

> > matched. */

> >     > +    void *prev_flow;            /* Flow pointed by the matching EMC

> >     > entry. */

> >     >  };

> >     >

> >     >  static inline void

> >     > --

> >     > 2.4.11

> >     >

[Sugesh] I cannot apply the patch on latest master due to some conflicts.
I am not sure , how does the stats get updated when you are combining the flows in datapath. Earlier there are two UUID for two flows and now only one UUID is being used. 
I don’t have a expertise in conntrack, However what would be the result when we have actions like

table=0, priority=100,ct_state=-trk,ip actions=clone(push_vlan(4), trunc(200), output:2), ct(table=1) 
table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2


Also another example would be 
table=0, priority=100,ct_state=-trk,ip actions=push_vlan(4), ct(table=1) ,dec_ttl
table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

As Joe suggested before, I think the logic of combine will be good to be in the ofproto-xlate layer than here. Though its userspace specific, it could be the same way how the userspace_tunnel handled there.


Also another concern is on the actions that does fields modification along with connection tracking. Probably it’s a valid use case with NAT. In this scenario, if the flow fields are changed by action, how does it possible to track back the original flow?
Or am I missing anything ?



> >     > _______________________________________________

> >     > dev mailing list

> >     > dev@openvswitch.org

> >     > https://urldefense.proofpoint.com/v2/url?u=https-

> > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

> >     _______________________________________________

> >     dev mailing list

> >     dev@openvswitch.org

> >     https://urldefense.proofpoint.com/v2/url?u=https-

> > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

> >

> 

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Fischetti, Antonio July 20, 2017, 3:24 p.m. UTC | #5
Thanks Sugesh for your comments, pls see replies inline.
-Antonio

> -----Original Message-----

> From: Chandran, Sugesh

> Sent: Friday, June 23, 2017 4:20 PM

> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

> Subject: RE: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

> established connections.

> 

> Hi Antonio,

> 

> Thank you for the patches,

> 

> Please find my comments below.

> 

> 

> 

> Regards

> _Sugesh

> 

> 

> > -----Original Message-----

> > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> > bounces@openvswitch.org] On Behalf Of Fischetti, Antonio

> > Sent: Tuesday, June 6, 2017 5:15 PM

> > To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org

> > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

> > established connections.

> >

> > Thanks Darrel for your useful comments, I've tried to replicate the

> usecases

> > you mentioned, please find inline some details.

> >

> > Regards,

> > Antonio

> >

> > > -----Original Message-----

> > > From: Darrell Ball [mailto:dball@vmware.com]

> > > Sent: Thursday, June 1, 2017 6:20 PM

> > > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> > > dev@openvswitch.org

> > > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation

> > > for established connections.

> > >

> > > Comments inline

> > >

> > > On 5/29/17, 8:24 AM, "ovs-dev-bounces@openvswitch.org on behalf of

> > > Fischetti, Antonio" <ovs-dev-bounces@openvswitch.org on behalf of

> > > antonio.fischetti@intel.com> wrote:

> > >

> > >     Thanks Joe for your feedback and the interesting insights in

> > > conntrack in your earlier communication.

> > >     We have added all the details that we considered for this first

> > > implementation. Also, some answers are inline.

> > >

> > >     The purpose of this implementation is to avoid recirculation just

> > > for those packets that are part of established connections.

> > >

> > >     This shouldn't affect the packet recirculation for actions other

> > > than conntrack. For example in MPLS, after a pop_mpls action the

> > > packet will still be recirculated to follow the usual datapath.

> > >

> > >     Most importantly, the CT module isn't by-passed in this

> > > implementation.

> > >

> > >     Besides the recirculation, all other action[s] shall be executed

> > > as-is on each packet.

> > >     Any new CT change or action set by the controller will be managed

> > > as usual.

> > >

> > >     For our testing we set up a simple firewall, below are the flows.

> > >

> > >

> > >      Flow Setup

> > >      ----------

> > >     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

> > >     table=0, priority=10,arp actions=NORMAL

> > >     table=0, priority=1 actions=drop

> > >     table=1, ct_state=+new+trk,ip,in_port=1

> actions=ct(commit),output:2

> > >     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

> > >     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> > >     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

> > >

> > >

> > >      Basic idea

> > >      ----------

> > >     With the above rules, all packets belonging to an established

> > > connection will first hit the flow

> > >     "ct_state=-trk,ip actions=ct(table=1)"

> > >

> > >     then on recirculation, they will hit

> > >     "ct_state=+est+trk,ip,in_port=.. actions=output:X".

> > >

> > >     The basic idea is to do the following 2 steps.

> > >     1. Combine the two sets of actions by removing the recirculation.

> > >        a) Original actions:

> > >         - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]

> > >         - "output:X"

> > >        b) Combined Actions after Removing recirculation:

> > >         - "ct(zone=N), output:X".

> > >

> > >     2. All the subsequent packets shall hit a flow with the combined

> > > actions.

> > >

> > >

> > > [Darrell]

> > >

> > > 1) This would be constraining on how rules are written such that we

> > > can’t go back to

> > >               the beginning of the pipeline, just because a packet is

> > > established.

> > >               Meaning we can’t run stateless or stateful rules earlier

> > > in the pipeline.

> > >               There are lots of possible combinations.

> > >               One case is:

> > >               We may run a bunch of firewall checks in table X, later

> > > on (table X+1) run the non-filtered packets through conntrack

> > >               and then want to go back to the firewall table.

> > >

> >

> > [Antonio]

> > I've tried to replicate the case you described. I've used the following

> set of

> > rules. They may have no practical meaning, the purpose is to have - at

> least

> > for packets from port 1 - that they traverse tables #0 => 1 => 2 and

> then are

> > sent back to table#0.

> >

> > NXST_FLOW reply (xid=0x4):

> >  table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)  table=1,

> > ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24

> > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1

> > actions=mod_nw_src:3.3.3.3,resubmit(,2)

> >  table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop  table=2,

> > ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0)  table=0,

> > ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2

> table=1,

> > ct_state=+new+trk,ip,in_port=2 actions=drop  table=1,

> > ct_state=+est+trk,ip,in_port=2 actions=output:1  table=0,

> priority=10,arp

> > actions=NORMAL

> >

> > For packets from port1:

> >  - table#0: untracked packets go through the CT, then are recirculated,

> ie go

> > to table#1.

> >  - table#1: new conn are committed.

> >  - table#1: packets of est conn are modified in the IPsrc that becomes

> 3.3.3.3,

> >             then sent to table#2.

> >  - table#2: TCP packets are dropped - but I'm sending just UDP traffic.

> >  - table#2: UDP packets go back to table#0.

> >  - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2.

> >

> > In this case we still see that the packets of est conn will skip one

> recirculation.

> >

> > In fact what happens is the following.

> > For packets belonging to est connections - and coming from

> > port1 - the actions

> >    mod_nw_src:3.3.3.3,resubmit(,2)

> > become

> >    ct(),mod_nw_src:3.3.3.3,resubmit(,2)

> >

> > and the EMC entry that was hit by the original packet will point to the

> > megaflow

> >    table=1, ct_state=+est+trk,ip,in_port=1

> > actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2)

> > So new packets of the same connection will skip one recirculation.

> >

> > For packets from port2 the megaflow

> >    table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 becomes

> >    table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1 and

> will be

> > pointed by the EMC entry hit by the original packet.

> > So new packets of the same connection will skip one recirculation.

> >

> > Is this usecase somewhat similar to the one you were mentioning?

> > Or maybe you can suggest some other rule setup?

> >

> >

> > > 2) This allows the dataplane to override what the ofproto layer

> specifies.

> > > This looks like a layering violation.

> > >

> > > I guess this also speaks to the datapath (userspace) specific design

> here.

> > >

> > >

> > > I don’t think it matters what the performance advantage is, although

> > > it is small in this case.

> > >

> > >

> > >

> > >

> > >      Considerations

> > >      --------------

> > >     On EMC entries:

> > >     All packets hitting a connection in the same direction will have

> > > the same 5-tuple.

> > >     They shall all hit the same EMC entry - here referred as

> > > 'EMC_ORIG' - and point to the flow

> > >     "ct_state=-trk,ip actions=ct(table=1)", referred from here as

> > > 'FLOW_ORIG'.

> > >

> > >     On recirculation the packets shall hit another EMC entry -

> > > referred as 'EMC_RECIRC' - and point to the flow

> > >     "ct_state=+est+trk,ip,in_port=.. actions=output:X",

> > >     referred from here as 'FLOW_RECIRC'.

> > >

> > >     FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC

> > > is the flow hit

> > >     by the recirculated packet.

> > >

> > >

> > >      Implementation Details

> > >      ----------------------

> > >     We thought about different implementations, but for the purpose of

> > > this RFC we stick to the following approach.

> > >

> > >     On packet recirculation, a new flow has to be created with a list

> > > of appropriate actions. At this

> > >     stage we prepend a new action called 'ct(Skip)' to the original

> > > list of actions. This

> > >     new action is used as a temporary place-holder that will be over

> > > written later on.

> > >

> > >        Original Actions: "Output:2"    ==>    New Actions: "ct(Skip),

> > > Output:2"

> > >

> > >     For all the packets of an established conn. (i.e marked with

> > > +trk,+est bits), do the 3 steps:

> > >     1. Retrieve Flows from EMC.

> > >         - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id)

> > >         - FLOW_RECIRC whose actions include: ct(Skip), Output:X

> > >

> > >     2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with

> > > the ct action from FLOW_ORIG, ie ct(zone=N).

> > >         ct(Skip), Output:X  =>  ct(zone=N), Output:X

> > >

> > >     3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC'

> flow.

> > >

> > >     In this way, we avoid recirculating the packets belonging to

> > > established connections and

> > >     this no ways skips sending the packets to the CT module.

> > >

> > >

> > >     Antonio

> > >

> > >     > -----Original Message-----

> > >     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> > >     > bounces@openvswitch.org] On Behalf Of

> antonio.fischetti@intel.com

> > >     > Sent: Thursday, May 25, 2017 5:03 PM

> > >     > To: dev@openvswitch.org

> > >     > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation

> for

> > >     > established connections.

> > >     >

> > >     > From: Antonio Fischetti <antonio.fischetti@intel.com>

> > >     >

> > >     > With conntrack enabled, packets get recirculated and this

> impacts

> > >     > the performance with thousands of active concurrent connections.

> > >     >

> > >     > This patch is aimed at avoiding recirculation for packets

> > > belonging to

> > >     > established connections in steady state. This is achieved by

> > >     > manipulating the megaflows and actions. In this case the actions

> > > of the

> > >     > megaflow of recirculated packet is initially updated with new

> > > 'CT_SKIP'

> > >     > action and later updated with the actions of the megaflow of the

> > >     > original packet(to handle zones). There after the EMC entry

> > >     > belonging to the original packet is updated to point to the

> > > 'megaflow' of

> > >     > the recirculated packet there by avoiding recirculation.

> > >     >

> > >     > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

> > >     > Signed-off-by: Bhanuprakash Bodireddy

> > > <bhanuprakash.bodireddy@intel.com>

> > >     > Co-authored-by: Bhanuprakash Bodireddy

> > > <bhanuprakash.bodireddy@intel.com>

> > >     > ---

> > >     > - ~10% performance improvement is observed in our testing with

> UDP

> > >     > streams.

> > >     > - Limited testing is done with RFC patch and the tests include

> > > hundreds of

> > >     >   concurrent active TCP connections.

> > >     > - This is very early implementation and we are posting here to

> > > get initial

> > >     >   feedback.

> > >     > - This RFC patch is implemented leveraging EMC, but optimization

> > > could be

> > >     >   extended to dpcls as well to handle higher flows.

> > >     > - No VXLAN testing is done with this patch.

> > >     >

> > >     >  datapath/linux/compat/include/linux/openvswitch.h |   1 +

> > >     >  lib/dpif-netdev.c                                 | 148

> > >     > ++++++++++++++++++++--

> > >     >  lib/packets.h                                     |   2 +

> > >     >  3 files changed, 143 insertions(+), 8 deletions(-)

> > >     >

> > >     > diff --git a/datapath/linux/compat/include/linux/openvswitch.h

> > >     > b/datapath/linux/compat/include/linux/openvswitch.h

> > >     > index d22102e..6dc5674 100644

> > >     > --- a/datapath/linux/compat/include/linux/openvswitch.h

> > >     > +++ b/datapath/linux/compat/include/linux/openvswitch.h

> > >     > @@ -762,6 +762,7 @@ enum ovs_ct_attr {

> > >     >  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */

> > >     >  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */

> > >     >  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */

> > >     > +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT.

> > > */

> > >     >  	__OVS_CT_ATTR_MAX

> > >     >  };

> > >     >

> > >     > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

> > >     > index b50164b..fbbb42e 100644

> > >     > --- a/lib/dpif-netdev.c

> > >     > +++ b/lib/dpif-netdev.c

> > >     > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct

> > > dp_netdev_pmd_thread

> > >     > *pmd,

> > >     >  }

> > >     >

> > >     >  static inline struct dp_netdev_flow *

> > >     > -emc_lookup(struct emc_cache *cache, const struct

> > > netdev_flow_key

> > > *key)

> > >     > +emc_lookup(struct emc_cache *cache, const struct

> > > netdev_flow_key *key,

> > >     > +        void **pfound_entry)

> > >     >  {

> > >     >      struct emc_entry *current_entry;

> > >     >

> > >     > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache,

> > > const struct

> > >     > netdev_flow_key *key)

> > >     >              && emc_entry_alive(current_entry)

> > >     >              && netdev_flow_key_equal_mf(&current_entry->key,

> &key-

> > > >mf)) {

> > >     >

> > >     > +            *pfound_entry = current_entry;

> > >     >              /* We found the entry with the 'key->mf' miniflow

> */

> > >     >              return current_entry->flow;

> > >     >          }

> > >     >      }

> > >     >

> > >     > +    *pfound_entry = NULL;

> > >     >      return NULL;

> > >     >  }

> > >     >

> > >     > @@ -4583,10 +4586,12 @@ emc_processing(struct

> > > dp_netdev_pmd_thread *pmd,

> > >     >          key->hash = dpif_netdev_packet_get_rss_hash(packet,

> &key-

> > > >mf);

> > >     >

> > >     >          /* If EMC is disabled skip emc_lookup */

> > >     > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache,

> key);

> > >     > +        flow = (cur_min == 0) ? NULL:

> > >     > +                emc_lookup(flow_cache, key, &packet-

> > > >md.prev_emc_entry);

> > >     >          if (OVS_LIKELY(flow)) {

> > >     >              dp_netdev_queue_batches(packet, flow, &key->mf,

> > > batches,

> > >     >                                      n_batches);

> > >     > +            packet->md.prev_flow = flow;

> > >     >          } else {

> > >     >              /* Exact match cache missed. Group missed packets

> > > together at

> > >     >               * the beginning of the 'packets' array. */

> > >     > @@ -4595,6 +4600,7 @@ emc_processing(struct

> > dp_netdev_pmd_thread

> > > *pmd,

> > >     >               * must be returned to the caller. The next key

> should

> > > be

> > >     > extracted

> > >     >               * to 'keys[n_missed + 1]'. */

> > >     >              key = &keys[++n_missed];

> > >     > +            packet->md.prev_flow = NULL;

> > >     >          }

> > >     >      }

> > >     >

> > >     > @@ -4604,6 +4610,9 @@ emc_processing(struct

> > dp_netdev_pmd_thread

> > > *pmd,

> > >     >      return dp_packet_batch_size(packets_);

> > >     >  }

> > >     >

> > >     > +#define CT_PREPEND_ACTION_LEN 0x0C

> > >     > +#define CT_PREPEND_ACTION_SPARE_LEN 32

> > >     > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +

> > >     > CT_PREPEND_ACTION_SPARE_LEN)

> > >     >  static inline void

> > >     >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,

> > >     >                       struct dp_packet *packet,

> > >     > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct

> > > dp_netdev_pmd_thread

> > >     > *pmd,

> > >     >          ovs_mutex_lock(&pmd->flow_mutex);

> > >     >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key,

> NULL);

> > >     >          if (OVS_LIKELY(!netdev_flow)) {

> > >     > -            netdev_flow = dp_netdev_flow_add(pmd, &match,

> &ufid,

> > >     > -                                             add_actions->data,

> > >     > -                                             add_actions-

> >size);

> > >     > +            /* Recirculated packets that belong to established

> > >     > connections

> > >     > +             * are treated differently.  A place-holder is

> > > prepended to

> > >     > the

> > >     > +             * list of actions. */

> > >     > +            if (!(packet->md.ct_state & CS_INVALID) &&

> > >     > +                 (packet->md.ct_state & CS_TRACKED) &&

> > >     > +                 (packet->md.ct_state & CS_ESTABLISHED) &&

> > >     > +                 (packet->md.recirc_id)) {

> > >     > +

> > >

> > >

> > >     [Joe] "Just because the value of ct_state in the packet metadata

> > > of this

> > >     particular packet happens to have those particular state flags

> doesn't

> > >     mean that the OpenFlow pipeline is enforcing this constraint for

> all

> > >     packets hitting this flow. You would have to ensure that the mask

> in

> > >     the flow is also ensuring that all future packets hitting the new

> flow

> > >     will also have that particular ct_state."

> > >

> > >     This part of the code should prepend a ct(Skip) action to the

> > >     FLOW_RECIRC only and just for packets that belong to an

> > > established conn.

> > >     The FLOW_ORIG shouldn't be altered, because it will be hit by all

> > > ingress

> > >     packets.

> > >

> > >     > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of

> > > actions

> > >     > inside

> > >     > +                 * add_actions.  It does not really invoke the

> > > ConnTrack

> > >     > module,

> > >     > +                 * it's a Ct action that works as a place-

> holder.

> > > It

> > >     > will be

> > >     > +                 * overwritten inside

> emc_patch_established_conn()

> > > with

> > >     > +                 * the proper ct(zone=X) action. */

> > >     > +                struct dp_netdev_actions *new_actions;

> > >     > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {

> > >     > +                        0x0C, 0x00, 0x0C, 0x00,

> > >     > +                        0x06, 0x00, 0x09, 0x00,

> > >     > +                        0x00, 0x00, 0x00, 0x00,

> > >     > +                        /* will append here add_actions. */

> > >     > +                };

> > >     > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],

> > >     > +                        add_actions->data,

> > >     > +                        add_actions->size);

> > >     > +                new_actions = dp_netdev_actions_create(

> > >     > +                        (const struct nlattr *)tmp_action,

> > >     > +                        CT_PREPEND_ACTION_LEN + add_actions-

> >size);

> > >

> > >     [Joe] "Hmm. In some ways I think this would be cleaner to be

> > > implemented in

> > >     the xlate code, around compose_conntrack_action().. but that code

> is

> > >     currently pretty much dataplane-oblivious and this suggestion is a

> > >     userspace-specific optimization."

> > >

> > >

> > >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> > > &ufid,

> > >     > +                                                 new_actions-

> > > >actions,

> > >     > +                                                 new_actions-

> > > >size);

> > >     > +            } else {

> > >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

> > > &ufid,

> > >     > +                                                 add_actions-

> >data,

> > >     > +                                                 add_actions-

> > > >size);

> > >     > +            }

> > >     >          }

> > >     >          ovs_mutex_unlock(&pmd->flow_mutex);

> > >     > -        emc_probabilistic_insert(pmd, key, netdev_flow);

> > >     > +        /* For this RFC, probabilistic emc insertion is

> disabled.

> > > */

> > >     > +        emc_insert(&pmd->flow_cache, key, netdev_flow);

> > >     >      }

> > >     >  }

> > >     >

> > >     > @@ -4761,7 +4802,8 @@ fast_path_processing(struct

> > > dp_netdev_pmd_thread

> > >     > *pmd,

> > >     >

> > >     >          flow = dp_netdev_flow_cast(rules[i]);

> > >     >

> > >     > -        emc_probabilistic_insert(pmd, &keys[i], flow);

> > >     > +        /* For testing purposes the emc_insert is restored

> here. */

> > >     > +        emc_insert(&pmd->flow_cache, &keys[i], flow);

> > >     >          dp_netdev_queue_batches(packet, flow, &keys[i].mf,

> batches,

> > >     > n_batches);

> > >     >      }

> > >     >

> > >     > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct

> > >     > dp_netdev_pmd_thread *pmd,

> > >     >      }

> > >     >  }

> > >     >

> > >     > +/* Search into EMC to retrieve the entry related to the

> > > original packet

> > >     > + * and the entry related to the recirculated packet.

> > >     > + * If both EMC entries are alive, then:

> > >     > + *  - The flow actions of the recirculated packet are updated

> with

> > >     > + *    'ct(zone=N)' as retrieved from the actions of the

> original

> > > flow.

> > >     > + *  - The EMC orig entry flow is updated with the flow pointer

> > > of recirc

> > >     > entry.

> > >     > + */

> > >     > +static inline void

> > >     > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,

> > >     > +        struct dp_packet *packet, long long now OVS_UNUSED)

> > >     > +{

> > >     > +    struct emc_cache *cache = &pmd->flow_cache;

> > >     > +    struct netdev_flow_key key;

> > >     > +    struct emc_entry *orig_entry; /* EMC entry hit by the

> original

> > >     > packet. */

> > >     > +    struct emc_entry *recirc_entry; /* EMC entry for

> recirculated

> > > packet.

> > >     > */

> > >     > +    uint32_t old_hash;

> > >     > +

> > >     > +    if (!packet->md.prev_emc_entry) {

> > >     > +        return;

> > >     > +    }

> > >     > +

> > >     > +    orig_entry = packet->md.prev_emc_entry;

> > >     > +    if (!emc_entry_alive(orig_entry)) {

> > >     > +        return;

> > >     > +    }

> > >     > +

> > >     > +    /* Check that the original EMC entry was not overwritten.

> */

> > >     > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;

> > >     > +    if (packet->md.prev_flow && (packet->md.prev_flow !=

> > > orig_flow)) {

> > >     > +       return;

> > >     > +    }

> > >     > +

> > >     > +    /* TODO as we are calling miniflow_extract now, can we

> avoid to

> > >     > invoke

> > >     > +     * it again when we'll classify this recirculated packet?

> */

> > >     > +    miniflow_extract(packet, &key.mf);

> > >     > +    key.len = 0; /* Not computed yet. */

> > >     > +    old_hash = dp_packet_get_rss_hash(packet);

> > >     > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);

> > >     > +    dp_packet_set_rss_hash(packet, old_hash);

> > >     > +

> > >     > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {

> > >     > +        if (recirc_entry->key.hash == key.hash

> > >     > +            && emc_entry_alive(recirc_entry)

> > >     > +            && netdev_flow_key_equal_mf(&recirc_entry->key,

> > > &key.mf)) {

> > >     > +

> > >     > +            if (orig_entry->flow == recirc_entry->flow) {

> > >     > +                /* Nothing to do, already patched by a packet

> of

> > > this

> > >     > +                 * same batch. */

> > >     > +                return;

> > >     > +            }

> > >     > +            /* We found the entry related to the recirculated

> > > packet.

> > >     > +             * Retrieve the actions for orig and recirc

> entries. *

> > > */

> > >     > +            struct dp_netdev_actions * orig_actions =

> > >     > +                    dp_netdev_flow_get_actions(orig_entry-

> >flow);

> > >     > +            struct dp_netdev_actions * recirc_actions =

> > >     > +                    dp_netdev_flow_get_actions(recirc_entry-

> >flow);

> > >     > +

> > >     > +            /* The orig entry action contains a CT action with

> the

> > > zone

> > >     > info. */

> [Sugesh] is it safe to assume first action would be ct ??


[Antonio] Right, in general it could be something different than ct.
For this RFC I'm assuming a simplified case.


> > >     > +            struct nlattr *p = &orig_actions->actions[0];

> > >     > +

> > >     > +            /* Overwrite the CT Skip action of recirc entry

> with

> > >     > ct(zone=N). */

> > >     > +            memcpy(recirc_actions->actions, p, p->nla_len);

> [Sugesh] What if recirc_action size is smaller than p? shouldn’t we make

> sure the size of skip action match with size of actions present in

> ct_table?


[Antonio] The skip action is just a place-holder to accommodate the list of
ct actions from the orig flow. It should be sufficient that the skip action
is bigger. 
Worth remember that this solution is a quick implementation of the basic idea,
the purpose was to share the approach to better analyze all possible pros/cons
and corner cases.
A better implementation could be for example - instead of tampering the 
actions of the recirc flow - to create a brand new flow which
has the combined actions picked up from the orig and the recirc flow.


> 

> > >     > +

> > >     > +            /* Update orig EMC entry. */

> > >     > +            orig_entry->flow = recirc_entry->flow;

> > >     > +            dp_netdev_flow_ref(orig_entry->flow);

> > >     > +

> > >     > +            return;

> > >     > +        }

> > >     > +    }

> > >     > +}

> > >     > +

> > >     >  static void

> > >     >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,

> > >     >                const struct nlattr *a, bool may_steal)

> > >     > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct

> > > dp_packet_batch

> > >     > *packets_,

> > >     >              } else {

> > >     >                  tx_qid = pmd->static_tx_qid;

> > >     >              }

> > >     > -

> > >     >              netdev_send(p->port->netdev, tx_qid, packets_,

> > > may_steal,

> > >     >                          dynamic_txqs);

> > >     >              return;

> > >

> > >     [Joe] "Unrelated style change."

> > >

> > >     Will do that.

> > >

> > >     > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct

> > > dp_packet_batch

> > >     > *packets_,

> > >     >              struct dp_packet *packet;

> > >     >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

> > >     >                  packet->md.recirc_id = nl_attr_get_u32(a);

> > >     > +

> > >     > +                if (!(packet->md.ct_state & CS_INVALID) &&

> > >     > +                     (packet->md.ct_state & CS_TRACKED) &&

> > >     > +                     (packet->md.ct_state & CS_ESTABLISHED)) {

> > >     > +                    (*depth)++;

> > >     > +                    emc_patch_established_conn(pmd, packet,

> now);

> > >     > +                    (*depth)--;

> > >     > +                }

> > >     >              }

> > >     >

> > >     >              (*depth)++;

> > >     >              dp_netdev_recirculate(pmd, packets_);

> > >     > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

> > >     > +                packet->md.recirc_id = 0;

> > >     > +            }

> > >     >              (*depth)--;

> > >     >

> > >     >              return;

> > >

> > >     [Joe] "Can you give some more detail about why resetting the

> > > recirc_id is

> > >     correct here? Could/should this be applied as an independent

> change

> > >     regardless of the rest of this series?"

> > >

> > >     This addition is needed so that in handle_packet_upcall we add the

> > > ct(Skip)

> > >     just to the FLOW_RECIRC and not to the FLOW_ORIG.

> > >     I guess this could be also applied as an independent change?

> > >

> > >     > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct

> > > dp_packet_batch

> > >     > *packets_,

> > >     >          const char *helper = NULL;

> > >     >          const uint32_t *setmark = NULL;

> > >     >          const struct ovs_key_ct_labels *setlabel = NULL;

> > >     > +        bool skip_ct = false;

> > >     >

> > >     >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),

> > >     >                                   nl_attr_get_size(a)) {

> > >     > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct

> > > dp_packet_batch

> > >     > *packets_,

> > >     >                  /* Silently ignored, as userspace datapath does

> not

> > >     > generate

> > >     >                   * netlink events. */

> > >     >                  break;

> > >     > +            case OVS_CT_ATTR_SKIP:

> > >     > +                skip_ct = true;

> > >     > +                break;

> > >     >              case OVS_CT_ATTR_NAT:

> > >     >              case OVS_CT_ATTR_UNSPEC:

> > >     >              case __OVS_CT_ATTR_MAX:

> > >     > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct

> > > dp_packet_batch

> > >     > *packets_,

> > >     >              }

> > >     >          }

> > >     >

> > >     > +        if (OVS_UNLIKELY(skip_ct)) {

> > >     > +            break;

> > >     > +        }

> > >     > +

> > >     >          conntrack_execute(&dp->conntrack, packets_, aux->flow-

> > > >dl_type,

> > >     > force,

> > >     >                            commit, zone, setmark, setlabel,

> helper);

> > >     >          break;

> > >     > diff --git a/lib/packets.h b/lib/packets.h

> > >     > index 7dbf6dd..2c46a15 100644

> > >     > --- a/lib/packets.h

> > >     > +++ b/lib/packets.h

> > >     > @@ -112,6 +112,8 @@ struct pkt_metadata {

> > >     >      struct flow_tnl tunnel;     /* Encapsulating tunnel

> parameters.

> > > Note

> > >     > that

> > >     >                                   * if 'ip_dst' == 0, the rest

> of

> > > the

> > >     > fields may

> > >     >                                   * be uninitialized. */

> [Sugesh] I don’t see the initiation of these fields. Have you verified the

> cost of these fields in pkt_metadata for normal traffic,?. I mean the

> performance impact.?


[Antonio]
For sure this is an overhead for normal traffic.


> > >     > +    void *prev_emc_entry;       /* EMC entry that this packet

> > > matched. */

> > >     > +    void *prev_flow;            /* Flow pointed by the matching

> EMC

> > >     > entry. */

> > >     >  };

> > >     >

> > >     >  static inline void

> > >     > --

> > >     > 2.4.11

> > >     >

> [Sugesh] I cannot apply the patch on latest master due to some conflicts.

> I am not sure , how does the stats get updated when you are combining the

> flows in datapath. Earlier there are two UUID for two flows and now only

> one UUID is being used.

> I don’t have a expertise in conntrack, However what would be the result

> when we have actions like

> 

> table=0, priority=100,ct_state=-trk,ip actions=clone(push_vlan(4),

> trunc(200), output:2), ct(table=1)

> table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> 

> 

> Also another example would be

> table=0, priority=100,ct_state=-trk,ip actions=push_vlan(4), ct(table=1)

> ,dec_ttl

> table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> 


[Antonio]
In general whenever you have some actions that imply recirculation 'before'
the ct action
   actions=<other recirc here>, ct(zone=N,recirc)
it should work by combining
   <other recirc here>
   + ct(zone=N)		<=== recirc gets removed
   <actions from recirc flow>


> As Joe suggested before, I think the logic of combine will be good to be

> in the ofproto-xlate layer than here. Though its userspace specific, it

> could be the same way how the userspace_tunnel handled there.

> 


[Antonio] yes, that's interesting


> 

> Also another concern is on the actions that does fields modification along

> with connection tracking. Probably it’s a valid use case with NAT. In this

> scenario, if the flow fields are changed by action, how does it possible

> to track back the original flow?

> Or am I missing anything ?

> 

> 

> 

> > >     > _______________________________________________

> > >     > dev mailing list

> > >     > dev@openvswitch.org

> > >     > https://urldefense.proofpoint.com/v2/url?u=https-

> > > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> > > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> > > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

> > >     _______________________________________________

> > >     dev mailing list

> > >     dev@openvswitch.org

> > >     https://urldefense.proofpoint.com/v2/url?u=https-

> > > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> > > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

> > > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

> > >

> >

> > _______________________________________________

> > dev mailing list

> > dev@openvswitch.org

> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Darrell Ball July 21, 2017, 1:41 a.m. UTC | #6
I did lose track of this patch; sorry.
I see Sugesh has added some other comments, but I’ll respond to the previous
comments to speed things along.

-----Original Message-----
From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

Date: Tuesday, June 6, 2017 at 9:14 AM
To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>
Subject: RE: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for established connections.

    Thanks Darrel for your useful comments, I've tried to replicate the
    usecases you mentioned, please find inline some details.
    
    Regards,
    Antonio
    
    > -----Original Message-----

    > From: Darrell Ball [mailto:dball@vmware.com]

    > Sent: Thursday, June 1, 2017 6:20 PM

    > To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

    > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

    > established connections.

    > 

    > Comments inline

    > 

    > On 5/29/17, 8:24 AM, "ovs-dev-bounces@openvswitch.org on behalf of

    > Fischetti, Antonio" <ovs-dev-bounces@openvswitch.org on behalf of

    > antonio.fischetti@intel.com> wrote:

    > 

    >     Thanks Joe for your feedback and the interesting insights in conntrack

    > in your earlier communication.

    >     We have added all the details that we considered for this first

    > implementation. Also, some answers are inline.

    > 

    >     The purpose of this implementation is to avoid recirculation just for

    > those packets that are part of established connections.

    > 

    >     This shouldn't affect the packet recirculation for actions other than

    > conntrack. For example in MPLS, after a pop_mpls action the packet will

    > still be recirculated to follow the usual datapath.

    > 

    >     Most importantly, the CT module isn't by-passed in this

    > implementation.

    > 

    >     Besides the recirculation, all other action[s] shall be executed as-is

    > on each packet.

    >     Any new CT change or action set by the controller will be managed as

    > usual.

    > 

    >     For our testing we set up a simple firewall, below are the flows.

    > 

    > 

    >      Flow Setup

    >      ----------

    >     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

    >     table=0, priority=10,arp actions=NORMAL

    >     table=0, priority=1 actions=drop

    >     table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2

    >     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

    >     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

    >     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

    > 

    > 

    >      Basic idea

    >      ----------

    >     With the above rules, all packets belonging to an established

    > connection will first hit the flow

    >     "ct_state=-trk,ip actions=ct(table=1)"

    > 

    >     then on recirculation, they will hit

    >     "ct_state=+est+trk,ip,in_port=.. actions=output:X".

    > 

    >     The basic idea is to do the following 2 steps.

    >     1. Combine the two sets of actions by removing the recirculation.

    >        a) Original actions:

    >         - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]

    >         - "output:X"

    >        b) Combined Actions after Removing recirculation:

    >         - "ct(zone=N), output:X".

    > 

    >     2. All the subsequent packets shall hit a flow with the combined

    > actions.

    > 

    > 

    > [Darrell]

    > 

    > 1) This would be constraining on how rules are written such that we can’t

    > go back to

    >               the beginning of the pipeline, just because a packet is

    > established.

    >               Meaning we can’t run stateless or stateful rules earlier in

    > the pipeline.

    >               There are lots of possible combinations.

    >               One case is:

    >               We may run a bunch of firewall checks in table X, later on

    > (table X+1) run the non-filtered packets through conntrack

    >               and then want to go back to the firewall table.

    > 

    
    [Antonio]
    I've tried to replicate the case you described. I've used the 
    following set of rules. They may have no practical meaning,
    the purpose is to have - at least for packets from port 1 - that
    they traverse tables #0 => 1 => 2 and then are sent back to table#0.
    
    NXST_FLOW reply (xid=0x4):
     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
     table=1, ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24 actions=ct(commit),output:2
     table=1, ct_state=+est+trk,ip,in_port=1 actions=mod_nw_src:3.3.3.3,resubmit(,2)
     table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop
     table=2, ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0)
     table=0, ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2
     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
     table=0, priority=10,arp actions=NORMAL

[Darrell]
Please take a look at the label
m4_define([CHECK_FTP_NAT_ORIG_TUPLE] in system-traffic.at.
This gives a very simple more realistic example.
Meaning it gets more realistically complex.

Going back to an earlier table is valid for example when that table does both ingress and egress ACLs.
However, that is really not the key point here. Let us simplify the discussion and just assume we have 
multiple ACL tables and other tables and packets always go from x to >x table.
////////////

    For packets from port1:
     - table#0: untracked packets go through the CT, then are recirculated, ie go to table#1.
     - table#1: new conn are committed.
     - table#1: packets of est conn are modified in the IPsrc that becomes 3.3.3.3,
                then sent to table#2.
     - table#2: TCP packets are dropped - but I'm sending just UDP traffic.
     - table#2: UDP packets go back to table#0.
     - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2.
    
    In this case we still see that the packets of est conn will skip one
    recirculation.
    
    In fact what happens is the following.
    For packets belonging to est connections - and coming from
    port1 - the actions 
       mod_nw_src:3.3.3.3,resubmit(,2)
    become
       ct(),mod_nw_src:3.3.3.3,resubmit(,2)

[Darrell]
In general, when packets go thru the connection tracker, metadata and the packet
itself can be modified. One example is the mark or label can be modified.
After modification, we can use the new mark/label to do egress ACL for example.
In your example, we don’t even have the opportunity to match on the new mark/label.
Here is another case, when we go thru the connection tracker, we modify the dst IP.
We can have another rule that checks for a combination of src IP and dst IP and then sets a label.
The label can later be used to set a TOS.
The possibilities are endless.
Your optimization for EST packets is one small subset of what can happen and a very simple one.
It basically assumes that the ct action for EST does not change the metadata and/or packet in a way
that is of interest later in the pipeline. 

The staging of ct is where lots of power lies. Trying to compress even for a few cases is:
1) Not going to provide optimizations for realistically complex cases.
2) Is going to make the code hard to maintain.
3) Is going to break is various corner cases.
////////////
    
    and the EMC entry that was hit by the original packet will point to the megaflow
       table=1, ct_state=+est+trk,ip,in_port=1 actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2)
    So new packets of the same connection will skip one recirculation.
    
    For packets from port2 the megaflow
       table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
    becomes
       table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1
    and will be pointed by the EMC entry hit by the original packet.
    So new packets of the same connection will skip one recirculation.

[Darrell]
Similar comments to above.
////////////

    
    Is this usecase somewhat similar to the one you were mentioning? 
    Or maybe you can suggest some other rule setup?
    
    
    > 2) This allows the dataplane to override what the ofproto layer specifies.

    > This looks like a layering violation.

    > 

    > I guess this also speaks to the datapath (userspace) specific design here.

    > 

    > 

    > I don’t think it matters what the performance advantage is, although it is

    > small in this case.

    > 

    > 

    > 

    > 

    >      Considerations

    >      --------------

    >     On EMC entries:

    >     All packets hitting a connection in the same direction will have the

    > same 5-tuple.

    >     They shall all hit the same EMC entry - here referred as 'EMC_ORIG' -

    > and point to the flow

    >     "ct_state=-trk,ip actions=ct(table=1)", referred from here as

    > 'FLOW_ORIG'.

    > 

    >     On recirculation the packets shall hit another EMC entry - referred as

    > 'EMC_RECIRC' - and point to the flow

    >     "ct_state=+est+trk,ip,in_port=.. actions=output:X",

    >     referred from here as 'FLOW_RECIRC'.

    > 

    >     FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC is

    > the flow hit

    >     by the recirculated packet.

    > 

    > 

    >      Implementation Details

    >      ----------------------

    >     We thought about different implementations, but for the purpose of

    > this RFC we stick to the following approach.

    > 

    >     On packet recirculation, a new flow has to be created with a list of

    > appropriate actions. At this

    >     stage we prepend a new action called 'ct(Skip)' to the original list

    > of actions. This

    >     new action is used as a temporary place-holder that will be over

    > written later on.

    > 

    >        Original Actions: "Output:2"    ==>    New Actions: "ct(Skip),

    > Output:2"

    > 

    >     For all the packets of an established conn. (i.e marked with +trk,+est

    > bits), do the 3 steps:

    >     1. Retrieve Flows from EMC.

    >         - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id)

    >         - FLOW_RECIRC whose actions include: ct(Skip), Output:X

    > 

    >     2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with the

    > ct action from FLOW_ORIG, ie ct(zone=N).

    >         ct(Skip), Output:X  =>  ct(zone=N), Output:X

    > 

    >     3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow.

    > 

    >     In this way, we avoid recirculating the packets belonging to

    > established connections and

    >     this no ways skips sending the packets to the CT module.

    > 

    > 

    >     Antonio

    > 

    >     > -----Original Message-----

    >     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

    >     > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com

    >     > Sent: Thursday, May 25, 2017 5:03 PM

    >     > To: dev@openvswitch.org

    >     > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for

    >     > established connections.

    >     >

    >     > From: Antonio Fischetti <antonio.fischetti@intel.com>

    >     >

    >     > With conntrack enabled, packets get recirculated and this impacts

    >     > the performance with thousands of active concurrent connections.

    >     >

    >     > This patch is aimed at avoiding recirculation for packets belonging

    > to

    >     > established connections in steady state. This is achieved by

    >     > manipulating the megaflows and actions. In this case the actions of

    > the

    >     > megaflow of recirculated packet is initially updated with new

    > 'CT_SKIP'

    >     > action and later updated with the actions of the megaflow of the

    >     > original packet(to handle zones). There after the EMC entry

    >     > belonging to the original packet is updated to point to the

    > 'megaflow' of

    >     > the recirculated packet there by avoiding recirculation.

    >     >

    >     > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

    >     > Signed-off-by: Bhanuprakash Bodireddy

    > <bhanuprakash.bodireddy@intel.com>

    >     > Co-authored-by: Bhanuprakash Bodireddy

    > <bhanuprakash.bodireddy@intel.com>

    >     > ---

    >     > - ~10% performance improvement is observed in our testing with UDP

    >     > streams.

    >     > - Limited testing is done with RFC patch and the tests include

    > hundreds of

    >     >   concurrent active TCP connections.

    >     > - This is very early implementation and we are posting here to get

    > initial

    >     >   feedback.

    >     > - This RFC patch is implemented leveraging EMC, but optimization

    > could be

    >     >   extended to dpcls as well to handle higher flows.

    >     > - No VXLAN testing is done with this patch.

    >     >

    >     >  datapath/linux/compat/include/linux/openvswitch.h |   1 +

    >     >  lib/dpif-netdev.c                                 | 148

    >     > ++++++++++++++++++++--

    >     >  lib/packets.h                                     |   2 +

    >     >  3 files changed, 143 insertions(+), 8 deletions(-)

    >     >

    >     > diff --git a/datapath/linux/compat/include/linux/openvswitch.h

    >     > b/datapath/linux/compat/include/linux/openvswitch.h

    >     > index d22102e..6dc5674 100644

    >     > --- a/datapath/linux/compat/include/linux/openvswitch.h

    >     > +++ b/datapath/linux/compat/include/linux/openvswitch.h

    >     > @@ -762,6 +762,7 @@ enum ovs_ct_attr {

    >     >  	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */

    >     >  	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */

    >     >  	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */

    >     > +	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT.

    > */

    >     >  	__OVS_CT_ATTR_MAX

    >     >  };

    >     >

    >     > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

    >     > index b50164b..fbbb42e 100644

    >     > --- a/lib/dpif-netdev.c

    >     > +++ b/lib/dpif-netdev.c

    >     > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct

    > dp_netdev_pmd_thread

    >     > *pmd,

    >     >  }

    >     >

    >     >  static inline struct dp_netdev_flow *

    >     > -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key

    > *key)

    >     > +emc_lookup(struct emc_cache *cache, const struct netdev_flow_key

    > *key,

    >     > +        void **pfound_entry)

    >     >  {

    >     >      struct emc_entry *current_entry;

    >     >

    >     > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache, const

    > struct

    >     > netdev_flow_key *key)

    >     >              && emc_entry_alive(current_entry)

    >     >              && netdev_flow_key_equal_mf(&current_entry->key, &key-

    > >mf)) {

    >     >

    >     > +            *pfound_entry = current_entry;

    >     >              /* We found the entry with the 'key->mf' miniflow */

    >     >              return current_entry->flow;

    >     >          }

    >     >      }

    >     >

    >     > +    *pfound_entry = NULL;

    >     >      return NULL;

    >     >  }

    >     >

    >     > @@ -4583,10 +4586,12 @@ emc_processing(struct dp_netdev_pmd_thread

    > *pmd,

    >     >          key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

    > >mf);

    >     >

    >     >          /* If EMC is disabled skip emc_lookup */

    >     > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

    >     > +        flow = (cur_min == 0) ? NULL:

    >     > +                emc_lookup(flow_cache, key, &packet-

    > >md.prev_emc_entry);

    >     >          if (OVS_LIKELY(flow)) {

    >     >              dp_netdev_queue_batches(packet, flow, &key->mf,

    > batches,

    >     >                                      n_batches);

    >     > +            packet->md.prev_flow = flow;

    >     >          } else {

    >     >              /* Exact match cache missed. Group missed packets

    > together at

    >     >               * the beginning of the 'packets' array. */

    >     > @@ -4595,6 +4600,7 @@ emc_processing(struct dp_netdev_pmd_thread

    > *pmd,

    >     >               * must be returned to the caller. The next key should

    > be

    >     > extracted

    >     >               * to 'keys[n_missed + 1]'. */

    >     >              key = &keys[++n_missed];

    >     > +            packet->md.prev_flow = NULL;

    >     >          }

    >     >      }

    >     >

    >     > @@ -4604,6 +4610,9 @@ emc_processing(struct dp_netdev_pmd_thread

    > *pmd,

    >     >      return dp_packet_batch_size(packets_);

    >     >  }

    >     >

    >     > +#define CT_PREPEND_ACTION_LEN 0x0C

    >     > +#define CT_PREPEND_ACTION_SPARE_LEN 32

    >     > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +

    >     > CT_PREPEND_ACTION_SPARE_LEN)

    >     >  static inline void

    >     >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,

    >     >                       struct dp_packet *packet,

    >     > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct

    > dp_netdev_pmd_thread

    >     > *pmd,

    >     >          ovs_mutex_lock(&pmd->flow_mutex);

    >     >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);

    >     >          if (OVS_LIKELY(!netdev_flow)) {

    >     > -            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,

    >     > -                                             add_actions->data,

    >     > -                                             add_actions->size);

    >     > +            /* Recirculated packets that belong to established

    >     > connections

    >     > +             * are treated differently.  A place-holder is

    > prepended to

    >     > the

    >     > +             * list of actions. */

    >     > +            if (!(packet->md.ct_state & CS_INVALID) &&

    >     > +                 (packet->md.ct_state & CS_TRACKED) &&

    >     > +                 (packet->md.ct_state & CS_ESTABLISHED) &&

    >     > +                 (packet->md.recirc_id)) {

    >     > +

    > 

    > 

    >     [Joe] "Just because the value of ct_state in the packet metadata of

    > this

    >     particular packet happens to have those particular state flags doesn't

    >     mean that the OpenFlow pipeline is enforcing this constraint for all

    >     packets hitting this flow. You would have to ensure that the mask in

    >     the flow is also ensuring that all future packets hitting the new flow

    >     will also have that particular ct_state."

    > 

    >     This part of the code should prepend a ct(Skip) action to the

    >     FLOW_RECIRC only and just for packets that belong to an established

    > conn.

    >     The FLOW_ORIG shouldn't be altered, because it will be hit by all

    > ingress

    >     packets.

    > 

    >     > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of

    > actions

    >     > inside

    >     > +                 * add_actions.  It does not really invoke the

    > ConnTrack

    >     > module,

    >     > +                 * it's a Ct action that works as a place-holder.

    > It

    >     > will be

    >     > +                 * overwritten inside emc_patch_established_conn()

    > with

    >     > +                 * the proper ct(zone=X) action. */

    >     > +                struct dp_netdev_actions *new_actions;

    >     > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {

    >     > +                        0x0C, 0x00, 0x0C, 0x00,

    >     > +                        0x06, 0x00, 0x09, 0x00,

    >     > +                        0x00, 0x00, 0x00, 0x00,

    >     > +                        /* will append here add_actions. */

    >     > +                };

    >     > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],

    >     > +                        add_actions->data,

    >     > +                        add_actions->size);

    >     > +                new_actions = dp_netdev_actions_create(

    >     > +                        (const struct nlattr *)tmp_action,

    >     > +                        CT_PREPEND_ACTION_LEN + add_actions->size);

    > 

    >     [Joe] "Hmm. In some ways I think this would be cleaner to be

    > implemented in

    >     the xlate code, around compose_conntrack_action().. but that code is

    >     currently pretty much dataplane-oblivious and this suggestion is a

    >     userspace-specific optimization."

    > 

    > 

    >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

    > &ufid,

    >     > +                                                 new_actions-

    > >actions,

    >     > +                                                 new_actions-

    > >size);

    >     > +            } else {

    >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,

    > &ufid,

    >     > +                                                 add_actions->data,

    >     > +                                                 add_actions-

    > >size);

    >     > +            }

    >     >          }

    >     >          ovs_mutex_unlock(&pmd->flow_mutex);

    >     > -        emc_probabilistic_insert(pmd, key, netdev_flow);

    >     > +        /* For this RFC, probabilistic emc insertion is disabled.

    > */

    >     > +        emc_insert(&pmd->flow_cache, key, netdev_flow);

    >     >      }

    >     >  }

    >     >

    >     > @@ -4761,7 +4802,8 @@ fast_path_processing(struct

    > dp_netdev_pmd_thread

    >     > *pmd,

    >     >

    >     >          flow = dp_netdev_flow_cast(rules[i]);

    >     >

    >     > -        emc_probabilistic_insert(pmd, &keys[i], flow);

    >     > +        /* For testing purposes the emc_insert is restored here. */

    >     > +        emc_insert(&pmd->flow_cache, &keys[i], flow);

    >     >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

    >     > n_batches);

    >     >      }

    >     >

    >     > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct

    >     > dp_netdev_pmd_thread *pmd,

    >     >      }

    >     >  }

    >     >

    >     > +/* Search into EMC to retrieve the entry related to the original

    > packet

    >     > + * and the entry related to the recirculated packet.

    >     > + * If both EMC entries are alive, then:

    >     > + *  - The flow actions of the recirculated packet are updated with

    >     > + *    'ct(zone=N)' as retrieved from the actions of the original

    > flow.

    >     > + *  - The EMC orig entry flow is updated with the flow pointer of

    > recirc

    >     > entry.

    >     > + */

    >     > +static inline void

    >     > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,

    >     > +        struct dp_packet *packet, long long now OVS_UNUSED)

    >     > +{

    >     > +    struct emc_cache *cache = &pmd->flow_cache;

    >     > +    struct netdev_flow_key key;

    >     > +    struct emc_entry *orig_entry; /* EMC entry hit by the original

    >     > packet. */

    >     > +    struct emc_entry *recirc_entry; /* EMC entry for recirculated

    > packet.

    >     > */

    >     > +    uint32_t old_hash;

    >     > +

    >     > +    if (!packet->md.prev_emc_entry) {

    >     > +        return;

    >     > +    }

    >     > +

    >     > +    orig_entry = packet->md.prev_emc_entry;

    >     > +    if (!emc_entry_alive(orig_entry)) {

    >     > +        return;

    >     > +    }

    >     > +

    >     > +    /* Check that the original EMC entry was not overwritten. */

    >     > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;

    >     > +    if (packet->md.prev_flow && (packet->md.prev_flow !=

    > orig_flow)) {

    >     > +       return;

    >     > +    }

    >     > +

    >     > +    /* TODO as we are calling miniflow_extract now, can we avoid to

    >     > invoke

    >     > +     * it again when we'll classify this recirculated packet? */

    >     > +    miniflow_extract(packet, &key.mf);

    >     > +    key.len = 0; /* Not computed yet. */

    >     > +    old_hash = dp_packet_get_rss_hash(packet);

    >     > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);

    >     > +    dp_packet_set_rss_hash(packet, old_hash);

    >     > +

    >     > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {

    >     > +        if (recirc_entry->key.hash == key.hash

    >     > +            && emc_entry_alive(recirc_entry)

    >     > +            && netdev_flow_key_equal_mf(&recirc_entry->key,

    > &key.mf)) {

    >     > +

    >     > +            if (orig_entry->flow == recirc_entry->flow) {

    >     > +                /* Nothing to do, already patched by a packet of

    > this

    >     > +                 * same batch. */

    >     > +                return;

    >     > +            }

    >     > +            /* We found the entry related to the recirculated

    > packet.

    >     > +             * Retrieve the actions for orig and recirc entries. *

    > */

    >     > +            struct dp_netdev_actions * orig_actions =

    >     > +                    dp_netdev_flow_get_actions(orig_entry->flow);

    >     > +            struct dp_netdev_actions * recirc_actions =

    >     > +                    dp_netdev_flow_get_actions(recirc_entry->flow);

    >     > +

    >     > +            /* The orig entry action contains a CT action with the

    > zone

    >     > info. */

    >     > +            struct nlattr *p = &orig_actions->actions[0];

    >     > +

    >     > +            /* Overwrite the CT Skip action of recirc entry with

    >     > ct(zone=N). */

    >     > +            memcpy(recirc_actions->actions, p, p->nla_len);

    >     > +

    >     > +            /* Update orig EMC entry. */

    >     > +            orig_entry->flow = recirc_entry->flow;

    >     > +            dp_netdev_flow_ref(orig_entry->flow);

    >     > +

    >     > +            return;

    >     > +        }

    >     > +    }

    >     > +}

    >     > +

    >     >  static void

    >     >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,

    >     >                const struct nlattr *a, bool may_steal)

    >     > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct

    > dp_packet_batch

    >     > *packets_,

    >     >              } else {

    >     >                  tx_qid = pmd->static_tx_qid;

    >     >              }

    >     > -

    >     >              netdev_send(p->port->netdev, tx_qid, packets_,

    > may_steal,

    >     >                          dynamic_txqs);

    >     >              return;

    > 

    >     [Joe] "Unrelated style change."

    > 

    >     Will do that.

    > 

    >     > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct

    > dp_packet_batch

    >     > *packets_,

    >     >              struct dp_packet *packet;

    >     >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

    >     >                  packet->md.recirc_id = nl_attr_get_u32(a);

    >     > +

    >     > +                if (!(packet->md.ct_state & CS_INVALID) &&

    >     > +                     (packet->md.ct_state & CS_TRACKED) &&

    >     > +                     (packet->md.ct_state & CS_ESTABLISHED)) {

    >     > +                    (*depth)++;

    >     > +                    emc_patch_established_conn(pmd, packet, now);

    >     > +                    (*depth)--;

    >     > +                }

    >     >              }

    >     >

    >     >              (*depth)++;

    >     >              dp_netdev_recirculate(pmd, packets_);

    >     > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {

    >     > +                packet->md.recirc_id = 0;

    >     > +            }

    >     >              (*depth)--;

    >     >

    >     >              return;

    > 

    >     [Joe] "Can you give some more detail about why resetting the recirc_id

    > is

    >     correct here? Could/should this be applied as an independent change

    >     regardless of the rest of this series?"

    > 

    >     This addition is needed so that in handle_packet_upcall we add the

    > ct(Skip)

    >     just to the FLOW_RECIRC and not to the FLOW_ORIG.

    >     I guess this could be also applied as an independent change?

    > 

    >     > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct

    > dp_packet_batch

    >     > *packets_,

    >     >          const char *helper = NULL;

    >     >          const uint32_t *setmark = NULL;

    >     >          const struct ovs_key_ct_labels *setlabel = NULL;

    >     > +        bool skip_ct = false;

    >     >

    >     >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),

    >     >                                   nl_attr_get_size(a)) {

    >     > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct

    > dp_packet_batch

    >     > *packets_,

    >     >                  /* Silently ignored, as userspace datapath does not

    >     > generate

    >     >                   * netlink events. */

    >     >                  break;

    >     > +            case OVS_CT_ATTR_SKIP:

    >     > +                skip_ct = true;

    >     > +                break;

    >     >              case OVS_CT_ATTR_NAT:

    >     >              case OVS_CT_ATTR_UNSPEC:

    >     >              case __OVS_CT_ATTR_MAX:

    >     > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct

    > dp_packet_batch

    >     > *packets_,

    >     >              }

    >     >          }

    >     >

    >     > +        if (OVS_UNLIKELY(skip_ct)) {

    >     > +            break;

    >     > +        }

    >     > +

    >     >          conntrack_execute(&dp->conntrack, packets_, aux->flow-

    > >dl_type,

    >     > force,

    >     >                            commit, zone, setmark, setlabel, helper);

    >     >          break;

    >     > diff --git a/lib/packets.h b/lib/packets.h

    >     > index 7dbf6dd..2c46a15 100644

    >     > --- a/lib/packets.h

    >     > +++ b/lib/packets.h

    >     > @@ -112,6 +112,8 @@ struct pkt_metadata {

    >     >      struct flow_tnl tunnel;     /* Encapsulating tunnel parameters.

    > Note

    >     > that

    >     >                                   * if 'ip_dst' == 0, the rest of

    > the

    >     > fields may

    >     >                                   * be uninitialized. */

    >     > +    void *prev_emc_entry;       /* EMC entry that this packet

    > matched. */

    >     > +    void *prev_flow;            /* Flow pointed by the matching EMC

    >     > entry. */

    >     >  };

    >     >

    >     >  static inline void

    >     > --

    >     > 2.4.11

    >     >

    >     > _______________________________________________

    >     > dev mailing list

    >     > dev@openvswitch.org

    >     > https://urldefense.proofpoint.com/v2/url?u=https-

    > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

    > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

    > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

    > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

    >     _______________________________________________

    >     dev mailing list

    >     dev@openvswitch.org

    >     https://urldefense.proofpoint.com/v2/url?u=https-

    > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

    > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

    > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-

    > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=

    >
diff mbox

Patch

diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index d22102e..6dc5674 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -762,6 +762,7 @@  enum ovs_ct_attr {
 	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
 	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
 	OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
+	OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT. */
 	__OVS_CT_ATTR_MAX
 };
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b50164b..fbbb42e 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2101,7 +2101,8 @@  emc_probabilistic_insert(struct dp_netdev_pmd_thread *pmd,
 }
 
 static inline struct dp_netdev_flow *
-emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
+emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key,
+        void **pfound_entry)
 {
     struct emc_entry *current_entry;
 
@@ -2110,11 +2111,13 @@  emc_lookup(struct emc_cache *cache, const struct netdev_flow_key *key)
             && emc_entry_alive(current_entry)
             && netdev_flow_key_equal_mf(&current_entry->key, &key->mf)) {
 
+            *pfound_entry = current_entry;
             /* We found the entry with the 'key->mf' miniflow */
             return current_entry->flow;
         }
     }
 
+    *pfound_entry = NULL;
     return NULL;
 }
 
@@ -4583,10 +4586,12 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
 
         /* If EMC is disabled skip emc_lookup */
-        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+        flow = (cur_min == 0) ? NULL:
+                emc_lookup(flow_cache, key, &packet->md.prev_emc_entry);
         if (OVS_LIKELY(flow)) {
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
+            packet->md.prev_flow = flow;
         } else {
             /* Exact match cache missed. Group missed packets together at
              * the beginning of the 'packets' array. */
@@ -4595,6 +4600,7 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
              * must be returned to the caller. The next key should be extracted
              * to 'keys[n_missed + 1]'. */
             key = &keys[++n_missed];
+            packet->md.prev_flow = NULL;
         }
     }
 
@@ -4604,6 +4610,9 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
     return dp_packet_batch_size(packets_);
 }
 
+#define CT_PREPEND_ACTION_LEN 0x0C
+#define CT_PREPEND_ACTION_SPARE_LEN 32
+#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN + CT_PREPEND_ACTION_SPARE_LEN)
 static inline void
 handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                      struct dp_packet *packet,
@@ -4663,12 +4672,44 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
         ovs_mutex_lock(&pmd->flow_mutex);
         netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);
         if (OVS_LIKELY(!netdev_flow)) {
-            netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
-                                             add_actions->data,
-                                             add_actions->size);
+            /* Recirculated packets that belong to established connections
+             * are treated differently.  A place-holder is prepended to the
+             * list of actions. */
+            if (!(packet->md.ct_state & CS_INVALID) &&
+                 (packet->md.ct_state & CS_TRACKED) &&
+                 (packet->md.ct_state & CS_ESTABLISHED) &&
+                 (packet->md.recirc_id)) {
+
+                /* Prepend an OVS_CT_ATTR_SKIP to the list of actions inside
+                 * add_actions.  It does not really invoke the ConnTrack module,
+                 * it's a Ct action that works as a place-holder.  It will be
+                 * overwritten inside emc_patch_established_conn() with
+                 * the proper ct(zone=X) action. */
+                struct dp_netdev_actions *new_actions;
+                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {
+                        0x0C, 0x00, 0x0C, 0x00,
+                        0x06, 0x00, 0x09, 0x00,
+                        0x00, 0x00, 0x00, 0x00,
+                        /* will append here add_actions. */
+                };
+                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],
+                        add_actions->data,
+                        add_actions->size);
+                new_actions = dp_netdev_actions_create(
+                        (const struct nlattr *)tmp_action,
+                        CT_PREPEND_ACTION_LEN + add_actions->size);
+                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
+                                                 new_actions->actions,
+                                                 new_actions->size);
+            } else {
+                netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
+                                                 add_actions->data,
+                                                 add_actions->size);
+            }
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        /* For this RFC, probabilistic emc insertion is disabled. */
+        emc_insert(&pmd->flow_cache, key, netdev_flow);
     }
 }
 
@@ -4761,7 +4802,8 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
         flow = dp_netdev_flow_cast(rules[i]);
 
-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        /* For testing purposes the emc_insert is restored here. */
+        emc_insert(&pmd->flow_cache, &keys[i], flow);
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }
 
@@ -4998,6 +5040,78 @@  dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
+/* Search into EMC to retrieve the entry related to the original packet
+ * and the entry related to the recirculated packet.
+ * If both EMC entries are alive, then:
+ *  - The flow actions of the recirculated packet are updated with
+ *    'ct(zone=N)' as retrieved from the actions of the original flow.
+ *  - The EMC orig entry flow is updated with the flow pointer of recirc entry.
+ */
+static inline void
+emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,
+        struct dp_packet *packet, long long now OVS_UNUSED)
+{
+    struct emc_cache *cache = &pmd->flow_cache;
+    struct netdev_flow_key key;
+    struct emc_entry *orig_entry; /* EMC entry hit by the original packet. */
+    struct emc_entry *recirc_entry; /* EMC entry for recirculated packet. */
+    uint32_t old_hash;
+
+    if (!packet->md.prev_emc_entry) {
+        return;
+    }
+
+    orig_entry = packet->md.prev_emc_entry;
+    if (!emc_entry_alive(orig_entry)) {
+        return;
+    }
+
+    /* Check that the original EMC entry was not overwritten. */
+    struct dp_netdev_flow *orig_flow = orig_entry->flow;
+    if (packet->md.prev_flow && (packet->md.prev_flow != orig_flow)) {
+       return;
+    }
+
+    /* TODO as we are calling miniflow_extract now, can we avoid to invoke
+     * it again when we'll classify this recirculated packet? */
+    miniflow_extract(packet, &key.mf);
+    key.len = 0; /* Not computed yet. */
+    old_hash = dp_packet_get_rss_hash(packet);
+    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);
+    dp_packet_set_rss_hash(packet, old_hash);
+
+    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {
+        if (recirc_entry->key.hash == key.hash
+            && emc_entry_alive(recirc_entry)
+            && netdev_flow_key_equal_mf(&recirc_entry->key, &key.mf)) {
+
+            if (orig_entry->flow == recirc_entry->flow) {
+                /* Nothing to do, already patched by a packet of this
+                 * same batch. */
+                return;
+            }
+            /* We found the entry related to the recirculated packet.
+             * Retrieve the actions for orig and recirc entries. * */
+            struct dp_netdev_actions * orig_actions =
+                    dp_netdev_flow_get_actions(orig_entry->flow);
+            struct dp_netdev_actions * recirc_actions =
+                    dp_netdev_flow_get_actions(recirc_entry->flow);
+
+            /* The orig entry action contains a CT action with the zone info. */
+            struct nlattr *p = &orig_actions->actions[0];
+
+            /* Overwrite the CT Skip action of recirc entry with ct(zone=N). */
+            memcpy(recirc_actions->actions, p, p->nla_len);
+
+            /* Update orig EMC entry. */
+            orig_entry->flow = recirc_entry->flow;
+            dp_netdev_flow_ref(orig_entry->flow);
+
+            return;
+        }
+    }
+}
+
 static void
 dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
               const struct nlattr *a, bool may_steal)
@@ -5024,7 +5138,6 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             } else {
                 tx_qid = pmd->static_tx_qid;
             }
-
             netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
                         dynamic_txqs);
             return;
@@ -5145,10 +5258,21 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             struct dp_packet *packet;
             DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
                 packet->md.recirc_id = nl_attr_get_u32(a);
+
+                if (!(packet->md.ct_state & CS_INVALID) &&
+                     (packet->md.ct_state & CS_TRACKED) &&
+                     (packet->md.ct_state & CS_ESTABLISHED)) {
+                    (*depth)++;
+                    emc_patch_established_conn(pmd, packet, now);
+                    (*depth)--;
+                }
             }
 
             (*depth)++;
             dp_netdev_recirculate(pmd, packets_);
+            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
+                packet->md.recirc_id = 0;
+            }
             (*depth)--;
 
             return;
@@ -5166,6 +5290,7 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
         const char *helper = NULL;
         const uint32_t *setmark = NULL;
         const struct ovs_key_ct_labels *setlabel = NULL;
+        bool skip_ct = false;
 
         NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),
                                  nl_attr_get_size(a)) {
@@ -5194,6 +5319,9 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
                 /* Silently ignored, as userspace datapath does not generate
                  * netlink events. */
                 break;
+            case OVS_CT_ATTR_SKIP:
+                skip_ct = true;
+                break;
             case OVS_CT_ATTR_NAT:
             case OVS_CT_ATTR_UNSPEC:
             case __OVS_CT_ATTR_MAX:
@@ -5201,6 +5329,10 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             }
         }
 
+        if (OVS_UNLIKELY(skip_ct)) {
+            break;
+        }
+
         conntrack_execute(&dp->conntrack, packets_, aux->flow->dl_type, force,
                           commit, zone, setmark, setlabel, helper);
         break;
diff --git a/lib/packets.h b/lib/packets.h
index 7dbf6dd..2c46a15 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -112,6 +112,8 @@  struct pkt_metadata {
     struct flow_tnl tunnel;     /* Encapsulating tunnel parameters. Note that
                                  * if 'ip_dst' == 0, the rest of the fields may
                                  * be uninitialized. */
+    void *prev_emc_entry;       /* EMC entry that this packet matched. */
+    void *prev_flow;            /* Flow pointed by the matching EMC entry. */
 };
 
 static inline void