diff mbox

[ovs-dev,v3,3/3] tunneling: Avoid datapath-recirc by combining recirc actions at xlate.

Message ID 1500400161-76766-4-git-send-email-sugesh.chandran@intel.com
State Superseded
Delegated to: Joe Stringer
Headers show

Commit Message

Chandran, Sugesh July 18, 2017, 5:49 p.m. UTC
This patch set removes the recirculation of encapsulated tunnel packets
if possible. It is done by computing the post tunnel actions at the time of
translation. The combined nested action set are programmed in the datapath
using CLONE action.

The following test results shows the performance improvement offered by
this optimization for tunnel encap.

          +-------------+
      dpdk0 |             |
         -->o    br-in    |
            |             o--> gre0
            +-------------+

                   --> LOCAL
            +-----------o-+
            |             | dpdk1
            |    br-p1    o-->
            |             |
            +-------------+

Test result on OVS master with DPDK 16.11.2 (Without optimization):

 # dpdk0

 RX packets         : 7037641.60  / sec
 RX packet errors   : 0  / sec
 RX packets dropped : 7730632.90  / sec
 RX rate            : 402.69 MB/sec

 # dpdk1

 TX packets         : 7037641.60  / sec
 TX packet errors   : 0  / sec
 TX packets dropped : 0  / sec
 TX rate            : 657.73 MB/sec
 TX processing cost per TX packets in nsec : 142.09

Test result on OVS master + DPDK 16.11.2 (With optimization):

 # dpdk0

 RX packets         : 9386809.60  / sec
 RX packet errors   : 0  / sec
 RX packets dropped : 5381496.40  / sec
 RX rate            : 537.11 MB/sec

 # dpdk1

 TX packets         : 9386809.60  / sec
 TX packet errors   : 0  / sec
 TX packets dropped : 0  / sec
 TX rate            : 877.29 MB/sec
 TX processing cost per TX packets in nsec : 106.53

The offered performance gain is approx 30%.

Signed-off-by: Sugesh Chandran <sugesh.chandran@intel.com>
Signed-off-by: Zoltán Balogh <zoltan.balogh@ericsson.com>
Co-authored-by: Zoltán Balogh <zoltan.balogh@ericsson.com>
---
 lib/dpif-netdev.c                  |  18 +--
 ofproto/ofproto-dpif-xlate-cache.c |  32 ++++-
 ofproto/ofproto-dpif-xlate-cache.h |  14 ++-
 ofproto/ofproto-dpif-xlate.c       | 234 ++++++++++++++++++++++++++++++++++++-
 ofproto/ofproto-dpif.c             |   3 +-
 tests/packet-type-aware.at         |  27 ++---
 6 files changed, 283 insertions(+), 45 deletions(-)

Comments

Joe Stringer July 19, 2017, 12:40 a.m. UTC | #1
On 18 July 2017 at 10:49, Sugesh Chandran <sugesh.chandran@intel.com> wrote:
> This patch set removes the recirculation of encapsulated tunnel packets
> if possible. It is done by computing the post tunnel actions at the time of
> translation. The combined nested action set are programmed in the datapath
> using CLONE action.
>
> The following test results shows the performance improvement offered by
> this optimization for tunnel encap.
>
>           +-------------+
>       dpdk0 |             |
>          -->o    br-in    |
>             |             o--> gre0
>             +-------------+
>
>                    --> LOCAL
>             +-----------o-+
>             |             | dpdk1
>             |    br-p1    o-->
>             |             |
>             +-------------+
>
> Test result on OVS master with DPDK 16.11.2 (Without optimization):
>
>  # dpdk0
>
>  RX packets         : 7037641.60  / sec
>  RX packet errors   : 0  / sec
>  RX packets dropped : 7730632.90  / sec
>  RX rate            : 402.69 MB/sec
>
>  # dpdk1
>
>  TX packets         : 7037641.60  / sec
>  TX packet errors   : 0  / sec
>  TX packets dropped : 0  / sec
>  TX rate            : 657.73 MB/sec
>  TX processing cost per TX packets in nsec : 142.09
>
> Test result on OVS master + DPDK 16.11.2 (With optimization):
>
>  # dpdk0
>
>  RX packets         : 9386809.60  / sec
>  RX packet errors   : 0  / sec
>  RX packets dropped : 5381496.40  / sec
>  RX rate            : 537.11 MB/sec
>
>  # dpdk1
>
>  TX packets         : 9386809.60  / sec
>  TX packet errors   : 0  / sec
>  TX packets dropped : 0  / sec
>  TX rate            : 877.29 MB/sec
>  TX processing cost per TX packets in nsec : 106.53
>
> The offered performance gain is approx 30%.
>
> Signed-off-by: Sugesh Chandran <sugesh.chandran@intel.com>
> Signed-off-by: Zoltán Balogh <zoltan.balogh@ericsson.com>
> Co-authored-by: Zoltán Balogh <zoltan.balogh@ericsson.com>
> ---

Hi Sugesh,

I have some brief feedback below. Looks like this is getting close.

<snip>

> @@ -279,3 +289,21 @@ xlate_cache_delete(struct xlate_cache *xcache)
>      xlate_cache_uninit(xcache);
>      free(xcache);
>  }
> +
> +/* Append all the entries in src into dst and remove them from src.
> + * The caller must own both xc-caches to use this function.
> + * The 'src' entries are not freed in this function as its owned by caller.
> + */
> +void xlate_cache_steal_entries(struct xlate_cache *dst, struct
> +                               xlate_cache *src)

Usually, the type definition and the variable are kept on the same
line. Would you mind updating this to follow the surrounding code
style?

> +{
> +    if (!dst || !src) {
> +        return;
> +    }
> +    void *p;
> +    struct ofpbuf *src_entries = &src->entries;
> +    struct ofpbuf *dst_entries = &dst->entries;
> +    p = ofpbuf_put_zeros(dst_entries, src_entries->size);
> +    nullable_memcpy(p, src_entries->data, src_entries->size);
> +    ofpbuf_clear(src_entries);

I don't think we need to zero the buffer if we're going to write into
it, so we could use ofpbuf_put_uninit() instead. That function will
never return a NULL pointer, so a regular memcpy should be sufficient.

> @@ -134,11 +142,13 @@ struct xlate_cache {
>  void xlate_cache_init(struct xlate_cache *);
>  struct xlate_cache *xlate_cache_new(void);
>  struct xc_entry *xlate_cache_add_entry(struct xlate_cache *, enum xc_type);
> -void xlate_push_stats_entry(struct xc_entry *, const struct dpif_flow_stats *);
> -void xlate_push_stats(struct xlate_cache *, const struct dpif_flow_stats *);
> +void xlate_push_stats_entry(struct xc_entry *, struct dpif_flow_stats *);
> +void xlate_push_stats(struct xlate_cache *, struct dpif_flow_stats *);
>  void xlate_cache_clear_entry(struct xc_entry *);
>  void xlate_cache_clear(struct xlate_cache *);
>  void xlate_cache_uninit(struct xlate_cache *);
>  void xlate_cache_delete(struct xlate_cache *);
> +void xlate_cache_steal_entries(struct xlate_cache *, struct xlate_cache *);
> +

There's some extra unnecessary whitespace here.

> +/* Validate if the transalated combined actions are OK to proceed.
> + * If actions consist of TRUNC action, it is not allowed to do the
> + * tunnel_push combine as it cannot update stats correctly.
> + */
> +static bool
> +is_tunnel_actions_clone_ready(struct xlate_ctx *ctx)
> +{
> +    struct nlattr *tnl_actions;
> +    const struct nlattr *a;
> +    unsigned int left;
> +    size_t actions_len;
> +    struct ofpbuf *actions = ctx->odp_actions;
> +
> +    if (!actions) {
> +        /* No actions, no harm in doing combine. */
> +        return true;
> +    }
> +
> +    /* Cannot perform tunnel push on slow path action CONTROLLER_OUTPUT. */
> +    if (!ctx->xout->avoid_caching &&
> +        (ctx->xout->slow & SLOW_CONTROLLER)) {

Even if we are avoiding caching the flow, I still think that the
controller action cannot be correctly handled through this new path so
we should return false. Do you have a particular condition in mind for
why we need to care about the 'avoid_caching' flag in this case?

> +static bool
> +validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,
> +                                      const struct xport *xport,
> +                                      struct xport *out_dev,
> +                                      struct ovs_action_push_tnl tnl_push_data)
> +{
...
> +    if (ctx->odp_actions->size > push_action_size) {
> +        /* Update the CLONE action only when combined. */
> +        nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> +    } else {
> +        /* No actions after the tunnel, no need of clone. */
> +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
> +        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);

I looked at this line again for this version since I didn't understand
it last time around, and I'm still a bit confused. If there's no
actions to run on the second bridge, then the copy of the packet which
is tunneled is effectively dropped. If that copy of the packet is
dropped, then why do we need the tunnel action at all? If I follow
correctly, then this means that if you have two bridges, where the
first bridge has output(tunnel),output(other_device) then the second
bridge where the tunneling occurs has no flows, then the datapath flow
will end up as something like:

push_tnl(...),output(other_device).

I realise that the testsuite breaks if you remove this line, but maybe
the testsuite needs fixing for these cases?

> +    }
> +
> +out:
> +    /* Restore context status. */
> +    ctx->xin->resubmit_stats = backup_resubmit_stats;
> +    xlate_cache_delete(ctx->xin->xcache);
> +    ctx->xin->xcache = backup_xcache;
> +    ctx->xin->allow_side_effects = backup_side_effects;
> +    ctx->xin->packet = backup_pkt;
> +    ctx->wc = backup_flow_wc_ptr;
> +    return nested_act_flag;
> +

There's some extra unnecessary whitespace here.
Joe Stringer July 19, 2017, 1:33 a.m. UTC | #2
On 18 July 2017 at 17:40, Joe Stringer <joe@ovn.org> wrote:
> On 18 July 2017 at 10:49, Sugesh Chandran <sugesh.chandran@intel.com> wrote:
>> +static bool
>> +validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,
>> +                                      const struct xport *xport,
>> +                                      struct xport *out_dev,
>> +                                      struct ovs_action_push_tnl tnl_push_data)
>> +{
> ...
>> +    if (ctx->odp_actions->size > push_action_size) {
>> +        /* Update the CLONE action only when combined. */
>> +        nl_msg_end_nested(ctx->odp_actions, clone_ofs);
>> +    } else {
>> +        /* No actions after the tunnel, no need of clone. */
>> +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
>> +        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
>
> I looked at this line again for this version since I didn't understand
> it last time around, and I'm still a bit confused. If there's no
> actions to run on the second bridge, then the copy of the packet which
> is tunneled is effectively dropped. If that copy of the packet is
> dropped, then why do we need the tunnel action at all? If I follow
> correctly, then this means that if you have two bridges, where the
> first bridge has output(tunnel),output(other_device) then the second
> bridge where the tunneling occurs has no flows, then the datapath flow
> will end up as something like:
>
> push_tnl(...),output(other_device).
>
> I realise that the testsuite breaks if you remove this line, but maybe
> the testsuite needs fixing for these cases?

On second thought, this should probably be a logically separate
follow-up patch even if we agree to change it.
Chandran, Sugesh July 19, 2017, 8:21 a.m. UTC | #3
Hi Joe,

Thank you for providing the comments on this series.
Please see my answers below.

Regards
_Sugesh

> -----Original Message-----

> From: Joe Stringer [mailto:joe@ovn.org]

> Sent: Wednesday, July 19, 2017 1:40 AM

> To: Chandran, Sugesh <sugesh.chandran@intel.com>

> Cc: ovs dev <dev@openvswitch.org>; Andy Zhou <azhou@ovn.org>; Zoltán

> Balogh <zoltan.balogh@ericsson.com>

> Subject: Re: [PATCH v3 3/3] tunneling: Avoid datapath-recirc by combining

> recirc actions at xlate.

> 

> On 18 July 2017 at 10:49, Sugesh Chandran <sugesh.chandran@intel.com>

> wrote:

> > This patch set removes the recirculation of encapsulated tunnel

> > packets if possible. It is done by computing the post tunnel actions

> > at the time of translation. The combined nested action set are

> > programmed in the datapath using CLONE action.

> >

> > The following test results shows the performance improvement offered

> > by this optimization for tunnel encap.

> >

> >           +-------------+

> >       dpdk0 |             |

> >          -->o    br-in    |

> >             |             o--> gre0

> >             +-------------+

> >

> >                    --> LOCAL

> >             +-----------o-+

> >             |             | dpdk1

> >             |    br-p1    o-->

> >             |             |

> >             +-------------+

> >

> > Test result on OVS master with DPDK 16.11.2 (Without optimization):

> >

> >  # dpdk0

> >

> >  RX packets         : 7037641.60  / sec

> >  RX packet errors   : 0  / sec

> >  RX packets dropped : 7730632.90  / sec

> >  RX rate            : 402.69 MB/sec

> >

> >  # dpdk1

> >

> >  TX packets         : 7037641.60  / sec

> >  TX packet errors   : 0  / sec

> >  TX packets dropped : 0  / sec

> >  TX rate            : 657.73 MB/sec

> >  TX processing cost per TX packets in nsec : 142.09

> >

> > Test result on OVS master + DPDK 16.11.2 (With optimization):

> >

> >  # dpdk0

> >

> >  RX packets         : 9386809.60  / sec

> >  RX packet errors   : 0  / sec

> >  RX packets dropped : 5381496.40  / sec

> >  RX rate            : 537.11 MB/sec

> >

> >  # dpdk1

> >

> >  TX packets         : 9386809.60  / sec

> >  TX packet errors   : 0  / sec

> >  TX packets dropped : 0  / sec

> >  TX rate            : 877.29 MB/sec

> >  TX processing cost per TX packets in nsec : 106.53

> >

> > The offered performance gain is approx 30%.

> >

> > Signed-off-by: Sugesh Chandran <sugesh.chandran@intel.com>

> > Signed-off-by: Zoltán Balogh <zoltan.balogh@ericsson.com>

> > Co-authored-by: Zoltán Balogh <zoltan.balogh@ericsson.com>

> > ---

> 

> Hi Sugesh,

> 

> I have some brief feedback below. Looks like this is getting close.

[Sugesh] Happy to see that and thank you for quick feedback :)
> 

> <snip>

> 

> > @@ -279,3 +289,21 @@ xlate_cache_delete(struct xlate_cache *xcache)

> >      xlate_cache_uninit(xcache);

> >      free(xcache);

> >  }

> > +

> > +/* Append all the entries in src into dst and remove them from src.

> > + * The caller must own both xc-caches to use this function.

> > + * The 'src' entries are not freed in this function as its owned by caller.

> > + */

> > +void xlate_cache_steal_entries(struct xlate_cache *dst, struct

> > +                               xlate_cache *src)

> 

> Usually, the type definition and the variable are kept on the same line.

> Would you mind updating this to follow the surrounding code style?

[Sugesh] Sure, Will change in the next series.
> 

> > +{

> > +    if (!dst || !src) {

> > +        return;

> > +    }

> > +    void *p;

> > +    struct ofpbuf *src_entries = &src->entries;

> > +    struct ofpbuf *dst_entries = &dst->entries;

> > +    p = ofpbuf_put_zeros(dst_entries, src_entries->size);

> > +    nullable_memcpy(p, src_entries->data, src_entries->size);

> > +    ofpbuf_clear(src_entries);

> 

> I don't think we need to zero the buffer if we're going to write into it, so we

> could use ofpbuf_put_uninit() instead. That function will never return a NULL

> pointer, so a regular memcpy should be sufficient.

[Sugesh] Ok, Will change these lines to,
+ p = ofpbuf_put_uninit(dst_entries, src_entries->size);
+ memcpy(p, src_entries->data, src_entries->size);
> 

> > @@ -134,11 +142,13 @@ struct xlate_cache {  void

> > xlate_cache_init(struct xlate_cache *);  struct xlate_cache

> > *xlate_cache_new(void);  struct xc_entry *xlate_cache_add_entry(struct

> > xlate_cache *, enum xc_type); -void xlate_push_stats_entry(struct

> > xc_entry *, const struct dpif_flow_stats *); -void

> > xlate_push_stats(struct xlate_cache *, const struct dpif_flow_stats

> > *);

> > +void xlate_push_stats_entry(struct xc_entry *, struct dpif_flow_stats

> > +*); void xlate_push_stats(struct xlate_cache *, struct

> > +dpif_flow_stats *);

> >  void xlate_cache_clear_entry(struct xc_entry *);  void

> > xlate_cache_clear(struct xlate_cache *);  void

> > xlate_cache_uninit(struct xlate_cache *);  void

> > xlate_cache_delete(struct xlate_cache *);

> > +void xlate_cache_steal_entries(struct xlate_cache *, struct

> > +xlate_cache *);

> > +

> 

> There's some extra unnecessary whitespace here.

[Sugesh] Will remove it. 
> 

> > +/* Validate if the transalated combined actions are OK to proceed.

> > + * If actions consist of TRUNC action, it is not allowed to do the

> > + * tunnel_push combine as it cannot update stats correctly.

> > + */

> > +static bool

> > +is_tunnel_actions_clone_ready(struct xlate_ctx *ctx) {

> > +    struct nlattr *tnl_actions;

> > +    const struct nlattr *a;

> > +    unsigned int left;

> > +    size_t actions_len;

> > +    struct ofpbuf *actions = ctx->odp_actions;

> > +

> > +    if (!actions) {

> > +        /* No actions, no harm in doing combine. */

> > +        return true;

> > +    }

> > +

> > +    /* Cannot perform tunnel push on slow path action

> CONTROLLER_OUTPUT. */

> > +    if (!ctx->xout->avoid_caching &&

> > +        (ctx->xout->slow & SLOW_CONTROLLER)) {

> 

> Even if we are avoiding caching the flow, I still think that the controller action

> cannot be correctly handled through this new path so we should return false.

> Do you have a particular condition in mind for why we need to care about the

> 'avoid_caching' flag in this case?

[Sugesh] I don’t have any specific case as such. Will remove it and change the condition
to
+ if (ctx->xout->slow & SLOW_CONTROLLER)) {

> 

> > +static bool

> > +validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,

> > +                                      const struct xport *xport,

> > +                                      struct xport *out_dev,

> > +                                      struct ovs_action_push_tnl

> > +tnl_push_data) {

> ...

> > +    if (ctx->odp_actions->size > push_action_size) {

> > +        /* Update the CLONE action only when combined. */

> > +        nl_msg_end_nested(ctx->odp_actions, clone_ofs);

> > +    } else {

> > +        /* No actions after the tunnel, no need of clone. */

> > +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);

> > +        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);

> 

> I looked at this line again for this version since I didn't understand it last time

> around, and I'm still a bit confused. If there's no actions to run on the second

> bridge, then the copy of the packet which is tunneled is effectively dropped.

> If that copy of the packet is dropped, then why do we need the tunnel action

> at all? If I follow correctly, then this means that if you have two bridges,

> where the first bridge has output(tunnel),output(other_device) then the

> second bridge where the tunneling occurs has no flows, then the datapath

> flow will end up as something like:

> 

> push_tnl(...),output(other_device).

> 

> I realise that the testsuite breaks if you remove this line, but maybe the

> testsuite needs fixing for these cases?

[Sugesh] Actually as I mentioned earlier,  this case was not really needed for any usecase.
(As far I know).
I totally agree that its not necessary to push tunnel when there are no actions afterwards.
But some of the tunnel and OVN test cases are failing because of this. 
Do you want to fix the test suits as part of this series.?
> 

> > +    }

> > +

> > +out:

> > +    /* Restore context status. */

> > +    ctx->xin->resubmit_stats = backup_resubmit_stats;

> > +    xlate_cache_delete(ctx->xin->xcache);

> > +    ctx->xin->xcache = backup_xcache;

> > +    ctx->xin->allow_side_effects = backup_side_effects;

> > +    ctx->xin->packet = backup_pkt;

> > +    ctx->wc = backup_flow_wc_ptr;

> > +    return nested_act_flag;

> > +

> 

> There's some extra unnecessary whitespace here.

[Sugesh] Will remove it in the next series.
Chandran, Sugesh July 19, 2017, 8:26 a.m. UTC | #4
Regards
_Sugesh


> -----Original Message-----

> From: Joe Stringer [mailto:joe@ovn.org]

> Sent: Wednesday, July 19, 2017 2:34 AM

> To: Chandran, Sugesh <sugesh.chandran@intel.com>

> Cc: ovs dev <dev@openvswitch.org>; Andy Zhou <azhou@ovn.org>; Zoltán

> Balogh <zoltan.balogh@ericsson.com>

> Subject: Re: [PATCH v3 3/3] tunneling: Avoid datapath-recirc by combining

> recirc actions at xlate.

> 

> On 18 July 2017 at 17:40, Joe Stringer <joe@ovn.org> wrote:

> > On 18 July 2017 at 10:49, Sugesh Chandran <sugesh.chandran@intel.com>

> wrote:

> >> +static bool

> >> +validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,

> >> +                                      const struct xport *xport,

> >> +                                      struct xport *out_dev,

> >> +                                      struct ovs_action_push_tnl

> >> +tnl_push_data) {

> > ...

> >> +    if (ctx->odp_actions->size > push_action_size) {

> >> +        /* Update the CLONE action only when combined. */

> >> +        nl_msg_end_nested(ctx->odp_actions, clone_ofs);

> >> +    } else {

> >> +        /* No actions after the tunnel, no need of clone. */

> >> +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);

> >> +        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);

> >

> > I looked at this line again for this version since I didn't understand

> > it last time around, and I'm still a bit confused. If there's no

> > actions to run on the second bridge, then the copy of the packet which

> > is tunneled is effectively dropped. If that copy of the packet is

> > dropped, then why do we need the tunnel action at all? If I follow

> > correctly, then this means that if you have two bridges, where the

> > first bridge has output(tunnel),output(other_device) then the second

> > bridge where the tunneling occurs has no flows, then the datapath flow

> > will end up as something like:

> >

> > push_tnl(...),output(other_device).

> >

> > I realise that the testsuite breaks if you remove this line, but maybe

> > the testsuite needs fixing for these cases?

> 

> On second thought, this should probably be a logically separate follow-up

> patch even if we agree to change it.

[Sugesh] That make sense. Perhaps we can add comment like
/*
  * XXX : Only needed for the unit test cases. Few tests in 'make check'
  * doesn’t have any actions to handle encaped packets.
  */

Is that OK?
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 1dd0d63..8d909de 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -5048,24 +5048,8 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
 
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
         if (*depth < MAX_RECIRC_DEPTH) {
-            struct dp_packet_batch tnl_pkt;
-            struct dp_packet_batch *orig_packets_ = packets_;
-            int err;
-
-            if (!may_steal) {
-                dp_packet_batch_clone(&tnl_pkt, packets_);
-                packets_ = &tnl_pkt;
-                dp_packet_batch_reset_cutlen(orig_packets_);
-            }
-
             dp_packet_batch_apply_cutlen(packets_);
-
-            err = push_tnl_action(pmd, a, packets_);
-            if (!err) {
-                (*depth)++;
-                dp_netdev_recirculate(pmd, packets_);
-                (*depth)--;
-            }
+            push_tnl_action(pmd, a, packets_);
             return;
         }
         break;
diff --git a/ofproto/ofproto-dpif-xlate-cache.c b/ofproto/ofproto-dpif-xlate-cache.c
index 9161701..87d2e46 100644
--- a/ofproto/ofproto-dpif-xlate-cache.c
+++ b/ofproto/ofproto-dpif-xlate-cache.c
@@ -89,7 +89,7 @@  xlate_cache_netdev(struct xc_entry *entry, const struct dpif_flow_stats *stats)
 /* Push stats and perform side effects of flow translation. */
 void
 xlate_push_stats_entry(struct xc_entry *entry,
-                       const struct dpif_flow_stats *stats)
+                       struct dpif_flow_stats *stats)
 {
     struct eth_addr dmac;
 
@@ -160,6 +160,14 @@  xlate_push_stats_entry(struct xc_entry *entry,
             entry->controller.am = NULL; /* One time only. */
         }
         break;
+    case XC_TUNNEL_HEADER:
+        if (entry->tunnel_hdr.operation == ADD) {
+            stats->n_bytes += stats->n_packets * entry->tunnel_hdr.hdr_size;
+        } else {
+            stats->n_bytes -= stats->n_packets * entry->tunnel_hdr.hdr_size;
+        }
+
+        break;
     default:
         OVS_NOT_REACHED();
     }
@@ -167,7 +175,7 @@  xlate_push_stats_entry(struct xc_entry *entry,
 
 void
 xlate_push_stats(struct xlate_cache *xcache,
-                 const struct dpif_flow_stats *stats)
+                 struct dpif_flow_stats *stats)
 {
     if (!stats->n_packets) {
         return;
@@ -245,6 +253,8 @@  xlate_cache_clear_entry(struct xc_entry *entry)
             entry->controller.am = NULL;
         }
         break;
+    case XC_TUNNEL_HEADER:
+        break;
     default:
         OVS_NOT_REACHED();
     }
@@ -279,3 +289,21 @@  xlate_cache_delete(struct xlate_cache *xcache)
     xlate_cache_uninit(xcache);
     free(xcache);
 }
+
+/* Append all the entries in src into dst and remove them from src.
+ * The caller must own both xc-caches to use this function.
+ * The 'src' entries are not freed in this function as its owned by caller.
+ */
+void xlate_cache_steal_entries(struct xlate_cache *dst, struct
+                               xlate_cache *src)
+{
+    if (!dst || !src) {
+        return;
+    }
+    void *p;
+    struct ofpbuf *src_entries = &src->entries;
+    struct ofpbuf *dst_entries = &dst->entries;
+    p = ofpbuf_put_zeros(dst_entries, src_entries->size);
+    nullable_memcpy(p, src_entries->data, src_entries->size);
+    ofpbuf_clear(src_entries);
+}
diff --git a/ofproto/ofproto-dpif-xlate-cache.h b/ofproto/ofproto-dpif-xlate-cache.h
index 13f7cbc..40186a2 100644
--- a/ofproto/ofproto-dpif-xlate-cache.h
+++ b/ofproto/ofproto-dpif-xlate-cache.h
@@ -52,6 +52,7 @@  enum xc_type {
     XC_GROUP,
     XC_TNL_NEIGH,
     XC_CONTROLLER,
+    XC_TUNNEL_HEADER,
 };
 
 /* xlate_cache entries hold enough information to perform the side effects of
@@ -119,6 +120,13 @@  struct xc_entry {
             struct ofproto_dpif *ofproto;
             struct ofproto_async_msg *am;
         } controller;
+        struct {
+            enum {
+                ADD,
+                REMOVE,
+            } operation;
+            uint16_t hdr_size;
+        } tunnel_hdr;
     };
 };
 
@@ -134,11 +142,13 @@  struct xlate_cache {
 void xlate_cache_init(struct xlate_cache *);
 struct xlate_cache *xlate_cache_new(void);
 struct xc_entry *xlate_cache_add_entry(struct xlate_cache *, enum xc_type);
-void xlate_push_stats_entry(struct xc_entry *, const struct dpif_flow_stats *);
-void xlate_push_stats(struct xlate_cache *, const struct dpif_flow_stats *);
+void xlate_push_stats_entry(struct xc_entry *, struct dpif_flow_stats *);
+void xlate_push_stats(struct xlate_cache *, struct dpif_flow_stats *);
 void xlate_cache_clear_entry(struct xc_entry *);
 void xlate_cache_clear(struct xlate_cache *);
 void xlate_cache_uninit(struct xlate_cache *);
 void xlate_cache_delete(struct xlate_cache *);
+void xlate_cache_steal_entries(struct xlate_cache *, struct xlate_cache *);
+
 
 #endif /* ofproto-dpif-xlate-cache.h */
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 03e7a7f..1e9b183 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -3147,6 +3147,203 @@  tnl_send_arp_request(struct xlate_ctx *ctx, const struct xport *out_dev,
     dp_packet_uninit(&packet);
 }
 
+static void
+propagate_tunnel_data_to_flow__(struct flow *dst_flow,
+                                const struct flow *src_flow,
+                                struct eth_addr dmac, struct eth_addr smac,
+                                struct in6_addr s_ip6, ovs_be32 s_ip,
+                                bool is_tnl_ipv6, uint8_t nw_proto)
+{
+    dst_flow->dl_dst = dmac;
+    dst_flow->dl_src = smac;
+
+    dst_flow->packet_type = htonl(PT_ETH);
+    dst_flow->nw_dst = src_flow->tunnel.ip_dst;
+    dst_flow->nw_src = src_flow->tunnel.ip_src;
+    dst_flow->ipv6_dst = src_flow->tunnel.ipv6_dst;
+    dst_flow->ipv6_src = src_flow->tunnel.ipv6_src;
+
+    dst_flow->nw_tos = src_flow->tunnel.ip_tos;
+    dst_flow->nw_ttl = src_flow->tunnel.ip_ttl;
+    dst_flow->tp_dst = src_flow->tunnel.tp_dst;
+    dst_flow->tp_src = src_flow->tunnel.tp_src;
+
+    if (is_tnl_ipv6) {
+        dst_flow->dl_type = htons(ETH_TYPE_IPV6);
+        if (ipv6_mask_is_any(&dst_flow->ipv6_src)
+            && !ipv6_mask_is_any(&s_ip6)) {
+            dst_flow->ipv6_src = s_ip6;
+        }
+    } else {
+        dst_flow->dl_type = htons(ETH_TYPE_IP);
+        if (dst_flow->nw_src == 0 && s_ip) {
+            dst_flow->nw_src = s_ip;
+        }
+    }
+    dst_flow->nw_proto = nw_proto;
+}
+
+/*
+ * Populate the 'flow' and 'base_flow' L3 fields to do the post tunnel push
+ * translations.
+ */
+static void
+propagate_tunnel_data_to_flow(struct xlate_ctx *ctx, struct eth_addr dmac,
+                              struct eth_addr smac,   struct in6_addr s_ip6,
+                              ovs_be32 s_ip, bool is_tnl_ipv6,
+                              enum ovs_vport_type tnl_type)
+{
+    struct flow *base_flow, *flow;
+    flow = &ctx->xin->flow;
+    base_flow = &ctx->base_flow;
+    uint8_t nw_proto = 0;
+
+    switch (tnl_type) {
+    case OVS_VPORT_TYPE_GRE:
+        nw_proto = IPPROTO_GRE;
+        break;
+    case OVS_VPORT_TYPE_VXLAN:
+    case OVS_VPORT_TYPE_GENEVE:
+        nw_proto = IPPROTO_UDP;
+        break;
+    case OVS_VPORT_TYPE_LISP:
+    case OVS_VPORT_TYPE_STT:
+    case OVS_VPORT_TYPE_UNSPEC:
+    case OVS_VPORT_TYPE_NETDEV:
+    case OVS_VPORT_TYPE_INTERNAL:
+    case __OVS_VPORT_TYPE_MAX:
+    default:
+        OVS_NOT_REACHED();
+        break;
+    }
+    /*
+     * Update base_flow first followed by flow as the dst_flow gets modified
+     * in the function.
+     */
+    propagate_tunnel_data_to_flow__(base_flow, flow, dmac, smac, s_ip6, s_ip,
+                                    is_tnl_ipv6, nw_proto);
+    propagate_tunnel_data_to_flow__(flow, flow, dmac, smac, s_ip6, s_ip,
+                                    is_tnl_ipv6, nw_proto);
+}
+
+/* Validate if the transalated combined actions are OK to proceed.
+ * If actions consist of TRUNC action, it is not allowed to do the
+ * tunnel_push combine as it cannot update stats correctly.
+ */
+static bool
+is_tunnel_actions_clone_ready(struct xlate_ctx *ctx)
+{
+    struct nlattr *tnl_actions;
+    const struct nlattr *a;
+    unsigned int left;
+    size_t actions_len;
+    struct ofpbuf *actions = ctx->odp_actions;
+
+    if (!actions) {
+        /* No actions, no harm in doing combine. */
+        return true;
+    }
+
+    /* Cannot perform tunnel push on slow path action CONTROLLER_OUTPUT. */
+    if (!ctx->xout->avoid_caching &&
+        (ctx->xout->slow & SLOW_CONTROLLER)) {
+        return false;
+    }
+    actions_len = actions->size;
+
+    tnl_actions =(struct nlattr *)(actions->data);
+    NL_ATTR_FOR_EACH_UNSAFE (a, left, tnl_actions, actions_len) {
+        int type = nl_attr_type(a);
+        if (type == OVS_ACTION_ATTR_TRUNC) {
+            VLOG_DBG("Cannot do tunnel action-combine on trunc action");
+            return false;
+            break;
+        }
+    }
+    return true;
+}
+
+static bool
+validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,
+                                      const struct xport *xport,
+                                      struct xport *out_dev,
+                                      struct ovs_action_push_tnl tnl_push_data)
+{
+    const struct dpif_flow_stats *backup_resubmit_stats;
+    struct xlate_cache *backup_xcache;
+    bool nested_act_flag = false;
+    struct flow_wildcards tmp_flow_wc;
+    struct flow_wildcards *backup_flow_wc_ptr;
+    bool backup_side_effects;
+    const struct dp_packet *backup_pkt;
+
+    memset(&tmp_flow_wc, 0 , sizeof tmp_flow_wc);
+    backup_flow_wc_ptr = ctx->wc;
+    ctx->wc = &tmp_flow_wc;
+    ctx->xin->wc = NULL;
+    backup_resubmit_stats = ctx->xin->resubmit_stats;
+    backup_xcache = ctx->xin->xcache;
+    backup_side_effects = ctx->xin->allow_side_effects;
+    backup_pkt = ctx->xin->packet;
+
+    size_t push_action_size = 0;
+    size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
+                                           OVS_ACTION_ATTR_CLONE);
+    odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
+    push_action_size = ctx->odp_actions->size;
+
+    ctx->xin->resubmit_stats =  NULL;
+    ctx->xin->xcache = xlate_cache_new(); /* Use new temporary cache. */
+    ctx->xin->allow_side_effects = false;
+    ctx->xin->packet = NULL;
+
+    /* Push the cache entry for the tunnel first. */
+    struct xc_entry *entry;
+    entry = xlate_cache_add_entry(ctx->xin->xcache, XC_TUNNEL_HEADER);
+    entry->tunnel_hdr.hdr_size = tnl_push_data.header_len;
+    entry->tunnel_hdr.operation = ADD;
+
+    apply_nested_clone_actions(ctx, xport, out_dev);
+    nested_act_flag = is_tunnel_actions_clone_ready(ctx);
+
+    if (nested_act_flag) {
+         /* Similar to the stats update in revalidation, the x_cache entries
+          * are populated by the previous translation are used to update the
+          * stats correctly.
+          */
+        if (backup_resubmit_stats) {
+            struct dpif_flow_stats tmp_resubmit_stats;
+            memcpy(&tmp_resubmit_stats, backup_resubmit_stats,
+                   sizeof tmp_resubmit_stats);
+            xlate_push_stats(ctx->xin->xcache, &tmp_resubmit_stats);
+        }
+        xlate_cache_steal_entries(backup_xcache, ctx->xin->xcache);
+    } else {
+        /* Combine is not valid. */
+        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
+        goto out;
+    }
+    if (ctx->odp_actions->size > push_action_size) {
+        /* Update the CLONE action only when combined. */
+        nl_msg_end_nested(ctx->odp_actions, clone_ofs);
+    } else {
+        /* No actions after the tunnel, no need of clone. */
+        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
+        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
+    }
+
+out:
+    /* Restore context status. */
+    ctx->xin->resubmit_stats = backup_resubmit_stats;
+    xlate_cache_delete(ctx->xin->xcache);
+    ctx->xin->xcache = backup_xcache;
+    ctx->xin->allow_side_effects = backup_side_effects;
+    ctx->xin->packet = backup_pkt;
+    ctx->wc = backup_flow_wc_ptr;
+    return nested_act_flag;
+
+}
+
 static int
 build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
                   const struct flow *flow, odp_port_t tunnel_odp_port)
@@ -3163,6 +3360,14 @@  build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
     char buf_sip6[INET6_ADDRSTRLEN];
     char buf_dip6[INET6_ADDRSTRLEN];
 
+    /* Structures to backup Ethernet and IP of base_flow. */
+    struct flow old_base_flow;
+    struct flow old_flow;
+
+    /* Backup flow & base_flow data. */
+    memcpy(&old_base_flow, &ctx->base_flow, sizeof old_base_flow);
+    memcpy(&old_flow, &ctx->xin->flow, sizeof old_flow);
+
     err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
     if (err) {
         xlate_report(ctx, OFT_WARN, "native tunnel routing failed");
@@ -3222,12 +3427,31 @@  build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
     tnl_push_data.tnl_port = tunnel_odp_port;
     tnl_push_data.out_port = out_dev->odp_port;
 
-    /* After tunnel header has been added, packet_type of flow and base_flow
-     * need to be set to PT_ETH. */
-    ctx->xin->flow.packet_type = htonl(PT_ETH);
-    ctx->base_flow.packet_type = htonl(PT_ETH);
+    /* After tunnel header has been added, MAC and IP data of flow and
+     * base_flow need to be set properly, since there is not recirculation
+     * any more when sending packet to tunnel. */
 
-    odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
+    propagate_tunnel_data_to_flow(ctx, dmac, smac, s_ip6, s_ip,
+                                  tnl_params.is_ipv6, tnl_push_data.tnl_type);
+
+
+    /* Try to see if its possible to apply nested clone actions on tunnel.
+     * Revert the combined actions on tunnel if its not valid.
+     */
+    if (!validate_and_combine_post_tnl_actions(ctx, xport, out_dev,
+                                      tnl_push_data)) {
+        /* Datapath is not doing the recirculation now, so lets make it
+         * happen explicitly.
+         */
+        size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
+                                        OVS_ACTION_ATTR_CLONE);
+        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
+        nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, 0);
+        nl_msg_end_nested(ctx->odp_actions, clone_ofs);
+    }
+    /* Restore the flows after the translation. */
+    memcpy(&ctx->xin->flow, &old_flow, sizeof ctx->xin->flow);
+    memcpy(&ctx->base_flow, &old_base_flow, sizeof ctx->base_flow);
     return 0;
 }
 
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 7cf1a40..50f440f 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4601,7 +4601,7 @@  packet_xlate_revert(struct ofproto *ofproto OVS_UNUSED,
 static void
 ofproto_dpif_xcache_execute(struct ofproto_dpif *ofproto,
                             struct xlate_cache *xcache,
-                            const struct dpif_flow_stats *stats)
+                            struct dpif_flow_stats *stats)
     OVS_REQUIRES(ofproto_mutex)
 {
     struct xc_entry *entry;
@@ -4634,6 +4634,7 @@  ofproto_dpif_xcache_execute(struct ofproto_dpif *ofproto,
         case XC_GROUP:
         case XC_TNL_NEIGH:
         case XC_CONTROLLER:
+        case XC_TUNNEL_HEADER:
             xlate_push_stats_entry(entry, stats);
             break;
         default:
diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at
index 6c287c5..883d9a0 100644
--- a/tests/packet-type-aware.at
+++ b/tests/packet-type-aware.at
@@ -326,8 +326,7 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p1),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:03),eth_type(0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=30.0.0.1,dst=30.0.0.3)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:03,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1))
+recirc_id(0),in_port(n1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:03,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1)),set(ipv4(src=30.0.0.1,dst=30.0.0.3)),tnl_pop(gre_sys))
 tunnel(src=30.0.0.1,dst=30.0.0.3,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(dst=192.168.10.30,frag=no), packets:1, bytes:98, used:0.0s, actions:set(eth(dst=aa:55:aa:55:00:03)),n3
 ])
 
@@ -345,8 +344,7 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p1),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:02),eth_type(0x0800),ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=20.0.0.1,dst=20.0.0.2)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1))
+recirc_id(0),in_port(n1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1)),set(ipv4(src=20.0.0.1,dst=20.0.0.2)),tnl_pop(gre_sys))
 tunnel(src=20.0.0.1,dst=20.0.0.2,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth(dst=46:1e:7d:1a:95:a1),eth_type(0x0800),ipv4(dst=192.168.10.20,frag=no), packets:1, bytes:98, used:0.0s, actions:set(eth(dst=aa:55:aa:55:00:02)),n2
 ])
 
@@ -364,8 +362,7 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p2),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:01),eth_type(0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p2))
+recirc_id(0),in_port(n2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p2)),set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys))
 tunnel(src=10.0.0.2,dst=10.0.0.1,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth(dst=3a:6d:d2:09:9c:ab),eth_type(0x0800),ipv4(dst=192.168.10.10,frag=no), packets:1, bytes:98, used:0.0s, actions:set(eth(dst=aa:55:aa:55:00:01)),n1
 ])
 
@@ -383,10 +380,8 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p1),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:03),eth_type(0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=30.0.0.1,dst=30.0.0.3)),tnl_pop(gre_sys)
-recirc_id(0),in_port(br-p2),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:01),eth_type(0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p2))
-tunnel(src=10.0.0.2,dst=10.0.0.1,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:03,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1))
+recirc_id(0),in_port(n2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p2)),set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys))
+tunnel(src=10.0.0.2,dst=10.0.0.1,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.30,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:03,src=aa:55:00:00:00:01,dl_type=0x0800),ipv4(src=10.0.0.1,dst=10.0.0.3,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p1)),set(ipv4(src=30.0.0.1,dst=30.0.0.3)),tnl_pop(gre_sys))
 tunnel(src=30.0.0.1,dst=30.0.0.3,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(dst=192.168.10.30,frag=no), packets:1, bytes:98, used:0.0s, actions:set(eth(dst=aa:55:aa:55:00:03)),n3
 ])
 
@@ -404,11 +399,9 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p2),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:01),eth_type(0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,frag=no), packets:1, bytes:122, used:0.0s, actions:set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys)
-recirc_id(0),in_port(br-p3),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:02),eth_type(0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,frag=no), packets:1, bytes:122, used:0.0s, actions:set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:pop_eth,tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p3))
+recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p3)),set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys))
 tunnel(src=10.0.0.2,dst=10.0.0.1,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=1,id=0x800),ipv4(dst=192.168.10.10,frag=no), packets:1, bytes:84, used:0.0s, actions:push_eth(src=00:00:00:00:00:00,dst=aa:55:aa:55:00:01),n1
-tunnel(src=20.0.0.3,dst=20.0.0.2,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=1,id=0x800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:84, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p2))
+tunnel(src=20.0.0.3,dst=20.0.0.2,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=1,id=0x800),ipv4(dst=192.168.10.10,tos=0/0x3,frag=no), packets:1, bytes:84, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:01,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=20.0.0.2,dst=20.0.0.1,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p2)),set(ipv4(src=10.0.0.2,dst=10.0.0.1)),tnl_pop(gre_sys))
 ])
 
 # Clear up megaflow cache
@@ -425,8 +418,7 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p3),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:02),eth_type(0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,frag=no), packets:1, bytes:136, used:0.0s, actions:set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p3))
+recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x6558))),out_port(br-p3)),set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys))
 tunnel(src=20.0.0.3,dst=20.0.0.2,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=0,id=0),eth(dst=46:1e:7d:1a:95:a1),eth_type(0x0800),ipv4(dst=192.168.10.20,frag=no), packets:1, bytes:98, used:0.0s, actions:set(eth(dst=aa:55:aa:55:00:02)),n2
 ])
 
@@ -512,8 +504,7 @@  ovs-appctl time/warp 1000
 AT_CHECK([
     ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | grep -v ipv6 | sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(br-p3),packet_type(ns=0,id=0),eth(dst=aa:55:00:00:00:02),eth_type(0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,frag=no), packets:1, bytes:122, used:0.0s, actions:set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys)
-recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:pop_eth,tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p3))
+recirc_id(0),in_port(n3),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(dst=192.168.10.20,tos=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst=aa:55:00:00:00:02,src=aa:55:00:00:00:03,dl_type=0x0800),ipv4(src=30.0.0.3,dst=30.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out_port(br-p3)),set(ipv4(src=20.0.0.3,dst=20.0.0.2)),tnl_pop(gre_sys))
 tunnel(src=20.0.0.3,dst=20.0.0.2,flags(-df-csum)),recirc_id(0),in_port(gre_sys),packet_type(ns=1,id=0x800),ipv4(dst=192.168.10.20,frag=no), packets:1, bytes:84, used:0.0s, actions:drop
 ])