diff mbox series

[ovs-dev,v5] Avoid dp_hash recirculation for balance-tcp bond selection mode

Message ID 1565254571-25137-2-git-send-email-vishal.deep.ajmera@ericsson.com
State Changes Requested
Headers show
Series [ovs-dev,v5] Avoid dp_hash recirculation for balance-tcp bond selection mode | expand

Commit Message

Vishal Deep Ajmera Aug. 8, 2019, 8:56 a.m. UTC
Problem:
--------
In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
(using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
"HASH" and "RECIRC" datapath actions. After recirculation, the packet is
forwarded to the bond member port based on 8-bits of the datapath hash
value computed through dp_hash. This causes performance degradation in the
following ways:

1. The recirculation of the packet implies another lookup of the packet’s
flow key in the exact match cache (EMC) and potentially Megaflow classifier
(DPCLS). This is the biggest cost factor.

2. The recirculated packets have a new “RSS” hash and compete with the
original packets for the scarce number of EMC slots. This implies more
EMC misses and potentially EMC thrashing causing costly DPCLS lookups.

3. The 256 extra megaflow entries per bond for dp_hash bond selection put
additional load on the revalidation threads.

Owing to this performance degradation, deployments stick to “balance-slb”
bond mode even though it does not do active-active load balancing for
VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
source MAC address.

Proposed optimization:
----------------------
This proposal introduces a new load-balancing output action instead of
recirculation.

Maintain one table per-bond (could just be an array of uint16's) and
program it the same way internal flows are created today for each possible
hash value(256 entries) from ofproto layer. Use this table to load-balance
flows as part of output action processing.

Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
-> bond_may_recirc() and compose_output_action__() generate
“dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
RecircID identifies the bond. For the recirculated packets the ofproto layer
installs megaflow entries that match on RecircID and masked dp_hash and send
them to the corresponding output port.

Instead, we will now generate actions as
    "hash(l4(0)),lb_output(bond,<bond id>)"

This combines hash computation (only if needed, else re-use RSS hash) and
inline load-balancing over the bond. This action is used *only* for balance-tcp
bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).

Example:
--------
Current scheme:
---------------
With 1 IP-UDP flow:

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2

With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1

New scheme:
-----------
We can do with a single flow entry (for any number of new flows):

in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(l4(0)),lb_output(bond,1)

A new CLI has been added to dump the per-PMD bond cache as given below.

“sudo ovs-appctl dpif-netdev/pmd-bond-show”

root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/pmd-bond-show
pmd thread numa_id 0 core_id 4:
Bond cache:
        bond-id 1 :
                bucket 0 - slave 2
                bucket 1 - slave 1
                bucket 2 - slave 2
                bucket 3 - slave 1

Performance improvement:
------------------------
With a prototype of the proposed idea, the following perf improvement is seen
with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
is even more enhanced (due to reduced number of flows).

1 VM:
*****
+--------------------------------------+
|                 mpps                 |
+--------------------------------------+
| Flows  master  with-opt.   %delta    |
+--------------------------------------+
| 1      4.53    5.89        29.96
| 10     4.16    5.89        41.51
| 400    3.55    5.55        56.22
| 1k     3.44    5.45        58.30
| 10k    2.50    4.63        85.34
| 100k   2.29    4.27        86.15
| 500k   2.25    4.27        89.23
+--------------------------------------+
mpps: million packets per second.

Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>

CC: Jan Scheurich <jan.scheurich@ericsson.com>
CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
CC: Nitin Katiyar <nitin.katiyar@ericsson.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |   2 +
 lib/dpif-netdev.c                                 | 528 ++++++++++++++++++++--
 lib/dpif-netlink.c                                |   3 +
 lib/dpif-provider.h                               |   8 +
 lib/dpif.c                                        |  48 ++
 lib/dpif.h                                        |   7 +
 lib/odp-execute.c                                 |   2 +
 lib/odp-util.c                                    |   4 +
 ofproto/bond.c                                    |  52 ++-
 ofproto/bond.h                                    |   9 +
 ofproto/ofproto-dpif-ipfix.c                      |   1 +
 ofproto/ofproto-dpif-sflow.c                      |   1 +
 ofproto/ofproto-dpif-xlate.c                      |  65 ++-
 ofproto/ofproto-dpif.c                            |  32 ++
 ofproto/ofproto-dpif.h                            |  12 +-
 tests/lacp.at                                     |   9 +
 vswitchd/bridge.c                                 |   4 +
 vswitchd/vswitch.xml                              |  10 +
 18 files changed, 736 insertions(+), 61 deletions(-)

Comments

Matteo Croce Aug. 10, 2019, 12:36 a.m. UTC | #1
On Thu, Aug 8, 2019 at 10:57 AM Vishal Deep Ajmera
<vishal.deep.ajmera@ericsson.com> wrote:
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -1939,3 +1959,9 @@ bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
>
>      return false;
>  }
> +
> +bool
> +bond_get_cache_mode(const struct bond *bond)
> +{
> +    return bond->use_bond_cache;
> +}

Hi,

why not a static function in the header file? So it gets inlined.

Regards,
Vishal Deep Ajmera Aug. 13, 2019, 7:52 a.m. UTC | #2
> 
> Hi,
> 
> why not a static function in the header file? So it gets inlined.
> 
> Regards,
> --
> Matteo Croce
> per aspera ad upstream

Thanks Matteo for looking into this patch-set. Yes I agree. I will address your suggestion in the next revision.

Warm Regards,
Vishal
Ilya Maximets Aug. 26, 2019, 2:02 p.m. UTC | #3
Hi.
Not a full review. Few comments inline.

Best regards, Ilya Maximets.

On 08.08.2019 11:56, Vishal Deep Ajmera wrote:
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
> 
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
> 
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
> 
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
> 
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
> 
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
> 
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
> 
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
> 
> Instead, we will now generate actions as
>     "hash(l4(0)),lb_output(bond,<bond id>)"
> 
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
> 
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2
> 
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1
> 
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
> 
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(l4(0)),lb_output(bond,1)
> 
> A new CLI has been added to dump the per-PMD bond cache as given below.
> 
> “sudo ovs-appctl dpif-netdev/pmd-bond-show”
> 
> root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/pmd-bond-show
> pmd thread numa_id 0 core_id 4:
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
> 
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
> 
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.53    5.89        29.96
> | 10     4.16    5.89        41.51
> | 400    3.55    5.55        56.22
> | 1k     3.44    5.45        58.30
> | 10k    2.50    4.63        85.34
> | 100k   2.29    4.27        86.15
> | 500k   2.25    4.27        89.23
> +--------------------------------------+
> mpps: million packets per second.
> 
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
> 
> CC: Jan Scheurich <jan.scheurich@ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> CC: Nitin Katiyar <nitin.katiyar@ericsson.com>
> ---

Here should be the patch version history.
Otherwise it's hard to track changes.

>  datapath/linux/compat/include/linux/openvswitch.h |   2 +
>  lib/dpif-netdev.c                                 | 528 ++++++++++++++++++++--
>  lib/dpif-netlink.c                                |   3 +
>  lib/dpif-provider.h                               |   8 +
>  lib/dpif.c                                        |  48 ++
>  lib/dpif.h                                        |   7 +
>  lib/odp-execute.c                                 |   2 +
>  lib/odp-util.c                                    |   4 +
>  ofproto/bond.c                                    |  52 ++-
>  ofproto/bond.h                                    |   9 +
>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>  ofproto/ofproto-dpif-sflow.c                      |   1 +
>  ofproto/ofproto-dpif-xlate.c                      |  65 ++-
>  ofproto/ofproto-dpif.c                            |  32 ++
>  ofproto/ofproto-dpif.h                            |  12 +-
>  tests/lacp.at                                     |   9 +
>  vswitchd/bridge.c                                 |   4 +
>  vswitchd/vswitch.xml                              |  10 +
>  18 files changed, 736 insertions(+), 61 deletions(-)
> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index 65a003a..6dafcfb 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -734,6 +734,7 @@ enum ovs_hash_alg {
>  	OVS_HASH_ALG_L4,
>  #ifndef __KERNEL__
>  	OVS_HASH_ALG_SYM_L4,
> +        OVS_HASH_ALG_L4_RSS,

Linux kernel coding style should be used in kernel headers.

>  #endif
>  	__OVS_HASH_MAX
>  };
> @@ -989,6 +990,7 @@ enum ovs_action_attr {
>  #ifndef __KERNEL__
>  	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>  	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
> +	OVS_ACTION_ATTR_LB_OUTPUT,     /* bond-id */

The side of a bond-id?

>  #endif
>  	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
>  				       * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d0a1c58..9db2a73 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -79,6 +79,7 @@
>  #include "unixctl.h"
>  #include "util.h"
>  #include "uuid.h"
> +#include "ofproto/bond.h"
>  
>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>  
> @@ -366,6 +367,11 @@ struct dp_netdev {
>  
>      struct conntrack *conntrack;
>      struct pmd_auto_lb pmd_alb;
> +    /* Bonds.
> +     *
> +     * Any lookup into 'bonds' requires taking 'bond_mutex'. */
> +    struct ovs_mutex bond_mutex;
> +    struct hmap bonds;
>  };
>  
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -596,6 +602,20 @@ struct tx_port {
>      struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>  };
>  
> +/* Contained by struct tx_bond 'slave_buckets' */
> +struct slave_entry {
> +    uint32_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'bond_cache' or 'tx_bonds'. */
> +struct tx_bond {
> +    struct hmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>  /* A set of properties for the current processing loop that is not directly
>   * associated with the pmd thread itself, but with the packets being
>   * processed or the short-term system configuration (for example, time).
> @@ -708,6 +728,11 @@ struct dp_netdev_pmd_thread {
>      atomic_bool reload_tx_qid;      /* Do we need to reload static_tx_qid? */
>      atomic_bool exit;               /* For terminating the pmd thread. */
>  
> +    atomic_bool reload_bond_cache;  /* Do we need to load tx bond cache?
> +                                     * Note: This flag is decoupled from 'reload'
> +                                     * flag otherwise full pmd reload will become
> +                                     * frequent and costly everytime bond
> +                                     * rebalancing is done. */

I'm not sure if this is better than full PMD reload. There was same recent
work about reload optimization and also reload doesn't clear any caches, so
this should not be very destructive to reload every ~10 seconds (default) by
bonding rebalancing. Did you test the reload time in your case and how much
it's better than normal reload?

>      pthread_t thread;
>      unsigned core_id;               /* CPU core id of this pmd thread. */
>      int numa_id;                    /* numa node id of this pmd thread. */
> @@ -728,6 +753,11 @@ struct dp_netdev_pmd_thread {
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED;
>  
> +    struct ovs_mutex bond_mutex;    /* Mutex for 'tx_bonds'. */
> +    /* Map of 'tx_bond's used for transmission.  Written by the main thread,
> +     * read/written by the pmd thread. */
> +    struct hmap tx_bonds OVS_GUARDED;
> +
>      /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>       * ports (that support push_tunnel/pop_tunnel), the other contains ports
>       * with at least one txq (that support send).  A port can be in both.
> @@ -740,6 +770,8 @@ struct dp_netdev_pmd_thread {
>       * other instance will only be accessed by its own pmd thread. */
>      struct hmap tnl_port_cache;
>      struct hmap send_port_cache;
> +    /* These are thread-local copies of 'tx_bonds' */
> +    struct hmap bond_cache;
>  
>      /* Keep track of detailed PMD performance statistics. */
>      struct pmd_perf_stats perf_stats;
> @@ -819,6 +851,12 @@ static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
>  static int
>  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>                                     bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                         struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex);
> +static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                           struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex);
>  
>  static void reconfigure_datapath(struct dp_netdev *dp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -827,6 +865,10 @@ static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
>  static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
>  static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>      OVS_REQUIRES(pmd->port_mutex);
> +static void pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> +    OVS_REQUIRES(pmd->bond_mutex);
> +static void pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd);
> +
>  static inline void
>  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
>                             struct polled_queue *poll_list, int poll_cnt);
> @@ -1385,6 +1427,58 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
>      par.command_type = PMD_INFO_PERF_SHOW;
>      dpif_netdev_pmd_info(conn, argc, argv, &par);
>  }
> +
> +static void
> +dpif_netdev_pmd_bond_show(struct unixctl_conn *conn, int argc,
> +                          const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct ds reply = DS_EMPTY_INITIALIZER;
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct dp_netdev *dp = NULL;
> +    uint32_t bucket;
> +    struct tx_bond *pmd_bond_entry = NULL;
> +
> +    ovs_mutex_lock(&dp_netdev_mutex);
> +
> +    if (argc == 2) {
> +        dp = shash_find_data(&dp_netdevs, argv[1]);
> +    } else if (shash_count(&dp_netdevs) == 1) {
> +        /* There's only one datapath */
> +        dp = shash_first(&dp_netdevs)->data;
> +    }
> +    if (!dp) {
> +        ovs_mutex_unlock(&dp_netdev_mutex);
> +        unixctl_command_reply_error(conn,
> +                                    "please specify an existing datapath");
> +        return;
> +    }
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ds_put_cstr(&reply, (pmd->core_id == NON_PMD_CORE_ID)
> +                            ? "main thread" : "pmd thread");
> +        if (pmd->numa_id != OVS_NUMA_UNSPEC) {
> +            ds_put_format(&reply, " numa_id %d", pmd->numa_id);
> +        }
> +        if (pmd->core_id != OVS_CORE_UNSPEC &&
> +            pmd->core_id != NON_PMD_CORE_ID) {
> +            ds_put_format(&reply, " core_id %u", pmd->core_id);
> +        }
> +        ds_put_cstr(&reply, ":\n");
> +        ds_put_cstr(&reply, "\nBonds:\n");
> +        HMAP_FOR_EACH (pmd_bond_entry, node, &pmd->tx_bonds) {

This data should be the same for all the PMD threads. Do we really need
to print it on per-thread basis? It should be enough to print only
dp stored vcalues. This will also save accesses to PMD-local mutexes.

> +            ds_put_format(&reply, "\tbond-id %u :\n",
> +                          pmd_bond_entry->bond_id);
> +            for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +                ds_put_format(&reply, "\t\tbucket %u - slave %u \n",
> +                          bucket,
> +                          pmd_bond_entry->slave_buckets[bucket].slave_id);
> +            }
> +        }
> +    }
> +    ovs_mutex_unlock(&dp_netdev_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&reply));
> +    ds_destroy(&reply);
> +}
> +
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1416,6 +1510,9 @@ dpif_netdev_init(void)
>                               "[-us usec] [-q qlen]",
>                               0, 10, pmd_perf_log_set_cmd,
>                               NULL);
> +    unixctl_command_register("dpif-netdev/pmd-bond-show", "[dp]",

In terms of the prefious comment, this should be renamed to something like
'dp-bond-show'.

> +                             0, 1, dpif_netdev_pmd_bond_show,
> +                             NULL);
>      return 0;
>  }
>  
> @@ -1531,6 +1628,9 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>      ovs_mutex_init(&dp->port_mutex);
>      hmap_init(&dp->ports);
>      dp->port_seq = seq_create();
> +    ovs_mutex_init(&dp->bond_mutex);
> +    hmap_init(&dp->bonds);
> +
>      fat_rwlock_init(&dp->upcall_rwlock);
>  
>      dp->reconfigure_seq = seq_create();
> @@ -1645,6 +1745,7 @@ dp_netdev_free(struct dp_netdev *dp)
>      OVS_REQUIRES(dp_netdev_mutex)
>  {
>      struct dp_netdev_port *port, *next;
> +    struct tx_bond *bond, *next_bond;
>  
>      shash_find_and_delete(&dp_netdevs, dp->name);
>  
> @@ -1654,6 +1755,13 @@ dp_netdev_free(struct dp_netdev *dp)
>      }
>      ovs_mutex_unlock(&dp->port_mutex);
>  
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    HMAP_FOR_EACH_SAFE (bond, next_bond, node, &dp->bonds) {
> +        hmap_remove(&dp->bonds, &bond->node);
> +        free(bond);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      dp_netdev_destroy_all_pmds(dp, true);
>      cmap_destroy(&dp->poll_threads);
>  
> @@ -1672,6 +1780,9 @@ dp_netdev_free(struct dp_netdev *dp)
>      hmap_destroy(&dp->ports);
>      ovs_mutex_destroy(&dp->port_mutex);
>  
> +    hmap_destroy(&dp->bonds);
> +    ovs_mutex_destroy(&dp->bond_mutex);
> +
>      /* Upcalls must be disabled at this point */
>      dp_netdev_destroy_upcall_lock(dp);
>  
> @@ -1775,6 +1886,7 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
>          ovs_mutex_lock(&pmd->port_mutex);
>          pmd_load_cached_ports(pmd);
>          ovs_mutex_unlock(&pmd->port_mutex);
> +        pmd_load_bond_cache(pmd);
>          ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
>          return;
>      }
> @@ -1789,6 +1901,12 @@ hash_port_no(odp_port_t port_no)
>      return hash_int(odp_to_u32(port_no), 0);
>  }
>  
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> +    return hash_int(bond_id, 0);
> +}
> +
>  static int
>  port_create(const char *devname, const char *type,
>              odp_port_t port_no, struct dp_netdev_port **portp)
> @@ -4311,6 +4429,19 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
>      return NULL;
>  }
>  
> +static struct tx_bond *
> +tx_bond_lookup(const struct hmap *hmap, uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +
> +    HMAP_FOR_EACH_IN_BUCKET (tx, node, hash_bond_id(bond_id), hmap) {
> +        if (tx->bond_id == bond_id) {
> +            return tx;
> +        }
> +    }
> +    return NULL;
> +}
> +
>  static int
>  port_reconfigure(struct dp_netdev_port *port)
>  {
> @@ -4788,6 +4919,27 @@ pmd_remove_stale_ports(struct dp_netdev *dp,
>      ovs_mutex_unlock(&pmd->port_mutex);
>  }
>  
> +static void
> +pmd_remove_stale_bonds(struct dp_netdev *dp,
> +                       struct dp_netdev_pmd_thread *pmd)
> +    OVS_EXCLUDED(pmd->bond_mutex)
> +    OVS_EXCLUDED(dp->bond_mutex)
> +{
> +    struct tx_bond *tx, *tx_next;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +
> +    HMAP_FOR_EACH_SAFE (tx, tx_next, node, &pmd->tx_bonds) {
> +        if (!tx_bond_lookup(&dp->bonds, tx->bond_id)) {
> +            dp_netdev_del_bond_tx_from_pmd(pmd, tx);
> +        }
> +    }
> +
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +}
> +
>  /* Must be called each time a port is added/removed or the cmask changes.
>   * This creates and destroys pmd threads, reconfigures ports, opens their
>   * rxqs and assigns all rxqs/txqs to pmd threads. */
> @@ -4798,6 +4950,7 @@ reconfigure_datapath(struct dp_netdev *dp)
>      struct hmapx busy_threads = HMAPX_INITIALIZER(&busy_threads);
>      struct dp_netdev_pmd_thread *pmd;
>      struct dp_netdev_port *port;
> +    struct tx_bond *bond;
>      int wanted_txqs;
>  
>      dp->last_reconfigure_seq = seq_read(dp->reconfigure_seq);
> @@ -4826,10 +4979,11 @@ reconfigure_datapath(struct dp_netdev *dp)
>          }
>      }
>  
> -    /* Remove from the pmd threads all the ports that have been deleted or
> -     * need reconfiguration. */
> +    /* Remove from the pmd threads all the ports/bonds that have been deleted
> +     * or need reconfiguration. */
>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>          pmd_remove_stale_ports(dp, pmd);
> +        pmd_remove_stale_bonds(dp, pmd);
>      }
>  
>      /* Reload affected pmd threads.  We must wait for the pmd threads before
> @@ -4951,6 +5105,20 @@ reconfigure_datapath(struct dp_netdev *dp)
>          ovs_mutex_unlock(&pmd->port_mutex);
>      }
>  
> +    /* Add every bond to the tx cache of every pmd thread, if it's not
> +     * there already and if this pmd has at least one rxq to poll. */
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ovs_mutex_lock(&pmd->bond_mutex);
> +        if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
> +            HMAP_FOR_EACH (bond, node, &dp->bonds) {
> +                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
> +            }
> +        }
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);
>  
> @@ -5209,7 +5377,6 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>      return ret;
>  }
>  
> -
>  /* Return true if needs to revalidate datapath flows. */
>  static bool
>  dpif_netdev_run(struct dpif *dpif)
> @@ -5379,6 +5546,58 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>  }
>  
>  static void
> +pmd_free_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> +{
> +    struct tx_bond *bond, *next;
> +
> +    /* Remove bonds from pmd which no longer exists. */
> +    HMAP_FOR_EACH_SAFE (bond, next, node, &pmd->bond_cache) {
> +        struct tx_bond *tx = NULL;
> +
> +        tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
> +        if (!tx) {
> +            /* Bond no longer exist. Delete it from pmd. */
> +            hmap_remove(&pmd->bond_cache, &bond->node);
> +            free(bond);
> +        }
> +    }
> +}
> +
> +/* Copies bonds from 'pmd->tx_bonds' (shared with the main thread) to
> + * 'pmd->bond_cache' (thread local) */
> +static void
> +pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    struct tx_bond *tx_bond, *tx_bond_cached;
> +
> +    pmd_free_cached_bonds(pmd);
> +    hmap_shrink(&pmd->bond_cache);
> +
> +    HMAP_FOR_EACH (tx_bond, node, &pmd->tx_bonds) {
> +        uint32_t bucket = 0;
> +        /* Check if bond already exist on pmd. */
> +        tx_bond_cached = tx_bond_lookup(&pmd->bond_cache, tx_bond->bond_id);
> +
> +        if (!tx_bond_cached) {
> +            /* Create new bond entry in cache. */
> +            tx_bond_cached = xmemdup(tx_bond, sizeof *tx_bond_cached);
> +            hmap_insert(&pmd->bond_cache, &tx_bond_cached->node,
> +                        hash_bond_id(tx_bond_cached->bond_id));
> +        } else {
> +            /* Update the slave-map. */
> +            for (bucket = 0; bucket <= BOND_MASK; bucket++) {
> +                tx_bond_cached->slave_buckets[bucket].slave_id =
> +                    tx_bond->slave_buckets[bucket].slave_id;
> +            }
> +        }
> +        VLOG_DBG("Caching bond-id %d pmd %d\n",
> +                 tx_bond_cached->bond_id, pmd->core_id);
> +    }
> +}
> +
> +
> +static void
>  pmd_alloc_static_tx_qid(struct dp_netdev_pmd_thread *pmd)
>  {
>      ovs_mutex_lock(&pmd->dp->tx_qid_pool_mutex);
> @@ -5400,6 +5619,14 @@ pmd_free_static_tx_qid(struct dp_netdev_pmd_thread *pmd)
>      ovs_mutex_unlock(&pmd->dp->tx_qid_pool_mutex);
>  }
>  
> +static void
> +pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd)
> +{
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +    pmd_load_cached_bonds(pmd);
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +}
> +
>  static int
>  pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>                            struct polled_queue **ppoll_list)
> @@ -5427,6 +5654,8 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>  
>      ovs_mutex_unlock(&pmd->port_mutex);
>  
> +    pmd_load_bond_cache(pmd);
> +
>      *ppoll_list = poll_list;
>      return i;
>  }
> @@ -5442,6 +5671,7 @@ pmd_thread_main(void *f_)
>      bool reload_tx_qid;
>      bool exiting;
>      bool reload;
> +    bool reload_bond_cache;
>      int poll_cnt;
>      int i;
>      int process_packets = 0;
> @@ -5538,6 +5768,13 @@ reload:
>                                   netdev_rxq_enabled(poll_list[i].rxq->rx);
>                  }
>              }
> +            atomic_read_explicit(&pmd->reload_bond_cache, &reload_bond_cache,
> +                                 memory_order_acquire);
> +            if (reload_bond_cache) {
> +                pmd_load_bond_cache(pmd);
> +                atomic_store_explicit(&pmd->reload_bond_cache, false,
> +                                      memory_order_release);
> +            }
>          }
>  
>          atomic_read_explicit(&pmd->reload, &reload, memory_order_acquire);
> @@ -5981,6 +6218,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      atomic_init(&pmd->reload, false);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> +    ovs_mutex_init(&pmd->bond_mutex);
>      cmap_init(&pmd->flow_table);
>      cmap_init(&pmd->classifiers);
>      pmd->ctx.last_rxq = NULL;
> @@ -5991,6 +6229,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache);
> +    hmap_init(&pmd->tx_bonds);
> +    hmap_init(&pmd->bond_cache);
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> @@ -6011,6 +6251,8 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>      hmap_destroy(&pmd->send_port_cache);
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    hmap_destroy(&pmd->bond_cache);
> +    hmap_destroy(&pmd->tx_bonds);
>      hmap_destroy(&pmd->poll_list);
>      /* All flows (including their dpcls_rules) have been deleted already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6022,6 +6264,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      seq_destroy(pmd->reload_seq);
>      ovs_mutex_destroy(&pmd->port_mutex);
> +    ovs_mutex_destroy(&pmd->bond_mutex);
>      free(pmd);
>  }
>  
> @@ -6175,6 +6418,49 @@ dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
>      free(tx);
>      pmd->need_reload = true;
>  }
> +
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                             struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    struct tx_bond *tx;
> +    uint32_t i;
> +    bool reload = false;
> +
> +    tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
> +    if (tx) {
> +        /* Check if mapping is changed. */
> +        for (i = 0; i <= BOND_MASK; i++) {
> +            if (bond->slave_buckets[i].slave_id !=
> +                     tx->slave_buckets[i].slave_id) {
> +                /* Mapping is modified. Reload pmd bond cache again. */
> +                reload = true;
> +            }
> +            /* Copy the map always. */
> +            tx->slave_buckets[i].slave_id = bond->slave_buckets[i].slave_id;
> +        }
> +    } else {
> +        tx = xmemdup(bond, sizeof *tx);
> +        hmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
> +        reload = true;
> +    }
> +    if (reload == true) {
> +        atomic_store_explicit(&pmd->reload_bond_cache, true,
> +                              memory_order_release);
> +    }
> +}
> +
> +/* Del 'tx' from the tx bond cache of 'pmd' */
> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                               struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    hmap_remove(&pmd->tx_bonds, &tx->node);
> +    free(tx);
> +    atomic_store_explicit(&pmd->reload_bond_cache, true, memory_order_release);
> +}
>  
>  static char *
>  dpif_netdev_get_datapath_version(void)
> @@ -6946,6 +7232,13 @@ pmd_send_port_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
>      return tx_port_lookup(&pmd->send_port_cache, port_no);
>  }
>  
> +static struct tx_bond *
> +pmd_tx_bond_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
> +                         uint32_t bond_id)
> +{
> +    return tx_bond_lookup(&pmd->bond_cache, bond_id);
> +}
> +
>  static int
>  push_tnl_action(const struct dp_netdev_pmd_thread *pmd,
>                  const struct nlattr *attr,
> @@ -6995,6 +7288,51 @@ dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
>      }
>  }
>  
> +static int
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_packet_batch *packets_,
> +                         bool should_steal,
> +                         odp_port_t port_no)
> +{
> +    struct tx_port *p;
> +    p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (OVS_LIKELY(p)) {
> +        struct dp_packet *packet;
> +        struct dp_packet_batch out;
> +        if (!should_steal) {
> +            dp_packet_batch_clone(&out, packets_);
> +            dp_packet_batch_reset_cutlen(packets_);
> +            packets_ = &out;
> +        }
> +        dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> +        if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> +                         && packets_->packets[0]->source
> +                            != p->output_pkts.packets[0]->source)) {
> +            /* netdev-dpdk assumes that all packets in a single
> +             * output batch has the same source. Flush here to
> +             * avoid memory access issues. */
> +            dp_netdev_pmd_flush_output_on_port(pmd, p);
> +        }
> +#endif
> +        if (dp_packet_batch_size(&p->output_pkts)
> +            + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> +            /* Flush here to avoid overflow. */
> +            dp_netdev_pmd_flush_output_on_port(pmd, p);
> +        }
> +        if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +            pmd->n_output_batches++;
> +        }
> +        DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +            p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> +                                                         pmd->ctx.last_rxq;
> +            dp_packet_batch_add(&p->output_pkts, packet);
> +        }
> +        return 0;
> +    }
> +    return -1;
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool should_steal)
> @@ -7006,49 +7344,58 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>      struct dp_netdev *dp = pmd->dp;
>      int type = nl_attr_type(a);
>      struct tx_port *p;
> +    int ret;
>  
>      switch ((enum ovs_action_attr)type) {
>      case OVS_ACTION_ATTR_OUTPUT:
> -        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> -        if (OVS_LIKELY(p)) {
> -            struct dp_packet *packet;
> -            struct dp_packet_batch out;
> -
> -            if (!should_steal) {
> -                dp_packet_batch_clone(&out, packets_);
> -                dp_packet_batch_reset_cutlen(packets_);
> -                packets_ = &out;
> -            }
> -            dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> -            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> -                             && packets_->packets[0]->source
> -                                != p->output_pkts.packets[0]->source)) {
> -                /* XXX: netdev-dpdk assumes that all packets in a single
> -                 *      output batch has the same source. Flush here to
> -                 *      avoid memory access issues. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -#endif
> -            if (dp_packet_batch_size(&p->output_pkts)
> -                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> -                /* Flush here to avoid overflow. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> -                pmd->n_output_batches++;
> -            }
> +        ret = dp_execute_output_action(pmd, packets_, should_steal,
> +                                       nl_attr_get_odp_port(a));
> +        if (ret == 0) {
> +            /* Output action executed successfully. */
> +            return;
> +        }
> +        break;
>  
> +    case OVS_ACTION_ATTR_LB_OUTPUT: {
> +        uint32_t bond = nl_attr_get_u32(a);
> +        uint32_t bond_member;
> +        uint32_t bucket;
> +        struct dp_packet_batch del_pkts;
> +        struct dp_packet_batch output_pkt;
> +        struct dp_packet *packet;
> +        struct tx_bond *p_bond;
> +        struct slave_entry *s_entry;
> +        uint32_t size;
> +
> +        p_bond = pmd_tx_bond_cache_lookup(pmd, bond);
> +        dp_packet_batch_init(&del_pkts);
> +        if (p_bond) {
>              DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -                                                             pmd->ctx.last_rxq;
> -                dp_packet_batch_add(&p->output_pkts, packet);
> +                /*
> +                 * Lookup the bond-hash table using hash to get the slave.
> +                 */
> +                bucket = (packet->md.dp_hash & BOND_MASK);
> +                s_entry = &p_bond->slave_buckets[bucket];
> +                bond_member = s_entry->slave_id;
> +                size = dp_packet_size(packet);
> +
> +                dp_packet_batch_init_packet(&output_pkt, packet);
> +                ret = dp_execute_output_action(pmd, &output_pkt, should_steal,
> +                                               u32_to_odp(bond_member));
> +                if (OVS_UNLIKELY(ret != 0)) {
> +                    dp_packet_batch_add(&del_pkts, packet);
> +                } else {
> +                    /* Update slave stats. */
> +                    non_atomic_ullong_add(&s_entry->n_packets, 1);
> +                    non_atomic_ullong_add(&s_entry->n_bytes, size);
> +                }
>              }
> +            /* Delete packets that failed OUTPUT action */
> +            dp_packet_delete_batch(&del_pkts, should_steal);
>              return;
>          }
>          break;
> +    }
>  
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          if (should_steal) {
> @@ -7477,6 +7824,110 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif OVS_UNUSED, void *ipf_dump_ctx)
>  
>  }
>  
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[])
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +    uint32_t bucket;
> +    struct tx_bond *dp_bond_entry = NULL;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    /*
> +     * Lookup for the bond. If already exists, just update the slave-map.
> +     * Else create new.
> +     */
> +    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> +    if (dp_bond_entry) {
> +        for (bucket = 0; bucket <= BOND_MASK; bucket++) {
> +            dp_bond_entry->slave_buckets[bucket].slave_id = slave_map[bucket];
> +        }
> +        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond_entry);
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +    } else {
> +        struct tx_bond *dp_bond = xzalloc(sizeof *dp_bond);
> +        dp_bond->bond_id = bond_id;
> +        for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +            dp_bond->slave_buckets[bucket].slave_id = slave_map[bucket];
> +        }
> +        hmap_insert(&dp->bonds, &dp_bond->node,
> +                    hash_bond_id(dp_bond->bond_id));
> +        /* Insert the bond map in all pmds. */
> +        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond);
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct tx_bond *dp_bond_entry = NULL;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +
> +    /* Find the bond and delete it if present */
> +    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> +    if (dp_bond_entry) {
> +        /* Remove the bond map in all pmds. */
> +        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            dp_netdev_del_bond_tx_from_pmd(pmd, dp_bond_entry);
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +        hmap_remove(&dp->bonds, &dp_bond_entry->node);
> +        free(dp_bond_entry);
> +    }
> +
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                           uint64_t *n_bytes)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct tx_bond *dp_bond_entry = NULL;
> +    struct tx_bond *pmd_bond_entry = NULL;
> +    uint32_t i;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +
> +    /* Find the bond and retrieve stats if present */
> +    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> +    if (dp_bond_entry) {
> +        /* Search the bond in all PMDs */
> +        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +            uint64_t pmd_n_bytes;
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            pmd_bond_entry = tx_bond_lookup(&pmd->bond_cache, bond_id);
> +            if (pmd_bond_entry) {
> +                /* Read bond stats. */
> +                for (i = 0;i <= BOND_MASK; i++) {
> +                    atomic_read_relaxed(
> +                         &pmd_bond_entry->slave_buckets[i].n_bytes,
> +                         &pmd_n_bytes);
> +                    n_bytes[i] += pmd_n_bytes;
> +                }
> +            }
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +    return 0;
> +}
> +
>  const struct dpif_class dpif_netdev_class = {
>      "netdev",
>      true,                       /* cleanup_required */
> @@ -7540,6 +7991,9 @@ const struct dpif_class dpif_netdev_class = {
>      dpif_netdev_meter_set,
>      dpif_netdev_meter_get,
>      dpif_netdev_meter_del,
> +    dpif_netdev_bond_add,
> +    dpif_netdev_bond_del,
> +    dpif_netdev_bond_stats_get,
>  };
>  
>  static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 7bc71d6..a87c898 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3440,6 +3440,9 @@ const struct dpif_class dpif_netlink_class = {
>      dpif_netlink_meter_set,
>      dpif_netlink_meter_get,
>      dpif_netlink_meter_del,
> +    NULL,                       /* bond_add */
> +    NULL,                       /* bond_del */
> +    NULL,                       /* bond_stats_get */
>  };
>  
>  static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index 12898b9..043b885 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -552,6 +552,14 @@ struct dpif_class {
>       * zero. */
>      int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
>                       struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> +    int (*bond_add)(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[]);
> +    /* Removes bond identified by 'bond_id' from 'dpif'. */
> +    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
> +    /* Reads bond stats from 'dpif'. */
> +    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> +                          uint64_t *n_bytes);
>  };
>  
>  extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index c88b210..2411c2c 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1177,6 +1177,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
>  
>      case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1227,6 +1228,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
>          struct dp_packet *clone = NULL;
>          uint32_t cutlen = dp_packet_get_cutlen(packet);
>          if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> +                        || type == OVS_ACTION_ATTR_LB_OUTPUT
>                          || type == OVS_ACTION_ATTR_TUNNEL_PUSH
>                          || type == OVS_ACTION_ATTR_TUNNEL_POP
>                          || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1879,6 +1881,16 @@ dpif_supports_tnl_push_pop(const struct dpif *dpif)
>      return dpif_is_netdev(dpif);
>  }
>  
> +bool
> +dpif_supports_balance_tcp_opt(const struct dpif *dpif)
> +{
> +    /*
> +     * Balance-tcp optimization is currently supported in netdev
> +     * datapath only.
> +     */
> +    return dpif_is_netdev(dpif);
> +}
> +
>  /* Meters */
>  void
>  dpif_meter_get_features(const struct dpif *dpif,
> @@ -1976,3 +1988,39 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
>      }
>      return error;
>  }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[])
> +{
> +    int error = 0;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_add) {
> +        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    int error = 0;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_del(dpif, bond_id);
> +    }
> +
> +    return error;
> +}
> +
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes)
> +{
> +    int error = 0;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
> +        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
> +    }
> +
> +    return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 289d574..9b84122 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -891,6 +891,13 @@ int dpif_get_pmds_for_port(const struct dpif * dpif, odp_port_t port_no,
>  char *dpif_get_dp_version(const struct dpif *);
>  bool dpif_supports_tnl_push_pop(const struct dpif *);
>  
> +bool dpif_supports_balance_tcp_opt(const struct dpif *);
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[]);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes);
> +
>  /* Log functions. */
>  struct vlog_module;
>  
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 563ad1d..13e4e96 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -725,6 +725,7 @@ requires_datapath_assistance(const struct nlattr *a)
>      switch (type) {
>          /* These only make sense in the context of a datapath. */
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -990,6 +991,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
>              break;
>  
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          case OVS_ACTION_ATTR_TUNNEL_POP:
>          case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 84ea4c1..e616da0 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -118,6 +118,7 @@ odp_action_len(uint16_t type)
>  
>      switch ((enum ovs_action_attr) type) {
>      case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
>      case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
>      case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
>      case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1113,6 +1114,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a,
>      case OVS_ACTION_ATTR_OUTPUT:
>          odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
>          break;
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));
> +        break;
>      case OVS_ACTION_ATTR_TRUNC: {
>          const struct ovs_action_trunc *trunc =
>                         nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index c5d5f2c..15f1b40 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
>  static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
>  static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
>  
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
>  /* Priority for internal rules created to handle recirculation */
>  #define RECIRC_RULE_PRIORITY 20
>  
> @@ -126,6 +122,8 @@ struct bond {
>      enum lacp_status lacp_status; /* Status of LACP negotiations. */
>      bool bond_revalidate;       /* True if flows need revalidation. */
>      uint32_t basis;             /* Basis for flow hash function. */
> +    bool use_bond_cache;        /* Use bond cache to avoid recirculation.
> +                                   Applicable only for Balance TCP mode. */
>  
>      /* SLB specific bonding info. */
>      struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
> @@ -185,7 +183,7 @@ static struct bond_slave *choose_output_slave(const struct bond *,
>                                                struct flow_wildcards *,
>                                                uint16_t vlan)
>      OVS_REQ_RDLOCK(rwlock);
> -static void update_recirc_rules__(struct bond *bond);
> +static void update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id);
>  static bool bond_is_falling_back_to_ab(const struct bond *);
>  
>  /* Attempts to parse 's' as the name of a bond balancing mode.  If successful,
> @@ -262,6 +260,7 @@ void
>  bond_unref(struct bond *bond)
>  {
>      struct bond_slave *slave;
> +    uint32_t bond_recirc_id = 0;
>  
>      if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
>          return;
> @@ -282,12 +281,13 @@ bond_unref(struct bond *bond)
>  
>      /* Free bond resources. Remove existing post recirc rules. */
>      if (bond->recirc_id) {
> +        bond_recirc_id = bond->recirc_id;
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
>      free(bond->hash);
>      bond->hash = NULL;
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond_recirc_id);
>  
>      hmap_destroy(&bond->pr_rule_ops);
>      free(bond->name);
> @@ -328,13 +328,14 @@ add_pr_rule(struct bond *bond, const struct match *match,
>   * lock annotation. Currently, only 'bond_unref()' calls
>   * this function directly.  */
>  static void
> -update_recirc_rules__(struct bond *bond)
> +update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)
>  {
>      struct match match;
>      struct bond_pr_rule_op *pr_op, *next_op;
>      uint64_t ofpacts_stub[128 / 8];
>      struct ofpbuf ofpacts;
>      int i;
> +    uint32_t slave_map[BOND_MASK];
>  
>      ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
>  
> @@ -353,8 +354,14 @@ update_recirc_rules__(struct bond *bond)
>  
>                  add_pr_rule(bond, &match, slave->ofp_port,
>                              &bond->hash[i].pr_rule);
> +                slave_map[i] = slave->ofp_port;
> +            } else {
> +                slave_map[i] = -1;
>              }
>          }
> +        ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, slave_map);
> +    } else {
> +        ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
>      }
>  
>      HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
> @@ -404,7 +411,7 @@ static void
>  update_recirc_rules(struct bond *bond)
>      OVS_REQ_RDLOCK(rwlock)
>  {
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond->recirc_id);
>  }
>  
>  /* Updates 'bond''s overall configuration to 's'.
> @@ -467,6 +474,10 @@ bond_reconfigure(struct bond *bond, const struct bond_settings *s)
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> +    if (bond->use_bond_cache != s->use_bond_cache) {
> +        bond->use_bond_cache = s->use_bond_cache;
> +        revalidate = true;
> +    }
>  
>      if (bond->balance == BM_AB || !bond->hash || revalidate) {
>          bond_entry_reset(bond);
> @@ -940,6 +951,13 @@ bond_recirculation_account(struct bond *bond)
>      OVS_REQ_WRLOCK(rwlock)
>  {
>      int i;
> +    uint64_t n_bytes[BOND_BUCKETS] = {0};
> +
> +    if (bond->hash && bond->recirc_id) {
> +        /* Retrieve bond stats from datapath. */
> +        dpif_bond_stats_get(bond->ofproto->backer->dpif,
> +                            bond->recirc_id, n_bytes);
> +    }
>  
>      for (i=0; i<=BOND_MASK; i++) {
>          struct bond_entry *entry = &bond->hash[i];
> @@ -948,11 +966,11 @@ bond_recirculation_account(struct bond *bond)
>          if (rule) {
>              uint64_t n_packets OVS_UNUSED;
>              long long int used OVS_UNUSED;
> -            uint64_t n_bytes;
> -
> -            rule->ofproto->ofproto_class->rule_get_stats(
> -                rule, &n_packets, &n_bytes, &used);
> -            bond_entry_account(entry, n_bytes);
> +            if (!bond->ofproto->backer->rt_support.balance_tcp_opt) {
> +                rule->ofproto->ofproto_class->rule_get_stats(
> +                    rule, &n_packets, &n_bytes[i], &used);
> +            }
> +            bond_entry_account(entry, n_bytes[i]);
>          }
>      }
>  }
> @@ -1362,6 +1380,8 @@ bond_print_details(struct ds *ds, const struct bond *bond)
>                    may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
>  
>      ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> +    ds_put_format(ds, "opt-bond-tcp: %s\n",
> +                  bond->use_bond_cache ? "enabled" : "disabled");
>  
>      ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
>      ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1939,3 +1959,9 @@ bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
>  
>      return false;
>  }
> +
> +bool
> +bond_get_cache_mode(const struct bond *bond)
> +{
> +    return bond->use_bond_cache;
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..88a4de1 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -22,6 +22,10 @@
>  #include "ofproto-provider.h"
>  #include "packets.h"
>  
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
>  struct flow;
>  struct netdev;
>  struct ofpbuf;
> @@ -58,6 +62,8 @@ struct bond_settings {
>                                  /* The MAC address of the interface
>                                     that was active during the last
>                                     ovs run. */
> +    bool use_bond_cache;        /* Use bond cache. Only applicable for
> +                                   bond mode BALANCE TCP. */
>  };
>  
>  /* Program startup. */
> @@ -122,4 +128,7 @@ void bond_rebalance(struct bond *);
>  */
>  void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
>                                     uint32_t *hash_basis);
> +
> +bool bond_get_cache_mode(const struct bond *);
> +
>  #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
> index b8bd1b8..3daed47 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -3016,6 +3016,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_POP_NSH:
>          case OVS_ACTION_ATTR_CHECK_PKT_LEN:
>          case OVS_ACTION_ATTR_UNSPEC:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case __OVS_ACTION_ATTR_MAX:
>          default:
>              break;
> diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> index 03bd763..36f3f24 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1177,6 +1177,7 @@ dpif_sflow_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_CT_CLEAR:
>          case OVS_ACTION_ATTR_METER:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              break;
>  
>          case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 28a7fdd..9d15ac7 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -409,6 +409,8 @@ struct xlate_ctx {
>      struct ofpbuf action_set;   /* Action set. */
>  
>      enum xlate_error error;     /* Translation failed. */
> +
> +    bool tnl_push_no_recirc;    /* Tunnel push recirculation status */
>  };
>  
>  /* Structure to track VLAN manipulation */
> @@ -2421,6 +2423,34 @@ output_normal(struct xlate_ctx *ctx, const struct xbundle *out_xbundle,
>                  /* Use recirculation instead of output. */
>                  use_recirc = true;
>                  xr.hash_alg = OVS_HASH_ALG_L4;
> +
> +                if (bond_get_cache_mode(out_xbundle->bond)) {
> +                    /*
> +                     * Select the hash-alg based on datapath's capability.
> +                     * If not supported, default to OVS_HASH_ALG_L4 for
> +                     * which HASH + RECIRC actions would be set in xlate. Else
> +                     * use the RSS hash for better throughput. With
> +                     * OVS_HASH_ALG_L4_RSS, RECIRC action is also avoided.
> +                     *
> +                     * NOTE:
> +                     * Do not use load-balanced-output action when tunnel push
> +                     * recirculation is avoided (via CLONE action), as L4 hash
> +                     * for bond balancing needs to be computed post tunnel
> +                     * encapsulation.
> +                     */
> +                    if (ctx->xbridge->support.balance_tcp_opt &&
> +                        !ctx->tnl_push_no_recirc) {
> +                        xr.hash_alg = OVS_HASH_ALG_L4_RSS;
> +                    }
> +
> +                    VLOG_DBG("xin-in_port: %u/%u base-flow-in_port: %u/%u "
> +                             "hash-algo = %d\n",
> +                             ctx->xin->flow.in_port.ofp_port,
> +                             ctx->xin->flow.in_port.odp_port,
> +                             ctx->base_flow.in_port.ofp_port,
> +                             ctx->base_flow.in_port.odp_port, xr.hash_alg);
> +                }
> +
>                  /* Recirculation does not require unmasking hash fields. */

But we're not always recirculating here with this patch. Not setting the
hash mask in the flow may cause behaviour change.

In general, this part of code looks very dangerous and hard to extend.
Comments above and the variable 'use_recirc' says that we're going to
recirculate, but the newly added code avoids recirculation keeping
the environment thinking that we're not. This needs to be reworked.

I'm also still confused by the fact of introducing of OVS_HASH_ALG_L4_RSS
which is used only to be changed back to OVS_HASH_ALG_L4 which actually
uses RSS too. Can we avoid introducing it?

>                  wc = NULL;
>              }
> @@ -3697,12 +3727,16 @@ native_tunnel_output(struct xlate_ctx *ctx, const struct xport *xport,
>          ctx->xin->allow_side_effects = backup_side_effects;
>          ctx->xin->packet = backup_packet;
>          ctx->wc = backup_wc;
> +
> +        ctx->tnl_push_no_recirc = true;
>      } else {
>          /* In order to maintain accurate stats, use recirc for
>           * natvie tunneling.  */
>          nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, 0);
>          nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> -    }
> +
> +        ctx->tnl_push_no_recirc = false;

This variable initialized only in case of tunneling. It must be explicitly
initialized in all cases, i.e. inside xlate_actions() and xlate_in_init().

> +   }
>  
>      /* Restore the flows after the translation. */
>      memcpy(&ctx->xin->flow, &old_flow, sizeof ctx->xin->flow);
> @@ -4128,24 +4162,36 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>          xlate_commit_actions(ctx);
>  
>          if (xr) {
> -            /* Recirculate the packet. */
>              struct ovs_action_hash *act_hash;
>  
>              /* Hash action. */
>              enum ovs_hash_alg hash_alg = xr->hash_alg;
> -            if (hash_alg > ctx->xbridge->support.max_hash_alg) {
> +            if (hash_alg > ctx->xbridge->support.max_hash_alg ||
> +                hash_alg == OVS_HASH_ALG_L4_RSS) {
>                  /* Algorithm supported by all datapaths. */
>                  hash_alg = OVS_HASH_ALG_L4;
>              }
>              act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> -                                                OVS_ACTION_ATTR_HASH,
> -                                                sizeof *act_hash);
> +                                            OVS_ACTION_ATTR_HASH,
> +                                            sizeof *act_hash);
>              act_hash->hash_alg = hash_alg;
>              act_hash->hash_basis = xr->hash_basis;
>  
> -            /* Recirc action. */
> -            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> -                           xr->recirc_id);
> +            if (xr->hash_alg == OVS_HASH_ALG_L4_RSS) {
> +                /*
> +                 * If hash algorithm is RSS, use the hash directly
> +                 * for slave selection and avoid recirculation.
> +                 *
> +                 * Currently support for netdev datapath only.
> +                 */
> +                nl_msg_put_odp_port(ctx->odp_actions,
> +                                    OVS_ACTION_ATTR_LB_OUTPUT,
> +                                    xr->recirc_id);
> +            } else {
> +                /* Recirculate the packet. */
> +                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> +                               xr->recirc_id);
> +            }
>          } else if (is_native_tunnel) {
>              /* Output to native tunnel port. */
>              native_tunnel_output(ctx, xport, flow, odp_port, truncate);
> @@ -7170,7 +7216,8 @@ count_output_actions(const struct ofpbuf *odp_actions)
>      int n = 0;
>  
>      NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
> -        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> +        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> +            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
>              n++;
>          }
>      }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 7515352..a591035 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1441,6 +1441,8 @@ check_support(struct dpif_backer *backer)
>      backer->rt_support.ct_clear = check_ct_clear(backer);
>      backer->rt_support.max_hash_alg = check_max_dp_hash_alg(backer);
>      backer->rt_support.check_pkt_len = check_check_pkt_len(backer);
> +    backer->rt_support.balance_tcp_opt =
> +        dpif_supports_balance_tcp_opt(backer->dpif);
>  
>      /* Flow fields. */
>      backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3294,6 +3296,36 @@ bundle_remove(struct ofport *port_)
>      }
>  }
>  
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
> +                        uint32_t bond_id,
> +                        uint32_t slave_map[])
> +{
> +    int error;
> +    uint32_t bucket;
> +
> +    /* Convert ofp_port to odp_port */
> +    for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        if (slave_map[bucket] != -1) {
> +            slave_map[bucket] =
> +                ofp_port_to_odp_port(ofproto, slave_map[bucket]);
> +        }
> +    }
> +
> +    error = dpif_bond_add(ofproto->backer->dpif, bond_id, slave_map);
> +    return error;
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto,
> +                        uint32_t bond_id)
> +{
> +    int error;
> +
> +    error = dpif_bond_del(ofproto->backer->dpif, bond_id);
> +    return error;
> +}
> +
>  static void
>  send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
>  {
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index cd5321e..43ab09d 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -194,8 +194,11 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif *,
>      /* Highest supported dp_hash algorithm. */                              \
>      DPIF_SUPPORT_FIELD(size_t, max_hash_alg, "Max dp_hash algorithm")       \
>                                                                              \
> -    /* True if the datapath supports OVS_ACTION_ATTR_CHECK_PKT_LEN. */   \
> -    DPIF_SUPPORT_FIELD(bool, check_pkt_len, "Check pkt length action")
> +    /* True if the datapath supports OVS_ACTION_ATTR_CHECK_PKT_LEN. */      \
> +    DPIF_SUPPORT_FIELD(bool, check_pkt_len, "Check pkt length action")      \
> +                                                                            \
> +    /* True if the datapath supports balance_tcp optimization */            \
> +    DPIF_SUPPORT_FIELD(bool, balance_tcp_opt, "Balance-tcp opt")
>  
>  /* Stores the various features which the corresponding backer supports. */
>  struct dpif_backer_support {
> @@ -361,6 +364,11 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
>                                     struct rule **rulep);
>  int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
>                                        int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *,
> +                            uint32_t bond_id,
> +                            uint32_t slave_map[]);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *,
> +                            uint32_t bond_id);
>  
>  bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>  
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..26aa1b7 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
>  bond_mode: active-backup
>  bond may use recirculation: no, Recirc-ID : -1
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +opt-bond-tcp: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 2976771..6199a7b 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4300,6 +4300,10 @@ port_configure_bond(struct port *port, struct bond_settings *s)
>          /* OVSDB did not store the last active interface */
>          s->active_slave_mac = eth_addr_zero;
>      }
> +    if (s->balance == BM_TCP) {
> +        s->use_bond_cache = smap_get_bool(&port->cfg->other_config,
> +                                        "opt-bond-tcp", false);
> +    }
>  }
>  
>  /* Returns true if 'port' is synthetic, that is, if we constructed it locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 027aee2..123f694 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1963,6 +1963,16 @@
>          <code>active-backup</code>.
>        </column>
>  
> +      <column name="other_config" key="opt-bond-tcp"

The name is not readable, i.e. it's hard to understand what it means without
reading the whole decription. Maybe something like "lb-output-action"?

> +              type='{"type": "boolean"}'>
> +        Enable/disable usage of RSS hash from the ingress port for load
> +        balancing flows among output slaves in load balanced bonds in
> +        <code>balance-tcp</code>. When enabled, it uses optimized path for
> +        balance-tcp mode by using rss hash and avoids recirculation.

Sill the same, usual bonding with recirculation uses RSS hash too.
Above description is misleading.

> +        It affects only new flows, i.e, existing flows remain unchanged.
> +        This knob does not affect other balancing modes.
> +      </column>
> +
>        <group title="Link Failure Detection">
>          <p>
>            An important part of link bonding is detecting that links are down so
>
Matteo Croce Aug. 26, 2019, 4:18 p.m. UTC | #4
On Thu, Aug 8, 2019 at 10:57 AM Vishal Deep Ajmera
<vishal.deep.ajmera@ericsson.com> wrote:
>
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
>
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
>
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
>
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
>
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
>
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
>
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
>
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
>
> Instead, we will now generate actions as
>     "hash(l4(0)),lb_output(bond,<bond id>)"
>
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
>
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2
>
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1
>
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
>
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(l4(0)),lb_output(bond,1)
>
> A new CLI has been added to dump the per-PMD bond cache as given below.
>
> “sudo ovs-appctl dpif-netdev/pmd-bond-show”
>
> root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/pmd-bond-show
> pmd thread numa_id 0 core_id 4:
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
>
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
>
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.53    5.89        29.96
> | 10     4.16    5.89        41.51
> | 400    3.55    5.55        56.22
> | 1k     3.44    5.45        58.30
> | 10k    2.50    4.63        85.34
> | 100k   2.29    4.27        86.15
> | 500k   2.25    4.27        89.23
> +--------------------------------------+
> mpps: million packets per second.
>
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
>
> CC: Jan Scheurich <jan.scheurich@ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> CC: Nitin Katiyar <nitin.katiyar@ericsson.com>

Hi Vishal,

I quickly tested your patch on two servers with 2 ixgbe cards each
linked via a juniper switch with LACP.
With testpmd using a single core the switching rate raised from ~2.0
Mpps to ~2.4+ Mpps, so I read at least a +20% gain.

Please add an example command on how to enable it in the commit message, e.g.

  ovs-vsctl set port bond0 other_config:opt-bond-tcp=true

Thanks,
Vishal Deep Ajmera Aug. 27, 2019, 10:22 a.m. UTC | #5
> 
> Hi Vishal,
> 
> I quickly tested your patch on two servers with 2 ixgbe cards each linked via a
> juniper switch with LACP.
> With testpmd using a single core the switching rate raised from ~2.0 Mpps to
> ~2.4+ Mpps, so I read at least a +20% gain.
> 
> Please add an example command on how to enable it in the commit message,
> e.g.
> 
>   ovs-vsctl set port bond0 other_config:opt-bond-tcp=true

Thanks Matteo for testing the patch and sharing results. I will add example in the commit message for next patch-set.

Warm Regards,
Vishal Ajmera
Vishal Deep Ajmera Aug. 27, 2019, 10:24 a.m. UTC | #6
Thanks Ilya for comments. I will address them in the next patch-set.

Warm Regards,
Vishal Ajmera
diff mbox series

Patch

diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index 65a003a..6dafcfb 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -734,6 +734,7 @@  enum ovs_hash_alg {
 	OVS_HASH_ALG_L4,
 #ifndef __KERNEL__
 	OVS_HASH_ALG_SYM_L4,
+        OVS_HASH_ALG_L4_RSS,
 #endif
 	__OVS_HASH_MAX
 };
@@ -989,6 +990,7 @@  enum ovs_action_attr {
 #ifndef __KERNEL__
 	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
 	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
+	OVS_ACTION_ATTR_LB_OUTPUT,     /* bond-id */
 #endif
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d0a1c58..9db2a73 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -79,6 +79,7 @@ 
 #include "unixctl.h"
 #include "util.h"
 #include "uuid.h"
+#include "ofproto/bond.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
@@ -366,6 +367,11 @@  struct dp_netdev {
 
     struct conntrack *conntrack;
     struct pmd_auto_lb pmd_alb;
+    /* Bonds.
+     *
+     * Any lookup into 'bonds' requires taking 'bond_mutex'. */
+    struct ovs_mutex bond_mutex;
+    struct hmap bonds;
 };
 
 static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
@@ -596,6 +602,20 @@  struct tx_port {
     struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
 };
 
+/* Contained by struct tx_bond 'slave_buckets' */
+struct slave_entry {
+    uint32_t slave_id;
+    atomic_ullong n_packets;
+    atomic_ullong n_bytes;
+};
+
+/* Contained by struct dp_netdev_pmd_thread's 'bond_cache' or 'tx_bonds'. */
+struct tx_bond {
+    struct hmap_node node;
+    uint32_t bond_id;
+    struct slave_entry slave_buckets[BOND_BUCKETS];
+};
+
 /* A set of properties for the current processing loop that is not directly
  * associated with the pmd thread itself, but with the packets being
  * processed or the short-term system configuration (for example, time).
@@ -708,6 +728,11 @@  struct dp_netdev_pmd_thread {
     atomic_bool reload_tx_qid;      /* Do we need to reload static_tx_qid? */
     atomic_bool exit;               /* For terminating the pmd thread. */
 
+    atomic_bool reload_bond_cache;  /* Do we need to load tx bond cache?
+                                     * Note: This flag is decoupled from 'reload'
+                                     * flag otherwise full pmd reload will become
+                                     * frequent and costly everytime bond
+                                     * rebalancing is done. */
     pthread_t thread;
     unsigned core_id;               /* CPU core id of this pmd thread. */
     int numa_id;                    /* numa node id of this pmd thread. */
@@ -728,6 +753,11 @@  struct dp_netdev_pmd_thread {
      * read by the pmd thread. */
     struct hmap tx_ports OVS_GUARDED;
 
+    struct ovs_mutex bond_mutex;    /* Mutex for 'tx_bonds'. */
+    /* Map of 'tx_bond's used for transmission.  Written by the main thread,
+     * read/written by the pmd thread. */
+    struct hmap tx_bonds OVS_GUARDED;
+
     /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
      * ports (that support push_tunnel/pop_tunnel), the other contains ports
      * with at least one txq (that support send).  A port can be in both.
@@ -740,6 +770,8 @@  struct dp_netdev_pmd_thread {
      * other instance will only be accessed by its own pmd thread. */
     struct hmap tnl_port_cache;
     struct hmap send_port_cache;
+    /* These are thread-local copies of 'tx_bonds' */
+    struct hmap bond_cache;
 
     /* Keep track of detailed PMD performance statistics. */
     struct pmd_perf_stats perf_stats;
@@ -819,6 +851,12 @@  static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
 static int
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
                                    bool force);
+static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                                         struct tx_bond *bond)
+    OVS_REQUIRES(pmd->bond_mutex);
+static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                                           struct tx_bond *tx)
+    OVS_REQUIRES(pmd->bond_mutex);
 
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
@@ -827,6 +865,10 @@  static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
 static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
     OVS_REQUIRES(pmd->port_mutex);
+static void pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
+    OVS_REQUIRES(pmd->bond_mutex);
+static void pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd);
+
 static inline void
 dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
                            struct polled_queue *poll_list, int poll_cnt);
@@ -1385,6 +1427,58 @@  pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
     par.command_type = PMD_INFO_PERF_SHOW;
     dpif_netdev_pmd_info(conn, argc, argv, &par);
 }
+
+static void
+dpif_netdev_pmd_bond_show(struct unixctl_conn *conn, int argc,
+                          const char *argv[], void *aux OVS_UNUSED)
+{
+    struct ds reply = DS_EMPTY_INITIALIZER;
+    struct dp_netdev_pmd_thread *pmd;
+    struct dp_netdev *dp = NULL;
+    uint32_t bucket;
+    struct tx_bond *pmd_bond_entry = NULL;
+
+    ovs_mutex_lock(&dp_netdev_mutex);
+
+    if (argc == 2) {
+        dp = shash_find_data(&dp_netdevs, argv[1]);
+    } else if (shash_count(&dp_netdevs) == 1) {
+        /* There's only one datapath */
+        dp = shash_first(&dp_netdevs)->data;
+    }
+    if (!dp) {
+        ovs_mutex_unlock(&dp_netdev_mutex);
+        unixctl_command_reply_error(conn,
+                                    "please specify an existing datapath");
+        return;
+    }
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        ds_put_cstr(&reply, (pmd->core_id == NON_PMD_CORE_ID)
+                            ? "main thread" : "pmd thread");
+        if (pmd->numa_id != OVS_NUMA_UNSPEC) {
+            ds_put_format(&reply, " numa_id %d", pmd->numa_id);
+        }
+        if (pmd->core_id != OVS_CORE_UNSPEC &&
+            pmd->core_id != NON_PMD_CORE_ID) {
+            ds_put_format(&reply, " core_id %u", pmd->core_id);
+        }
+        ds_put_cstr(&reply, ":\n");
+        ds_put_cstr(&reply, "\nBonds:\n");
+        HMAP_FOR_EACH (pmd_bond_entry, node, &pmd->tx_bonds) {
+            ds_put_format(&reply, "\tbond-id %u :\n",
+                          pmd_bond_entry->bond_id);
+            for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+                ds_put_format(&reply, "\t\tbucket %u - slave %u \n",
+                          bucket,
+                          pmd_bond_entry->slave_buckets[bucket].slave_id);
+            }
+        }
+    }
+    ovs_mutex_unlock(&dp_netdev_mutex);
+    unixctl_command_reply(conn, ds_cstr(&reply));
+    ds_destroy(&reply);
+}
+
 
 static int
 dpif_netdev_init(void)
@@ -1416,6 +1510,9 @@  dpif_netdev_init(void)
                              "[-us usec] [-q qlen]",
                              0, 10, pmd_perf_log_set_cmd,
                              NULL);
+    unixctl_command_register("dpif-netdev/pmd-bond-show", "[dp]",
+                             0, 1, dpif_netdev_pmd_bond_show,
+                             NULL);
     return 0;
 }
 
@@ -1531,6 +1628,9 @@  create_dp_netdev(const char *name, const struct dpif_class *class,
     ovs_mutex_init(&dp->port_mutex);
     hmap_init(&dp->ports);
     dp->port_seq = seq_create();
+    ovs_mutex_init(&dp->bond_mutex);
+    hmap_init(&dp->bonds);
+
     fat_rwlock_init(&dp->upcall_rwlock);
 
     dp->reconfigure_seq = seq_create();
@@ -1645,6 +1745,7 @@  dp_netdev_free(struct dp_netdev *dp)
     OVS_REQUIRES(dp_netdev_mutex)
 {
     struct dp_netdev_port *port, *next;
+    struct tx_bond *bond, *next_bond;
 
     shash_find_and_delete(&dp_netdevs, dp->name);
 
@@ -1654,6 +1755,13 @@  dp_netdev_free(struct dp_netdev *dp)
     }
     ovs_mutex_unlock(&dp->port_mutex);
 
+    ovs_mutex_lock(&dp->bond_mutex);
+    HMAP_FOR_EACH_SAFE (bond, next_bond, node, &dp->bonds) {
+        hmap_remove(&dp->bonds, &bond->node);
+        free(bond);
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
     dp_netdev_destroy_all_pmds(dp, true);
     cmap_destroy(&dp->poll_threads);
 
@@ -1672,6 +1780,9 @@  dp_netdev_free(struct dp_netdev *dp)
     hmap_destroy(&dp->ports);
     ovs_mutex_destroy(&dp->port_mutex);
 
+    hmap_destroy(&dp->bonds);
+    ovs_mutex_destroy(&dp->bond_mutex);
+
     /* Upcalls must be disabled at this point */
     dp_netdev_destroy_upcall_lock(dp);
 
@@ -1775,6 +1886,7 @@  dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
         ovs_mutex_lock(&pmd->port_mutex);
         pmd_load_cached_ports(pmd);
         ovs_mutex_unlock(&pmd->port_mutex);
+        pmd_load_bond_cache(pmd);
         ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
         return;
     }
@@ -1789,6 +1901,12 @@  hash_port_no(odp_port_t port_no)
     return hash_int(odp_to_u32(port_no), 0);
 }
 
+static uint32_t
+hash_bond_id(uint32_t bond_id)
+{
+    return hash_int(bond_id, 0);
+}
+
 static int
 port_create(const char *devname, const char *type,
             odp_port_t port_no, struct dp_netdev_port **portp)
@@ -4311,6 +4429,19 @@  tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
     return NULL;
 }
 
+static struct tx_bond *
+tx_bond_lookup(const struct hmap *hmap, uint32_t bond_id)
+{
+    struct tx_bond *tx;
+
+    HMAP_FOR_EACH_IN_BUCKET (tx, node, hash_bond_id(bond_id), hmap) {
+        if (tx->bond_id == bond_id) {
+            return tx;
+        }
+    }
+    return NULL;
+}
+
 static int
 port_reconfigure(struct dp_netdev_port *port)
 {
@@ -4788,6 +4919,27 @@  pmd_remove_stale_ports(struct dp_netdev *dp,
     ovs_mutex_unlock(&pmd->port_mutex);
 }
 
+static void
+pmd_remove_stale_bonds(struct dp_netdev *dp,
+                       struct dp_netdev_pmd_thread *pmd)
+    OVS_EXCLUDED(pmd->bond_mutex)
+    OVS_EXCLUDED(dp->bond_mutex)
+{
+    struct tx_bond *tx, *tx_next;
+
+    ovs_mutex_lock(&dp->bond_mutex);
+    ovs_mutex_lock(&pmd->bond_mutex);
+
+    HMAP_FOR_EACH_SAFE (tx, tx_next, node, &pmd->tx_bonds) {
+        if (!tx_bond_lookup(&dp->bonds, tx->bond_id)) {
+            dp_netdev_del_bond_tx_from_pmd(pmd, tx);
+        }
+    }
+
+    ovs_mutex_unlock(&pmd->bond_mutex);
+    ovs_mutex_unlock(&dp->bond_mutex);
+}
+
 /* Must be called each time a port is added/removed or the cmask changes.
  * This creates and destroys pmd threads, reconfigures ports, opens their
  * rxqs and assigns all rxqs/txqs to pmd threads. */
@@ -4798,6 +4950,7 @@  reconfigure_datapath(struct dp_netdev *dp)
     struct hmapx busy_threads = HMAPX_INITIALIZER(&busy_threads);
     struct dp_netdev_pmd_thread *pmd;
     struct dp_netdev_port *port;
+    struct tx_bond *bond;
     int wanted_txqs;
 
     dp->last_reconfigure_seq = seq_read(dp->reconfigure_seq);
@@ -4826,10 +4979,11 @@  reconfigure_datapath(struct dp_netdev *dp)
         }
     }
 
-    /* Remove from the pmd threads all the ports that have been deleted or
-     * need reconfiguration. */
+    /* Remove from the pmd threads all the ports/bonds that have been deleted
+     * or need reconfiguration. */
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         pmd_remove_stale_ports(dp, pmd);
+        pmd_remove_stale_bonds(dp, pmd);
     }
 
     /* Reload affected pmd threads.  We must wait for the pmd threads before
@@ -4951,6 +5105,20 @@  reconfigure_datapath(struct dp_netdev *dp)
         ovs_mutex_unlock(&pmd->port_mutex);
     }
 
+    /* Add every bond to the tx cache of every pmd thread, if it's not
+     * there already and if this pmd has at least one rxq to poll. */
+    ovs_mutex_lock(&dp->bond_mutex);
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        ovs_mutex_lock(&pmd->bond_mutex);
+        if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
+            HMAP_FOR_EACH (bond, node, &dp->bonds) {
+                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
+            }
+        }
+        ovs_mutex_unlock(&pmd->bond_mutex);
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
     /* Reload affected pmd threads. */
     reload_affected_pmds(dp);
 
@@ -5209,7 +5377,6 @@  pmd_rebalance_dry_run(struct dp_netdev *dp)
     return ret;
 }
 
-
 /* Return true if needs to revalidate datapath flows. */
 static bool
 dpif_netdev_run(struct dpif *dpif)
@@ -5379,6 +5546,58 @@  pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
 }
 
 static void
+pmd_free_cached_bonds(struct dp_netdev_pmd_thread *pmd)
+{
+    struct tx_bond *bond, *next;
+
+    /* Remove bonds from pmd which no longer exists. */
+    HMAP_FOR_EACH_SAFE (bond, next, node, &pmd->bond_cache) {
+        struct tx_bond *tx = NULL;
+
+        tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
+        if (!tx) {
+            /* Bond no longer exist. Delete it from pmd. */
+            hmap_remove(&pmd->bond_cache, &bond->node);
+            free(bond);
+        }
+    }
+}
+
+/* Copies bonds from 'pmd->tx_bonds' (shared with the main thread) to
+ * 'pmd->bond_cache' (thread local) */
+static void
+pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
+    OVS_REQUIRES(pmd->bond_mutex)
+{
+    struct tx_bond *tx_bond, *tx_bond_cached;
+
+    pmd_free_cached_bonds(pmd);
+    hmap_shrink(&pmd->bond_cache);
+
+    HMAP_FOR_EACH (tx_bond, node, &pmd->tx_bonds) {
+        uint32_t bucket = 0;
+        /* Check if bond already exist on pmd. */
+        tx_bond_cached = tx_bond_lookup(&pmd->bond_cache, tx_bond->bond_id);
+
+        if (!tx_bond_cached) {
+            /* Create new bond entry in cache. */
+            tx_bond_cached = xmemdup(tx_bond, sizeof *tx_bond_cached);
+            hmap_insert(&pmd->bond_cache, &tx_bond_cached->node,
+                        hash_bond_id(tx_bond_cached->bond_id));
+        } else {
+            /* Update the slave-map. */
+            for (bucket = 0; bucket <= BOND_MASK; bucket++) {
+                tx_bond_cached->slave_buckets[bucket].slave_id =
+                    tx_bond->slave_buckets[bucket].slave_id;
+            }
+        }
+        VLOG_DBG("Caching bond-id %d pmd %d\n",
+                 tx_bond_cached->bond_id, pmd->core_id);
+    }
+}
+
+
+static void
 pmd_alloc_static_tx_qid(struct dp_netdev_pmd_thread *pmd)
 {
     ovs_mutex_lock(&pmd->dp->tx_qid_pool_mutex);
@@ -5400,6 +5619,14 @@  pmd_free_static_tx_qid(struct dp_netdev_pmd_thread *pmd)
     ovs_mutex_unlock(&pmd->dp->tx_qid_pool_mutex);
 }
 
+static void
+pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd)
+{
+    ovs_mutex_lock(&pmd->bond_mutex);
+    pmd_load_cached_bonds(pmd);
+    ovs_mutex_unlock(&pmd->bond_mutex);
+}
+
 static int
 pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
                           struct polled_queue **ppoll_list)
@@ -5427,6 +5654,8 @@  pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
 
     ovs_mutex_unlock(&pmd->port_mutex);
 
+    pmd_load_bond_cache(pmd);
+
     *ppoll_list = poll_list;
     return i;
 }
@@ -5442,6 +5671,7 @@  pmd_thread_main(void *f_)
     bool reload_tx_qid;
     bool exiting;
     bool reload;
+    bool reload_bond_cache;
     int poll_cnt;
     int i;
     int process_packets = 0;
@@ -5538,6 +5768,13 @@  reload:
                                  netdev_rxq_enabled(poll_list[i].rxq->rx);
                 }
             }
+            atomic_read_explicit(&pmd->reload_bond_cache, &reload_bond_cache,
+                                 memory_order_acquire);
+            if (reload_bond_cache) {
+                pmd_load_bond_cache(pmd);
+                atomic_store_explicit(&pmd->reload_bond_cache, false,
+                                      memory_order_release);
+            }
         }
 
         atomic_read_explicit(&pmd->reload, &reload, memory_order_acquire);
@@ -5981,6 +6218,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     atomic_init(&pmd->reload, false);
     ovs_mutex_init(&pmd->flow_mutex);
     ovs_mutex_init(&pmd->port_mutex);
+    ovs_mutex_init(&pmd->bond_mutex);
     cmap_init(&pmd->flow_table);
     cmap_init(&pmd->classifiers);
     pmd->ctx.last_rxq = NULL;
@@ -5991,6 +6229,8 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     hmap_init(&pmd->tx_ports);
     hmap_init(&pmd->tnl_port_cache);
     hmap_init(&pmd->send_port_cache);
+    hmap_init(&pmd->tx_bonds);
+    hmap_init(&pmd->bond_cache);
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
@@ -6011,6 +6251,8 @@  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
     hmap_destroy(&pmd->send_port_cache);
     hmap_destroy(&pmd->tnl_port_cache);
     hmap_destroy(&pmd->tx_ports);
+    hmap_destroy(&pmd->bond_cache);
+    hmap_destroy(&pmd->tx_bonds);
     hmap_destroy(&pmd->poll_list);
     /* All flows (including their dpcls_rules) have been deleted already */
     CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
@@ -6022,6 +6264,7 @@  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
     ovs_mutex_destroy(&pmd->flow_mutex);
     seq_destroy(pmd->reload_seq);
     ovs_mutex_destroy(&pmd->port_mutex);
+    ovs_mutex_destroy(&pmd->bond_mutex);
     free(pmd);
 }
 
@@ -6175,6 +6418,49 @@  dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
     free(tx);
     pmd->need_reload = true;
 }
+
+static void
+dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                             struct tx_bond *bond)
+    OVS_REQUIRES(pmd->bond_mutex)
+{
+    struct tx_bond *tx;
+    uint32_t i;
+    bool reload = false;
+
+    tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
+    if (tx) {
+        /* Check if mapping is changed. */
+        for (i = 0; i <= BOND_MASK; i++) {
+            if (bond->slave_buckets[i].slave_id !=
+                     tx->slave_buckets[i].slave_id) {
+                /* Mapping is modified. Reload pmd bond cache again. */
+                reload = true;
+            }
+            /* Copy the map always. */
+            tx->slave_buckets[i].slave_id = bond->slave_buckets[i].slave_id;
+        }
+    } else {
+        tx = xmemdup(bond, sizeof *tx);
+        hmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
+        reload = true;
+    }
+    if (reload == true) {
+        atomic_store_explicit(&pmd->reload_bond_cache, true,
+                              memory_order_release);
+    }
+}
+
+/* Del 'tx' from the tx bond cache of 'pmd' */
+static void
+dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                               struct tx_bond *tx)
+    OVS_REQUIRES(pmd->bond_mutex)
+{
+    hmap_remove(&pmd->tx_bonds, &tx->node);
+    free(tx);
+    atomic_store_explicit(&pmd->reload_bond_cache, true, memory_order_release);
+}
 
 static char *
 dpif_netdev_get_datapath_version(void)
@@ -6946,6 +7232,13 @@  pmd_send_port_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
     return tx_port_lookup(&pmd->send_port_cache, port_no);
 }
 
+static struct tx_bond *
+pmd_tx_bond_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
+                         uint32_t bond_id)
+{
+    return tx_bond_lookup(&pmd->bond_cache, bond_id);
+}
+
 static int
 push_tnl_action(const struct dp_netdev_pmd_thread *pmd,
                 const struct nlattr *attr,
@@ -6995,6 +7288,51 @@  dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
+static int
+dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
+                         struct dp_packet_batch *packets_,
+                         bool should_steal,
+                         odp_port_t port_no)
+{
+    struct tx_port *p;
+    p = pmd_send_port_cache_lookup(pmd, port_no);
+    if (OVS_LIKELY(p)) {
+        struct dp_packet *packet;
+        struct dp_packet_batch out;
+        if (!should_steal) {
+            dp_packet_batch_clone(&out, packets_);
+            dp_packet_batch_reset_cutlen(packets_);
+            packets_ = &out;
+        }
+        dp_packet_batch_apply_cutlen(packets_);
+#ifdef DPDK_NETDEV
+        if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
+                         && packets_->packets[0]->source
+                            != p->output_pkts.packets[0]->source)) {
+            /* netdev-dpdk assumes that all packets in a single
+             * output batch has the same source. Flush here to
+             * avoid memory access issues. */
+            dp_netdev_pmd_flush_output_on_port(pmd, p);
+        }
+#endif
+        if (dp_packet_batch_size(&p->output_pkts)
+            + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
+            /* Flush here to avoid overflow. */
+            dp_netdev_pmd_flush_output_on_port(pmd, p);
+        }
+        if (dp_packet_batch_is_empty(&p->output_pkts)) {
+            pmd->n_output_batches++;
+        }
+        DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
+            p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
+                                                         pmd->ctx.last_rxq;
+            dp_packet_batch_add(&p->output_pkts, packet);
+        }
+        return 0;
+    }
+    return -1;
+}
+
 static void
 dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
               const struct nlattr *a, bool should_steal)
@@ -7006,49 +7344,58 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
     struct dp_netdev *dp = pmd->dp;
     int type = nl_attr_type(a);
     struct tx_port *p;
+    int ret;
 
     switch ((enum ovs_action_attr)type) {
     case OVS_ACTION_ATTR_OUTPUT:
-        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
-        if (OVS_LIKELY(p)) {
-            struct dp_packet *packet;
-            struct dp_packet_batch out;
-
-            if (!should_steal) {
-                dp_packet_batch_clone(&out, packets_);
-                dp_packet_batch_reset_cutlen(packets_);
-                packets_ = &out;
-            }
-            dp_packet_batch_apply_cutlen(packets_);
-
-#ifdef DPDK_NETDEV
-            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
-                             && packets_->packets[0]->source
-                                != p->output_pkts.packets[0]->source)) {
-                /* XXX: netdev-dpdk assumes that all packets in a single
-                 *      output batch has the same source. Flush here to
-                 *      avoid memory access issues. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-#endif
-            if (dp_packet_batch_size(&p->output_pkts)
-                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
-                /* Flush here to avoid overflow. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-
-            if (dp_packet_batch_is_empty(&p->output_pkts)) {
-                pmd->n_output_batches++;
-            }
+        ret = dp_execute_output_action(pmd, packets_, should_steal,
+                                       nl_attr_get_odp_port(a));
+        if (ret == 0) {
+            /* Output action executed successfully. */
+            return;
+        }
+        break;
 
+    case OVS_ACTION_ATTR_LB_OUTPUT: {
+        uint32_t bond = nl_attr_get_u32(a);
+        uint32_t bond_member;
+        uint32_t bucket;
+        struct dp_packet_batch del_pkts;
+        struct dp_packet_batch output_pkt;
+        struct dp_packet *packet;
+        struct tx_bond *p_bond;
+        struct slave_entry *s_entry;
+        uint32_t size;
+
+        p_bond = pmd_tx_bond_cache_lookup(pmd, bond);
+        dp_packet_batch_init(&del_pkts);
+        if (p_bond) {
             DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
-                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
-                                                             pmd->ctx.last_rxq;
-                dp_packet_batch_add(&p->output_pkts, packet);
+                /*
+                 * Lookup the bond-hash table using hash to get the slave.
+                 */
+                bucket = (packet->md.dp_hash & BOND_MASK);
+                s_entry = &p_bond->slave_buckets[bucket];
+                bond_member = s_entry->slave_id;
+                size = dp_packet_size(packet);
+
+                dp_packet_batch_init_packet(&output_pkt, packet);
+                ret = dp_execute_output_action(pmd, &output_pkt, should_steal,
+                                               u32_to_odp(bond_member));
+                if (OVS_UNLIKELY(ret != 0)) {
+                    dp_packet_batch_add(&del_pkts, packet);
+                } else {
+                    /* Update slave stats. */
+                    non_atomic_ullong_add(&s_entry->n_packets, 1);
+                    non_atomic_ullong_add(&s_entry->n_bytes, size);
+                }
             }
+            /* Delete packets that failed OUTPUT action */
+            dp_packet_delete_batch(&del_pkts, should_steal);
             return;
         }
         break;
+    }
 
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
         if (should_steal) {
@@ -7477,6 +7824,110 @@  dpif_netdev_ipf_dump_done(struct dpif *dpif OVS_UNUSED, void *ipf_dump_ctx)
 
 }
 
+static int
+dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[])
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+    uint32_t bucket;
+    struct tx_bond *dp_bond_entry = NULL;
+
+    ovs_mutex_lock(&dp->bond_mutex);
+    /*
+     * Lookup for the bond. If already exists, just update the slave-map.
+     * Else create new.
+     */
+    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
+    if (dp_bond_entry) {
+        for (bucket = 0; bucket <= BOND_MASK; bucket++) {
+            dp_bond_entry->slave_buckets[bucket].slave_id = slave_map[bucket];
+        }
+        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+            ovs_mutex_lock(&pmd->bond_mutex);
+            dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond_entry);
+            ovs_mutex_unlock(&pmd->bond_mutex);
+        }
+    } else {
+        struct tx_bond *dp_bond = xzalloc(sizeof *dp_bond);
+        dp_bond->bond_id = bond_id;
+        for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+            dp_bond->slave_buckets[bucket].slave_id = slave_map[bucket];
+        }
+        hmap_insert(&dp->bonds, &dp_bond->node,
+                    hash_bond_id(dp_bond->bond_id));
+        /* Insert the bond map in all pmds. */
+        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+            ovs_mutex_lock(&pmd->bond_mutex);
+            dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond);
+            ovs_mutex_unlock(&pmd->bond_mutex);
+        }
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+    return 0;
+}
+
+static int
+dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+    struct tx_bond *dp_bond_entry = NULL;
+
+    ovs_mutex_lock(&dp->bond_mutex);
+
+    /* Find the bond and delete it if present */
+    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
+    if (dp_bond_entry) {
+        /* Remove the bond map in all pmds. */
+        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+            ovs_mutex_lock(&pmd->bond_mutex);
+            dp_netdev_del_bond_tx_from_pmd(pmd, dp_bond_entry);
+            ovs_mutex_unlock(&pmd->bond_mutex);
+        }
+        hmap_remove(&dp->bonds, &dp_bond_entry->node);
+        free(dp_bond_entry);
+    }
+
+    ovs_mutex_unlock(&dp->bond_mutex);
+    return 0;
+}
+
+static int
+dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                           uint64_t *n_bytes)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+    struct tx_bond *dp_bond_entry = NULL;
+    struct tx_bond *pmd_bond_entry = NULL;
+    uint32_t i;
+
+    ovs_mutex_lock(&dp->bond_mutex);
+
+    /* Find the bond and retrieve stats if present */
+    dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
+    if (dp_bond_entry) {
+        /* Search the bond in all PMDs */
+        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+            uint64_t pmd_n_bytes;
+            ovs_mutex_lock(&pmd->bond_mutex);
+            pmd_bond_entry = tx_bond_lookup(&pmd->bond_cache, bond_id);
+            if (pmd_bond_entry) {
+                /* Read bond stats. */
+                for (i = 0;i <= BOND_MASK; i++) {
+                    atomic_read_relaxed(
+                         &pmd_bond_entry->slave_buckets[i].n_bytes,
+                         &pmd_n_bytes);
+                    n_bytes[i] += pmd_n_bytes;
+                }
+            }
+            ovs_mutex_unlock(&pmd->bond_mutex);
+        }
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+    return 0;
+}
+
 const struct dpif_class dpif_netdev_class = {
     "netdev",
     true,                       /* cleanup_required */
@@ -7540,6 +7991,9 @@  const struct dpif_class dpif_netdev_class = {
     dpif_netdev_meter_set,
     dpif_netdev_meter_get,
     dpif_netdev_meter_del,
+    dpif_netdev_bond_add,
+    dpif_netdev_bond_del,
+    dpif_netdev_bond_stats_get,
 };
 
 static void
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 7bc71d6..a87c898 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -3440,6 +3440,9 @@  const struct dpif_class dpif_netlink_class = {
     dpif_netlink_meter_set,
     dpif_netlink_meter_get,
     dpif_netlink_meter_del,
+    NULL,                       /* bond_add */
+    NULL,                       /* bond_del */
+    NULL,                       /* bond_stats_get */
 };
 
 static int
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index 12898b9..043b885 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -552,6 +552,14 @@  struct dpif_class {
      * zero. */
     int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
                      struct ofputil_meter_stats *, uint16_t n_bands);
+
+    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
+    int (*bond_add)(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[]);
+    /* Removes bond identified by 'bond_id' from 'dpif'. */
+    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
+    /* Reads bond stats from 'dpif'. */
+    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
+                          uint64_t *n_bytes);
 };
 
 extern const struct dpif_class dpif_netlink_class;
diff --git a/lib/dpif.c b/lib/dpif.c
index c88b210..2411c2c 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1177,6 +1177,7 @@  dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
 
     case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -1227,6 +1228,7 @@  dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
         struct dp_packet *clone = NULL;
         uint32_t cutlen = dp_packet_get_cutlen(packet);
         if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
+                        || type == OVS_ACTION_ATTR_LB_OUTPUT
                         || type == OVS_ACTION_ATTR_TUNNEL_PUSH
                         || type == OVS_ACTION_ATTR_TUNNEL_POP
                         || type == OVS_ACTION_ATTR_USERSPACE)) {
@@ -1879,6 +1881,16 @@  dpif_supports_tnl_push_pop(const struct dpif *dpif)
     return dpif_is_netdev(dpif);
 }
 
+bool
+dpif_supports_balance_tcp_opt(const struct dpif *dpif)
+{
+    /*
+     * Balance-tcp optimization is currently supported in netdev
+     * datapath only.
+     */
+    return dpif_is_netdev(dpif);
+}
+
 /* Meters */
 void
 dpif_meter_get_features(const struct dpif *dpif,
@@ -1976,3 +1988,39 @@  dpif_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
     }
     return error;
 }
+
+int
+dpif_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[])
+{
+    int error = 0;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_add) {
+        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
+    }
+
+    return error;
+}
+
+int
+dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    int error = 0;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
+        error = dpif->dpif_class->bond_del(dpif, bond_id);
+    }
+
+    return error;
+}
+
+int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                        uint64_t *n_bytes)
+{
+    int error = 0;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
+        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
+    }
+
+    return error;
+}
diff --git a/lib/dpif.h b/lib/dpif.h
index 289d574..9b84122 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -891,6 +891,13 @@  int dpif_get_pmds_for_port(const struct dpif * dpif, odp_port_t port_no,
 char *dpif_get_dp_version(const struct dpif *);
 bool dpif_supports_tnl_push_pop(const struct dpif *);
 
+bool dpif_supports_balance_tcp_opt(const struct dpif *);
+
+int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, uint32_t slave_map[]);
+int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
+int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                        uint64_t *n_bytes);
+
 /* Log functions. */
 struct vlog_module;
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 563ad1d..13e4e96 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -725,6 +725,7 @@  requires_datapath_assistance(const struct nlattr *a)
     switch (type) {
         /* These only make sense in the context of a datapath. */
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -990,6 +991,7 @@  odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
             break;
 
         case OVS_ACTION_ATTR_OUTPUT:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
         case OVS_ACTION_ATTR_TUNNEL_PUSH:
         case OVS_ACTION_ATTR_TUNNEL_POP:
         case OVS_ACTION_ATTR_USERSPACE:
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 84ea4c1..e616da0 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -118,6 +118,7 @@  odp_action_len(uint16_t type)
 
     switch ((enum ovs_action_attr) type) {
     case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
+    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
     case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
     case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
     case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
@@ -1113,6 +1114,9 @@  format_odp_action(struct ds *ds, const struct nlattr *a,
     case OVS_ACTION_ATTR_OUTPUT:
         odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
         break;
+    case OVS_ACTION_ATTR_LB_OUTPUT:
+        ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));
+        break;
     case OVS_ACTION_ATTR_TRUNC: {
         const struct ovs_action_trunc *trunc =
                        nl_attr_get_unspec(a, sizeof *trunc);
diff --git a/ofproto/bond.c b/ofproto/bond.c
index c5d5f2c..15f1b40 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -54,10 +54,6 @@  static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
 static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
 static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
 
-/* Bit-mask for hashing a flow down to a bucket. */
-#define BOND_MASK 0xff
-#define BOND_BUCKETS (BOND_MASK + 1)
-
 /* Priority for internal rules created to handle recirculation */
 #define RECIRC_RULE_PRIORITY 20
 
@@ -126,6 +122,8 @@  struct bond {
     enum lacp_status lacp_status; /* Status of LACP negotiations. */
     bool bond_revalidate;       /* True if flows need revalidation. */
     uint32_t basis;             /* Basis for flow hash function. */
+    bool use_bond_cache;        /* Use bond cache to avoid recirculation.
+                                   Applicable only for Balance TCP mode. */
 
     /* SLB specific bonding info. */
     struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
@@ -185,7 +183,7 @@  static struct bond_slave *choose_output_slave(const struct bond *,
                                               struct flow_wildcards *,
                                               uint16_t vlan)
     OVS_REQ_RDLOCK(rwlock);
-static void update_recirc_rules__(struct bond *bond);
+static void update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id);
 static bool bond_is_falling_back_to_ab(const struct bond *);
 
 /* Attempts to parse 's' as the name of a bond balancing mode.  If successful,
@@ -262,6 +260,7 @@  void
 bond_unref(struct bond *bond)
 {
     struct bond_slave *slave;
+    uint32_t bond_recirc_id = 0;
 
     if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
         return;
@@ -282,12 +281,13 @@  bond_unref(struct bond *bond)
 
     /* Free bond resources. Remove existing post recirc rules. */
     if (bond->recirc_id) {
+        bond_recirc_id = bond->recirc_id;
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
     free(bond->hash);
     bond->hash = NULL;
-    update_recirc_rules__(bond);
+    update_recirc_rules__(bond, bond_recirc_id);
 
     hmap_destroy(&bond->pr_rule_ops);
     free(bond->name);
@@ -328,13 +328,14 @@  add_pr_rule(struct bond *bond, const struct match *match,
  * lock annotation. Currently, only 'bond_unref()' calls
  * this function directly.  */
 static void
-update_recirc_rules__(struct bond *bond)
+update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)
 {
     struct match match;
     struct bond_pr_rule_op *pr_op, *next_op;
     uint64_t ofpacts_stub[128 / 8];
     struct ofpbuf ofpacts;
     int i;
+    uint32_t slave_map[BOND_MASK];
 
     ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
 
@@ -353,8 +354,14 @@  update_recirc_rules__(struct bond *bond)
 
                 add_pr_rule(bond, &match, slave->ofp_port,
                             &bond->hash[i].pr_rule);
+                slave_map[i] = slave->ofp_port;
+            } else {
+                slave_map[i] = -1;
             }
         }
+        ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, slave_map);
+    } else {
+        ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
     }
 
     HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
@@ -404,7 +411,7 @@  static void
 update_recirc_rules(struct bond *bond)
     OVS_REQ_RDLOCK(rwlock)
 {
-    update_recirc_rules__(bond);
+    update_recirc_rules__(bond, bond->recirc_id);
 }
 
 /* Updates 'bond''s overall configuration to 's'.
@@ -467,6 +474,10 @@  bond_reconfigure(struct bond *bond, const struct bond_settings *s)
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
+    if (bond->use_bond_cache != s->use_bond_cache) {
+        bond->use_bond_cache = s->use_bond_cache;
+        revalidate = true;
+    }
 
     if (bond->balance == BM_AB || !bond->hash || revalidate) {
         bond_entry_reset(bond);
@@ -940,6 +951,13 @@  bond_recirculation_account(struct bond *bond)
     OVS_REQ_WRLOCK(rwlock)
 {
     int i;
+    uint64_t n_bytes[BOND_BUCKETS] = {0};
+
+    if (bond->hash && bond->recirc_id) {
+        /* Retrieve bond stats from datapath. */
+        dpif_bond_stats_get(bond->ofproto->backer->dpif,
+                            bond->recirc_id, n_bytes);
+    }
 
     for (i=0; i<=BOND_MASK; i++) {
         struct bond_entry *entry = &bond->hash[i];
@@ -948,11 +966,11 @@  bond_recirculation_account(struct bond *bond)
         if (rule) {
             uint64_t n_packets OVS_UNUSED;
             long long int used OVS_UNUSED;
-            uint64_t n_bytes;
-
-            rule->ofproto->ofproto_class->rule_get_stats(
-                rule, &n_packets, &n_bytes, &used);
-            bond_entry_account(entry, n_bytes);
+            if (!bond->ofproto->backer->rt_support.balance_tcp_opt) {
+                rule->ofproto->ofproto_class->rule_get_stats(
+                    rule, &n_packets, &n_bytes[i], &used);
+            }
+            bond_entry_account(entry, n_bytes[i]);
         }
     }
 }
@@ -1362,6 +1380,8 @@  bond_print_details(struct ds *ds, const struct bond *bond)
                   may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
 
     ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
+    ds_put_format(ds, "opt-bond-tcp: %s\n",
+                  bond->use_bond_cache ? "enabled" : "disabled");
 
     ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
     ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
@@ -1939,3 +1959,9 @@  bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
 
     return false;
 }
+
+bool
+bond_get_cache_mode(const struct bond *bond)
+{
+    return bond->use_bond_cache;
+}
diff --git a/ofproto/bond.h b/ofproto/bond.h
index e7c3d9b..88a4de1 100644
--- a/ofproto/bond.h
+++ b/ofproto/bond.h
@@ -22,6 +22,10 @@ 
 #include "ofproto-provider.h"
 #include "packets.h"
 
+/* Bit-mask for hashing a flow down to a bucket. */
+#define BOND_MASK 0xff
+#define BOND_BUCKETS (BOND_MASK + 1)
+
 struct flow;
 struct netdev;
 struct ofpbuf;
@@ -58,6 +62,8 @@  struct bond_settings {
                                 /* The MAC address of the interface
                                    that was active during the last
                                    ovs run. */
+    bool use_bond_cache;        /* Use bond cache. Only applicable for
+                                   bond mode BALANCE TCP. */
 };
 
 /* Program startup. */
@@ -122,4 +128,7 @@  void bond_rebalance(struct bond *);
 */
 void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
                                    uint32_t *hash_basis);
+
+bool bond_get_cache_mode(const struct bond *);
+
 #endif /* bond.h */
diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
index b8bd1b8..3daed47 100644
--- a/ofproto/ofproto-dpif-ipfix.c
+++ b/ofproto/ofproto-dpif-ipfix.c
@@ -3016,6 +3016,7 @@  dpif_ipfix_read_actions(const struct flow *flow,
         case OVS_ACTION_ATTR_POP_NSH:
         case OVS_ACTION_ATTR_CHECK_PKT_LEN:
         case OVS_ACTION_ATTR_UNSPEC:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
         case __OVS_ACTION_ATTR_MAX:
         default:
             break;
diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
index 03bd763..36f3f24 100644
--- a/ofproto/ofproto-dpif-sflow.c
+++ b/ofproto/ofproto-dpif-sflow.c
@@ -1177,6 +1177,7 @@  dpif_sflow_read_actions(const struct flow *flow,
         case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_CT_CLEAR:
         case OVS_ACTION_ATTR_METER:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
             break;
 
         case OVS_ACTION_ATTR_SET_MASKED:
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 28a7fdd..9d15ac7 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -409,6 +409,8 @@  struct xlate_ctx {
     struct ofpbuf action_set;   /* Action set. */
 
     enum xlate_error error;     /* Translation failed. */
+
+    bool tnl_push_no_recirc;    /* Tunnel push recirculation status */
 };
 
 /* Structure to track VLAN manipulation */
@@ -2421,6 +2423,34 @@  output_normal(struct xlate_ctx *ctx, const struct xbundle *out_xbundle,
                 /* Use recirculation instead of output. */
                 use_recirc = true;
                 xr.hash_alg = OVS_HASH_ALG_L4;
+
+                if (bond_get_cache_mode(out_xbundle->bond)) {
+                    /*
+                     * Select the hash-alg based on datapath's capability.
+                     * If not supported, default to OVS_HASH_ALG_L4 for
+                     * which HASH + RECIRC actions would be set in xlate. Else
+                     * use the RSS hash for better throughput. With
+                     * OVS_HASH_ALG_L4_RSS, RECIRC action is also avoided.
+                     *
+                     * NOTE:
+                     * Do not use load-balanced-output action when tunnel push
+                     * recirculation is avoided (via CLONE action), as L4 hash
+                     * for bond balancing needs to be computed post tunnel
+                     * encapsulation.
+                     */
+                    if (ctx->xbridge->support.balance_tcp_opt &&
+                        !ctx->tnl_push_no_recirc) {
+                        xr.hash_alg = OVS_HASH_ALG_L4_RSS;
+                    }
+
+                    VLOG_DBG("xin-in_port: %u/%u base-flow-in_port: %u/%u "
+                             "hash-algo = %d\n",
+                             ctx->xin->flow.in_port.ofp_port,
+                             ctx->xin->flow.in_port.odp_port,
+                             ctx->base_flow.in_port.ofp_port,
+                             ctx->base_flow.in_port.odp_port, xr.hash_alg);
+                }
+
                 /* Recirculation does not require unmasking hash fields. */
                 wc = NULL;
             }
@@ -3697,12 +3727,16 @@  native_tunnel_output(struct xlate_ctx *ctx, const struct xport *xport,
         ctx->xin->allow_side_effects = backup_side_effects;
         ctx->xin->packet = backup_packet;
         ctx->wc = backup_wc;
+
+        ctx->tnl_push_no_recirc = true;
     } else {
         /* In order to maintain accurate stats, use recirc for
          * natvie tunneling.  */
         nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, 0);
         nl_msg_end_nested(ctx->odp_actions, clone_ofs);
-    }
+
+        ctx->tnl_push_no_recirc = false;
+   }
 
     /* Restore the flows after the translation. */
     memcpy(&ctx->xin->flow, &old_flow, sizeof ctx->xin->flow);
@@ -4128,24 +4162,36 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
         xlate_commit_actions(ctx);
 
         if (xr) {
-            /* Recirculate the packet. */
             struct ovs_action_hash *act_hash;
 
             /* Hash action. */
             enum ovs_hash_alg hash_alg = xr->hash_alg;
-            if (hash_alg > ctx->xbridge->support.max_hash_alg) {
+            if (hash_alg > ctx->xbridge->support.max_hash_alg ||
+                hash_alg == OVS_HASH_ALG_L4_RSS) {
                 /* Algorithm supported by all datapaths. */
                 hash_alg = OVS_HASH_ALG_L4;
             }
             act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
-                                                OVS_ACTION_ATTR_HASH,
-                                                sizeof *act_hash);
+                                            OVS_ACTION_ATTR_HASH,
+                                            sizeof *act_hash);
             act_hash->hash_alg = hash_alg;
             act_hash->hash_basis = xr->hash_basis;
 
-            /* Recirc action. */
-            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
-                           xr->recirc_id);
+            if (xr->hash_alg == OVS_HASH_ALG_L4_RSS) {
+                /*
+                 * If hash algorithm is RSS, use the hash directly
+                 * for slave selection and avoid recirculation.
+                 *
+                 * Currently support for netdev datapath only.
+                 */
+                nl_msg_put_odp_port(ctx->odp_actions,
+                                    OVS_ACTION_ATTR_LB_OUTPUT,
+                                    xr->recirc_id);
+            } else {
+                /* Recirculate the packet. */
+                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
+                               xr->recirc_id);
+            }
         } else if (is_native_tunnel) {
             /* Output to native tunnel port. */
             native_tunnel_output(ctx, xport, flow, odp_port, truncate);
@@ -7170,7 +7216,8 @@  count_output_actions(const struct ofpbuf *odp_actions)
     int n = 0;
 
     NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
-        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
+        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
+            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
             n++;
         }
     }
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 7515352..a591035 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1441,6 +1441,8 @@  check_support(struct dpif_backer *backer)
     backer->rt_support.ct_clear = check_ct_clear(backer);
     backer->rt_support.max_hash_alg = check_max_dp_hash_alg(backer);
     backer->rt_support.check_pkt_len = check_check_pkt_len(backer);
+    backer->rt_support.balance_tcp_opt =
+        dpif_supports_balance_tcp_opt(backer->dpif);
 
     /* Flow fields. */
     backer->rt_support.odp.ct_state = check_ct_state(backer);
@@ -3294,6 +3296,36 @@  bundle_remove(struct ofport *port_)
     }
 }
 
+int
+ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
+                        uint32_t bond_id,
+                        uint32_t slave_map[])
+{
+    int error;
+    uint32_t bucket;
+
+    /* Convert ofp_port to odp_port */
+    for (bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+        if (slave_map[bucket] != -1) {
+            slave_map[bucket] =
+                ofp_port_to_odp_port(ofproto, slave_map[bucket]);
+        }
+    }
+
+    error = dpif_bond_add(ofproto->backer->dpif, bond_id, slave_map);
+    return error;
+}
+
+int
+ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto,
+                        uint32_t bond_id)
+{
+    int error;
+
+    error = dpif_bond_del(ofproto->backer->dpif, bond_id);
+    return error;
+}
+
 static void
 send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
 {
diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
index cd5321e..43ab09d 100644
--- a/ofproto/ofproto-dpif.h
+++ b/ofproto/ofproto-dpif.h
@@ -194,8 +194,11 @@  struct group_dpif *group_dpif_lookup(struct ofproto_dpif *,
     /* Highest supported dp_hash algorithm. */                              \
     DPIF_SUPPORT_FIELD(size_t, max_hash_alg, "Max dp_hash algorithm")       \
                                                                             \
-    /* True if the datapath supports OVS_ACTION_ATTR_CHECK_PKT_LEN. */   \
-    DPIF_SUPPORT_FIELD(bool, check_pkt_len, "Check pkt length action")
+    /* True if the datapath supports OVS_ACTION_ATTR_CHECK_PKT_LEN. */      \
+    DPIF_SUPPORT_FIELD(bool, check_pkt_len, "Check pkt length action")      \
+                                                                            \
+    /* True if the datapath supports balance_tcp optimization */            \
+    DPIF_SUPPORT_FIELD(bool, balance_tcp_opt, "Balance-tcp opt")
 
 /* Stores the various features which the corresponding backer supports. */
 struct dpif_backer_support {
@@ -361,6 +364,11 @@  int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
                                    struct rule **rulep);
 int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
                                       int priority);
+int ofproto_dpif_bundle_add(struct ofproto_dpif *,
+                            uint32_t bond_id,
+                            uint32_t slave_map[]);
+int ofproto_dpif_bundle_del(struct ofproto_dpif *,
+                            uint32_t bond_id);
 
 bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
 
diff --git a/tests/lacp.at b/tests/lacp.at
index 7b460d7..26aa1b7 100644
--- a/tests/lacp.at
+++ b/tests/lacp.at
@@ -121,6 +121,7 @@  AT_CHECK([ovs-appctl bond/show], [0], [dnl
 bond_mode: active-backup
 bond may use recirculation: no, Recirc-ID : -1
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -286,6 +287,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -301,6 +303,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -423,6 +426,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -440,6 +444,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -555,6 +560,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -572,6 +578,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -692,6 +699,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -709,6 +717,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+opt-bond-tcp: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index 2976771..6199a7b 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -4300,6 +4300,10 @@  port_configure_bond(struct port *port, struct bond_settings *s)
         /* OVSDB did not store the last active interface */
         s->active_slave_mac = eth_addr_zero;
     }
+    if (s->balance == BM_TCP) {
+        s->use_bond_cache = smap_get_bool(&port->cfg->other_config,
+                                        "opt-bond-tcp", false);
+    }
 }
 
 /* Returns true if 'port' is synthetic, that is, if we constructed it locally
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 027aee2..123f694 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -1963,6 +1963,16 @@ 
         <code>active-backup</code>.
       </column>
 
+      <column name="other_config" key="opt-bond-tcp"
+              type='{"type": "boolean"}'>
+        Enable/disable usage of RSS hash from the ingress port for load
+        balancing flows among output slaves in load balanced bonds in
+        <code>balance-tcp</code>. When enabled, it uses optimized path for
+        balance-tcp mode by using rss hash and avoids recirculation.
+        It affects only new flows, i.e, existing flows remain unchanged.
+        This knob does not affect other balancing modes.
+      </column>
+
       <group title="Link Failure Detection">
         <p>
           An important part of link bonding is detecting that links are down so