diff mbox series

[ovs-dev,v11,1/1] Avoid dp_hash recirculation for balance-tcp bond selection mode

Message ID 1581420483-24451-2-git-send-email-vishal.deep.ajmera@ericsson.com
State New
Headers show
Series Balance-tcp bond mode optimization | expand

Commit Message

姜立东 via dev Feb. 11, 2020, 11:28 a.m. UTC
Problem:
--------
In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
(using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
"HASH" and "RECIRC" datapath actions. After recirculation, the packet is
forwarded to the bond member port based on 8-bits of the datapath hash
value computed through dp_hash. This causes performance degradation in the
following ways:

1. The recirculation of the packet implies another lookup of the packet’s
flow key in the exact match cache (EMC) and potentially Megaflow classifier
(DPCLS). This is the biggest cost factor.

2. The recirculated packets have a new “RSS” hash and compete with the
original packets for the scarce number of EMC slots. This implies more
EMC misses and potentially EMC thrashing causing costly DPCLS lookups.

3. The 256 extra megaflow entries per bond for dp_hash bond selection put
additional load on the revalidation threads.

Owing to this performance degradation, deployments stick to “balance-slb”
bond mode even though it does not do active-active load balancing for
VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
source MAC address.

Proposed optimization:
----------------------
This proposal introduces a new load-balancing output action instead of
recirculation.

Maintain one table per-bond (could just be an array of uint16's) and
program it the same way internal flows are created today for each possible
hash value(256 entries) from ofproto layer. Use this table to load-balance
flows as part of output action processing.

Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
-> bond_may_recirc() and compose_output_action__() generate
“dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
RecircID identifies the bond. For the recirculated packets the ofproto layer
installs megaflow entries that match on RecircID and masked dp_hash and send
them to the corresponding output port.

Instead, we will now generate action as
    "lb_output(bond,<bond id>)"

This combines hash computation (only if needed, else re-use RSS hash) and
inline load-balancing over the bond. This action is used *only* for balance-tcp
bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).

Example:
--------
Current scheme:
---------------
With 1 IP-UDP flow:

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2

With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1

New scheme:
-----------
We can do with a single flow entry (for any number of new flows):

in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1)

A new CLI has been added to dump datapath bond cache as given below.

“sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”

root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show
Bond cache:
        bond-id 1 :
                bucket 0 - slave 2
                bucket 1 - slave 1
                bucket 2 - slave 2
                bucket 3 - slave 1

Performance improvement:
------------------------
With a prototype of the proposed idea, the following perf improvement is seen
with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
is even more enhanced (due to reduced number of flows).

1 VM:
*****
+--------------------------------------+
|                 mpps                 |
+--------------------------------------+
| Flows  master  with-opt.   %delta    |
+--------------------------------------+
| 1      4.34    5.59        28.86
| 10     4.09    5.43        32.61
| 400    3.44    5.35        55.25
| 1k     3.31    5.25        58.41
| 10k    2.46    4.43        79.78
| 100k   2.32    4.27        83.59
| 500k   2.29    4.23        84.57
+--------------------------------------+
mpps: million packets per second.
packet size 64 bytes.

Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>

CC: Jan Scheurich <jan.scheurich@ericsson.com>
CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
CC: Ben Pfaff <blp@ovn.org>
CC: Ilya Maximets <i.maximets@ovn.org>
CC: Ian Stokes <ian.stokes@intel.com>
CC: David Marchand <david.marchand@redhat.com>
CC: Matteo Croce <mcroce@redhat.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |   1 +
 lib/dpif-netdev.c                                 | 399 ++++++++++++++++++++--
 lib/dpif-netlink.c                                |   3 +
 lib/dpif-provider.h                               |   9 +
 lib/dpif.c                                        |  49 +++
 lib/dpif.h                                        |   7 +
 lib/odp-execute.c                                 |   2 +
 lib/odp-util.c                                    |   4 +
 ofproto/bond.c                                    |  58 +++-
 ofproto/bond.h                                    |   9 +
 ofproto/ofproto-dpif-ipfix.c                      |   1 +
 ofproto/ofproto-dpif-sflow.c                      |   1 +
 ofproto/ofproto-dpif-xlate.c                      |  31 +-
 ofproto/ofproto-dpif.c                            |  24 ++
 ofproto/ofproto-dpif.h                            |   9 +-
 tests/lacp.at                                     |   9 +
 vswitchd/bridge.c                                 |   5 +
 vswitchd/vswitch.xml                              |  23 ++
 18 files changed, 587 insertions(+), 57 deletions(-)

Comments

Eelco Chaudron Feb. 14, 2020, 3:27 p.m. UTC | #1
Hi Vishal,

Thanks for sending out yet another version! I have some final comments 
below.

Cheers,


Eelco


On 11 Feb 2020, at 12:28, Vishal Deep Ajmera via dev wrote:

EC> Even without enabling lb-output-action=true I see the bond cache 
being populated is this the desired effect?
EC> There is no error checking when other_config:lb-output-action=true 
is done on a non supporting backend (i.e. kernel)
EC> There is no documentation of this feature (or any balance-tcp for 
that matters) but we should add it somewhere
EC> In bond_reconfigure() the recirc_free_id() call is called if bond 
mode changes, but this is not cleaning up the datapath

> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type 
> “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer 
> into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet 
> is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in 
> the
> following ways:
>
> 1. The recirculation of the packet implies another lookup of the 
> packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow 
> classifier
> (DPCLS). This is the biggest cost factor.
>
> 2. The recirculated packets have a new “RSS” hash and compete with 
> the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
>
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection 
> put
> additional load on the revalidation threads.
>
> Owing to this performance degradation, deployments stick to 
> “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the 
> same
> source MAC address.
>
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
>
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each 
> possible
> hash value(256 entries) from ofproto layer. Use this table to 
> load-balance
> flows as part of output action processing.
>
> Currently xlate_normal() -> output_normal() -> 
> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In 
> this case the
> RecircID identifies the bond. For the recirculated packets the ofproto 
> layer
> installs megaflow entries that match on RecircID and masked dp_hash 
> and send
> them to the corresponding output port.
>
> Instead, we will now generate action as
>     "lb_output(bond,<bond id>)"
>
> This combines hash computation (only if needed, else re-use RSS hash) 
> and
> inline load-balancing over the bond. This action is used *only* for 
> balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains 
> unchanged).
>
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2828969, bytes:181054016, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:2828937, bytes:181051968, used:0.000s, actions:2
>
> With 8 IP-UDP flows (with random UDP src port): (all hitting same 
> DPCL):
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2674009, bytes:171136576, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:326404, bytes:20889856, used:0.001s, actions:1
>
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
>
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2674009, bytes:171136576, used:0.000s, 
> actions:lb_output(bond,1)
>
> A new CLI has been added to dump datapath bond cache as given below.
>
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
>
> root@ubuntu-190:performance_scripts # sudo ovs-appctl 
> dpif-netdev/dp-bond-show
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
>
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement 
> is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the 
> improvement
> is even more enhanced (due to reduced number of flows).
>
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.34    5.59        28.86
> | 10     4.09    5.43        32.61
> | 400    3.44    5.35        55.25
> | 1k     3.31    5.25        58.41
> | 10k    2.46    4.43        79.78
> | 100k   2.32    4.27        83.59
> | 500k   2.29    4.23        84.57
> +--------------------------------------+
> mpps: million packets per second.
> packet size 64 bytes.
>
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
>
> CC: Jan Scheurich <jan.scheurich@ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> CC: Ben Pfaff <blp@ovn.org>
> CC: Ilya Maximets <i.maximets@ovn.org>
> CC: Ian Stokes <ian.stokes@intel.com>
> CC: David Marchand <david.marchand@redhat.com>
> CC: Matteo Croce <mcroce@redhat.com>
> ---
>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>  lib/dpif-netdev.c                                 | 399 
> ++++++++++++++++++++--
>  lib/dpif-netlink.c                                |   3 +
>  lib/dpif-provider.h                               |   9 +
>  lib/dpif.c                                        |  49 +++
>  lib/dpif.h                                        |   7 +
>  lib/odp-execute.c                                 |   2 +
>  lib/odp-util.c                                    |   4 +
>  ofproto/bond.c                                    |  58 +++-
>  ofproto/bond.h                                    |   9 +
>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>  ofproto/ofproto-dpif-sflow.c                      |   1 +
>  ofproto/ofproto-dpif-xlate.c                      |  31 +-
>  ofproto/ofproto-dpif.c                            |  24 ++
>  ofproto/ofproto-dpif.h                            |   9 +-
>  tests/lacp.at                                     |   9 +
>  vswitchd/bridge.c                                 |   5 +
>  vswitchd/vswitch.xml                              |  23 ++
>  18 files changed, 587 insertions(+), 57 deletions(-)
>
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
> b/datapath/linux/compat/include/linux/openvswitch.h
> index 2f0c655..7604a2a 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -1021,6 +1021,7 @@ enum ovs_action_attr {
>  	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>  	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
>  	OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
> +	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id */
>  #endif
>  	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
>  				       * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d393aab..f550192 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -79,6 +79,7 @@
>  #include "unixctl.h"
>  #include "util.h"
>  #include "uuid.h"
> +#include "ofproto/bond.h"

EC> We should not include ofproto/bond.h, should be the other way 
around. Guess this is for BOND_BUCKETS? We need a way for ofproto to 
take this from a /lib include.

>
>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>
> @@ -376,6 +377,12 @@ struct dp_netdev {
>
>      struct conntrack *conntrack;
>      struct pmd_auto_lb pmd_alb;
> +
> +    /* Bonds.
> +     *
> +     * Any lookup into 'bonds' requires taking 'bond_mutex'. */
> +    struct ovs_mutex bond_mutex;
> +    struct cmap tx_bonds; /* Contains "struct tx_bond"s. */
>  };
>
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -607,6 +614,20 @@ struct tx_port {
>      struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>  };
>
> +/* Contained by struct tx_bond 'slave_buckets' */
> +struct slave_entry {
> +    odp_port_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
> +struct tx_bond {
> +    struct cmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>  /* A set of properties for the current processing loop that is not 
> directly
>   * associated with the pmd thread itself, but with the packets being
>   * processed or the short-term system configuration (for example, 
> time).
> @@ -739,6 +760,11 @@ struct dp_netdev_pmd_thread {
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED;
>
> +    struct ovs_mutex bond_mutex;    /* Mutex for 'tx_bonds'. */
> +    /* Map of 'tx_bond's used for transmission.  Written by the main 
> thread,
> +     * read/written by the pmd thread. */
> +    struct cmap tx_bonds OVS_GUARDED;
> +
>      /* These are thread-local copies of 'tx_ports'.  One contains 
> only tunnel
>       * ports (that support push_tunnel/pop_tunnel), the other 
> contains ports
>       * with at least one txq (that support send).  A port can be in 
> both.
> @@ -830,6 +856,12 @@ static void dp_netdev_del_rxq_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>  static int
>  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>                                     bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread 
> *pmd,
> +                                         struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex);
> +static void dp_netdev_del_bond_tx_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
> +                                           struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex);
>
>  static void reconfigure_datapath(struct dp_netdev *dp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -1396,6 +1428,45 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, 
> int argc,
>      par.command_type = PMD_INFO_PERF_SHOW;
>      dpif_netdev_pmd_info(conn, argc, argv, &par);
>  }
> +
> +static void
> +dpif_netdev_dp_bond_show(struct unixctl_conn *conn, int argc,
> +                         const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct ds reply = DS_EMPTY_INITIALIZER;
> +
> +    ovs_mutex_lock(&dp_netdev_mutex);
> +
> +    struct dp_netdev *dp = NULL;
> +    if (argc == 2) {
> +        dp = shash_find_data(&dp_netdevs, argv[1]);
> +    } else if (shash_count(&dp_netdevs) == 1) {
> +        /* There's only one datapath */
> +        dp = shash_first(&dp_netdevs)->data;
> +    }
> +    if (!dp) {
> +        ovs_mutex_unlock(&dp_netdev_mutex);
> +        unixctl_command_reply_error(conn,
> +                                    "please specify an existing 
> datapath");
> +        return;
> +    }

EC> No need to hold the dp->bond_mutex?

EC> Might be my OCD kicking in but what about:
EC> if (cmap_count(&dp->tx_bonds) > 0) {
  > +    ds_put_cstr(&reply, "\nBonds:\n");
EC> }
> +
> +    struct tx_bond *dp_bond_entry;
> +    CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
> +        ds_put_format(&reply, "\tbond-id %u :\n",

EC> Remove the space before :, i.e. ""\tbond-id %u:\n"

> +                      dp_bond_entry->bond_id);
> +        for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +            ds_put_format(&reply, "\t\tbucket %u - slave %u \n",
> +                      bucket,
> +                      dp_bond_entry->slave_buckets[bucket].slave_id);
> +        }
> +    }
> +    ovs_mutex_unlock(&dp_netdev_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&reply));
> +    ds_destroy(&reply);
> +}
> +
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1427,6 +1498,9 @@ dpif_netdev_init(void)
>                               "[-us usec] [-q qlen]",
>                               0, 10, pmd_perf_log_set_cmd,
>                               NULL);
> +    unixctl_command_register("dpif-netdev/dp-bond-show", "[dp]",
> +                             0, 1, dpif_netdev_dp_bond_show,
> +                             NULL);
>      return 0;
>  }
>
> @@ -1551,6 +1625,9 @@ create_dp_netdev(const char *name, const struct 
> dpif_class *class,
>      ovs_mutex_init_recursive(&dp->port_mutex);
>      hmap_init(&dp->ports);
>      dp->port_seq = seq_create();
> +    ovs_mutex_init(&dp->bond_mutex);
> +    cmap_init(&dp->tx_bonds);
> +
>      fat_rwlock_init(&dp->upcall_rwlock);
>
>      dp->reconfigure_seq = seq_create();
> @@ -1657,6 +1734,12 @@ dp_delete_meter(struct dp_netdev *dp, uint32_t 
> meter_id)
>      }
>  }
>
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> +    return hash_int(bond_id, 0);
> +}

EC> Does this ensure we get no duplicate hash values, don’t think so? 
We need to handle this in tx_bond_lookup()

> +
>  /* Requires dp_netdev_mutex so that we can't get a new reference to 
> 'dp'
>   * through the 'dp_netdevs' shash while freeing 'dp'. */
>  static void
> @@ -1664,6 +1747,7 @@ dp_netdev_free(struct dp_netdev *dp)
>      OVS_REQUIRES(dp_netdev_mutex)
>  {
>      struct dp_netdev_port *port, *next;
> +    struct tx_bond *bond;
>
>      shash_find_and_delete(&dp_netdevs, dp->name);
>
> @@ -1673,6 +1757,13 @@ dp_netdev_free(struct dp_netdev *dp)
>      }
>      ovs_mutex_unlock(&dp->port_mutex);
>
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +        cmap_remove(&dp->tx_bonds, &bond->node, 
> hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, bond);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      dp_netdev_destroy_all_pmds(dp, true);
>      cmap_destroy(&dp->poll_threads);
>
> @@ -1691,6 +1782,9 @@ dp_netdev_free(struct dp_netdev *dp)
>      hmap_destroy(&dp->ports);
>      ovs_mutex_destroy(&dp->port_mutex);
>
> +    cmap_destroy(&dp->tx_bonds);
> +    ovs_mutex_destroy(&dp->bond_mutex);
> +
>      /* Upcalls must be disabled at this point */
>      dp_netdev_destroy_upcall_lock(dp);
>
> @@ -4421,6 +4515,20 @@ tx_port_lookup(const struct hmap *hmap, 
> odp_port_t port_no)
>      return NULL;
>  }
>
> +static struct tx_bond *
> +tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +    const struct cmap_node *pnode;
> +
> +    pnode = cmap_find(tx_bonds, hash_bond_id(bond_id));
> +    if (!pnode) {
> +        return NULL;
> +    }

EC> If duplicate hash_bond_id's can exist we need to use 
CMAP_FOR_EACH_WITH_HASH(), see tx_port_lookup() how to implement this.

> +    tx = CONTAINER_OF(pnode, struct tx_bond, node);
> +    return tx;
> +}
> +
>  static int
>  port_reconfigure(struct dp_netdev_port *port)
>  {
> @@ -5072,6 +5180,21 @@ reconfigure_datapath(struct dp_netdev *dp)
>          ovs_mutex_unlock(&pmd->port_mutex);
>      }
>
> +    /* Add every bond to the cmap of each pmd thread, if it's not
> +     * there already and if this pmd has at least one rxq to poll. */
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ovs_mutex_lock(&pmd->bond_mutex);
> +        if (hmap_count(&pmd->poll_list) || pmd->core_id == 
> NON_PMD_CORE_ID) {
> +            struct tx_bond *bond;
> +            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
> +            }
> +        }

EC> Should we remove the cmaps if this is not the case, i.e. else 
dp_netdev_del_bond_tx_from_pmd()

> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);
>
> @@ -6110,6 +6233,7 @@ dp_netdev_configure_pmd(struct 
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      atomic_init(&pmd->reload, false);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> +    ovs_mutex_init(&pmd->bond_mutex);
>      cmap_init(&pmd->flow_table);
>      cmap_init(&pmd->classifiers);
>      pmd->ctx.last_rxq = NULL;
> @@ -6120,6 +6244,7 @@ dp_netdev_configure_pmd(struct 
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache);
> +    cmap_init(&pmd->tx_bonds);
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> @@ -6135,11 +6260,17 @@ static void
>  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>  {
>      struct dpcls *cls;
> +    struct tx_bond *tx;
>
>      dp_netdev_pmd_flow_flush(pmd);
>      hmap_destroy(&pmd->send_port_cache);
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }
> +    cmap_destroy(&pmd->tx_bonds);
>      hmap_destroy(&pmd->poll_list);
>      /* All flows (including their dpcls_rules) have been deleted 
> already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6151,6 +6282,7 @@ dp_netdev_destroy_pmd(struct 
> dp_netdev_pmd_thread *pmd)
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      seq_destroy(pmd->reload_seq);
>      ovs_mutex_destroy(&pmd->port_mutex);
> +    ovs_mutex_destroy(&pmd->bond_mutex);
>      free(pmd);
>  }
>
> @@ -6304,6 +6436,41 @@ dp_netdev_del_port_tx_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>      free(tx);
>      pmd->need_reload = true;
>  }
> +
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                             struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    struct tx_bond *tx = tx_bond_lookup(&pmd->tx_bonds, 
> bond->bond_id);
> +    if (tx) {
> +        struct tx_bond *new_tx = xmemdup(bond, sizeof *tx);
> +        /* Copy the stats for each bucket. */
> +        for (int i = 0;i < BOND_BUCKETS; i++) {

EC> Space after 0, i.e. (int i = 0; i <...)

> +            uint64_t n_packets, n_bytes;
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, 
> &n_packets);
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, 
> &n_bytes);
> +            new_tx->slave_buckets[i].n_packets = n_packets;
> +            new_tx->slave_buckets[i].n_bytes = n_bytes;

EC> But if we read them relaxed we should also write them relaxes, i.e. 
atomic_store_relaxed() to be consistent.
EC> Are we ok with inaccurate stats due to bond member updates? If so we 
should document it in the (missing) documentation.


> +        }
> +        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
> +                     hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        tx = xmemdup(bond, sizeof *tx);
> +        cmap_insert(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(bond->bond_id));
> +    }
> +}
> +
> +/* Del 'tx' from the tx bond cache of 'pmd' */
> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                               struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    cmap_remove(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(tx->bond_id));
> +    ovsrcu_postpone(free, tx);
> +}
>  
>  static char *
>  dpif_netdev_get_datapath_version(void)
> @@ -7129,6 +7296,94 @@ dp_execute_userspace_action(struct 
> dp_netdev_pmd_thread *pmd,
>      }
>  }
>
> +static bool
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_packet_batch *packets_,
> +                         bool should_steal, odp_port_t port_no)
> +{
> +    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (!OVS_LIKELY(p)) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch out;
> +    if (!should_steal) {
> +        dp_packet_batch_clone(&out, packets_);
> +        dp_packet_batch_reset_cutlen(packets_);
> +        packets_ = &out;
> +    }
> +    dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> +    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> +                     && packets_->packets[0]->source
> +                     != p->output_pkts.packets[0]->source)) {
> +        /* netdev-dpdk assumes that all packets in a single
> +         * output batch has the same source. Flush here to
> +         * avoid memory access issues. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +#endif
> +    if (dp_packet_batch_size(&p->output_pkts)
> +        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> +        /* Flush here to avoid overflow. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +    if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +        pmd->n_output_batches++;
> +    }
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> +            pmd->ctx.last_rxq;
> +        dp_packet_batch_add(&p->output_pkts, packet);
> +    }
> +    return true;
> +}
> +
> +static bool
> +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
> +                            struct dp_packet_batch *packets_,
> +                            bool should_steal, uint32_t bond)
> +{
> +    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
> +    if (!p_bond) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch del_pkts;
> +    dp_packet_batch_init(&del_pkts);
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +        /*
> +         * Lookup the bond-hash table using hash to get the slave.
> +         */
> +        uint32_t hash = dp_packet_get_rss_hash(packet);
> +        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & 
> BOND_MASK];
> +        odp_port_t bond_member = s_entry->slave_id;
> +        uint32_t size = dp_packet_size(packet);
> +
> +        struct dp_packet_batch output_pkt;
> +        dp_packet_batch_init_packet(&output_pkt, packet);
> +        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, 
> should_steal,
> +                                                bond_member))) {
> +            /* Update slave stats. */
> +            non_atomic_ullong_add(&s_entry->n_packets, 1);
> +            non_atomic_ullong_add(&s_entry->n_bytes, size);
> +        } else {
> +            dp_packet_batch_add(&del_pkts, packet);
> +        }
> +    }
> +
> +    /* Delete packets that failed OUTPUT action */
> +    COVERAGE_ADD(datapath_drop_invalid_port,
> +                 dp_packet_batch_size(&del_pkts));
> +    dp_packet_delete_batch(&del_pkts, should_steal);
> +
> +    return true;
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool should_steal)
> @@ -7144,43 +7399,18 @@ dp_execute_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>
>      switch ((enum ovs_action_attr)type) {
>      case OVS_ACTION_ATTR_OUTPUT:
> -        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> -        if (OVS_LIKELY(p)) {
> -            struct dp_packet *packet;
> -            struct dp_packet_batch out;
> -
> -            if (!should_steal) {
> -                dp_packet_batch_clone(&out, packets_);
> -                dp_packet_batch_reset_cutlen(packets_);
> -                packets_ = &out;
> -            }
> -            dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> -            if 
> (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> -                             && packets_->packets[0]->source
> -                                != 
> p->output_pkts.packets[0]->source)) {
> -                /* XXX: netdev-dpdk assumes that all packets in a 
> single
> -                 *      output batch has the same source. Flush here 
> to
> -                 *      avoid memory access issues. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -#endif
> -            if (dp_packet_batch_size(&p->output_pkts)
> -                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) 
> {
> -                /* Flush here to avoid overflow. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> -                pmd->n_output_batches++;
> -            }
> +        if (dp_execute_output_action(pmd, packets_, should_steal,
> +                                     nl_attr_get_odp_port(a))) {
> +            return;
> +        } else {
> +            COVERAGE_ADD(datapath_drop_invalid_port,
> +                         dp_packet_batch_size(packets_));
> +        }
> +        break;
>
> -            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -                
> p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -                                                             
> pmd->ctx.last_rxq;
> -                dp_packet_batch_add(&p->output_pkts, packet);
> -            }
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        if (dp_execute_lb_output_action(pmd, packets_, should_steal,
> +                                        nl_attr_get_u32(a))) {
>              return;
>          } else {
>              COVERAGE_ADD(datapath_drop_invalid_port,
> @@ -7738,6 +7968,100 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif 
> OVS_UNUSED, void *ipf_dump_ctx)
>
>  }
>
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
> +                     odp_port_t slave_map[])
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Prepare new bond mapping. */
> +    struct tx_bond *new_tx = xzalloc(sizeof(struct tx_bond));
> +
> +    new_tx->bond_id = bond_id;
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
> +    }
> +
> +    /* Check if bond already existed */
> +    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);

EC> Does the above not need to be done while holding dp->bond_mutex 
below?

> +    ovs_mutex_lock(&dp->bond_mutex);
> +    if (old_tx) {
> +        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
> +                     hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, old_tx);
> +    } else {
> +        cmap_insert(&dp->tx_bonds, &new_tx->node,
> +                    hash_bond_id(bond_id));
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Update all PMDs with new bond mapping. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ovs_mutex_lock(&pmd->bond_mutex);
> +        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> +    /* Check if bond existed. */
> +    struct tx_bond *tx = tx_bond_lookup(&dp->tx_bonds, bond_id);

EC> Mutex?

> +    if (tx) {
> +        ovs_mutex_lock(&dp->bond_mutex);
> +        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, tx);
> +        ovs_mutex_unlock(&dp->bond_mutex);
> +    } else {
> +        /* Bond is not present. */
> +        return 0;
> +    }
> +
> +    /* Remove the bond map in all pmds. */
> +    struct dp_netdev_pmd_thread *pmd;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);

EC> Should below mutex be moved above the lookup, outside the if 
statement?

> +        if (tx) {
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            dp_netdev_del_bond_tx_from_pmd(pmd, tx);
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                           uint64_t *n_bytes)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Search the bond in all PMDs */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        struct tx_bond *pmd_bond_entry
> +            = tx_bond_lookup(&pmd->tx_bonds, bond_id);

EC> Mutex?

> +        ovs_mutex_lock(&pmd->bond_mutex);
> +        if (pmd_bond_entry) {
> +             /* Read bond stats. */
> +            for (int i = 0; i < BOND_BUCKETS; i++) {
> +                uint64_t pmd_n_bytes;
> +                atomic_read_relaxed(
> +                     &pmd_bond_entry->slave_buckets[i].n_bytes,
> +                     &pmd_n_bytes);
> +                n_bytes[i] += pmd_n_bytes;
> +            }
> +        }
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    return 0;
> +}
> +
>  const struct dpif_class dpif_netdev_class = {
>      "netdev",
>      true,                       /* cleanup_required */
> @@ -7811,6 +8135,9 @@ const struct dpif_class dpif_netdev_class = {
>      dpif_netdev_meter_set,
>      dpif_netdev_meter_get,
>      dpif_netdev_meter_del,
> +    dpif_netdev_bond_add,
> +    dpif_netdev_bond_del,
> +    dpif_netdev_bond_stats_get,
>  };
>
>  static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 5b5c96d..116f8fc 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3993,6 +3993,9 @@ const struct dpif_class dpif_netlink_class = {
>      dpif_netlink_meter_set,
>      dpif_netlink_meter_get,
>      dpif_netlink_meter_del,
> +    NULL,                       /* bond_add */
> +    NULL,                       /* bond_del */
> +    NULL,                       /* bond_stats_get */
>  };
>
>  static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index b77317b..4415836 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -616,6 +616,15 @@ struct dpif_class {
>       * zero. */
>      int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
>                       struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> +    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
> +                    odp_port_t slave_map[]);

EC> Can we use a pointer instead of [] to be inline with the other 
definitions
EC> Also a new line after each function definition, like done above

> +    /* Removes bond identified by 'bond_id' from 'dpif'. */
> +    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);

> +    /* Reads bond stats from 'dpif'. */
> +    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> +                          uint64_t *n_bytes);
>  };
>
>  extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 9d9c716..22168de 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1170,6 +1170,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>
>      case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1220,6 +1221,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>          struct dp_packet *clone = NULL;
>          uint32_t cutlen = dp_packet_get_cutlen(packet);
>          if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> +                        || type == OVS_ACTION_ATTR_LB_OUTPUT
>                          || type == OVS_ACTION_ATTR_TUNNEL_PUSH
>                          || type == OVS_ACTION_ATTR_TUNNEL_POP
>                          || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1879,6 +1881,16 @@ dpif_supports_explicit_drop_action(const struct 
> dpif *dpif)
>      return dpif_is_netdev(dpif);
>  }
>
> +bool
> +dpif_supports_lb_output_action(const struct dpif *dpif)
> +{
> +    /*
> +     * Balance-tcp optimization is currently supported in netdev
> +     * datapath only.
> +     */
> +    return dpif_is_netdev(dpif);
> +}
> +
>  /* Meters */
>  void
>  dpif_meter_get_features(const struct dpif *dpif,
> @@ -1976,3 +1988,40 @@ dpif_meter_del(struct dpif *dpif, 
> ofproto_meter_id meter_id,
>      }
>      return error;
>  }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t 
> slave_map[])
> +{
> +    int error = -1;

EC> Maybe error = -ENOTSUP

> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    int error = -1;

EC> Maybe error = -ENOTSUP

> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_del(dpif, bond_id);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                    uint64_t *n_bytes)
> +{
> +    int error = -1;

EC> Maybe error = -ENOTSUP

> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) 
> {
> +        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, 
> n_bytes);
> +    }
> +
> +    return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 4df8f7c..898fa25 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -907,6 +907,13 @@ char *dpif_get_dp_version(const struct dpif *);
>  bool dpif_supports_tnl_push_pop(const struct dpif *);
>  bool dpif_supports_explicit_drop_action(const struct dpif *);
>

EC> Maybe add /* Bonding */ and move it above /* Miscellaneous. */

> +bool dpif_supports_lb_output_action(const struct dpif *);
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t 
> slave_map[]);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes);
> +
>  /* Log functions. */
>  struct vlog_module;
>
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 42d3335..6018e37 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr 
> *a)
>      switch (type) {
>          /* These only make sense in the context of a datapath. */
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct 
> dp_packet_batch *batch, bool steal,
>              return;
>          }
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          case OVS_ACTION_ATTR_TUNNEL_POP:
>          case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 746d1e9..4ef4cdc 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
>
>      switch ((enum ovs_action_attr) type) {
>      case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
>      case OVS_ACTION_ATTR_TRUNC: return sizeof(struct 
> ovs_action_trunc);
>      case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
>      case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct 
> nlattr *a,
>      case OVS_ACTION_ATTR_OUTPUT:
>          odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), 
> ds);
>          break;
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        ds_put_format(ds, "lb_output(bond,%"PRIu32")", 
> nl_attr_get_u32(a));
> +        break;
>      case OVS_ACTION_ATTR_TRUNC: {
>          const struct ovs_action_trunc *trunc =
>                         nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index 405202f..baeb5ee 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = 
> OVS_RWLOCK_INITIALIZER;
>  static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
>  static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = 
> &all_bonds__;
>
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
>  /* Priority for internal rules created to handle recirculation */
>  #define RECIRC_RULE_PRIORITY 20
>
> @@ -126,6 +122,8 @@ struct bond {
>      enum lacp_status lacp_status; /* Status of LACP negotiations. */
>      bool bond_revalidate;       /* True if flows need revalidation. 
> */
>      uint32_t basis;             /* Basis for flow hash function. */
> +    bool use_bond_cache;        /* Use bond cache to avoid 
> recirculation.
> +                                   Applicable only for Balance TCP 
> mode. */
>
>      /* SLB specific bonding info. */
>      struct bond_entry *hash;     /* An array of BOND_BUCKETS 
> elements. */
> @@ -185,7 +183,7 @@ static struct bond_slave 
> *choose_output_slave(const struct bond *,
>                                                struct flow_wildcards 
> *,
>                                                uint16_t vlan)
>      OVS_REQ_RDLOCK(rwlock);
> -static void update_recirc_rules__(struct bond *bond);
> +static void update_recirc_rules__(struct bond *bond, uint32_t 
> bond_recirc_id);
>  static bool bond_is_falling_back_to_ab(const struct bond *);
>
>  /* Attempts to parse 's' as the name of a bond balancing mode.  If 
> successful,
> @@ -262,6 +260,7 @@ void
>  bond_unref(struct bond *bond)
>  {
>      struct bond_slave *slave;
> +    uint32_t bond_recirc_id = 0;
>
>      if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
>          return;
> @@ -282,12 +281,13 @@ bond_unref(struct bond *bond)
>
>      /* Free bond resources. Remove existing post recirc rules. */
>      if (bond->recirc_id) {
> +        bond_recirc_id = bond->recirc_id;
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
>      free(bond->hash);
>      bond->hash = NULL;
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond_recirc_id);
>
>      hmap_destroy(&bond->pr_rule_ops);
>      free(bond->name);
> @@ -328,13 +328,14 @@ add_pr_rule(struct bond *bond, const struct 
> match *match,
>   * lock annotation. Currently, only 'bond_unref()' calls
>   * this function directly.  */
>  static void
> -update_recirc_rules__(struct bond *bond)
> +update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)

EC> Maybe the name bond_recirc_id should be more meaningful, 
del_bond_id?

>  {
>      struct match match;
>      struct bond_pr_rule_op *pr_op, *next_op;
>      uint64_t ofpacts_stub[128 / 8];
>      struct ofpbuf ofpacts;
>      int i;
> +    ofp_port_t slave_map[BOND_MASK];
>
>      ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
>
> @@ -353,8 +354,18 @@ update_recirc_rules__(struct bond *bond)
>
>                  add_pr_rule(bond, &match, slave->ofp_port,
>                              &bond->hash[i].pr_rule);
> +                slave_map[i] = slave->ofp_port;
> +            } else {
> +                slave_map[i] = OFP_PORT_C(-1);
>              }
>          }
> +        if (bond->ofproto->backer->rt_support.lb_output_action) {
> +            ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, 
> slave_map);
> +        }
> +    } else {
> +        if (bond->ofproto->backer->rt_support.lb_output_action) {
> +            ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
> +        }
>      }
>
>      HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) 
> {
> @@ -404,7 +415,7 @@ static void
>  update_recirc_rules(struct bond *bond)
>      OVS_REQ_RDLOCK(rwlock)
>  {
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond->recirc_id);
>  }
>
>  /* Updates 'bond''s overall configuration to 's'.
> @@ -467,6 +478,10 @@ bond_reconfigure(struct bond *bond, const struct 
> bond_settings *s)
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> +    if (bond->use_bond_cache != s->use_bond_cache) {
> +        bond->use_bond_cache = s->use_bond_cache;
> +        revalidate = true;
> +    }
>
>      if (bond->balance == BM_AB || !bond->hash || revalidate) {
>          bond_entry_reset(bond);
> @@ -944,6 +959,17 @@ bond_recirculation_account(struct bond *bond)
>      OVS_REQ_WRLOCK(rwlock)
>  {
>      int i;
> +    uint64_t n_bytes[BOND_BUCKETS] = {0};
> +    bool use_bond_cache =
> +        (bond->balance == BM_TCP)
> +        && bond->ofproto->backer->rt_support.lb_output_action
> +        && bond->use_bond_cache;

EC> Maybe this can be bond_get_cache_mod() [use new name] && 
bond->ofproto->backer->rt_support.lb_output_action
EC> If there is a proper check and bond->use_bond_cache can only be true 
if the datapath is enabled the support check can be removed

> +
> +    if (use_bond_cache) {
> +        /* Retrieve bond stats from datapath. */
> +        dpif_bond_stats_get(bond->ofproto->backer->dpif,
> +                            bond->recirc_id, n_bytes);
> +    }
>
>      for (i=0; i<=BOND_MASK; i++) {
>          struct bond_entry *entry = &bond->hash[i];
> @@ -953,8 +979,12 @@ bond_recirculation_account(struct bond *bond)
>              struct pkt_stats stats;
>              long long int used OVS_UNUSED;
>
> -            rule->ofproto->ofproto_class->rule_get_stats(
> -                rule, &stats, &used);
> +            if (!use_bond_cache) {
> +                rule->ofproto->ofproto_class->rule_get_stats(
> +                    rule, &stats, &used);
> +            } else {
> +                stats.n_bytes = n_bytes[i];
> +            }
>              bond_entry_account(entry, stats.n_bytes);
>          }
>      }
> @@ -1365,6 +1395,8 @@ bond_print_details(struct ds *ds, const struct 
> bond *bond)
>                    may_recirc ? "yes" : "no", may_recirc ? recirc_id: 
> -1);
>
>      ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> +    ds_put_format(ds, "lb-output-action: %s\n",
> +                  bond->use_bond_cache ? "enabled" : "disabled");

EC> Can we also mention the bond_id here also if enabled? I know it's 
the same as recirc ID, but it's not clear
EC> i.e.: "lb-output-action: enabled, bond-id: 1"

>
>      ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
>      ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1942,3 +1974,9 @@ bond_get_changed_active_slave(const char *name, 
> struct eth_addr *mac,
>
>      return false;
>  }
> +
> +bool
> +bond_get_cache_mode(const struct bond *bond)

EC> Name is odd, should be more like bond_is_cache_mode_enabled()

> +{
> +    return (bond->balance == BM_TCP) && bond->use_bond_cache;
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..88a4de1 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -22,6 +22,10 @@
>  #include "ofproto-provider.h"
>  #include "packets.h"
>
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
>  struct flow;
>  struct netdev;
>  struct ofpbuf;
> @@ -58,6 +62,8 @@ struct bond_settings {
>                                  /* The MAC address of the interface
>                                     that was active during the last
>                                     ovs run. */
> +    bool use_bond_cache;        /* Use bond cache. Only applicable 
> for
> +                                   bond mode BALANCE TCP. */
>  };
>
>  /* Program startup. */
> @@ -122,4 +128,7 @@ void bond_rebalance(struct bond *);
>  */
>  void bond_update_post_recirc_rules(struct bond *, uint32_t 
> *recirc_id,
>                                     uint32_t *hash_basis);
> +
> +bool bond_get_cache_mode(const struct bond *);
> +
>  #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c 
> b/ofproto/ofproto-dpif-ipfix.c
> index b413768..5030978 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -3017,6 +3017,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_CHECK_PKT_LEN:
>          case OVS_ACTION_ATTR_UNSPEC:
>          case OVS_ACTION_ATTR_DROP:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:

EC> Should this not be handles the same as OVS_ACTION_ATTR_OUTPUT above, 
i.e. "ipfix_actions->output_action = true;"

>          case __OVS_ACTION_ATTR_MAX:
>          default:
>              break;
> diff --git a/ofproto/ofproto-dpif-sflow.c 
> b/ofproto/ofproto-dpif-sflow.c
> index f9ea47a..21dafa6 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1177,6 +1177,7 @@ dpif_sflow_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_CT_CLEAR:

EC> While we make the change below, maybe also fix this indentation

>          case OVS_ACTION_ATTR_METER:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              break;
>
>          case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c 
> b/ofproto/ofproto-dpif-xlate.c
> index 0b45ecf..1656d47 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4187,7 +4187,6 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>          xlate_commit_actions(ctx);
>
>          if (xr) {
> -            /* Recirculate the packet. */
>              struct ovs_action_hash *act_hash;
>
>              /* Hash action. */
> @@ -4196,15 +4195,28 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>                  /* Algorithm supported by all datapaths. */
>                  hash_alg = OVS_HASH_ALG_L4;
>              }
> -            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> +
> +            if ((ctx->xbridge->support.lb_output_action) &&
> +                bond_get_cache_mode(xport->xbundle->bond)) {
> +                /*
> +                 * If bond mode is balance-tcp and optimize balance 
> tcp
> +                 * is enabled then use the hash directly for slave 
> selection
> +                 * and avoid recirculation.
> +                 *
> +                 * Currently support for netdev datapath only.
> +                 */
> +                nl_msg_put_u32(ctx->odp_actions, 
> OVS_ACTION_ATTR_LB_OUTPUT,
> +                               xr->recirc_id);
> +            } else {
> +                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
>                                                  OVS_ACTION_ATTR_HASH,
>                                                  sizeof *act_hash);

EC> indent is not correct in the above

> -            act_hash->hash_alg = hash_alg;
> -            act_hash->hash_basis = xr->hash_basis;
> -
> -            /* Recirc action. */
> -            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> -                           xr->recirc_id);
> +                act_hash->hash_alg = hash_alg;
> +                act_hash->hash_basis = xr->hash_basis;
> +                /* Recirculate the packet. */
> +                nl_msg_put_u32(ctx->odp_actions, 
> OVS_ACTION_ATTR_RECIRC,
> +                               xr->recirc_id);
> +            }
>          } else if (is_native_tunnel) {
>              /* Output to native tunnel port. */
>              native_tunnel_output(ctx, xport, flow, odp_port, 
> truncate);
> @@ -7265,7 +7277,8 @@ count_output_actions(const struct ofpbuf 
> *odp_actions)
>      int n = 0;
>
>      NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, 
> odp_actions->size) {
> -        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> +        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> +            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
>              n++;
>          }
>      }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 0222ec8..9981be6 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1580,6 +1580,8 @@ check_support(struct dpif_backer *backer)
>      backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
>      backer->rt_support.explicit_drop_action =
>          dpif_supports_explicit_drop_action(backer->dpif);
> +    backer->rt_support.lb_output_action=
> +        dpif_supports_lb_output_action(backer->dpif);
>
>      /* Flow fields. */
>      backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3429,6 +3431,27 @@ bundle_remove(struct ofport *port_)
>      }
>  }
>
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
> +                        uint32_t bond_id, const ofp_port_t 
> slave_map[])
> +{
> +    /* Convert ofp_port to odp_port */
> +    odp_port_t odp_map[BOND_BUCKETS];
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        odp_map[bucket] = (slave_map[bucket] == OFP_PORT_C(-1)
> +                           ? ODP_PORT_C(-1)
> +                           : ofp_port_to_odp_port(ofproto, 
> slave_map[bucket]));
> +    }
> +
> +    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t 
> bond_id)
> +{
> +    return dpif_bond_del(ofproto->backer->dpif, bond_id);
> +}
> +
>  static void
>  send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
>  {
> @@ -5559,6 +5582,7 @@ get_datapath_cap(const char *datapath_type, 
> struct smap *cap)
>      smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
>      smap_add(cap, "explicit_drop_action",
>               s.explicit_drop_action ? "true" :"false");
> +    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : 
> "false");
>  }
>
>  /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' 
> and
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index c9d5df3..384a269 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -202,8 +202,10 @@ struct group_dpif *group_dpif_lookup(struct 
> ofproto_dpif *,
>      DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")  
>       \
>                                                                              \
>      /* True if the datapath supports explicit drop action. */         
>       \
> -    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop 
> action")
> -

EC> Removal of new line

> +    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop 
> action")  \
> +                                                                      
>       \
> +    /* True if the datapath supports balance_tcp optimization */      
>       \
> +    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP 
> mode")
>
>  /* Stores the various features which the corresponding backer 
> supports. */
>  struct dpif_backer_support {
> @@ -379,6 +381,9 @@ int ofproto_dpif_add_internal_flow(struct 
> ofproto_dpif *,
>                                     struct rule **rulep);
>  int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct 
> match *,
>                                        int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
> +                            const ofp_port_t slave_map[]);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
>
>  bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..c881a0b 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
>  bond_mode: active-backup
>  bond may use recirculation: no, Recirc-ID : -1
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index fe73c38..4a5e893 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4600,6 +4600,11 @@ port_configure_bond(struct port *port, struct 
> bond_settings *s)
>          /* OVSDB did not store the last active interface */
>          s->active_slave_mac = eth_addr_zero;
>      }
> +
> +    /* use_bond_cache is disabled by default. */
> +    s->use_bond_cache = (s->balance == BM_TCP)
> +                        && smap_get_bool(&port->cfg->other_config,
> +                                      "lb-output-action", false);
>  }
>
>  /* Returns true if 'port' is synthetic, that is, if we constructed it 
> locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4a74ed3..3da8d0c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1994,6 +1994,16 @@
>          <code>active-backup</code>.
>        </column>
>
> +      <column name="other_config" key="lb-output-action"
> +              type='{"type": "boolean"}'>
> +        Enable/disable usage of optimized lb-output-action for
> +        balancing flows among output slaves in load balanced bonds in
> +        <code>balance-tcp</code>. When enabled, it uses optimized 
> path for
> +        balance-tcp mode by using rss hash and avoids recirculation.
> +        It affects only new flows, i.e, existing flows remain 
> unchanged.
> +        This knob does not affect other balancing modes.
> +      </column>
> +
>        <group title="Link Failure Detection">
>          <p>
>            An important part of link bonding is detecting that links 
> are down so
> @@ -5788,6 +5798,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>          higher performance for MPLS and active-active load balancing
>          bonding modes.
>        </column>
> +      <column name="capabilities" key="lb_output_action"
> +              type='{"type": "boolean"}'>
> +        If this is true, then the datapath supports optimized 
> balance-tcp
> +        bond mode. This capability replaces existing 
> <code>hash</code> and
> +        <code>recirc</code> actions with new action 
> <code>lb_output</code>
> +        and avoids recirculation of packet in datapath. It is 
> supported
> +        only for balance-tcp bond mode in netdev datapath. The new 
> action
> +        gives higer performance by using bond buckets instead of post
> +        recirculation flows for selection of slave port from bond. By 
> default
> +        this new action is disabled, however it can be enabled by 
> setting
> +        <ref column="other-config" key="lb-output-action"/> in
> +        <ref table="Port"/> table.
> +      </column>
>        <group title="Connection-Tracking Capabilities">
>          <p>
>            These capabilities are granular because Open vSwitch and 
> its
> -- 
> 2.7.4
Ilya Maximets Feb. 14, 2020, 6:25 p.m. UTC | #2
On 2/11/20 12:28 PM, Vishal Deep Ajmera wrote:
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
> 
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
> 
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
> 
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
> 
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
> 
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
> 
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
> 
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
> 
> Instead, we will now generate action as
>     "lb_output(bond,<bond id>)"
> 
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
> 
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2
> 
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1
> 
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
> 
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1)
> 
> A new CLI has been added to dump datapath bond cache as given below.
> 
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
> 
> root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
> 
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
> 
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.34    5.59        28.86
> | 10     4.09    5.43        32.61
> | 400    3.44    5.35        55.25
> | 1k     3.31    5.25        58.41
> | 10k    2.46    4.43        79.78
> | 100k   2.32    4.27        83.59
> | 500k   2.29    4.23        84.57
> +--------------------------------------+
> mpps: million packets per second.
> packet size 64 bytes.
> 
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
> 
> CC: Jan Scheurich <jan.scheurich@ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> CC: Ben Pfaff <blp@ovn.org>
> CC: Ilya Maximets <i.maximets@ovn.org>
> CC: Ian Stokes <ian.stokes@intel.com>
> CC: David Marchand <david.marchand@redhat.com>
> CC: Matteo Croce <mcroce@redhat.com>
> ---
>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>  lib/dpif-netdev.c                                 | 399 ++++++++++++++++++++--
>  lib/dpif-netlink.c                                |   3 +
>  lib/dpif-provider.h                               |   9 +
>  lib/dpif.c                                        |  49 +++
>  lib/dpif.h                                        |   7 +
>  lib/odp-execute.c                                 |   2 +
>  lib/odp-util.c                                    |   4 +
>  ofproto/bond.c                                    |  58 +++-
>  ofproto/bond.h                                    |   9 +
>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>  ofproto/ofproto-dpif-sflow.c                      |   1 +
>  ofproto/ofproto-dpif-xlate.c                      |  31 +-
>  ofproto/ofproto-dpif.c                            |  24 ++
>  ofproto/ofproto-dpif.h                            |   9 +-
>  tests/lacp.at                                     |   9 +
>  vswitchd/bridge.c                                 |   5 +
>  vswitchd/vswitch.xml                              |  23 ++
>  18 files changed, 587 insertions(+), 57 deletions(-)

I agree with most of the things Eelco highlited in his revew.
Comments inline (some might be duplicates with review from Eelco).


Best regards, Ilya Maximets.

> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index 2f0c655..7604a2a 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -1021,6 +1021,7 @@ enum ovs_action_attr {
>  	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>  	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
>  	OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
> +	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id */

Period at the end of comment.

>  #endif
>  	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
>  				       * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d393aab..f550192 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -79,6 +79,7 @@
>  #include "unixctl.h"
>  #include "util.h"
>  #include "uuid.h"
> +#include "ofproto/bond.h"

'lib' code should not include 'ofproto' headers.
Move required things to 'lib' or common 'include' headers.

>  
>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>  
> @@ -376,6 +377,12 @@ struct dp_netdev {
>  
>      struct conntrack *conntrack;
>      struct pmd_auto_lb pmd_alb;
> +
> +    /* Bonds.
> +     *
> +     * Any lookup into 'bonds' requires taking 'bond_mutex'. */

Lookup doesn't require holding any mutexes.  That is RCU made for.
And there is no such structure member as 'bonds'.

> +    struct ovs_mutex bond_mutex;
> +    struct cmap tx_bonds; /* Contains "struct tx_bond"s. */

Single quotes, please.

>  };
>  
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -607,6 +614,20 @@ struct tx_port {
>      struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>  };
>  
> +/* Contained by struct tx_bond 'slave_buckets' */

Period at the end of the comment.

> +struct slave_entry {
> +    odp_port_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
> +struct tx_bond {
> +    struct cmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>  /* A set of properties for the current processing loop that is not directly
>   * associated with the pmd thread itself, but with the packets being
>   * processed or the short-term system configuration (for example, time).
> @@ -739,6 +760,11 @@ struct dp_netdev_pmd_thread {
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED;
>  
> +    struct ovs_mutex bond_mutex;    /* Mutex for 'tx_bonds'. */


s/Mutex for 'tx_bonds'./Protects updates of 'tx_bonds'/ ?

Also, this new mutex should be mentioned in "Acquisition order" in the
comment above the structure.

> +    /* Map of 'tx_bond's used for transmission.  Written by the main thread,
> +     * read/written by the pmd thread. */

Why PMD thread needs to update cmap?

> +    struct cmap tx_bonds OVS_GUARDED;

I'm not sure about OVS_GUARDED since this might produce clang
warnings if we'll iterate over cmap without holding any mutex.

> +
>      /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>       * ports (that support push_tunnel/pop_tunnel), the other contains ports
>       * with at least one txq (that support send).  A port can be in both.
> @@ -830,6 +856,12 @@ static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
>  static int
>  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>                                     bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                         struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex);
> +static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                           struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex);
>  
>  static void reconfigure_datapath(struct dp_netdev *dp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -1396,6 +1428,45 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
>      par.command_type = PMD_INFO_PERF_SHOW;
>      dpif_netdev_pmd_info(conn, argc, argv, &par);
>  }
> +
> +static void
> +dpif_netdev_dp_bond_show(struct unixctl_conn *conn, int argc,
> +                         const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct ds reply = DS_EMPTY_INITIALIZER;
> +
> +    ovs_mutex_lock(&dp_netdev_mutex);
> +
> +    struct dp_netdev *dp = NULL;
> +    if (argc == 2) {
> +        dp = shash_find_data(&dp_netdevs, argv[1]);
> +    } else if (shash_count(&dp_netdevs) == 1) {
> +        /* There's only one datapath */

Period at the end.
I see that it's a copy-paste, but it's better to not copy the bad style.

> +        dp = shash_first(&dp_netdevs)->data;
> +    }
> +    if (!dp) {
> +        ovs_mutex_unlock(&dp_netdev_mutex);
> +        unixctl_command_reply_error(conn,
> +                                    "please specify an existing datapath");
> +        return;
> +    }
> +    ds_put_cstr(&reply, "\nBonds:\n");
> +
> +    struct tx_bond *dp_bond_entry;
> +    CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
> +        ds_put_format(&reply, "\tbond-id %u :\n",

PRIu32

> +                      dp_bond_entry->bond_id);
> +        for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +            ds_put_format(&reply, "\t\tbucket %u - slave %u \n",

%d - PRIu32

> +                      bucket,
> +                      dp_bond_entry->slave_buckets[bucket].slave_id);

odp_to_u32(dp_bond_entry->slave_buckets[bucket].slave_id)


> +        }
> +    }
> +    ovs_mutex_unlock(&dp_netdev_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&reply));
> +    ds_destroy(&reply);
> +}
> +
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1427,6 +1498,9 @@ dpif_netdev_init(void)
>                               "[-us usec] [-q qlen]",
>                               0, 10, pmd_perf_log_set_cmd,
>                               NULL);
> +    unixctl_command_register("dpif-netdev/dp-bond-show", "[dp]",

Why dp-bond-show?
Isn't it clear to just have 'dpif-netdev/bond-show' ?

> +                             0, 1, dpif_netdev_dp_bond_show,
> +                             NULL);
>      return 0;
>  }
>  
> @@ -1551,6 +1625,9 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>      ovs_mutex_init_recursive(&dp->port_mutex);
>      hmap_init(&dp->ports);
>      dp->port_seq = seq_create();
> +    ovs_mutex_init(&dp->bond_mutex);
> +    cmap_init(&dp->tx_bonds);
> +
>      fat_rwlock_init(&dp->upcall_rwlock);
>  
>      dp->reconfigure_seq = seq_create();
> @@ -1657,6 +1734,12 @@ dp_delete_meter(struct dp_netdev *dp, uint32_t meter_id)
>      }
>  }
>  
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> +    return hash_int(bond_id, 0);
> +}
> +
>  /* Requires dp_netdev_mutex so that we can't get a new reference to 'dp'
>   * through the 'dp_netdevs' shash while freeing 'dp'. */
>  static void
> @@ -1664,6 +1747,7 @@ dp_netdev_free(struct dp_netdev *dp)
>      OVS_REQUIRES(dp_netdev_mutex)
>  {
>      struct dp_netdev_port *port, *next;
> +    struct tx_bond *bond;
>  
>      shash_find_and_delete(&dp_netdevs, dp->name);
>  
> @@ -1673,6 +1757,13 @@ dp_netdev_free(struct dp_netdev *dp)
>      }
>      ovs_mutex_unlock(&dp->port_mutex);
>  
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +        cmap_remove(&dp->tx_bonds, &bond->node, hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, bond);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      dp_netdev_destroy_all_pmds(dp, true);
>      cmap_destroy(&dp->poll_threads);
>  
> @@ -1691,6 +1782,9 @@ dp_netdev_free(struct dp_netdev *dp)
>      hmap_destroy(&dp->ports);
>      ovs_mutex_destroy(&dp->port_mutex);
>  
> +    cmap_destroy(&dp->tx_bonds);
> +    ovs_mutex_destroy(&dp->bond_mutex);
> +
>      /* Upcalls must be disabled at this point */
>      dp_netdev_destroy_upcall_lock(dp);
>  
> @@ -4421,6 +4515,20 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
>      return NULL;
>  }
>  
> +static struct tx_bond *
> +tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +    const struct cmap_node *pnode;
> +
> +    pnode = cmap_find(tx_bonds, hash_bond_id(bond_id));

As Eelco already pointed out you have to be careful with possible hash
collisions and check that returned node actually has desired bond_id,
and continue search if not.  See tx_port_lookup() for example.

> +    if (!pnode) {
> +        return NULL;
> +    }
> +    tx = CONTAINER_OF(pnode, struct tx_bond, node);
> +    return tx;
> +}
> +
>  static int
>  port_reconfigure(struct dp_netdev_port *port)
>  {
> @@ -5072,6 +5180,21 @@ reconfigure_datapath(struct dp_netdev *dp)
>          ovs_mutex_unlock(&pmd->port_mutex);
>      }
>  
> +    /* Add every bond to the cmap of each pmd thread, if it's not
> +     * there already and if this pmd has at least one rxq to poll. */
> +    ovs_mutex_lock(&dp->bond_mutex);

We're only reading dp->tx_bonds, no need to hold the mutex.

> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ovs_mutex_lock(&pmd->bond_mutex);

Looking at all the callers of dp_netdev_add_bond_tx_to_pmd(),
you may just move mutex acquiring into the function.

> +        if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
> +            struct tx_bond *bond;
> +            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +                dp_netdev_add_bond_tx_to_pmd(pmd, bond);

Do we really need to re-create all the nodes here?
Can we add another argument like ''bool update_if_exists' to the
dp_netdev_add_bond_tx_to_pmd() and avoid useless memory copy?

> +            }

Can we just do that in the same loop where we're adding ports?
It will reduce code duplication.

> +        }
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);

Not relevant taking into account above comments, but
reload should be done after updating ports and before updating bonds
to avoid confusion that bonds update requires PMD reloading.

>  
> @@ -6110,6 +6233,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      atomic_init(&pmd->reload, false);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> +    ovs_mutex_init(&pmd->bond_mutex);
>      cmap_init(&pmd->flow_table);
>      cmap_init(&pmd->classifiers);
>      pmd->ctx.last_rxq = NULL;
> @@ -6120,6 +6244,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache);
> +    cmap_init(&pmd->tx_bonds);
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> @@ -6135,11 +6260,17 @@ static void
>  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>  {
>      struct dpcls *cls;
> +    struct tx_bond *tx;
>  
>      dp_netdev_pmd_flow_flush(pmd);
>      hmap_destroy(&pmd->send_port_cache);
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }
> +    cmap_destroy(&pmd->tx_bonds);
>      hmap_destroy(&pmd->poll_list);
>      /* All flows (including their dpcls_rules) have been deleted already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6151,6 +6282,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      seq_destroy(pmd->reload_seq);
>      ovs_mutex_destroy(&pmd->port_mutex);
> +    ovs_mutex_destroy(&pmd->bond_mutex);
>      free(pmd);
>  }
>  
> @@ -6304,6 +6436,41 @@ dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
>      free(tx);
>      pmd->need_reload = true;
>  }
> +
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                             struct tx_bond *bond)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{
> +    struct tx_bond *tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);

Same as before.  Collisions should be handled.

> +    if (tx) {
> +        struct tx_bond *new_tx = xmemdup(bond, sizeof *tx);

sizeof *bond, since you're copying it.

> +        /* Copy the stats for each bucket. */
> +        for (int i = 0;i < BOND_BUCKETS; i++) {
> +            uint64_t n_packets, n_bytes;

Empty line after definitions.

> +            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, &n_packets);
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, &n_bytes);
> +            new_tx->slave_buckets[i].n_packets = n_packets;
> +            new_tx->slave_buckets[i].n_bytes = n_bytes;

atomic_init ?

> +        }
> +        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
> +                     hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        tx = xmemdup(bond, sizeof *tx);

sizeof *bond

> +        cmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
> +    }
> +}
> +
> +/* Del 'tx' from the tx bond cache of 'pmd' */

Period at the end of the comment.
s/Del/Delete/

> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                               struct tx_bond *tx)
> +    OVS_REQUIRES(pmd->bond_mutex)
> +{

Technically, mutex acquisition could be moved inside this function too.

> +    cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
> +    ovsrcu_postpone(free, tx);
> +}
>  
>  static char *
>  dpif_netdev_get_datapath_version(void)
> @@ -7129,6 +7296,94 @@ dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
>      }
>  }
>  
> +static bool
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_packet_batch *packets_,
> +                         bool should_steal, odp_port_t port_no)
> +{
> +    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (!OVS_LIKELY(p)) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch out;
> +    if (!should_steal) {
> +        dp_packet_batch_clone(&out, packets_);
> +        dp_packet_batch_reset_cutlen(packets_);
> +        packets_ = &out;
> +    }
> +    dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> +    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> +                     && packets_->packets[0]->source
> +                     != p->output_pkts.packets[0]->source)) {

I already asked in a previous review. Please, keep alignments as
in the original code.

> +        /* netdev-dpdk assumes that all packets in a single
> +         * output batch has the same source. Flush here to
> +         * avoid memory access issues. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +#endif
> +    if (dp_packet_batch_size(&p->output_pkts)
> +        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> +        /* Flush here to avoid overflow. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +    if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +        pmd->n_output_batches++;
> +    }
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> +            pmd->ctx.last_rxq;
> +        dp_packet_batch_add(&p->output_pkts, packet);
> +    }
> +    return true;
> +}
> +
> +static bool
> +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
> +                            struct dp_packet_batch *packets_,
> +                            bool should_steal, uint32_t bond)
> +{
> +    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
> +    if (!p_bond) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch del_pkts;
> +    dp_packet_batch_init(&del_pkts);
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {

For loops like this we have DP_PACKET_BATCH_REFILL_FOR_EACH.

> +        /*
> +         * Lookup the bond-hash table using hash to get the slave.
> +         */
> +        uint32_t hash = dp_packet_get_rss_hash(packet);
> +        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & BOND_MASK];
> +        odp_port_t bond_member = s_entry->slave_id;
> +        uint32_t size = dp_packet_size(packet);
> +
> +        struct dp_packet_batch output_pkt;
> +        dp_packet_batch_init_packet(&output_pkt, packet);
> +        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, should_steal,
> +                                                bond_member))) {
> +            /* Update slave stats. */
> +            non_atomic_ullong_add(&s_entry->n_packets, 1);
> +            non_atomic_ullong_add(&s_entry->n_bytes, size);
> +        } else {
> +            dp_packet_batch_add(&del_pkts, packet);
> +        }
> +    }
> +
> +    /* Delete packets that failed OUTPUT action */

Period at the end of comment.

> +    COVERAGE_ADD(datapath_drop_invalid_port,
> +                 dp_packet_batch_size(&del_pkts));
> +    dp_packet_delete_batch(&del_pkts, should_steal);

If you will refill oroginal batch, you may just return false in case of
any failures and caller will take care of remaining packets.

> +
> +    return true;
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool should_steal)
> @@ -7144,43 +7399,18 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>  
>      switch ((enum ovs_action_attr)type) {
>      case OVS_ACTION_ATTR_OUTPUT:
> -        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> -        if (OVS_LIKELY(p)) {
> -            struct dp_packet *packet;
> -            struct dp_packet_batch out;
> -
> -            if (!should_steal) {
> -                dp_packet_batch_clone(&out, packets_);
> -                dp_packet_batch_reset_cutlen(packets_);
> -                packets_ = &out;
> -            }
> -            dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> -            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> -                             && packets_->packets[0]->source
> -                                != p->output_pkts.packets[0]->source)) {
> -                /* XXX: netdev-dpdk assumes that all packets in a single
> -                 *      output batch has the same source. Flush here to
> -                 *      avoid memory access issues. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -#endif
> -            if (dp_packet_batch_size(&p->output_pkts)
> -                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> -                /* Flush here to avoid overflow. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> -                pmd->n_output_batches++;
> -            }
> +        if (dp_execute_output_action(pmd, packets_, should_steal,
> +                                     nl_attr_get_odp_port(a))) {
> +            return;
> +        } else {
> +            COVERAGE_ADD(datapath_drop_invalid_port,
> +                         dp_packet_batch_size(packets_));
> +        }
> +        break;
>  
> -            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -                                                             pmd->ctx.last_rxq;
> -                dp_packet_batch_add(&p->output_pkts, packet);
> -            }
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        if (dp_execute_lb_output_action(pmd, packets_, should_steal,
> +                                        nl_attr_get_u32(a))) {
>              return;
>          } else {
>              COVERAGE_ADD(datapath_drop_invalid_port,

You need to create different counter for bond failures.

> @@ -7738,6 +7968,100 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif OVS_UNUSED, void *ipf_dump_ctx)
>  
>  }
>  
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
> +                     odp_port_t slave_map[])
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Prepare new bond mapping. */
> +    struct tx_bond *new_tx = xzalloc(sizeof(struct tx_bond));

sizeof *new_tx

> +
> +    new_tx->bond_id = bond_id;
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
> +    }
> +
> +    /* Check if bond already existed */

Period.

> +    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);

lookup must be under bond_mutex to be sure that cmap didn't change
between lookup and update.

> +    ovs_mutex_lock(&dp->bond_mutex);
> +    if (old_tx) {
> +        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
> +                     hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, old_tx);
> +    } else {
> +        cmap_insert(&dp->tx_bonds, &new_tx->node,
> +                    hash_bond_id(bond_id));
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Update all PMDs with new bond mapping. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        ovs_mutex_lock(&pmd->bond_mutex);
> +        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> +    /* Check if bond existed. */
> +    struct tx_bond *tx = tx_bond_lookup(&dp->tx_bonds, bond_id);

lookup should be under the mutex.

> +    if (tx) {
> +        ovs_mutex_lock(&dp->bond_mutex);
> +        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, tx);

postpone does't require mutex.

> +        ovs_mutex_unlock(&dp->bond_mutex);
> +    } else {
> +        /* Bond is not present. */
> +        return 0;
> +    }
> +
> +    /* Remove the bond map in all pmds. */
> +    struct dp_netdev_pmd_thread *pmd;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);

lookup should be under the mutex.

BTW, why we're iterating twice?  Once for lookup and once for
actual remove?  Can't we just iterate once and remove on the way?

Same for dp->tx_bonds.

> +        if (tx) {
> +            ovs_mutex_lock(&pmd->bond_mutex);
> +            dp_netdev_del_bond_tx_from_pmd(pmd, tx);
> +            ovs_mutex_unlock(&pmd->bond_mutex);
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                           uint64_t *n_bytes)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Search the bond in all PMDs */

Period.

> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        struct tx_bond *pmd_bond_entry
> +            = tx_bond_lookup(&pmd->tx_bonds, bond_id);
> +        ovs_mutex_lock(&pmd->bond_mutex);

No need to hold mutex for reads.

> +        if (pmd_bond_entry) {

You could save some indentations by using 'continue' here.

> +             /* Read bond stats. */
> +            for (int i = 0; i < BOND_BUCKETS; i++) {
> +                uint64_t pmd_n_bytes;
> +                atomic_read_relaxed(
> +                     &pmd_bond_entry->slave_buckets[i].n_bytes,
> +                     &pmd_n_bytes);
> +                n_bytes[i] += pmd_n_bytes;
> +            }
> +        }
> +        ovs_mutex_unlock(&pmd->bond_mutex);
> +    }
> +    return 0;
> +}
> +
>  const struct dpif_class dpif_netdev_class = {
>      "netdev",
>      true,                       /* cleanup_required */
> @@ -7811,6 +8135,9 @@ const struct dpif_class dpif_netdev_class = {
>      dpif_netdev_meter_set,
>      dpif_netdev_meter_get,
>      dpif_netdev_meter_del,
> +    dpif_netdev_bond_add,
> +    dpif_netdev_bond_del,
> +    dpif_netdev_bond_stats_get,
>  };
>  
>  static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 5b5c96d..116f8fc 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3993,6 +3993,9 @@ const struct dpif_class dpif_netlink_class = {
>      dpif_netlink_meter_set,
>      dpif_netlink_meter_get,
>      dpif_netlink_meter_del,
> +    NULL,                       /* bond_add */
> +    NULL,                       /* bond_del */
> +    NULL,                       /* bond_stats_get */
>  };
>  
>  static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index b77317b..4415836 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -616,6 +616,15 @@ struct dpif_class {
>       * zero. */
>      int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
>                       struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> +    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
> +                    odp_port_t slave_map[]);

I agree with Eelco that it's better to use pointer instead of [].

> +    /* Removes bond identified by 'bond_id' from 'dpif'. */
> +    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
> +    /* Reads bond stats from 'dpif'. */
> +    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> +                          uint64_t *n_bytes);
>  };
>  
>  extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 9d9c716..22168de 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1170,6 +1170,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
>  
>      case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1220,6 +1221,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
>          struct dp_packet *clone = NULL;
>          uint32_t cutlen = dp_packet_get_cutlen(packet);
>          if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> +                        || type == OVS_ACTION_ATTR_LB_OUTPUT
>                          || type == OVS_ACTION_ATTR_TUNNEL_PUSH
>                          || type == OVS_ACTION_ATTR_TUNNEL_POP
>                          || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1879,6 +1881,16 @@ dpif_supports_explicit_drop_action(const struct dpif *dpif)
>      return dpif_is_netdev(dpif);
>  }
>  
> +bool
> +dpif_supports_lb_output_action(const struct dpif *dpif)
> +{
> +    /*
> +     * Balance-tcp optimization is currently supported in netdev
> +     * datapath only.
> +     */
> +    return dpif_is_netdev(dpif);
> +}
> +
>  /* Meters */
>  void
>  dpif_meter_get_features(const struct dpif *dpif,
> @@ -1976,3 +1988,40 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
>      }
>      return error;
>  }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[])
> +{
> +    int error = -1;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    int error = -1;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_del(dpif, bond_id);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                    uint64_t *n_bytes)
> +{
> +    int error = -1;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
> +        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
> +    }
> +
> +    return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 4df8f7c..898fa25 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -907,6 +907,13 @@ char *dpif_get_dp_version(const struct dpif *);
>  bool dpif_supports_tnl_push_pop(const struct dpif *);
>  bool dpif_supports_explicit_drop_action(const struct dpif *);
>  
> +bool dpif_supports_lb_output_action(const struct dpif *);
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[]);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes);
> +
>  /* Log functions. */
>  struct vlog_module;
>  
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 42d3335..6018e37 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr *a)
>      switch (type) {
>          /* These only make sense in the context of a datapath. */
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
>              return;
>          }
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          case OVS_ACTION_ATTR_TUNNEL_POP:
>          case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 746d1e9..4ef4cdc 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
>  
>      switch ((enum ovs_action_attr) type) {
>      case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
>      case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
>      case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
>      case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a,
>      case OVS_ACTION_ATTR_OUTPUT:
>          odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
>          break;
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));
> +        break;
>      case OVS_ACTION_ATTR_TRUNC: {
>          const struct ovs_action_trunc *trunc =
>                         nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index 405202f..baeb5ee 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
>  static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
>  static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
>  
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
>  /* Priority for internal rules created to handle recirculation */
>  #define RECIRC_RULE_PRIORITY 20
>  
> @@ -126,6 +122,8 @@ struct bond {
>      enum lacp_status lacp_status; /* Status of LACP negotiations. */
>      bool bond_revalidate;       /* True if flows need revalidation. */
>      uint32_t basis;             /* Basis for flow hash function. */
> +    bool use_bond_cache;        /* Use bond cache to avoid recirculation.
> +                                   Applicable only for Balance TCP mode. */
>  
>      /* SLB specific bonding info. */
>      struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
> @@ -185,7 +183,7 @@ static struct bond_slave *choose_output_slave(const struct bond *,
>                                                struct flow_wildcards *,
>                                                uint16_t vlan)
>      OVS_REQ_RDLOCK(rwlock);
> -static void update_recirc_rules__(struct bond *bond);
> +static void update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id);
>  static bool bond_is_falling_back_to_ab(const struct bond *);
>  
>  /* Attempts to parse 's' as the name of a bond balancing mode.  If successful,
> @@ -262,6 +260,7 @@ void
>  bond_unref(struct bond *bond)
>  {
>      struct bond_slave *slave;
> +    uint32_t bond_recirc_id = 0;
>  
>      if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
>          return;
> @@ -282,12 +281,13 @@ bond_unref(struct bond *bond)
>  
>      /* Free bond resources. Remove existing post recirc rules. */
>      if (bond->recirc_id) {
> +        bond_recirc_id = bond->recirc_id;
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
>      free(bond->hash);
>      bond->hash = NULL;
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond_recirc_id);
>  
>      hmap_destroy(&bond->pr_rule_ops);
>      free(bond->name);
> @@ -328,13 +328,14 @@ add_pr_rule(struct bond *bond, const struct match *match,
>   * lock annotation. Currently, only 'bond_unref()' calls
>   * this function directly.  */
>  static void
> -update_recirc_rules__(struct bond *bond)
> +update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)
>  {
>      struct match match;
>      struct bond_pr_rule_op *pr_op, *next_op;
>      uint64_t ofpacts_stub[128 / 8];
>      struct ofpbuf ofpacts;
>      int i;
> +    ofp_port_t slave_map[BOND_MASK];
>  
>      ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
>  
> @@ -353,8 +354,18 @@ update_recirc_rules__(struct bond *bond)
>  
>                  add_pr_rule(bond, &match, slave->ofp_port,
>                              &bond->hash[i].pr_rule);
> +                slave_map[i] = slave->ofp_port;
> +            } else {
> +                slave_map[i] = OFP_PORT_C(-1);


Fo rthe whole patch.  You, probably, need to use ODPP/OFPP_NONE instead.

>              }
>          }
> +        if (bond->ofproto->backer->rt_support.lb_output_action) {
> +            ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, slave_map);
> +        }
> +    } else {
> +        if (bond->ofproto->backer->rt_support.lb_output_action) {
> +            ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
> +        }
>      }
>  
>      HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
> @@ -404,7 +415,7 @@ static void
>  update_recirc_rules(struct bond *bond)
>      OVS_REQ_RDLOCK(rwlock)
>  {
> -    update_recirc_rules__(bond);
> +    update_recirc_rules__(bond, bond->recirc_id);
>  }
>  
>  /* Updates 'bond''s overall configuration to 's'.
> @@ -467,6 +478,10 @@ bond_reconfigure(struct bond *bond, const struct bond_settings *s)
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> +    if (bond->use_bond_cache != s->use_bond_cache) {
> +        bond->use_bond_cache = s->use_bond_cache;
> +        revalidate = true;
> +    }
>  
>      if (bond->balance == BM_AB || !bond->hash || revalidate) {
>          bond_entry_reset(bond);
> @@ -944,6 +959,17 @@ bond_recirculation_account(struct bond *bond)
>      OVS_REQ_WRLOCK(rwlock)
>  {
>      int i;
> +    uint64_t n_bytes[BOND_BUCKETS] = {0};
> +    bool use_bond_cache =
> +        (bond->balance == BM_TCP)
> +        && bond->ofproto->backer->rt_support.lb_output_action
> +        && bond->use_bond_cache;
> +
> +    if (use_bond_cache) {
> +        /* Retrieve bond stats from datapath. */
> +        dpif_bond_stats_get(bond->ofproto->backer->dpif,
> +                            bond->recirc_id, n_bytes);
> +    }
>  
>      for (i=0; i<=BOND_MASK; i++) {
>          struct bond_entry *entry = &bond->hash[i];
> @@ -953,8 +979,12 @@ bond_recirculation_account(struct bond *bond)
>              struct pkt_stats stats;
>              long long int used OVS_UNUSED;
>  
> -            rule->ofproto->ofproto_class->rule_get_stats(
> -                rule, &stats, &used);
> +            if (!use_bond_cache) {
> +                rule->ofproto->ofproto_class->rule_get_stats(
> +                    rule, &stats, &used);
> +            } else {
> +                stats.n_bytes = n_bytes[i];
> +            }
>              bond_entry_account(entry, stats.n_bytes);
>          }
>      }
> @@ -1365,6 +1395,8 @@ bond_print_details(struct ds *ds, const struct bond *bond)
>                    may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
>  
>      ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> +    ds_put_format(ds, "lb-output-action: %s\n",
> +                  bond->use_bond_cache ? "enabled" : "disabled");
>  
>      ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
>      ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1942,3 +1974,9 @@ bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
>  
>      return false;
>  }
> +
> +bool
> +bond_get_cache_mode(const struct bond *bond)
> +{
> +    return (bond->balance == BM_TCP) && bond->use_bond_cache;
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..88a4de1 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -22,6 +22,10 @@
>  #include "ofproto-provider.h"
>  #include "packets.h"
>  
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
>  struct flow;
>  struct netdev;
>  struct ofpbuf;
> @@ -58,6 +62,8 @@ struct bond_settings {
>                                  /* The MAC address of the interface
>                                     that was active during the last
>                                     ovs run. */
> +    bool use_bond_cache;        /* Use bond cache. Only applicable for
> +                                   bond mode BALANCE TCP. */
>  };
>  
>  /* Program startup. */
> @@ -122,4 +128,7 @@ void bond_rebalance(struct bond *);
>  */
>  void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
>                                     uint32_t *hash_basis);
> +
> +bool bond_get_cache_mode(const struct bond *);

This is confusing name.

> +
>  #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
> index b413768..5030978 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -3017,6 +3017,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_CHECK_PKT_LEN:
>          case OVS_ACTION_ATTR_UNSPEC:
>          case OVS_ACTION_ATTR_DROP:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case __OVS_ACTION_ATTR_MAX:
>          default:
>              break;
> diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> index f9ea47a..21dafa6 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1177,6 +1177,7 @@ dpif_sflow_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_CT_CLEAR:
>          case OVS_ACTION_ATTR_METER:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              break;
>  
>          case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 0b45ecf..1656d47 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4187,7 +4187,6 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>          xlate_commit_actions(ctx);
>  
>          if (xr) {
> -            /* Recirculate the packet. */
>              struct ovs_action_hash *act_hash;
>  
>              /* Hash action. */
> @@ -4196,15 +4195,28 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>                  /* Algorithm supported by all datapaths. */
>                  hash_alg = OVS_HASH_ALG_L4;
>              }
> -            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> +
> +            if ((ctx->xbridge->support.lb_output_action) &&
> +                bond_get_cache_mode(xport->xbundle->bond)) {
> +                /*
> +                 * If bond mode is balance-tcp and optimize balance tcp
> +                 * is enabled then use the hash directly for slave selection
> +                 * and avoid recirculation.
> +                 *
> +                 * Currently support for netdev datapath only.
> +                 */
> +                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT,
> +                               xr->recirc_id);

This doesn't look correct.  Why you're using recirc_id as a bond identificator?
On configuration changes recirc_id will be cleared 

> +            } else {
> +                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
>                                                  OVS_ACTION_ATTR_HASH,
>                                                  sizeof *act_hash);
> -            act_hash->hash_alg = hash_alg;
> -            act_hash->hash_basis = xr->hash_basis;
> -
> -            /* Recirc action. */
> -            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> -                           xr->recirc_id);
> +                act_hash->hash_alg = hash_alg;
> +                act_hash->hash_basis = xr->hash_basis;
> +                /* Recirculate the packet. */
> +                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> +                               xr->recirc_id);
> +            }
>          } else if (is_native_tunnel) {
>              /* Output to native tunnel port. */
>              native_tunnel_output(ctx, xport, flow, odp_port, truncate);
> @@ -7265,7 +7277,8 @@ count_output_actions(const struct ofpbuf *odp_actions)
>      int n = 0;
>  
>      NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
> -        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> +        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> +            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
>              n++;
>          }
>      }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 0222ec8..9981be6 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1580,6 +1580,8 @@ check_support(struct dpif_backer *backer)
>      backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
>      backer->rt_support.explicit_drop_action =
>          dpif_supports_explicit_drop_action(backer->dpif);
> +    backer->rt_support.lb_output_action=
> +        dpif_supports_lb_output_action(backer->dpif);
>  
>      /* Flow fields. */
>      backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3429,6 +3431,27 @@ bundle_remove(struct ofport *port_)
>      }
>  }
>  
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,

Why this function has 'bundle' in its name?

> +                        uint32_t bond_id, const ofp_port_t slave_map[])
> +{
> +    /* Convert ofp_port to odp_port */
> +    odp_port_t odp_map[BOND_BUCKETS];
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        odp_map[bucket] = (slave_map[bucket] == OFP_PORT_C(-1)
> +                           ? ODP_PORT_C(-1)
> +                           : ofp_port_to_odp_port(ofproto, slave_map[bucket]));
> +    }
> +
> +    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t bond_id)
> +{
> +    return dpif_bond_del(ofproto->backer->dpif, bond_id);
> +}
> +
>  static void
>  send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
>  {
> @@ -5559,6 +5582,7 @@ get_datapath_cap(const char *datapath_type, struct smap *cap)
>      smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
>      smap_add(cap, "explicit_drop_action",
>               s.explicit_drop_action ? "true" :"false");
> +    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false");
>  }
>  
>  /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index c9d5df3..384a269 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -202,8 +202,10 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif *,
>      DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")        \
>                                                                              \
>      /* True if the datapath supports explicit drop action. */               \
> -    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")
> -
> +    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")  \
> +                                                                            \
> +    /* True if the datapath supports balance_tcp optimization */            \
> +    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode")
>  
>  /* Stores the various features which the corresponding backer supports. */
>  struct dpif_backer_support {
> @@ -379,6 +381,9 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
>                                     struct rule **rulep);
>  int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
>                                        int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
> +                            const ofp_port_t slave_map[]);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
>  
>  bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>  
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..c881a0b 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
>  bond_mode: active-backup
>  bond may use recirculation: no, Recirc-ID : -1
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index fe73c38..4a5e893 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4600,6 +4600,11 @@ port_configure_bond(struct port *port, struct bond_settings *s)
>          /* OVSDB did not store the last active interface */
>          s->active_slave_mac = eth_addr_zero;
>      }
> +
> +    /* use_bond_cache is disabled by default. */
> +    s->use_bond_cache = (s->balance == BM_TCP)
> +                        && smap_get_bool(&port->cfg->other_config,
> +                                      "lb-output-action", false);
>  }
>  
>  /* Returns true if 'port' is synthetic, that is, if we constructed it locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4a74ed3..3da8d0c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1994,6 +1994,16 @@
>          <code>active-backup</code>.
>        </column>
>  
> +      <column name="other_config" key="lb-output-action"
> +              type='{"type": "boolean"}'>
> +        Enable/disable usage of optimized lb-output-action for
> +        balancing flows among output slaves in load balanced bonds in
> +        <code>balance-tcp</code>. When enabled, it uses optimized path for
> +        balance-tcp mode by using rss hash and avoids recirculation.
> +        It affects only new flows, i.e, existing flows remain unchanged.
> +        This knob does not affect other balancing modes.
> +      </column>
> +
>        <group title="Link Failure Detection">
>          <p>
>            An important part of link bonding is detecting that links are down so
> @@ -5788,6 +5798,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          higher performance for MPLS and active-active load balancing
>          bonding modes.
>        </column>
> +      <column name="capabilities" key="lb_output_action"
> +              type='{"type": "boolean"}'>
> +        If this is true, then the datapath supports optimized balance-tcp
> +        bond mode. This capability replaces existing <code>hash</code> and
> +        <code>recirc</code> actions with new action <code>lb_output</code>
> +        and avoids recirculation of packet in datapath. It is supported
> +        only for balance-tcp bond mode in netdev datapath. The new action
> +        gives higer performance by using bond buckets instead of post
> +        recirculation flows for selection of slave port from bond. By default
> +        this new action is disabled, however it can be enabled by setting
> +        <ref column="other-config" key="lb-output-action"/> in
> +        <ref table="Port"/> table.
> +      </column>
>        <group title="Connection-Tracking Capabilities">
>          <p>
>            These capabilities are granular because Open vSwitch and its
>
diff mbox series

Patch

diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index 2f0c655..7604a2a 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -1021,6 +1021,7 @@  enum ovs_action_attr {
 	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
 	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
 	OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
+	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id */
 #endif
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d393aab..f550192 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -79,6 +79,7 @@ 
 #include "unixctl.h"
 #include "util.h"
 #include "uuid.h"
+#include "ofproto/bond.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
@@ -376,6 +377,12 @@  struct dp_netdev {
 
     struct conntrack *conntrack;
     struct pmd_auto_lb pmd_alb;
+
+    /* Bonds.
+     *
+     * Any lookup into 'bonds' requires taking 'bond_mutex'. */
+    struct ovs_mutex bond_mutex;
+    struct cmap tx_bonds; /* Contains "struct tx_bond"s. */
 };
 
 static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
@@ -607,6 +614,20 @@  struct tx_port {
     struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
 };
 
+/* Contained by struct tx_bond 'slave_buckets' */
+struct slave_entry {
+    odp_port_t slave_id;
+    atomic_ullong n_packets;
+    atomic_ullong n_bytes;
+};
+
+/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
+struct tx_bond {
+    struct cmap_node node;
+    uint32_t bond_id;
+    struct slave_entry slave_buckets[BOND_BUCKETS];
+};
+
 /* A set of properties for the current processing loop that is not directly
  * associated with the pmd thread itself, but with the packets being
  * processed or the short-term system configuration (for example, time).
@@ -739,6 +760,11 @@  struct dp_netdev_pmd_thread {
      * read by the pmd thread. */
     struct hmap tx_ports OVS_GUARDED;
 
+    struct ovs_mutex bond_mutex;    /* Mutex for 'tx_bonds'. */
+    /* Map of 'tx_bond's used for transmission.  Written by the main thread,
+     * read/written by the pmd thread. */
+    struct cmap tx_bonds OVS_GUARDED;
+
     /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
      * ports (that support push_tunnel/pop_tunnel), the other contains ports
      * with at least one txq (that support send).  A port can be in both.
@@ -830,6 +856,12 @@  static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
 static int
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
                                    bool force);
+static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                                         struct tx_bond *bond)
+    OVS_REQUIRES(pmd->bond_mutex);
+static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                                           struct tx_bond *tx)
+    OVS_REQUIRES(pmd->bond_mutex);
 
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
@@ -1396,6 +1428,45 @@  pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
     par.command_type = PMD_INFO_PERF_SHOW;
     dpif_netdev_pmd_info(conn, argc, argv, &par);
 }
+
+static void
+dpif_netdev_dp_bond_show(struct unixctl_conn *conn, int argc,
+                         const char *argv[], void *aux OVS_UNUSED)
+{
+    struct ds reply = DS_EMPTY_INITIALIZER;
+
+    ovs_mutex_lock(&dp_netdev_mutex);
+
+    struct dp_netdev *dp = NULL;
+    if (argc == 2) {
+        dp = shash_find_data(&dp_netdevs, argv[1]);
+    } else if (shash_count(&dp_netdevs) == 1) {
+        /* There's only one datapath */
+        dp = shash_first(&dp_netdevs)->data;
+    }
+    if (!dp) {
+        ovs_mutex_unlock(&dp_netdev_mutex);
+        unixctl_command_reply_error(conn,
+                                    "please specify an existing datapath");
+        return;
+    }
+    ds_put_cstr(&reply, "\nBonds:\n");
+
+    struct tx_bond *dp_bond_entry;
+    CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
+        ds_put_format(&reply, "\tbond-id %u :\n",
+                      dp_bond_entry->bond_id);
+        for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+            ds_put_format(&reply, "\t\tbucket %u - slave %u \n",
+                      bucket,
+                      dp_bond_entry->slave_buckets[bucket].slave_id);
+        }
+    }
+    ovs_mutex_unlock(&dp_netdev_mutex);
+    unixctl_command_reply(conn, ds_cstr(&reply));
+    ds_destroy(&reply);
+}
+
 
 static int
 dpif_netdev_init(void)
@@ -1427,6 +1498,9 @@  dpif_netdev_init(void)
                              "[-us usec] [-q qlen]",
                              0, 10, pmd_perf_log_set_cmd,
                              NULL);
+    unixctl_command_register("dpif-netdev/dp-bond-show", "[dp]",
+                             0, 1, dpif_netdev_dp_bond_show,
+                             NULL);
     return 0;
 }
 
@@ -1551,6 +1625,9 @@  create_dp_netdev(const char *name, const struct dpif_class *class,
     ovs_mutex_init_recursive(&dp->port_mutex);
     hmap_init(&dp->ports);
     dp->port_seq = seq_create();
+    ovs_mutex_init(&dp->bond_mutex);
+    cmap_init(&dp->tx_bonds);
+
     fat_rwlock_init(&dp->upcall_rwlock);
 
     dp->reconfigure_seq = seq_create();
@@ -1657,6 +1734,12 @@  dp_delete_meter(struct dp_netdev *dp, uint32_t meter_id)
     }
 }
 
+static uint32_t
+hash_bond_id(uint32_t bond_id)
+{
+    return hash_int(bond_id, 0);
+}
+
 /* Requires dp_netdev_mutex so that we can't get a new reference to 'dp'
  * through the 'dp_netdevs' shash while freeing 'dp'. */
 static void
@@ -1664,6 +1747,7 @@  dp_netdev_free(struct dp_netdev *dp)
     OVS_REQUIRES(dp_netdev_mutex)
 {
     struct dp_netdev_port *port, *next;
+    struct tx_bond *bond;
 
     shash_find_and_delete(&dp_netdevs, dp->name);
 
@@ -1673,6 +1757,13 @@  dp_netdev_free(struct dp_netdev *dp)
     }
     ovs_mutex_unlock(&dp->port_mutex);
 
+    ovs_mutex_lock(&dp->bond_mutex);
+    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
+        cmap_remove(&dp->tx_bonds, &bond->node, hash_bond_id(bond->bond_id));
+        ovsrcu_postpone(free, bond);
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
     dp_netdev_destroy_all_pmds(dp, true);
     cmap_destroy(&dp->poll_threads);
 
@@ -1691,6 +1782,9 @@  dp_netdev_free(struct dp_netdev *dp)
     hmap_destroy(&dp->ports);
     ovs_mutex_destroy(&dp->port_mutex);
 
+    cmap_destroy(&dp->tx_bonds);
+    ovs_mutex_destroy(&dp->bond_mutex);
+
     /* Upcalls must be disabled at this point */
     dp_netdev_destroy_upcall_lock(dp);
 
@@ -4421,6 +4515,20 @@  tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
     return NULL;
 }
 
+static struct tx_bond *
+tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
+{
+    struct tx_bond *tx;
+    const struct cmap_node *pnode;
+
+    pnode = cmap_find(tx_bonds, hash_bond_id(bond_id));
+    if (!pnode) {
+        return NULL;
+    }
+    tx = CONTAINER_OF(pnode, struct tx_bond, node);
+    return tx;
+}
+
 static int
 port_reconfigure(struct dp_netdev_port *port)
 {
@@ -5072,6 +5180,21 @@  reconfigure_datapath(struct dp_netdev *dp)
         ovs_mutex_unlock(&pmd->port_mutex);
     }
 
+    /* Add every bond to the cmap of each pmd thread, if it's not
+     * there already and if this pmd has at least one rxq to poll. */
+    ovs_mutex_lock(&dp->bond_mutex);
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        ovs_mutex_lock(&pmd->bond_mutex);
+        if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
+            struct tx_bond *bond;
+            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
+                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
+            }
+        }
+        ovs_mutex_unlock(&pmd->bond_mutex);
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
     /* Reload affected pmd threads. */
     reload_affected_pmds(dp);
 
@@ -6110,6 +6233,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     atomic_init(&pmd->reload, false);
     ovs_mutex_init(&pmd->flow_mutex);
     ovs_mutex_init(&pmd->port_mutex);
+    ovs_mutex_init(&pmd->bond_mutex);
     cmap_init(&pmd->flow_table);
     cmap_init(&pmd->classifiers);
     pmd->ctx.last_rxq = NULL;
@@ -6120,6 +6244,7 @@  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     hmap_init(&pmd->tx_ports);
     hmap_init(&pmd->tnl_port_cache);
     hmap_init(&pmd->send_port_cache);
+    cmap_init(&pmd->tx_bonds);
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
@@ -6135,11 +6260,17 @@  static void
 dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 {
     struct dpcls *cls;
+    struct tx_bond *tx;
 
     dp_netdev_pmd_flow_flush(pmd);
     hmap_destroy(&pmd->send_port_cache);
     hmap_destroy(&pmd->tnl_port_cache);
     hmap_destroy(&pmd->tx_ports);
+    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
+        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
+        ovsrcu_postpone(free, tx);
+    }
+    cmap_destroy(&pmd->tx_bonds);
     hmap_destroy(&pmd->poll_list);
     /* All flows (including their dpcls_rules) have been deleted already */
     CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
@@ -6151,6 +6282,7 @@  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
     ovs_mutex_destroy(&pmd->flow_mutex);
     seq_destroy(pmd->reload_seq);
     ovs_mutex_destroy(&pmd->port_mutex);
+    ovs_mutex_destroy(&pmd->bond_mutex);
     free(pmd);
 }
 
@@ -6304,6 +6436,41 @@  dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
     free(tx);
     pmd->need_reload = true;
 }
+
+static void
+dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                             struct tx_bond *bond)
+    OVS_REQUIRES(pmd->bond_mutex)
+{
+    struct tx_bond *tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
+    if (tx) {
+        struct tx_bond *new_tx = xmemdup(bond, sizeof *tx);
+        /* Copy the stats for each bucket. */
+        for (int i = 0;i < BOND_BUCKETS; i++) {
+            uint64_t n_packets, n_bytes;
+            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, &n_packets);
+            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, &n_bytes);
+            new_tx->slave_buckets[i].n_packets = n_packets;
+            new_tx->slave_buckets[i].n_bytes = n_bytes;
+        }
+        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
+                     hash_bond_id(bond->bond_id));
+        ovsrcu_postpone(free, tx);
+    } else {
+        tx = xmemdup(bond, sizeof *tx);
+        cmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
+    }
+}
+
+/* Del 'tx' from the tx bond cache of 'pmd' */
+static void
+dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                               struct tx_bond *tx)
+    OVS_REQUIRES(pmd->bond_mutex)
+{
+    cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
+    ovsrcu_postpone(free, tx);
+}
 
 static char *
 dpif_netdev_get_datapath_version(void)
@@ -7129,6 +7296,94 @@  dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
+static bool
+dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
+                         struct dp_packet_batch *packets_,
+                         bool should_steal, odp_port_t port_no)
+{
+    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
+    if (!OVS_LIKELY(p)) {
+        return false;
+    }
+
+    struct dp_packet_batch out;
+    if (!should_steal) {
+        dp_packet_batch_clone(&out, packets_);
+        dp_packet_batch_reset_cutlen(packets_);
+        packets_ = &out;
+    }
+    dp_packet_batch_apply_cutlen(packets_);
+#ifdef DPDK_NETDEV
+    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
+                     && packets_->packets[0]->source
+                     != p->output_pkts.packets[0]->source)) {
+        /* netdev-dpdk assumes that all packets in a single
+         * output batch has the same source. Flush here to
+         * avoid memory access issues. */
+        dp_netdev_pmd_flush_output_on_port(pmd, p);
+    }
+#endif
+    if (dp_packet_batch_size(&p->output_pkts)
+        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
+        /* Flush here to avoid overflow. */
+        dp_netdev_pmd_flush_output_on_port(pmd, p);
+    }
+    if (dp_packet_batch_is_empty(&p->output_pkts)) {
+        pmd->n_output_batches++;
+    }
+
+    struct dp_packet *packet;
+    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
+        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
+            pmd->ctx.last_rxq;
+        dp_packet_batch_add(&p->output_pkts, packet);
+    }
+    return true;
+}
+
+static bool
+dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
+                            struct dp_packet_batch *packets_,
+                            bool should_steal, uint32_t bond)
+{
+    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
+    if (!p_bond) {
+        return false;
+    }
+
+    struct dp_packet_batch del_pkts;
+    dp_packet_batch_init(&del_pkts);
+
+    struct dp_packet *packet;
+    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
+        /*
+         * Lookup the bond-hash table using hash to get the slave.
+         */
+        uint32_t hash = dp_packet_get_rss_hash(packet);
+        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & BOND_MASK];
+        odp_port_t bond_member = s_entry->slave_id;
+        uint32_t size = dp_packet_size(packet);
+
+        struct dp_packet_batch output_pkt;
+        dp_packet_batch_init_packet(&output_pkt, packet);
+        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, should_steal,
+                                                bond_member))) {
+            /* Update slave stats. */
+            non_atomic_ullong_add(&s_entry->n_packets, 1);
+            non_atomic_ullong_add(&s_entry->n_bytes, size);
+        } else {
+            dp_packet_batch_add(&del_pkts, packet);
+        }
+    }
+
+    /* Delete packets that failed OUTPUT action */
+    COVERAGE_ADD(datapath_drop_invalid_port,
+                 dp_packet_batch_size(&del_pkts));
+    dp_packet_delete_batch(&del_pkts, should_steal);
+
+    return true;
+}
+
 static void
 dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
               const struct nlattr *a, bool should_steal)
@@ -7144,43 +7399,18 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
 
     switch ((enum ovs_action_attr)type) {
     case OVS_ACTION_ATTR_OUTPUT:
-        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
-        if (OVS_LIKELY(p)) {
-            struct dp_packet *packet;
-            struct dp_packet_batch out;
-
-            if (!should_steal) {
-                dp_packet_batch_clone(&out, packets_);
-                dp_packet_batch_reset_cutlen(packets_);
-                packets_ = &out;
-            }
-            dp_packet_batch_apply_cutlen(packets_);
-
-#ifdef DPDK_NETDEV
-            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
-                             && packets_->packets[0]->source
-                                != p->output_pkts.packets[0]->source)) {
-                /* XXX: netdev-dpdk assumes that all packets in a single
-                 *      output batch has the same source. Flush here to
-                 *      avoid memory access issues. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-#endif
-            if (dp_packet_batch_size(&p->output_pkts)
-                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
-                /* Flush here to avoid overflow. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-
-            if (dp_packet_batch_is_empty(&p->output_pkts)) {
-                pmd->n_output_batches++;
-            }
+        if (dp_execute_output_action(pmd, packets_, should_steal,
+                                     nl_attr_get_odp_port(a))) {
+            return;
+        } else {
+            COVERAGE_ADD(datapath_drop_invalid_port,
+                         dp_packet_batch_size(packets_));
+        }
+        break;
 
-            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
-                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
-                                                             pmd->ctx.last_rxq;
-                dp_packet_batch_add(&p->output_pkts, packet);
-            }
+    case OVS_ACTION_ATTR_LB_OUTPUT:
+        if (dp_execute_lb_output_action(pmd, packets_, should_steal,
+                                        nl_attr_get_u32(a))) {
             return;
         } else {
             COVERAGE_ADD(datapath_drop_invalid_port,
@@ -7738,6 +7968,100 @@  dpif_netdev_ipf_dump_done(struct dpif *dpif OVS_UNUSED, void *ipf_dump_ctx)
 
 }
 
+static int
+dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
+                     odp_port_t slave_map[])
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+
+    /* Prepare new bond mapping. */
+    struct tx_bond *new_tx = xzalloc(sizeof(struct tx_bond));
+
+    new_tx->bond_id = bond_id;
+    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
+    }
+
+    /* Check if bond already existed */
+    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
+    ovs_mutex_lock(&dp->bond_mutex);
+    if (old_tx) {
+        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
+                     hash_bond_id(bond_id));
+        ovsrcu_postpone(free, old_tx);
+    } else {
+        cmap_insert(&dp->tx_bonds, &new_tx->node,
+                    hash_bond_id(bond_id));
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
+    /* Update all PMDs with new bond mapping. */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        ovs_mutex_lock(&pmd->bond_mutex);
+        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
+        ovs_mutex_unlock(&pmd->bond_mutex);
+    }
+    return 0;
+}
+
+static int
+dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+
+    /* Check if bond existed. */
+    struct tx_bond *tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
+    if (tx) {
+        ovs_mutex_lock(&dp->bond_mutex);
+        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
+        ovsrcu_postpone(free, tx);
+        ovs_mutex_unlock(&dp->bond_mutex);
+    } else {
+        /* Bond is not present. */
+        return 0;
+    }
+
+    /* Remove the bond map in all pmds. */
+    struct dp_netdev_pmd_thread *pmd;
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);
+        if (tx) {
+            ovs_mutex_lock(&pmd->bond_mutex);
+            dp_netdev_del_bond_tx_from_pmd(pmd, tx);
+            ovs_mutex_unlock(&pmd->bond_mutex);
+        }
+    }
+    return 0;
+}
+
+static int
+dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                           uint64_t *n_bytes)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+
+    /* Search the bond in all PMDs */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        struct tx_bond *pmd_bond_entry
+            = tx_bond_lookup(&pmd->tx_bonds, bond_id);
+        ovs_mutex_lock(&pmd->bond_mutex);
+        if (pmd_bond_entry) {
+             /* Read bond stats. */
+            for (int i = 0; i < BOND_BUCKETS; i++) {
+                uint64_t pmd_n_bytes;
+                atomic_read_relaxed(
+                     &pmd_bond_entry->slave_buckets[i].n_bytes,
+                     &pmd_n_bytes);
+                n_bytes[i] += pmd_n_bytes;
+            }
+        }
+        ovs_mutex_unlock(&pmd->bond_mutex);
+    }
+    return 0;
+}
+
 const struct dpif_class dpif_netdev_class = {
     "netdev",
     true,                       /* cleanup_required */
@@ -7811,6 +8135,9 @@  const struct dpif_class dpif_netdev_class = {
     dpif_netdev_meter_set,
     dpif_netdev_meter_get,
     dpif_netdev_meter_del,
+    dpif_netdev_bond_add,
+    dpif_netdev_bond_del,
+    dpif_netdev_bond_stats_get,
 };
 
 static void
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 5b5c96d..116f8fc 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -3993,6 +3993,9 @@  const struct dpif_class dpif_netlink_class = {
     dpif_netlink_meter_set,
     dpif_netlink_meter_get,
     dpif_netlink_meter_del,
+    NULL,                       /* bond_add */
+    NULL,                       /* bond_del */
+    NULL,                       /* bond_stats_get */
 };
 
 static int
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index b77317b..4415836 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -616,6 +616,15 @@  struct dpif_class {
      * zero. */
     int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
                      struct ofputil_meter_stats *, uint16_t n_bands);
+
+    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
+    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
+                    odp_port_t slave_map[]);
+    /* Removes bond identified by 'bond_id' from 'dpif'. */
+    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
+    /* Reads bond stats from 'dpif'. */
+    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
+                          uint64_t *n_bytes);
 };
 
 extern const struct dpif_class dpif_netlink_class;
diff --git a/lib/dpif.c b/lib/dpif.c
index 9d9c716..22168de 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1170,6 +1170,7 @@  dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
 
     case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -1220,6 +1221,7 @@  dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
         struct dp_packet *clone = NULL;
         uint32_t cutlen = dp_packet_get_cutlen(packet);
         if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
+                        || type == OVS_ACTION_ATTR_LB_OUTPUT
                         || type == OVS_ACTION_ATTR_TUNNEL_PUSH
                         || type == OVS_ACTION_ATTR_TUNNEL_POP
                         || type == OVS_ACTION_ATTR_USERSPACE)) {
@@ -1879,6 +1881,16 @@  dpif_supports_explicit_drop_action(const struct dpif *dpif)
     return dpif_is_netdev(dpif);
 }
 
+bool
+dpif_supports_lb_output_action(const struct dpif *dpif)
+{
+    /*
+     * Balance-tcp optimization is currently supported in netdev
+     * datapath only.
+     */
+    return dpif_is_netdev(dpif);
+}
+
 /* Meters */
 void
 dpif_meter_get_features(const struct dpif *dpif,
@@ -1976,3 +1988,40 @@  dpif_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
     }
     return error;
 }
+
+int
+dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[])
+{
+    int error = -1;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
+        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
+    }
+
+    return error;
+}
+
+int
+dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    int error = -1;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
+        error = dpif->dpif_class->bond_del(dpif, bond_id);
+    }
+
+    return error;
+}
+
+int
+dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                    uint64_t *n_bytes)
+{
+    int error = -1;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
+        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
+    }
+
+    return error;
+}
diff --git a/lib/dpif.h b/lib/dpif.h
index 4df8f7c..898fa25 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -907,6 +907,13 @@  char *dpif_get_dp_version(const struct dpif *);
 bool dpif_supports_tnl_push_pop(const struct dpif *);
 bool dpif_supports_explicit_drop_action(const struct dpif *);
 
+bool dpif_supports_lb_output_action(const struct dpif *);
+
+int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[]);
+int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
+int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                        uint64_t *n_bytes);
+
 /* Log functions. */
 struct vlog_module;
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 42d3335..6018e37 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -793,6 +793,7 @@  requires_datapath_assistance(const struct nlattr *a)
     switch (type) {
         /* These only make sense in the context of a datapath. */
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -1068,6 +1069,7 @@  odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
             return;
         }
         case OVS_ACTION_ATTR_OUTPUT:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
         case OVS_ACTION_ATTR_TUNNEL_PUSH:
         case OVS_ACTION_ATTR_TUNNEL_POP:
         case OVS_ACTION_ATTR_USERSPACE:
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 746d1e9..4ef4cdc 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -119,6 +119,7 @@  odp_action_len(uint16_t type)
 
     switch ((enum ovs_action_attr) type) {
     case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
+    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
     case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
     case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
     case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
@@ -1122,6 +1123,9 @@  format_odp_action(struct ds *ds, const struct nlattr *a,
     case OVS_ACTION_ATTR_OUTPUT:
         odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
         break;
+    case OVS_ACTION_ATTR_LB_OUTPUT:
+        ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));
+        break;
     case OVS_ACTION_ATTR_TRUNC: {
         const struct ovs_action_trunc *trunc =
                        nl_attr_get_unspec(a, sizeof *trunc);
diff --git a/ofproto/bond.c b/ofproto/bond.c
index 405202f..baeb5ee 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -54,10 +54,6 @@  static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
 static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
 static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
 
-/* Bit-mask for hashing a flow down to a bucket. */
-#define BOND_MASK 0xff
-#define BOND_BUCKETS (BOND_MASK + 1)
-
 /* Priority for internal rules created to handle recirculation */
 #define RECIRC_RULE_PRIORITY 20
 
@@ -126,6 +122,8 @@  struct bond {
     enum lacp_status lacp_status; /* Status of LACP negotiations. */
     bool bond_revalidate;       /* True if flows need revalidation. */
     uint32_t basis;             /* Basis for flow hash function. */
+    bool use_bond_cache;        /* Use bond cache to avoid recirculation.
+                                   Applicable only for Balance TCP mode. */
 
     /* SLB specific bonding info. */
     struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
@@ -185,7 +183,7 @@  static struct bond_slave *choose_output_slave(const struct bond *,
                                               struct flow_wildcards *,
                                               uint16_t vlan)
     OVS_REQ_RDLOCK(rwlock);
-static void update_recirc_rules__(struct bond *bond);
+static void update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id);
 static bool bond_is_falling_back_to_ab(const struct bond *);
 
 /* Attempts to parse 's' as the name of a bond balancing mode.  If successful,
@@ -262,6 +260,7 @@  void
 bond_unref(struct bond *bond)
 {
     struct bond_slave *slave;
+    uint32_t bond_recirc_id = 0;
 
     if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
         return;
@@ -282,12 +281,13 @@  bond_unref(struct bond *bond)
 
     /* Free bond resources. Remove existing post recirc rules. */
     if (bond->recirc_id) {
+        bond_recirc_id = bond->recirc_id;
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
     free(bond->hash);
     bond->hash = NULL;
-    update_recirc_rules__(bond);
+    update_recirc_rules__(bond, bond_recirc_id);
 
     hmap_destroy(&bond->pr_rule_ops);
     free(bond->name);
@@ -328,13 +328,14 @@  add_pr_rule(struct bond *bond, const struct match *match,
  * lock annotation. Currently, only 'bond_unref()' calls
  * this function directly.  */
 static void
-update_recirc_rules__(struct bond *bond)
+update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)
 {
     struct match match;
     struct bond_pr_rule_op *pr_op, *next_op;
     uint64_t ofpacts_stub[128 / 8];
     struct ofpbuf ofpacts;
     int i;
+    ofp_port_t slave_map[BOND_MASK];
 
     ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
 
@@ -353,8 +354,18 @@  update_recirc_rules__(struct bond *bond)
 
                 add_pr_rule(bond, &match, slave->ofp_port,
                             &bond->hash[i].pr_rule);
+                slave_map[i] = slave->ofp_port;
+            } else {
+                slave_map[i] = OFP_PORT_C(-1);
             }
         }
+        if (bond->ofproto->backer->rt_support.lb_output_action) {
+            ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, slave_map);
+        }
+    } else {
+        if (bond->ofproto->backer->rt_support.lb_output_action) {
+            ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
+        }
     }
 
     HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
@@ -404,7 +415,7 @@  static void
 update_recirc_rules(struct bond *bond)
     OVS_REQ_RDLOCK(rwlock)
 {
-    update_recirc_rules__(bond);
+    update_recirc_rules__(bond, bond->recirc_id);
 }
 
 /* Updates 'bond''s overall configuration to 's'.
@@ -467,6 +478,10 @@  bond_reconfigure(struct bond *bond, const struct bond_settings *s)
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
+    if (bond->use_bond_cache != s->use_bond_cache) {
+        bond->use_bond_cache = s->use_bond_cache;
+        revalidate = true;
+    }
 
     if (bond->balance == BM_AB || !bond->hash || revalidate) {
         bond_entry_reset(bond);
@@ -944,6 +959,17 @@  bond_recirculation_account(struct bond *bond)
     OVS_REQ_WRLOCK(rwlock)
 {
     int i;
+    uint64_t n_bytes[BOND_BUCKETS] = {0};
+    bool use_bond_cache =
+        (bond->balance == BM_TCP)
+        && bond->ofproto->backer->rt_support.lb_output_action
+        && bond->use_bond_cache;
+
+    if (use_bond_cache) {
+        /* Retrieve bond stats from datapath. */
+        dpif_bond_stats_get(bond->ofproto->backer->dpif,
+                            bond->recirc_id, n_bytes);
+    }
 
     for (i=0; i<=BOND_MASK; i++) {
         struct bond_entry *entry = &bond->hash[i];
@@ -953,8 +979,12 @@  bond_recirculation_account(struct bond *bond)
             struct pkt_stats stats;
             long long int used OVS_UNUSED;
 
-            rule->ofproto->ofproto_class->rule_get_stats(
-                rule, &stats, &used);
+            if (!use_bond_cache) {
+                rule->ofproto->ofproto_class->rule_get_stats(
+                    rule, &stats, &used);
+            } else {
+                stats.n_bytes = n_bytes[i];
+            }
             bond_entry_account(entry, stats.n_bytes);
         }
     }
@@ -1365,6 +1395,8 @@  bond_print_details(struct ds *ds, const struct bond *bond)
                   may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
 
     ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
+    ds_put_format(ds, "lb-output-action: %s\n",
+                  bond->use_bond_cache ? "enabled" : "disabled");
 
     ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
     ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
@@ -1942,3 +1974,9 @@  bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
 
     return false;
 }
+
+bool
+bond_get_cache_mode(const struct bond *bond)
+{
+    return (bond->balance == BM_TCP) && bond->use_bond_cache;
+}
diff --git a/ofproto/bond.h b/ofproto/bond.h
index e7c3d9b..88a4de1 100644
--- a/ofproto/bond.h
+++ b/ofproto/bond.h
@@ -22,6 +22,10 @@ 
 #include "ofproto-provider.h"
 #include "packets.h"
 
+/* Bit-mask for hashing a flow down to a bucket. */
+#define BOND_MASK 0xff
+#define BOND_BUCKETS (BOND_MASK + 1)
+
 struct flow;
 struct netdev;
 struct ofpbuf;
@@ -58,6 +62,8 @@  struct bond_settings {
                                 /* The MAC address of the interface
                                    that was active during the last
                                    ovs run. */
+    bool use_bond_cache;        /* Use bond cache. Only applicable for
+                                   bond mode BALANCE TCP. */
 };
 
 /* Program startup. */
@@ -122,4 +128,7 @@  void bond_rebalance(struct bond *);
 */
 void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
                                    uint32_t *hash_basis);
+
+bool bond_get_cache_mode(const struct bond *);
+
 #endif /* bond.h */
diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
index b413768..5030978 100644
--- a/ofproto/ofproto-dpif-ipfix.c
+++ b/ofproto/ofproto-dpif-ipfix.c
@@ -3017,6 +3017,7 @@  dpif_ipfix_read_actions(const struct flow *flow,
         case OVS_ACTION_ATTR_CHECK_PKT_LEN:
         case OVS_ACTION_ATTR_UNSPEC:
         case OVS_ACTION_ATTR_DROP:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
         case __OVS_ACTION_ATTR_MAX:
         default:
             break;
diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
index f9ea47a..21dafa6 100644
--- a/ofproto/ofproto-dpif-sflow.c
+++ b/ofproto/ofproto-dpif-sflow.c
@@ -1177,6 +1177,7 @@  dpif_sflow_read_actions(const struct flow *flow,
         case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_CT_CLEAR:
         case OVS_ACTION_ATTR_METER:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
             break;
 
         case OVS_ACTION_ATTR_SET_MASKED:
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 0b45ecf..1656d47 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4187,7 +4187,6 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
         xlate_commit_actions(ctx);
 
         if (xr) {
-            /* Recirculate the packet. */
             struct ovs_action_hash *act_hash;
 
             /* Hash action. */
@@ -4196,15 +4195,28 @@  compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                 /* Algorithm supported by all datapaths. */
                 hash_alg = OVS_HASH_ALG_L4;
             }
-            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
+
+            if ((ctx->xbridge->support.lb_output_action) &&
+                bond_get_cache_mode(xport->xbundle->bond)) {
+                /*
+                 * If bond mode is balance-tcp and optimize balance tcp
+                 * is enabled then use the hash directly for slave selection
+                 * and avoid recirculation.
+                 *
+                 * Currently support for netdev datapath only.
+                 */
+                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT,
+                               xr->recirc_id);
+            } else {
+                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                 OVS_ACTION_ATTR_HASH,
                                                 sizeof *act_hash);
-            act_hash->hash_alg = hash_alg;
-            act_hash->hash_basis = xr->hash_basis;
-
-            /* Recirc action. */
-            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
-                           xr->recirc_id);
+                act_hash->hash_alg = hash_alg;
+                act_hash->hash_basis = xr->hash_basis;
+                /* Recirculate the packet. */
+                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
+                               xr->recirc_id);
+            }
         } else if (is_native_tunnel) {
             /* Output to native tunnel port. */
             native_tunnel_output(ctx, xport, flow, odp_port, truncate);
@@ -7265,7 +7277,8 @@  count_output_actions(const struct ofpbuf *odp_actions)
     int n = 0;
 
     NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
-        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
+        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
+            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
             n++;
         }
     }
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 0222ec8..9981be6 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1580,6 +1580,8 @@  check_support(struct dpif_backer *backer)
     backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
     backer->rt_support.explicit_drop_action =
         dpif_supports_explicit_drop_action(backer->dpif);
+    backer->rt_support.lb_output_action=
+        dpif_supports_lb_output_action(backer->dpif);
 
     /* Flow fields. */
     backer->rt_support.odp.ct_state = check_ct_state(backer);
@@ -3429,6 +3431,27 @@  bundle_remove(struct ofport *port_)
     }
 }
 
+int
+ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
+                        uint32_t bond_id, const ofp_port_t slave_map[])
+{
+    /* Convert ofp_port to odp_port */
+    odp_port_t odp_map[BOND_BUCKETS];
+    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+        odp_map[bucket] = (slave_map[bucket] == OFP_PORT_C(-1)
+                           ? ODP_PORT_C(-1)
+                           : ofp_port_to_odp_port(ofproto, slave_map[bucket]));
+    }
+
+    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
+}
+
+int
+ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t bond_id)
+{
+    return dpif_bond_del(ofproto->backer->dpif, bond_id);
+}
+
 static void
 send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
 {
@@ -5559,6 +5582,7 @@  get_datapath_cap(const char *datapath_type, struct smap *cap)
     smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
     smap_add(cap, "explicit_drop_action",
              s.explicit_drop_action ? "true" :"false");
+    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false");
 }
 
 /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and
diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
index c9d5df3..384a269 100644
--- a/ofproto/ofproto-dpif.h
+++ b/ofproto/ofproto-dpif.h
@@ -202,8 +202,10 @@  struct group_dpif *group_dpif_lookup(struct ofproto_dpif *,
     DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")        \
                                                                             \
     /* True if the datapath supports explicit drop action. */               \
-    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")
-
+    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")  \
+                                                                            \
+    /* True if the datapath supports balance_tcp optimization */            \
+    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode")
 
 /* Stores the various features which the corresponding backer supports. */
 struct dpif_backer_support {
@@ -379,6 +381,9 @@  int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
                                    struct rule **rulep);
 int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
                                       int priority);
+int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
+                            const ofp_port_t slave_map[]);
+int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
 
 bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
 
diff --git a/tests/lacp.at b/tests/lacp.at
index 7b460d7..c881a0b 100644
--- a/tests/lacp.at
+++ b/tests/lacp.at
@@ -121,6 +121,7 @@  AT_CHECK([ovs-appctl bond/show], [0], [dnl
 bond_mode: active-backup
 bond may use recirculation: no, Recirc-ID : -1
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -286,6 +287,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -301,6 +303,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -423,6 +426,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -440,6 +444,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -555,6 +560,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -572,6 +578,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -692,6 +699,7 @@  slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -709,6 +717,7 @@  slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index fe73c38..4a5e893 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -4600,6 +4600,11 @@  port_configure_bond(struct port *port, struct bond_settings *s)
         /* OVSDB did not store the last active interface */
         s->active_slave_mac = eth_addr_zero;
     }
+
+    /* use_bond_cache is disabled by default. */
+    s->use_bond_cache = (s->balance == BM_TCP)
+                        && smap_get_bool(&port->cfg->other_config,
+                                      "lb-output-action", false);
 }
 
 /* Returns true if 'port' is synthetic, that is, if we constructed it locally
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 4a74ed3..3da8d0c 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -1994,6 +1994,16 @@ 
         <code>active-backup</code>.
       </column>
 
+      <column name="other_config" key="lb-output-action"
+              type='{"type": "boolean"}'>
+        Enable/disable usage of optimized lb-output-action for
+        balancing flows among output slaves in load balanced bonds in
+        <code>balance-tcp</code>. When enabled, it uses optimized path for
+        balance-tcp mode by using rss hash and avoids recirculation.
+        It affects only new flows, i.e, existing flows remain unchanged.
+        This knob does not affect other balancing modes.
+      </column>
+
       <group title="Link Failure Detection">
         <p>
           An important part of link bonding is detecting that links are down so
@@ -5788,6 +5798,19 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         higher performance for MPLS and active-active load balancing
         bonding modes.
       </column>
+      <column name="capabilities" key="lb_output_action"
+              type='{"type": "boolean"}'>
+        If this is true, then the datapath supports optimized balance-tcp
+        bond mode. This capability replaces existing <code>hash</code> and
+        <code>recirc</code> actions with new action <code>lb_output</code>
+        and avoids recirculation of packet in datapath. It is supported
+        only for balance-tcp bond mode in netdev datapath. The new action
+        gives higer performance by using bond buckets instead of post
+        recirculation flows for selection of slave port from bond. By default
+        this new action is disabled, however it can be enabled by setting
+        <ref column="other-config" key="lb-output-action"/> in
+        <ref table="Port"/> table.
+      </column>
       <group title="Connection-Tracking Capabilities">
         <p>
           These capabilities are granular because Open vSwitch and its