From patchwork Tue Jul  4 09:50:34 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Ilya Maximets <i.maximets@samsung.com>
X-Patchwork-Id: 783828
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from mail.linuxfoundation.org (mail.linuxfoundation.org
	[140.211.169.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3x1zlW2D5mz9s7B
	for <incoming@patchwork.ozlabs.org>;
	Tue,  4 Jul 2017 19:50:47 +1000 (AEST)
Received: from mail.linux-foundation.org (localhost [127.0.0.1])
	by mail.linuxfoundation.org (Postfix) with ESMTP id CE5348DC;
	Tue,  4 Jul 2017 09:50:45 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@mail.linuxfoundation.org
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id A8F43890
	for <ovs-dev@openvswitch.org>; Tue,  4 Jul 2017 09:50:43 +0000 (UTC)
X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6
Received: from mailout3.w1.samsung.com (mailout3.w1.samsung.com
	[210.118.77.13])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 12D6ECF
	for <ovs-dev@openvswitch.org>; Tue,  4 Jul 2017 09:50:40 +0000 (UTC)
MIME-version: 1.0
Received: from eucas1p2.samsung.com (unknown [182.198.249.207])
	by mailout3.w1.samsung.com
	(Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5
	2014)) with ESMTP id <0OSK004OW8ODGE90@mailout3.w1.samsung.com> for
	ovs-dev@openvswitch.org; Tue, 04 Jul 2017 10:50:38 +0100 (BST)
Received: from eusmges4.samsung.com (unknown [203.254.199.244])
	by	eucas1p1.samsung.com (KnoxPortal) with ESMTP
	id
	20170704095037eucas1p101e8ef58a4163535de274fba555dd0f6~OF7_Hu6Jg1339713397eucas1p1E;
	Tue,  4 Jul 2017 09:50:37 +0000 (GMT)
Received: from eucas1p1.samsung.com ( [182.198.249.206])
	by	eusmges4.samsung.com (EUCPMTA) with SMTP id BC.65.04729.DE46B595;
	Tue, 4	Jul 2017 10:50:37 +0100 (BST)
Received: from eusmgms2.samsung.com (unknown [182.198.249.180])
	by	eucas1p2.samsung.com (KnoxPortal) with ESMTP
	id
	20170704095036eucas1p276d25c62d33dabd8d35f2f6b4a983cc4~OF79gg-Iy1415014150eucas1p2C;
	Tue,  4 Jul 2017 09:50:36 +0000 (GMT)
X-AuditID: cbfec7f4-f79806d000001279-50-595b64ed55ff
Received: from eusync1.samsung.com ( [203.254.199.211])
	by	eusmgms2.samsung.com (EUCPMTA) with SMTP id FB.AA.20206.CE46B595;
	Tue, 4	Jul 2017 10:50:36 +0100 (BST)
Received: from [106.109.129.180] by eusync1.samsung.com
	(Oracle	Communications Messaging Server 7.0.5.31.0 64bit (built May 5
	2014)) with	ESMTPA id <0OSK00CHD8OBTY20@eusync1.samsung.com>; Tue,
	04 Jul 2017 10:50:36	+0100 (BST)
To: Darrell Ball <dball@vmware.com>,
	Jan Scheurich <jan.scheurich@ericsson.com>,
	"ovs-dev@openvswitch.org" <ovs-dev@openvswitch.org>,
	Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
From: Ilya Maximets <i.maximets@samsung.com>
Message-id: <1ea3e4e1-315c-d7ba-0ede-24bc636c67e4@samsung.com>
Date: Tue, 04 Jul 2017 12:50:34 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.1.1
In-reply-to: <DD040F1F-DC99-40D0-B386-C1727A446B1E@vmware.com>
Content-language: en-GB
X-Brightmail-Tracker: 
 H4sIAAAAAAAAA01SbUhTYRjl3b3b7kar2zR9sA9jfWtphtZFylJC9qMf/ohwCraht2k6tU0l
	JWJhlimIm1GxsC8pcWqFmhiJ5FoaDkXmV2pqlM0ZzZnbqBlZ2+6E/TvnnnOe5z0Pl8CEb9hh
	RE5+Ea3Ml+WJOHy8s889dMiWlS45PGzHqeZVHvXasotasD7GqE+jbg51xzHNpUZvurmU8ZcJ
	Ua2GZS5Vv7KATvHEq64xjrihe5Eltkz+Q2J7j4fWdOiReOmhjZPCSeMfz6LzckpoZXSClJ/d
	NHGfW+iqQJe7XItsNbJcqEI8AshY6DX14QwOgeHZF5wqxCeE5FMEw+ZpP3EgeDWhQ+uJL85G
	jBGeIWjTT7O8goDcDL/rZn2jMPIAWJ1anDFZEWj11RyvEETGQ412nuvFweR7BN3qHV4TRrYg
	mKi3+9Ic8iAMNBsRMzUBhpZGfAGc3AN28wefZwuZCk+0Gt93nscz+GPNvzkU+uYmEYPDoXd0
	wfcKIM1cmFo1eQjhIduh7S3G1DkNlQ1d/mpB8L2/g8vgbTBSV+3PliNQ682IIbUInOWNLMZ1
	EgYmx1jMto2g7byLMQsEUHlDyFjEYLMN+ocmwrXxr2zmLBoW/J0yYrVopy7gfLqA8+kCCukC
	Cj1CuB4F08UqhZxWxUapZApVcb48KrNA0YY8v5Rprd/ZhRr64g2IJJBog2B/UppEyJaVqEoV
	BgQEJgoWPMhMlwgFWbLSMlpZcF5ZnEerDGgrgYtCBfyB8VQhKZcV0bk0XUgr11UWwQtTo6B3
	xTPVkooUbPKlmS91xVdmDeZaopN5f47IHy4bLPuO3iuSPI8rGL3CXbkqt8S1Et8Sz0w1Xf9c
	m6G5bWgTkCG79y7T5cLEE+0KLEManpy0yVofjofH6H6WyecjI+Y07vbInksjA45zH1vOHhMM
	O8pDlLfyZ2OomaZQqXH5oghXZctiIjClSvYfTcmFWk4DAAA=
X-Brightmail-Tracker: 
 H4sIAAAAAAAAA+NgFrrJIsWRmVeSWpSXmKPExsVy+t/xy7pvUqIjDc5Pl7JY/YvTYuczZYvn
	LxYyW9y58pPNYtrn2+wWV9p/slsc+X6a0WLtoQ/sFnM/PWd04PT49fUqm8fiPS+ZPJ7d/M/o
	8X4fkNu3ZRWjx7v5b9kC2KLcbDJSE1NSixRS85LzUzLz0m2VQkPcdC2UFPISc1NtlSJ0fUOC
	lBTKEnNKgTwjAzTg4BzgHqykb5fglrHy+mz2gq+tjBU7vr5kbWB8ltbFyMkhIWAi8ejLcmYI
	W0ziwr31bF2MXBxCAksYJRb+XwiW4BUQlPgx+R5LFyMHB7OAusSUKbkQNS8YJfac7mYCqREW
	sJLom/SEHSQhInCUUWLjrsVgDrPAGkaJg6ufMkO0TGaSePj4JdhYNgEdiVOrjzBCrLCTOPfu
	MjuIzSKgKvH+0gkWEFtUIEJi1/UDrCA2J1DN2Tf/wOLMAuISx+7fZISw5SUOXnnOMoFRcBaS
	a2chXDsLSccsJB0LGFlWMYqklhbnpucWG+kVJ+YWl+al6yXn525iBEbrtmM/t+xg7HoXfIhR
	gINRiYdXwykqUog1say4MvcQowQHs5II77zk6Egh3pTEyqrUovz4otKc1OJDjKZAP0xklhJN
	zgcmkrySeEMTQ3NLQyNjCwtzIyMlcd6pH66ECwmkJ5akZqemFqQWwfQxcXBKNTCmV646JB5v
	cPRzc1jvZ4Y9mzMO1LAdC1h/xOZ80ZxCw+bb0XPCuA2M9tn43Fpy95h6clv2pNCl3IZrQlb6
	Vya/MW2f/1+X20BQ4+Xu7ezprZJcZ2d1nwrj+FCynv2M1uKU6strRPZ8/anx1jcn6tWfzFda
	n9kaV39pNJHXNt2mW65TMudr3mslluKMREMt5qLiRACDpFao7AIAAA==
X-MTR: 20000000000000000@CPGS
X-CMS-MailID: 20170704095036eucas1p276d25c62d33dabd8d35f2f6b4a983cc4
X-Msg-Generator: CA
X-Sender-IP: 182.198.249.180
X-Local-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?=
	=?UTF-8?B?G+yCvOyEseyghOyekBtMZWFkaW5nIEVuZ2luZWVy?=
X-Global-Sender: =?UTF-8?B?SWx5YSBNYXhpbWV0cxtTUlItVmlydHVhbGl6YXRpb24gTGFi?=
	=?UTF-8?B?G1NhbXN1bmcgRWxlY3Ryb25pY3MbTGVhZGluZyBFbmdpbmVlcg==?=
X-Sender-Code: =?UTF-8?B?QzEwG0NJU0hRG0MxMEdEMDFHRDAxMDE1NA==?=
CMS-TYPE: 201P
X-HopCount: 7
X-CMS-RootMailID: 20170630120301eucas1p20e735dcfa10e2ffd613b99b7e183f216
X-RootMTR: 20170630120301eucas1p20e735dcfa10e2ffd613b99b7e183f216
References: 
 <CGME20170630120301eucas1p20e735dcfa10e2ffd613b99b7e183f216@eucas1p2.samsung.com>
	<1498824162-2774-1-git-send-email-i.maximets@samsung.com>
	<CFF8EF42F1132E4CBE2BF0AB6C21C58D6E53A540@ESESSMB107.ericsson.se>
	<DD040F1F-DC99-40D0-B386-C1727A446B1E@vmware.com>
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: Heetae Ahn <heetae82.ahn@samsung.com>
Subject: Re: [ovs-dev] [PATCH 0/3] Output packet batching.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Sender: ovs-dev-bounces@openvswitch.org
Errors-To: ovs-dev-bounces@openvswitch.org

On 03.07.2017 20:33, Darrell Ball wrote:
> 
> 
> On 7/3/17, 7:31 AM, "ovs-dev-bounces@openvswitch.org on behalf of Jan Scheurich" <ovs-dev-bounces@openvswitch.org on behalf of jan.scheurich@ericsson.com> wrote:
> 
>     I like this generic approach to collect the packets to be output per port for each Rx batch in dpif-netdev. It is indeed simpler than the approach in [1].
>     
>     However, [1] originally had a larger scope, namely to buffer packets in an intermediate queue per netdev tx queue *across* multiple rx batches. Unfortunately this aspect was "optimized" away in v3 to minimize the impact on minimum and average latency for ports with little traffic.
>     
>     Limiting the output batching to a single rx batch has little benefit unless the average rx batch size is significant. According to our measurements with instrumented netdev-dpdk datapath this happens only very close to or above PMD saturation. At realistic production load levels (say 95% PMD processing cycles) the average rx batch size is still around 1-2 only, so the gain we can expect is negligible. The primary use case for this patch appears to be to push up the saturation throughput in benchmarks.
> 
> [Darrell]
> This patch series suggested by Ilya is mainly a subset of the series suggested by Bhanu here
> [1] https://patchwork.ozlabs.org/patch/706461/
> and hence are not really comparable.
> Ilya’s series/patch collects packets per rx batch and sends each subset as a tx batch per output. As pointed out here, by Jan and
> Has also come up many times before, real rx batch sizes are themselves small. Also, in general, an implementation should not
> rely on a very lucky distribution of input packets from one rx batch to provide some benefit. I also think the benefit
> will be negligible in the general case. It would be good to see what the extra cost is for the additional checking when 
> the general case of negligible benefit is in effect.
> 
> I agree one valid suggested concept of Ilya patch set is trying to put the code in a common code path across
> netdevs. This may be an oversimplification in some cases. If tx batching belongs at the netdev layer in some cases,
> I don’t think we should put some at dpif-netdev and some in netdev.  Mixing layers for one function (i.e. tx batching)
> will be more confusing and less maintainable.
> 
> 
>     
>     In our perspective a perhaps more important use case for tx batching would be to reduce the number of virtio interrupts a PMD needs to trigger in Qemu/KVM guests using the kernel virtio-net driver (currently one per tx batch).
>     
>     In internal benchmarks with kernel applications we have observed that the OVS PMD spends up to 30% of its cycles kicking the eventfd and the guest OS in the VM spends 50% or more of a vCPU processing virtio-net interrupts. This seriously degrades the performance of both OVS and application.
>     
>     With a vhostuser tx batching patch that is not limited to a single Rx batch but relies on periodic flushing (e.g. every 50-100 us), we have been able to reduce this interrupt overhead and significantly improve e.g. the iperf performance.
>     
>     My suggestion would be to complement the generic output packet batching per rx batch in dpif-netdev in this patch with specific support for tx batching for vhostuser ports in netdev-dpdk, using an intermediate queue with a configurable flushing interval (including the possibility to turn buffering off when latency is an issue).
> 
> [Darrell]
> If some tx batching logically belongs at the netdev layer, then we should put all of it there, since in this 
> situation the behavior really is “class or subclass” specific and we should recognize that.
> I don’t think we should mix and match tx batching across layers.
> 
> The general idea of decoupling rx and tx, which is Bhanu’s patch ultimate direction, has merit in some situations, especially
> where added latency is less important vs thoughput. This would ultimately
> allow collecting packets of a TX batch across different RXs and over longer time than a single RX batch.
> However, as Jan points out, there is going to be a tradeoff b/w latency and throughput; I don’t think this is a
> surprise to anyone. That logically leads us to more configuration knobs, which I think is ok here.
> 
> 
>     
>     BR, Jan
>     
>     > -----Original Message-----
>     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Ilya Maximets
>     > Sent: Friday, 30 June, 2017 14:03
>     > To: ovs-dev@openvswitch.org; Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>     > Cc: Heetae Ahn <heetae82.ahn@samsung.com>; Ilya Maximets <i.maximets@samsung.com>
>     > Subject: [ovs-dev] [PATCH 0/3] Output packet batching.
>     > 
>     > This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>     > Implementation of [1] looks very complex and introduces many pitfalls for
>     > later code modifications like possible packet stucks.
>     > 
>     > This version targeted to make simple and flexible output packet batching on
>     > higher level without introducing and even simplifying netdev layer.
>     > 
>     > Patch set consists of 3 patches. All the functionality introduced in the
>     > first patch. Two others are just cleanups of netdevs to not do unnecessary
>     > things.
>     > 
>     > Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>     > significant performance improvement.
>     > More accurate and intensive testing required.
>     > 
>     > [1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet transmission.
>     >     https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DJune_334762.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=RfKtd9eOmZJKMCN5_yrODHXgvt_GatMcZzsxG3UCcTc&e= 
>     > 
>     > Ilya Maximets (3):
>     >   dpif-netdev: Output packet batching.
>     >   netdev: Remove unused may_steal.
>     >   netdev: Remove useless cutlen.
>     > 
>     >  lib/dpif-netdev.c     | 81 ++++++++++++++++++++++++++++++++++++---------------
>     >  lib/netdev-bsd.c      |  7 ++---
>     >  lib/netdev-dpdk.c     | 30 +++++++------------
>     >  lib/netdev-dummy.c    |  6 ++--
>     >  lib/netdev-linux.c    |  7 ++---
>     >  lib/netdev-provider.h |  7 ++---
>     >  lib/netdev.c          | 12 +++-----
>     >  lib/netdev.h          |  2 +-
>     >  8 files changed, 83 insertions(+), 69 deletions(-)
>     > 
>     > --
>     > 2.7.4
>     > 
>     > _______________________________________________
>     > dev mailing list
>     > dev@openvswitch.org
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e= 
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e= 


Hi Darrell and Jan.
Thanks for looking at this. I agree with Darrell that mixing implementations
on two different levels is a bad idea, but as I already wrote in reply to
Bhanuprakash [2], there is no issues with implementing of output batching
of more than one rx batch.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html

Look at the incremental below. This is how it may look like:

---8<---------------------------------------------------------------->8---

---8<---------------------------------------------------------------->8---

I can easily change milliseconds granularity to microseconds if you wish.
For the final version part of this incremental will be squashed to the first patch.

One difference with netdev layer solution is that we batching only packets
recieved by current thread not mixing packets from different threads.
I think that it's an advantage of this solution because we will never
have performance issues connected to flushing from the non-local thread.
(see issue description in my reply to Bhanuprakash [2])

Best regards, Ilya Maximets.

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 38282bd..360e737 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -710,6 +710,12 @@ dp_packet_batch_is_empty(const struct dp_packet_batch *batch)
     return !dp_packet_batch_size(batch);
 }
 
+static inline bool
+dp_packet_batch_is_full(const struct dp_packet_batch *batch)
+{
+    return dp_packet_batch_size(batch) == NETDEV_MAX_BURST;
+}
+
 #define DP_PACKET_BATCH_FOR_EACH(PACKET, BATCH)    \
     for (size_t i = 0; i < dp_packet_batch_size(BATCH); i++)  \
         if (PACKET = BATCH->packets[i], true)
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e49f665..181dcb8 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -84,6 +84,9 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 #define MAX_RECIRC_DEPTH 5
 DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
 
+/* Use instant packet send by default. */
+#define DEFAULT_OUTPUT_MAX_LATENCY 0
+
 /* Configuration parameters. */
 enum { MAX_FLOWS = 65536 };     /* Maximum number of flows in flow table. */
 enum { MAX_METERS = 65536 };    /* Maximum number of meters. */
@@ -262,6 +265,9 @@ struct dp_netdev {
     struct ovs_mutex meter_locks[N_METER_LOCKS];
     struct dp_meter *meters[MAX_METERS]; /* Meter bands. */
 
+    /* The time that a packet can wait in output batch for sending. */
+    atomic_uint32_t output_max_latency;
+
     /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
 
@@ -494,6 +500,7 @@ struct tx_port {
     int qid;
     long long last_used;
     struct hmap_node node;
+    long long output_time;
     struct dp_packet_batch output_pkts;
 };
 
@@ -660,7 +667,7 @@ static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
                                        struct rxq_poll *poll)
     OVS_REQUIRES(pmd->port_mutex);
 static void dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *,
-                                               long long now);
+                                               long long now, bool force);
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
@@ -1182,6 +1189,7 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
 
     conntrack_init(&dp->conntrack);
 
+    atomic_init(&dp->output_max_latency, DEFAULT_OUTPUT_MAX_LATENCY);
     atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
 
     cmap_init(&dp->poll_threads);
@@ -2848,7 +2856,7 @@ dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
     dp_packet_batch_init_packet(&pp, execute->packet);
     dp_netdev_execute_actions(pmd, &pp, false, execute->flow,
                               execute->actions, execute->actions_len, now);
-    dp_netdev_pmd_flush_output_packets(pmd, now);
+    dp_netdev_pmd_flush_output_packets(pmd, now, true);
 
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -2897,6 +2905,16 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         smap_get_ullong(other_config, "emc-insert-inv-prob",
                         DEFAULT_EM_FLOW_INSERT_INV_PROB);
     uint32_t insert_min, cur_min;
+    uint32_t output_max_latency, cur_max_latency;
+
+    output_max_latency = smap_get_int(other_config, "output-max-latency",
+                                      DEFAULT_OUTPUT_MAX_LATENCY);
+    atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+    if (output_max_latency != cur_max_latency) {
+        atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
+        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
+                  output_max_latency);
+    }
 
     if (!nullable_string_is_equal(dp->pmd_cmask, cmask)) {
         free(dp->pmd_cmask);
@@ -3085,26 +3103,34 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
 }
 
 static void
+dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
+                                   struct tx_port *p, long long now)
+{
+    int tx_qid;
+    bool dynamic_txqs;
+
+    dynamic_txqs = p->port->dynamic_txqs;
+    if (dynamic_txqs) {
+        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
+    } else {
+        tx_qid = pmd->static_tx_qid;
+    }
+
+    netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
+                dynamic_txqs);
+    dp_packet_batch_init(&p->output_pkts);
+}
+
+static void
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
-                                   long long now)
+                                   long long now, bool force)
 {
     struct tx_port *p;
 
     HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
-        if (!dp_packet_batch_is_empty(&p->output_pkts)) {
-            int tx_qid;
-            bool dynamic_txqs;
-
-            dynamic_txqs = p->port->dynamic_txqs;
-            if (dynamic_txqs) {
-                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
-            } else {
-                tx_qid = pmd->static_tx_qid;
-            }
-
-            netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
-                        dynamic_txqs);
-            dp_packet_batch_init(&p->output_pkts);
+        if (!dp_packet_batch_is_empty(&p->output_pkts)
+            && (force || p->output_time <= now)) {
+            dp_netdev_pmd_flush_output_on_port(pmd, p, now);
         }
     }
 }
@@ -3128,7 +3154,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
 
         cycles_count_start(pmd);
         dp_netdev_input(pmd, &batch, port_no, now);
-        dp_netdev_pmd_flush_output_packets(pmd, now);
+        dp_netdev_pmd_flush_output_packets(pmd, now, false);
         cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
     } else if (error != EAGAIN && error != EOPNOTSUPP) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
@@ -3663,6 +3689,8 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
 {
     struct tx_port *tx_port_cached;
 
+    /* Flush all the queued packets. */
+    dp_netdev_pmd_flush_output_packets(pmd, 0, true);
     /* Free all used tx queue ids. */
     dpif_netdev_xps_revalidate_pmd(pmd, 0, true);
 
@@ -4388,6 +4416,7 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
 
     tx->port = port;
     tx->qid = -1;
+    tx->output_time = 0LL;
     dp_packet_batch_init(&tx->output_pkts);
 
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
@@ -5054,8 +5083,18 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             }
             dp_packet_batch_apply_cutlen(packets_);
 
+            if (dp_packet_batch_is_empty(&p->output_pkts)) {
+                uint32_t cur_max_latency;
+
+                atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+                p->output_time = now + cur_max_latency;
+            }
+
             DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
                 dp_packet_batch_add(&p->output_pkts, packet);
+                if (OVS_UNLIKELY(dp_packet_batch_is_full(&p->output_pkts))) {
+                    dp_netdev_pmd_flush_output_on_port(pmd, p, now);
+                }
             }
             return;
         }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index abfe397..34ee908 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -344,6 +344,21 @@
         </p>
       </column>
 
+      <column name="other_config" key="output-max-latency"
+              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
+        <p>
+          Specifies the time in milliseconds that a packet can wait in output
+          batch for sending i.e. amount of time that packet can spend in an
+          intermediate output queue before sending to netdev.
+          This option can be used to configure balance between throughput
+          and latency. Lower values decreases latency while higher values
+          may be useful to achieve higher performance.
+        </p>
+        <p>
+          Defaults to 0 i.e. instant packet sending (latency optimized).
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>