diff mbox

[RFC,net-next,3/3] packet: make use of deferred TX queue flushing

Message ID 1408887738-7661-4-git-send-email-dborkman@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann Aug. 24, 2014, 1:42 p.m. UTC
This adds a first use-case of deferred tail pointer flushing
for AF_PACKET's TX_RING in QDISC_BYPASS mode.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 49 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 12 deletions(-)

Comments

David Miller Aug. 25, 2014, 5:57 a.m. UTC | #1
From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 24 Aug 2014 15:42:18 +0200

> This adds a first use-case of deferred tail pointer flushing
> for AF_PACKET's TX_RING in QDISC_BYPASS mode.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

If last_queue changes, you'll need to force a flush, does that
end up happening with your changes here?  I really couldn't
tell for sure.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Aug. 25, 2014, 6:40 a.m. UTC | #2
On 08/25/2014 07:57 AM, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Sun, 24 Aug 2014 15:42:18 +0200
>
>> This adds a first use-case of deferred tail pointer flushing
>> for AF_PACKET's TX_RING in QDISC_BYPASS mode.
>>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>
> If last_queue changes, you'll need to force a flush, does that
> end up happening with your changes here?  I really couldn't
> tell for sure.

Yes indeed, I noticed that later on as well. :)

I will fix that up and resubmit the series, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Aug. 25, 2014, 1:54 p.m. UTC | #3
On Sun, 24 Aug 2014 15:42:18 +0200
Daniel Borkmann <dborkman@redhat.com> wrote:

> This adds a first use-case of deferred tail pointer flushing
> for AF_PACKET's TX_RING in QDISC_BYPASS mode.

Testing with trafgen.  I've updated patch 1/3 to NOT call mmiowb(),
during this testing, see why in my other post.

trafgen cmdline:
 trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
 * Only use 1 CPU
 * default is mmap
 * default is QDISC_BYPASS mode

BASELINE(no-patches): trafgen QDISC_BYPASS and mmap:
 - tx:1562539 pps

With PACKET_FLUSH_THRESH=8, and QDISC_BYPASS and mmap:
 - tx:1683746 pps

Improvement:
 + 121207 pps
 - 46 ns (1/1562539*10^9)-(1/1683746*10^9)

This is a significant improvement! :-)
Jesper Dangaard Brouer Aug. 25, 2014, 3:16 p.m. UTC | #4
On Mon, 25 Aug 2014 15:54:02 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Sun, 24 Aug 2014 15:42:18 +0200
> Daniel Borkmann <dborkman@redhat.com> wrote:
> 
> > This adds a first use-case of deferred tail pointer flushing
> > for AF_PACKET's TX_RING in QDISC_BYPASS mode.
> 
> Testing with trafgen.  I've updated patch 1/3 to NOT call mmiowb(),
> during this testing, see why in my other post.
> 
> trafgen cmdline:
>  trafgen --cpp  --dev eth5 --conf udp_example01.trafgen -V --cpus 1
>  * Only use 1 CPU
>  * default is mmap
>  * default is QDISC_BYPASS mode
> 
> BASELINE(no-patches): trafgen QDISC_BYPASS and mmap:
>  - tx:1562539 pps
> 
> With PACKET_FLUSH_THRESH=8, and QDISC_BYPASS and mmap:
>  - tx:1683746 pps
> 
> Improvement:
>  + 121207 pps
>  - 46 ns (1/1562539*10^9)-(1/1683746*10^9)
> 
> This is a significant improvement! :-)

I'm unfortunately seeing a regression, if I'm NOT bypassing the qdisc
layer, and still use mmap.  Trafgen have an option --qdisc-path for
this. (I believe most other solutions, don't set the QDISC_BYPASS
socket option)

trafgen command:
 # trafgen --cpp --dev eth5 --conf udp_example01.trafgen -V  --qdisc-path --cpus 1
 * still use mmap
 * choose normal qdisc code path via --qdisc-path

BASELINE(no-patches): trafgen using --qdisc-path and mmap:
 - tx:1371307 pps

(Patched): trafgen using --qdisc-path and mmap
 - tx:1345999 pps

Regression:
 * 25308 pps slower than before
 * 13.71 nanosec slower (1/1371307*10^9)-(1/1345999*10^9)

How can we explain this?!?

As can be deducted from the baseline numbers, the cost of the qdisc
path is fairly high, with 89.24 ns ((1/1562539*10^9)-(1/1371307*10^9)).
(This is a bit higher than I expected based on my data from:
http://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-qdisc-lockless.pdf
where I measured it to be 60ns).

(Does this makes sense?):  Above results say we can save 46ns by
delaying tailptr updates.  But the qdisc path itself will add 89ns of
delay between packet, which is then too large to take advantage of the
tailptr win.  (not sure this explains the issue... feel free to come up
with a better explanation)
diff mbox

Patch

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 0dfa990..27457e8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -216,7 +216,8 @@  static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 static void packet_flush_mclist(struct sock *sk);
 
 struct packet_skb_cb {
-	unsigned int origlen;
+	u32 enforce_flush:1,
+	    origlen:31;
 	union {
 		struct sockaddr_pkt pkt;
 		struct sockaddr_ll ll;
@@ -237,8 +238,11 @@  struct packet_skb_cb {
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+#define PACKET_FLUSH_THRESH	8
+
 static int packet_direct_xmit(struct sk_buff *skb)
 {
+	bool flush = PACKET_SKB_CB(skb)->enforce_flush;
 	struct net_device *dev = skb->dev;
 	netdev_features_t features;
 	struct netdev_queue *txq;
@@ -261,9 +265,12 @@  static int packet_direct_xmit(struct sk_buff *skb)
 
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
 	if (!netif_xmit_frozen_or_drv_stopped(txq)) {
-		ret = netdev_start_xmit(skb, dev);
-		if (ret == NETDEV_TX_OK)
+		ret = __netdev_xmit_only(skb, dev);
+		if (ret == NETDEV_TX_OK) {
+			if (flush)
+				__netdev_xmit_flush(dev, queue_map);
 			txq_trans_update(txq);
+		}
 	}
 	HARD_TX_UNLOCK(dev, txq);
 
@@ -313,7 +320,7 @@  static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
 }
 
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
+static u16 packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 	u16 queue_index;
@@ -327,6 +334,7 @@  static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb_set_queue_mapping(skb, queue_index);
+	return queue_index;
 }
 
 /* register_prot_hook must be invoked with the po->bind_lock held,
@@ -2237,7 +2245,8 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	unsigned char *addr;
 	int len_sum = 0;
 	int status = TP_STATUS_AVAILABLE;
-	int hlen, tlen;
+	int hlen, tlen, pending = 0;
+	u16 last_queue = 0;
 
 	mutex_lock(&po->pg_vec_lock);
 
@@ -2276,18 +2285,22 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			if (need_wait && need_resched())
+			if (need_wait && need_resched()) {
+				if (packet_use_direct_xmit(po) && pending > 0) {
+					__netdev_xmit_flush(dev, last_queue);
+					pending = 0;
+				}
 				schedule();
+			}
 			continue;
 		}
 
 		status = TP_STATUS_SEND_REQUEST;
 		hlen = LL_RESERVED_SPACE(dev);
-		tlen = dev->needed_tailroom;
-		skb = sock_alloc_send_skb(&po->sk,
-				hlen + tlen + sizeof(struct sockaddr_ll),
-				0, &err);
 
+		tlen = dev->needed_tailroom;
+		skb = sock_alloc_send_skb(&po->sk, hlen + tlen +
+					  sizeof(struct sockaddr_ll), 0, &err);
 		if (unlikely(skb == NULL))
 			goto out_status;
 
@@ -2319,13 +2332,18 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
-		packet_pick_tx_queue(dev, skb);
+		last_queue = packet_pick_tx_queue(dev, skb);
 
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
 		packet_inc_pending(&po->tx_ring);
 
 		status = TP_STATUS_SEND_REQUEST;
+		if (pending >= PACKET_FLUSH_THRESH) {
+			PACKET_SKB_CB(skb)->enforce_flush = 1;
+			pending = -1;
+		}
+
 		err = po->xmit(skb);
 		if (unlikely(err > 0)) {
 			err = net_xmit_errno(err);
@@ -2340,7 +2358,11 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			 * let's treat it like congestion or err < 0
 			 */
 			err = 0;
+		} else {
+			/* Sucessfully sent out. */
+			pending++;
 		}
+
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
 	} while (likely((ph != NULL) ||
@@ -2354,11 +2376,12 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	err = len_sum;
 	goto out_put;
-
 out_status:
 	__packet_set_status(po, ph, status);
 	kfree_skb(skb);
 out_put:
+	if (packet_use_direct_xmit(po) && pending > 0)
+		__netdev_xmit_flush(dev, last_queue);
 	dev_put(dev);
 out:
 	mutex_unlock(&po->pg_vec_lock);
@@ -2561,6 +2584,8 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;
 
+	PACKET_SKB_CB(skb)->enforce_flush = 1;
+
 	err = po->xmit(skb);
 	if (err > 0 && (err = net_xmit_errno(err)) != 0)
 		goto out_unlock;