diff mbox

[net-next] af_packet: Provide a TPACKET_V2 compatible Tx path for TPACKET_V3

Message ID 1483116853-45769-1-git-send-email-sowmini.varadhan@oracle.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan Dec. 30, 2016, 4:54 p.m. UTC
Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx,
*_v3 does not currently have Tx support. As a result an application
that wants the best perf for Tx and Rx (e.g. to handle
request/response transacations) ends up needing 2 sockets,
one with *_v2 for Tx and another with *_v3 for Rx.

This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
so that _v3 supports at least as many features as _v2.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 Documentation/networking/packet_mmap.txt |    6 ++++--
 net/packet/af_packet.c                   |   23 +++++++++++++++--------
 2 files changed, 19 insertions(+), 10 deletions(-)

Comments

Willem de Bruijn Dec. 30, 2016, 9:33 p.m. UTC | #1
On Fri, Dec 30, 2016 at 11:54 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx,
> *_v3 does not currently have Tx support. As a result an application
> that wants the best perf for Tx and Rx (e.g. to handle
> request/response transacations) ends up needing 2 sockets,
> one with *_v2 for Tx and another with *_v3 for Rx.
>
> This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
> so that _v3 supports at least as many features as _v2.

Once we define the interface as equivalent to v2, we cannot redefine it to
support v3-only features later.

> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  Documentation/networking/packet_mmap.txt |    6 ++++--
>  net/packet/af_packet.c                   |   23 +++++++++++++++--------
>  2 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
> index daa015a..6425062 100644
> --- a/Documentation/networking/packet_mmap.txt
> +++ b/Documentation/networking/packet_mmap.txt
> @@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
>                    (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
>
>  TPACKET_V2 --> TPACKET_V3:
> -       - Flexible buffer implementation:
> +       - Flexible buffer implementation for RX_RING:
>                 1. Blocks can be configured with non-static frame-size

This is one of the main advantages of the v3 interface, and also
relevant to Tx. The current implementation does not consult
tpacket3_hdr->tp_next_offset and would preclude adding that
later.
Sowmini Varadhan Dec. 30, 2016, 11:03 p.m. UTC | #2
On (12/30/16 16:33), Willem de Bruijn wrote:
> 
> Once we define the interface as equivalent to v2, we cannot redefine it to
> support v3-only features later.

What v3 only features do we think we want to support?

Tpacket_v3 went in 

  commit f6fb8f100b807378fda19e83e5ac6828b638603a
  :
  Date:   Fri Aug 19 10:18:16 2011 +0000

since then apps that want to use the Rx benefits
have to deal with this dual socket feature, where
with "one socket for super-fast rx, zero Tx".
The zero-tx part sounds like a regression to me.

If we want to have something that does block Tx,
and we cannot figure out how to retro-fit it to 
the exisiting APIs, we can always go for TPACKET_V4. 


> >  TPACKET_V2 --> TPACKET_V3:
> > -       - Flexible buffer implementation:
> > +       - Flexible buffer implementation for RX_RING:
> >                 1. Blocks can be configured with non-static frame-size
> 
> This is one of the main advantages of the v3 interface, and also

sure, and we see some marginal benefits for this, when
we try to use it for our apps. But the "marginal" part
is not worth it, if I have to use separate sockets for tx and rx.

> relevant to Tx. The current implementation does not consult
> tpacket3_hdr->tp_next_offset and would preclude adding that
> later.

When is "later"? its been 6+ years.

--Sowmini
Willem de Bruijn Dec. 30, 2016, 11:39 p.m. UTC | #3
>> Once we define the interface as equivalent to v2, we cannot redefine it to
>> support v3-only features later.
>
> What v3 only features do we think we want to support?

Variable length slots seems like the only one from that list that
makes sense on Tx.

It is already possible to prepare multiple buffers before triggering
transmit, so the block-based signal moderation is not very relevant.

> since then apps that want to use the Rx benefits
> have to deal with this dual socket feature, where
> with "one socket for super-fast rx, zero Tx".
> The zero-tx part sounds like a regression to me.

What is the issue with using separate sockets that you are
having? I generally end up using that even with V2.

> If we want to have something that does block Tx,
> and we cannot figure out how to retro-fit it to
> the exisiting APIs, we can always go for TPACKET_V4.

But the semantics for V3 are currenty well defined. Calling something
V3, but using V2 semantics is a somewhat unintuitive interface to me.

I don't see a benefit in defining something that does not implement
any new features. Especially if it blocks adding the expected
semantics later.

That said, if there is a workload that benefits from using a
single socket, and especially if it can be forward compatible with
support for variable sized slots, then I do not object. I was just
having a look at that second point, actually.

Could you also extend the TX_RING test in
tools/testing/selftests/net/psock_tpacket.c if there are no other
blocking issues?
Sowmini Varadhan Dec. 31, 2016, 12:48 a.m. UTC | #4
On (12/30/16 18:39), Willem de Bruijn wrote:
> 
> Variable length slots seems like the only one from that list that
> makes sense on Tx.
> 
> It is already possible to prepare multiple buffers before triggering
> transmit, so the block-based signal moderation is not very relevant.

FWIW, here is our experience

In our use cases, the blocking on the RX side comes quite naturally
to the application (since, upon waking from select(), we try 
to read as many requests as possible, until we run out of buffers
and/or input), but the response side is not batched today: the server
application sends out one response at a time, and trying to change
this would need additional batching-intelligence in the server. We
are working on the latter part, but as you point out, we can  prepare
multiple buffers before triggering transmit, so some variant of block
TX seems achievable.

Our response messages are usually well-defined multiples of PAGE_SIZE,
(and we are able to set Jumbo MTU) so the variable length slots is not
an issue we foresee (see additional comment on this below).

The block RX is interesting because it allows the server better control
over context-switches and system-calls. This is important because our
input request stream tends to be bursty - the senders (clients) of the
request have to do some computationally intense work before sending the
next request, so being able to adjust the timeout for poll wakeup at
the server is a useful knob.

Having 2 sockets instead of one is unattractive because it just makes
the existing API more clumsy - today  we are using UDP, RDS-TCP and
RDS-IB sockets, and all of this is built around a POSIX-like paradigm
of having some type of select(), sendmsg(), recvmsg() API with a single
socket. Even just extending this to also handle TPACKET_V2 (and tracking
needed context) is messy. Having to convert all this to a 2-socket model
would need significant perf justification, and we havent seen that
justification in our micro-benchmarks yet.

(and fwiw, the POSIX-like API with a single file desc for all I/O
is a major consideration, since the I/O can come from other sources 
like disk, fs etc, and it's cleanest if we follow the same paradigm
for networking as well)

> > since then apps that want to use the Rx benefits
> > have to deal with this dual socket feature, where
> > with "one socket for super-fast rx, zero Tx".
> > The zero-tx part sounds like a regression to me.
> 
> What is the issue with using separate sockets that you are
> having? I generally end up using that even with V2.

Why do you end up having to use 2 sockets with V2? That part
worked out quite nicely for my case (for a simple netserver like 
req/resp handler).

> But the semantics for V3 are currenty well defined. Calling something
> V3, but using V2 semantics is a somewhat unintuitive interface to me.

One fundamental part of tpacket that makes it attractive to 
alternatives like netmap, dpdk etc is that the API follows the
semantics of the classic  unix socket and fd APIs: support for basic
select/sendmsg/recvmsg that work for everything until _V3. 

> I don't see a benefit in defining something that does not implement
> any new features. Especially if it blocks adding the expected
> semantics later.

V3 removed the sendmsg feature.  This patch puts back that feature.

> That said, if there is a workload that benefits from using a
> single socket, and especially if it can be forward compatible with
> support for variable sized slots, then I do not object. I was just
> having a look at that second point, actually.

Actually I'm not averse to looking at extensions (or at least,
place-holders) to allow variable sized slots- do you have any
suggestions? As I mentioned before, the use-cases that I see
do not need variable length slots, thus I have not thought
too deeply about it. But if we think this may be needed in the 
future can't it be accomodated by additional sockopts (or even 
per-packet cmsghdr?) on top of V3? 

> Could you also extend the TX_RING test in
> tools/testing/selftests/net/psock_tpacket.c if there are no other
> blocking issues?

sure, I can do that. Let me do this for patchv2. 

--Sowmini
Willem de Bruijn Dec. 31, 2016, 4:59 a.m. UTC | #5
>> What is the issue with using separate sockets that you are
>> having? I generally end up using that even with V2.
>
> Why do you end up having to use 2 sockets with V2? That part
> worked out quite nicely for my case (for a simple netserver like
> req/resp handler).

Yes, that should work fine, actually. I was thinking of multi-threaded
setups where using per-cpu tx sockets avoids cacheline bouncing, but
there is no reason to have a receive socket on each cpu. But that's
not a standard setup.

>> But the semantics for V3 are currenty well defined. Calling something
>> V3, but using V2 semantics is a somewhat unintuitive interface to me.
>
> One fundamental part of tpacket that makes it attractive to
> alternatives like netmap, dpdk etc is that the API follows the
> semantics of the classic  unix socket and fd APIs: support for basic
> select/sendmsg/recvmsg that work for everything until _V3.
>
>> I don't see a benefit in defining something that does not implement
>> any new features. Especially if it blocks adding the expected
>> semantics later.
>
> V3 removed the sendmsg feature.  This patch puts back that feature.

You mean send ring, right? The sendmsg call works fine on a single socket
alongside an RX_RING until a TX_RING is installed.

>> That said, if there is a workload that benefits from using a
>> single socket, and especially if it can be forward compatible with
>> support for variable sized slots, then I do not object. I was just
>> having a look at that second point, actually.
>
> Actually I'm not averse to looking at extensions (or at least,
> place-holders) to allow variable sized slots- do you have any
> suggestions?

It is probably enough to just enforce that tp_next_offset is always
sane. Either that it points to the start of the next packet, even
though that currently can also be inferred from frame_size. Or, that
it is always zero unless the ring is to be interpreted as holding
variable sized frames. If the field is non-zero, drop the packet.

> As I mentioned before, the use-cases that I see
> do not need variable length slots, thus I have not thought
> too deeply about it. But if we think this may be needed in the
> future can't it be accomodated by additional sockopts (or even
> per-packet cmsghdr?) on top of V3?
>
>> Could you also extend the TX_RING test in
>> tools/testing/selftests/net/psock_tpacket.c if there are no other
>> blocking issues?
>
> sure, I can do that. Let me do this for patchv2.

Great. Thanks!
Sowmini Varadhan Dec. 31, 2016, 12:21 p.m. UTC | #6
On (12/30/16 23:59), Willem de Bruijn wrote:
> > Actually I'm not averse to looking at extensions (or at least,
> > place-holders) to allow variable sized slots- do you have any
> > suggestions?
> 
> It is probably enough to just enforce that tp_next_offset is always
> sane. Either that it points to the start of the next packet, even
> though that currently can also be inferred from frame_size. Or, that
> it is always zero unless the ring is to be interpreted as holding
> variable sized frames. If the field is non-zero, drop the packet.

Ok, let me add this to patchv2, along with extra test cases.

--Sowmini
diff mbox

Patch

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index daa015a..6425062 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -565,7 +565,7 @@  where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
 		   (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
 
 TPACKET_V2 --> TPACKET_V3:
-	- Flexible buffer implementation:
+	- Flexible buffer implementation for RX_RING:
 		1. Blocks can be configured with non-static frame-size
 		2. Read/poll is at a block-level (as opposed to packet-level)
 		3. Added poll timeout to avoid indefinite user-space wait
@@ -574,7 +574,9 @@  where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
 			4.1 block::timeout
 			4.2 tpkt_hdr::sk_rxhash
 	- RX Hash data available in user space
-	- Currently only RX_RING available
+	- TX_RING semantics are conceptually similar to TPACKET_V2;
+	  use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
+	  instead of TPACKET2_HDRLEN.
 
 -------------------------------------------------------------------------------
 + AF_PACKET fanout mode
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 89f2e8c..57692af 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -409,6 +409,9 @@  static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		break;
 	case TPACKET_V3:
+		h.h3->tp_status = status;
+		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
+		break;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -432,6 +435,8 @@  static int __packet_get_status(struct packet_sock *po, void *frame)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		return h.h2->tp_status;
 	case TPACKET_V3:
+		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
+		return h.h3->tp_status;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -2500,6 +2505,9 @@  static int tpacket_parse_header(struct packet_sock *po, void *frame,
 	ph.raw = frame;
 
 	switch (po->tp_version) {
+	case TPACKET_V3:
+		tp_len = ph.h3->tp_len;
+		break;
 	case TPACKET_V2:
 		tp_len = ph.h2->tp_len;
 		break;
@@ -2519,6 +2527,9 @@  static int tpacket_parse_header(struct packet_sock *po, void *frame,
 		off_max = po->tx_ring.frame_size - tp_len;
 		if (po->sk.sk_type == SOCK_DGRAM) {
 			switch (po->tp_version) {
+			case TPACKET_V3:
+				off = ph.h3->tp_net;
+				break;
 			case TPACKET_V2:
 				off = ph.h2->tp_net;
 				break;
@@ -2528,6 +2539,9 @@  static int tpacket_parse_header(struct packet_sock *po, void *frame,
 			}
 		} else {
 			switch (po->tp_version) {
+			case TPACKET_V3:
+				off = ph.h3->tp_mac;
+				break;
 			case TPACKET_V2:
 				off = ph.h2->tp_mac;
 				break;
@@ -4116,11 +4130,6 @@  static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	struct tpacket_req *req = &req_u->req;
 
 	lock_sock(sk);
-	/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
-	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
-		net_warn_ratelimited("Tx-ring is not supported.\n");
-		goto out;
-	}
 
 	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
 	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
@@ -4180,9 +4189,7 @@  static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 			goto out;
 		switch (po->tp_version) {
 		case TPACKET_V3:
-		/* Transmit path is not supported. We checked
-		 * it above but just being paranoid
-		 */
+			/* Block transmit is not supported yet */
 			if (!tx_ring)
 				init_prb_bdqc(po, rb, pg_vec, req_u);
 			break;