Message ID | 1483116853-45769-1-git-send-email-sowmini.varadhan@oracle.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Dec 30, 2016 at 11:54 AM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx, > *_v3 does not currently have Tx support. As a result an application > that wants the best perf for Tx and Rx (e.g. to handle > request/response transacations) ends up needing 2 sockets, > one with *_v2 for Tx and another with *_v3 for Rx. > > This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3 > so that _v3 supports at least as many features as _v2. Once we define the interface as equivalent to v2, we cannot redefine it to support v3-only features later. > Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> > --- > Documentation/networking/packet_mmap.txt | 6 ++++-- > net/packet/af_packet.c | 23 +++++++++++++++-------- > 2 files changed, 19 insertions(+), 10 deletions(-) > > diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt > index daa015a..6425062 100644 > --- a/Documentation/networking/packet_mmap.txt > +++ b/Documentation/networking/packet_mmap.txt > @@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. > (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) > > TPACKET_V2 --> TPACKET_V3: > - - Flexible buffer implementation: > + - Flexible buffer implementation for RX_RING: > 1. Blocks can be configured with non-static frame-size This is one of the main advantages of the v3 interface, and also relevant to Tx. The current implementation does not consult tpacket3_hdr->tp_next_offset and would preclude adding that later.
On (12/30/16 16:33), Willem de Bruijn wrote: > > Once we define the interface as equivalent to v2, we cannot redefine it to > support v3-only features later. What v3 only features do we think we want to support? Tpacket_v3 went in commit f6fb8f100b807378fda19e83e5ac6828b638603a : Date: Fri Aug 19 10:18:16 2011 +0000 since then apps that want to use the Rx benefits have to deal with this dual socket feature, where with "one socket for super-fast rx, zero Tx". The zero-tx part sounds like a regression to me. If we want to have something that does block Tx, and we cannot figure out how to retro-fit it to the exisiting APIs, we can always go for TPACKET_V4. > > TPACKET_V2 --> TPACKET_V3: > > - - Flexible buffer implementation: > > + - Flexible buffer implementation for RX_RING: > > 1. Blocks can be configured with non-static frame-size > > This is one of the main advantages of the v3 interface, and also sure, and we see some marginal benefits for this, when we try to use it for our apps. But the "marginal" part is not worth it, if I have to use separate sockets for tx and rx. > relevant to Tx. The current implementation does not consult > tpacket3_hdr->tp_next_offset and would preclude adding that > later. When is "later"? its been 6+ years. --Sowmini
>> Once we define the interface as equivalent to v2, we cannot redefine it to >> support v3-only features later. > > What v3 only features do we think we want to support? Variable length slots seems like the only one from that list that makes sense on Tx. It is already possible to prepare multiple buffers before triggering transmit, so the block-based signal moderation is not very relevant. > since then apps that want to use the Rx benefits > have to deal with this dual socket feature, where > with "one socket for super-fast rx, zero Tx". > The zero-tx part sounds like a regression to me. What is the issue with using separate sockets that you are having? I generally end up using that even with V2. > If we want to have something that does block Tx, > and we cannot figure out how to retro-fit it to > the exisiting APIs, we can always go for TPACKET_V4. But the semantics for V3 are currenty well defined. Calling something V3, but using V2 semantics is a somewhat unintuitive interface to me. I don't see a benefit in defining something that does not implement any new features. Especially if it blocks adding the expected semantics later. That said, if there is a workload that benefits from using a single socket, and especially if it can be forward compatible with support for variable sized slots, then I do not object. I was just having a look at that second point, actually. Could you also extend the TX_RING test in tools/testing/selftests/net/psock_tpacket.c if there are no other blocking issues?
On (12/30/16 18:39), Willem de Bruijn wrote: > > Variable length slots seems like the only one from that list that > makes sense on Tx. > > It is already possible to prepare multiple buffers before triggering > transmit, so the block-based signal moderation is not very relevant. FWIW, here is our experience In our use cases, the blocking on the RX side comes quite naturally to the application (since, upon waking from select(), we try to read as many requests as possible, until we run out of buffers and/or input), but the response side is not batched today: the server application sends out one response at a time, and trying to change this would need additional batching-intelligence in the server. We are working on the latter part, but as you point out, we can prepare multiple buffers before triggering transmit, so some variant of block TX seems achievable. Our response messages are usually well-defined multiples of PAGE_SIZE, (and we are able to set Jumbo MTU) so the variable length slots is not an issue we foresee (see additional comment on this below). The block RX is interesting because it allows the server better control over context-switches and system-calls. This is important because our input request stream tends to be bursty - the senders (clients) of the request have to do some computationally intense work before sending the next request, so being able to adjust the timeout for poll wakeup at the server is a useful knob. Having 2 sockets instead of one is unattractive because it just makes the existing API more clumsy - today we are using UDP, RDS-TCP and RDS-IB sockets, and all of this is built around a POSIX-like paradigm of having some type of select(), sendmsg(), recvmsg() API with a single socket. Even just extending this to also handle TPACKET_V2 (and tracking needed context) is messy. Having to convert all this to a 2-socket model would need significant perf justification, and we havent seen that justification in our micro-benchmarks yet. (and fwiw, the POSIX-like API with a single file desc for all I/O is a major consideration, since the I/O can come from other sources like disk, fs etc, and it's cleanest if we follow the same paradigm for networking as well) > > since then apps that want to use the Rx benefits > > have to deal with this dual socket feature, where > > with "one socket for super-fast rx, zero Tx". > > The zero-tx part sounds like a regression to me. > > What is the issue with using separate sockets that you are > having? I generally end up using that even with V2. Why do you end up having to use 2 sockets with V2? That part worked out quite nicely for my case (for a simple netserver like req/resp handler). > But the semantics for V3 are currenty well defined. Calling something > V3, but using V2 semantics is a somewhat unintuitive interface to me. One fundamental part of tpacket that makes it attractive to alternatives like netmap, dpdk etc is that the API follows the semantics of the classic unix socket and fd APIs: support for basic select/sendmsg/recvmsg that work for everything until _V3. > I don't see a benefit in defining something that does not implement > any new features. Especially if it blocks adding the expected > semantics later. V3 removed the sendmsg feature. This patch puts back that feature. > That said, if there is a workload that benefits from using a > single socket, and especially if it can be forward compatible with > support for variable sized slots, then I do not object. I was just > having a look at that second point, actually. Actually I'm not averse to looking at extensions (or at least, place-holders) to allow variable sized slots- do you have any suggestions? As I mentioned before, the use-cases that I see do not need variable length slots, thus I have not thought too deeply about it. But if we think this may be needed in the future can't it be accomodated by additional sockopts (or even per-packet cmsghdr?) on top of V3? > Could you also extend the TX_RING test in > tools/testing/selftests/net/psock_tpacket.c if there are no other > blocking issues? sure, I can do that. Let me do this for patchv2. --Sowmini
>> What is the issue with using separate sockets that you are >> having? I generally end up using that even with V2. > > Why do you end up having to use 2 sockets with V2? That part > worked out quite nicely for my case (for a simple netserver like > req/resp handler). Yes, that should work fine, actually. I was thinking of multi-threaded setups where using per-cpu tx sockets avoids cacheline bouncing, but there is no reason to have a receive socket on each cpu. But that's not a standard setup. >> But the semantics for V3 are currenty well defined. Calling something >> V3, but using V2 semantics is a somewhat unintuitive interface to me. > > One fundamental part of tpacket that makes it attractive to > alternatives like netmap, dpdk etc is that the API follows the > semantics of the classic unix socket and fd APIs: support for basic > select/sendmsg/recvmsg that work for everything until _V3. > >> I don't see a benefit in defining something that does not implement >> any new features. Especially if it blocks adding the expected >> semantics later. > > V3 removed the sendmsg feature. This patch puts back that feature. You mean send ring, right? The sendmsg call works fine on a single socket alongside an RX_RING until a TX_RING is installed. >> That said, if there is a workload that benefits from using a >> single socket, and especially if it can be forward compatible with >> support for variable sized slots, then I do not object. I was just >> having a look at that second point, actually. > > Actually I'm not averse to looking at extensions (or at least, > place-holders) to allow variable sized slots- do you have any > suggestions? It is probably enough to just enforce that tp_next_offset is always sane. Either that it points to the start of the next packet, even though that currently can also be inferred from frame_size. Or, that it is always zero unless the ring is to be interpreted as holding variable sized frames. If the field is non-zero, drop the packet. > As I mentioned before, the use-cases that I see > do not need variable length slots, thus I have not thought > too deeply about it. But if we think this may be needed in the > future can't it be accomodated by additional sockopts (or even > per-packet cmsghdr?) on top of V3? > >> Could you also extend the TX_RING test in >> tools/testing/selftests/net/psock_tpacket.c if there are no other >> blocking issues? > > sure, I can do that. Let me do this for patchv2. Great. Thanks!
On (12/30/16 23:59), Willem de Bruijn wrote: > > Actually I'm not averse to looking at extensions (or at least, > > place-holders) to allow variable sized slots- do you have any > > suggestions? > > It is probably enough to just enforce that tp_next_offset is always > sane. Either that it points to the start of the next packet, even > though that currently can also be inferred from frame_size. Or, that > it is always zero unless the ring is to be interpreted as holding > variable sized frames. If the field is non-zero, drop the packet. Ok, let me add this to patchv2, along with extra test cases. --Sowmini
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index daa015a..6425062 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -565,7 +565,7 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) TPACKET_V2 --> TPACKET_V3: - - Flexible buffer implementation: + - Flexible buffer implementation for RX_RING: 1. Blocks can be configured with non-static frame-size 2. Read/poll is at a block-level (as opposed to packet-level) 3. Added poll timeout to avoid indefinite user-space wait @@ -574,7 +574,9 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. 4.1 block::timeout 4.2 tpkt_hdr::sk_rxhash - RX Hash data available in user space - - Currently only RX_RING available + - TX_RING semantics are conceptually similar to TPACKET_V2; + use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN + instead of TPACKET2_HDRLEN. ------------------------------------------------------------------------------- + AF_PACKET fanout mode diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 89f2e8c..57692af 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -409,6 +409,9 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status) flush_dcache_page(pgv_to_page(&h.h2->tp_status)); break; case TPACKET_V3: + h.h3->tp_status = status; + flush_dcache_page(pgv_to_page(&h.h3->tp_status)); + break; default: WARN(1, "TPACKET version not supported.\n"); BUG(); @@ -432,6 +435,8 @@ static int __packet_get_status(struct packet_sock *po, void *frame) flush_dcache_page(pgv_to_page(&h.h2->tp_status)); return h.h2->tp_status; case TPACKET_V3: + flush_dcache_page(pgv_to_page(&h.h3->tp_status)); + return h.h3->tp_status; default: WARN(1, "TPACKET version not supported.\n"); BUG(); @@ -2500,6 +2505,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame, ph.raw = frame; switch (po->tp_version) { + case TPACKET_V3: + tp_len = ph.h3->tp_len; + break; case TPACKET_V2: tp_len = ph.h2->tp_len; break; @@ -2519,6 +2527,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame, off_max = po->tx_ring.frame_size - tp_len; if (po->sk.sk_type == SOCK_DGRAM) { switch (po->tp_version) { + case TPACKET_V3: + off = ph.h3->tp_net; + break; case TPACKET_V2: off = ph.h2->tp_net; break; @@ -2528,6 +2539,9 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame, } } else { switch (po->tp_version) { + case TPACKET_V3: + off = ph.h3->tp_mac; + break; case TPACKET_V2: off = ph.h2->tp_mac; break; @@ -4116,11 +4130,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, struct tpacket_req *req = &req_u->req; lock_sock(sk); - /* Opening a Tx-ring is NOT supported in TPACKET_V3 */ - if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) { - net_warn_ratelimited("Tx-ring is not supported.\n"); - goto out; - } rb = tx_ring ? &po->tx_ring : &po->rx_ring; rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; @@ -4180,9 +4189,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, goto out; switch (po->tp_version) { case TPACKET_V3: - /* Transmit path is not supported. We checked - * it above but just being paranoid - */ + /* Block transmit is not supported yet */ if (!tx_ring) init_prb_bdqc(po, rb, pg_vec, req_u); break;
Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx, *_v3 does not currently have Tx support. As a result an application that wants the best perf for Tx and Rx (e.g. to handle request/response transacations) ends up needing 2 sockets, one with *_v2 for Tx and another with *_v3 for Rx. This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3 so that _v3 supports at least as many features as _v2. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> --- Documentation/networking/packet_mmap.txt | 6 ++++-- net/packet/af_packet.c | 23 +++++++++++++++-------- 2 files changed, 19 insertions(+), 10 deletions(-)