diff mbox series

[RFC,net-next,1/6] sock: MSG_PEEK support for sk_error_queue

Message ID 05d060dc1169649d84c37ad51b0f8fe54a2a3185.1516147540.git.sowmini.varadhan@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show
Series rds: zerocopy support | expand

Commit Message

Sowmini Varadhan Jan. 17, 2018, 12:19 p.m. UTC
Allow the application the ability to use MSG_PEEK with sk_error_queue
so that it can peek and re-read message in cases where MSG_TRUNC
may be encountered.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/tun.c      |    2 +-
 include/net/sock.h     |    2 +-
 net/core/sock.c        |    7 +++++--
 net/packet/af_packet.c |    3 ++-
 4 files changed, 9 insertions(+), 5 deletions(-)

Comments

Willem de Bruijn Jan. 17, 2018, 11:50 p.m. UTC | #1
On Wed, Jan 17, 2018 at 7:19 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Allow the application the ability to use MSG_PEEK with sk_error_queue
> so that it can peek and re-read message in cases where MSG_TRUNC
> may be encountered.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>

>  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
> -                      int level, int type)
> +                      int level, int type, int flags)
>  {
>         struct sock_exterr_skb *serr;
>         struct sk_buff *skb;
> @@ -2916,7 +2916,10 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
>         err = copied;
>
>  out_free_skb:
> -       kfree_skb(skb);
> +       if (likely(!(flags & MSG_PEEK)))
> +               kfree_skb(skb);
> +       else
> +               skb_queue_head(&sk->sk_error_queue, skb);

This can cause reordering with parallel readers. Can we avoid the need
for peeking? It also caused a slew of subtle bugs previously.

How about just define a max number of cookies and require the caller
to always read with sufficient room to hold them?
Sowmini Varadhan Jan. 18, 2018, 11:02 a.m. UTC | #2
On (01/17/18 18:50), Willem de Bruijn wrote:
> 
> This can cause reordering with parallel readers. Can we avoid the need
> for peeking? It also caused a slew of subtle bugs previously.

Yes, I did notice the potential for re-ordering when writing the patch..
but these are not actuallly messages from the wire, so is re-ordering 
fatal?

In general, I"m not particularly attached to this solution- in my
testing, I'm seeing that it's possible to reduce the latency  and still
take a hit on the throughput if the application does not reap the
completion notifciation (and send out new data) efficiently

Some (radically differnt) alternatives that were suggested to me 

- send up all the cookies as ancillary data with recvmsg (i.e., send
  it as a cmsgdata along with actual data from the wire). In most
  cases, the application has data to read, anyway. If it doesnt (pure
  sender), we could wake up recvmsg with 0 bytes of data, but with
  the cookie info in the ancillary data. This feels not-so-elegant
  to me, but I suppose it would have the benefit of optimizing on
  the syscall overhead.. (and you could use MSG_CTRUNC to handle
  the case of insuufficient bufffer for cookies, sending the rest
  on the next call)..
- allow application to use a setsockopt on the rds socket, with
  some shmem region, into which the kernel could write the cookies,
  Let application reap cookies without syscall overhead from that
  shmem region..

> How about just define a max number of cookies and require the caller
> to always read with sufficient room to hold them?

This may be "good enough" as well, maybe allow a max of (say) 16 cookies,
and set up  the skb's in the error queue to send up batches of 16 cookies
at a time?

--Sowmini
Eric Dumazet Jan. 18, 2018, 3:51 p.m. UTC | #3
On Wed, 2018-01-17 at 04:19 -0800, Sowmini Varadhan wrote:
> Allow the application the ability to use MSG_PEEK with sk_error_queue
> so that it can peek and re-read message in cases where MSG_TRUNC
> may be encountered.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>


Lets not add buggy feature that only fuzzers love to use to trigger
awful bugs.

No serious (performance sensitive) application use MSG_PEEK, because of
extra syscall overhead.
Eric Dumazet Jan. 18, 2018, 3:54 p.m. UTC | #4
On Thu, 2018-01-18 at 06:02 -0500, Sowmini Varadhan wrote:
> On (01/17/18 18:50), Willem de Bruijn wrote:
> > 
> > This can cause reordering with parallel readers. Can we avoid the need
> > for peeking? It also caused a slew of subtle bugs previously.
> 
> Yes, I did notice the potential for re-ordering when writing the patch..
> but these are not actuallly messages from the wire, so is re-ordering 
> fatal?

Some applications out there would break horribly, trust me.
Sowmini Varadhan Jan. 18, 2018, 4:10 p.m. UTC | #5
On (01/18/18 07:54), Eric Dumazet wrote:
> 
> Some applications out there would break horribly, trust me.
> 

so I'm not particularly attached to that solution, and I appreciate
the wisdom (and the NACK), but lets try to find a useful alternative

The current zcopy completion notification mechanism involves syscall
overhead already, and is also inadequate for threaded applications sharing
an fd.  Plus it wont work for datagram sockets.

I'm fine with Willem's suggestion of passing a fixed number of cookies
as ancillary data (with CTRUNC to denote inadequate buffer) but if we
are really so thrifty about syscall overhead, we should not be using
sk_error_queue in the first place- perhaps we can pass up the completion
notification as ancillary data with recvmsg() on the POLLIN channel
itself (which is weird if there is no data to recv, and only ancillary
info to pass up, but hey, we are "performant"!).

--Sowmini
Eric Dumazet Jan. 18, 2018, 4:53 p.m. UTC | #6
On Thu, 2018-01-18 at 11:10 -0500, Sowmini Varadhan wrote:
> On (01/18/18 07:54), Eric Dumazet wrote:
> > 
> > Some applications out there would break horribly, trust me.
> > 
> 
> so I'm not particularly attached to that solution, and I appreciate
> the wisdom (and the NACK), but lets try to find a useful alternative
> 
> The current zcopy completion notification mechanism involves syscall
> overhead already, and is also inadequate for threaded applications sharing
> an fd.  Plus it wont work for datagram sockets.
> 
> I'm fine with Willem's suggestion of passing a fixed number of cookies
> as ancillary data (with CTRUNC to denote inadequate buffer) but if we
> are really so thrifty about syscall overhead, we should not be using
> sk_error_queue in the first place- perhaps we can pass up the completion
> notification as ancillary data with recvmsg() on the POLLIN channel
> itself (which is weird if there is no data to recv, and only ancillary
> info to pass up, but hey, we are "performant"!).

The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.

And begins the crazy stuff.

So lets properly design things, and not re-use legacy stuff that is
proven to be not multi-thread ready and too complex.

If you want to design a new channel of communication, do it, and
maintain it.
Sowmini Varadhan Jan. 18, 2018, 5:12 p.m. UTC | #7
On (01/18/18 08:53), Eric Dumazet wrote:
> 
> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.

sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled
about anyway)

> So lets properly design things, and not re-use legacy stuff that is
> proven to be not multi-thread ready and too complex.
> 
> If you want to design a new channel of communication, do it, and
> maintain it.

My instinct is to go with the fixed size ancillary data- which itself
allows 2 options:

1. cmsg_data has a sock_extended_err preamble
   with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar),
   and the ee_data is an array of 32 bit cookies (can pack at most 8 
   32-bit cookies, if we want to pack this into an skb->cb)

   Using the sock_extended_err as preamble will allow this to be usable by
   existing tcp zcopy applications (they can use the ee_origin to find
   out if this a batch of cookies or the existing hi/lo values).

2. If we have the option of passing completion-notification up as ancillary 
   data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
   we dont have to try to retain "backward compat" to the
   SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data
   struct for the notification and potentially pack more cookies into
   48 bytes (RDS could be the first guinea pig for this- doesnt even
   have to be done across all protocol families on day-1).

I think the shmem channel suggestion would be an optional optimization
that can be added later- it may not even be necessary, since most
applications will likely be sending *and* receiving data, so passing up
cookies with recvmsg should be "good enough" to save syscall overhead
for the common case.

I can work #2, if there are no objections to it.

--Sowmini
Willem de Bruijn Jan. 18, 2018, 10:54 p.m. UTC | #8
On Thu, Jan 18, 2018 at 12:12 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 08:53), Eric Dumazet wrote:
>>
>> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.
>
> sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled
> about anyway)
>
>> So lets properly design things, and not re-use legacy stuff that is
>> proven to be not multi-thread ready and too complex.
>>
>> If you want to design a new channel of communication, do it, and
>> maintain it.
>
> My instinct is to go with the fixed size ancillary data- which itself
> allows 2 options:
>
> 1. cmsg_data has a sock_extended_err preamble
>    with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar),
>    and the ee_data is an array of 32 bit cookies (can pack at most 8
>    32-bit cookies, if we want to pack this into an skb->cb)
>
>    Using the sock_extended_err as preamble will allow this to be usable by
>    existing tcp zcopy applications (they can use the ee_origin to find
>    out if this a batch of cookies or the existing hi/lo values).
>
> 2. If we have the option of passing completion-notification up as ancillary
>    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)

This assumes a somewhat symmetric workload, where there are enough recv
calls to reap the notification associated with the send calls.

>    we dont have to try to retain "backward compat" to the
>    SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data
>    struct for the notification and potentially pack more cookies into
>    48 bytes (RDS could be the first guinea pig for this- doesnt even
>    have to be done across all protocol families on day-1).
>
> I think the shmem channel suggestion would be an optional optimization
> that can be added later- it may not even be necessary, since most
> applications will likely be sending *and* receiving data, so passing up
> cookies with recvmsg should be "good enough" to save syscall overhead
> for the common case.
>
> I can work #2, if there are no objections to it.

I would stay with MSG_ERRQUEUE processing. One option is to pass data
up to userspace in the data portion of the notification skb instead of
encoding it in ancillary data, like tcp_get_timestamping_opt_stats.
Sowmini Varadhan Jan. 18, 2018, 11:03 p.m. UTC | #9
On (01/18/18 17:54), Willem de Bruijn wrote:
> > 2. If we have the option of passing completion-notification up as ancillary
> >    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
> 
> This assumes a somewhat symmetric workload, where there are enough recv
> calls to reap the notification associated with the send calls.

Your comment about the assumption is true, but at least for the database
use-cases, we have a request-response model, so the assumption works out..
I dont know if many other workloads that send large buffers have this
pattern.

> I would stay with MSG_ERRQUEUE processing. One option is to pass data
> up to userspace in the data portion of the notification skb instead of
> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.

that's similar to what I have, except that it does not have the
MSG_PEEK part (you'd need to enforce that the data portion
is upper-bounded, and that the application has the responsibility
of sending down "enough" buffer with recvmsg). 

Note that any one of these choices are ok with me- I have no
special attachments to any of them.

--Sowmini
Willem de Bruijn Jan. 18, 2018, 11:09 p.m. UTC | #10
On Thu, Jan 18, 2018 at 6:03 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 17:54), Willem de Bruijn wrote:
>> > 2. If we have the option of passing completion-notification up as ancillary
>> >    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
>>
>> This assumes a somewhat symmetric workload, where there are enough recv
>> calls to reap the notification associated with the send calls.
>
> Your comment about the assumption is true, but at least for the database
> use-cases, we have a request-response model, so the assumption works out..
> I dont know if many other workloads that send large buffers have this
> pattern.

If that is true in general for PF_RDS, then it is a reasonable approach.
How about treating it as a (follow-on) optimization path. Opportunistic
piggybacking of notifications on data reads is more widely applicable.

>
>> I would stay with MSG_ERRQUEUE processing. One option is to pass data
>> up to userspace in the data portion of the notification skb instead of
>> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.
>
> that's similar to what I have, except that it does not have the
> MSG_PEEK part (you'd need to enforce that the data portion
> is upper-bounded, and that the application has the responsibility
> of sending down "enough" buffer with recvmsg).

Right. I think that an upper bound is the simplest solution here.

By the way, if you allocate an skb immediately on page pinning, then
there are always sufficient skbs to store all notifications. On errqueue
enqueue just drop the new skb and copy its notification to the body of
the skb already on the queue, if one exists and it has room. That is
essentially what the tcp zerocopy code does with the [data, info] range.

> Note that any one of these choices are ok with me- I have no
> special attachments to any of them.
Sowmini Varadhan Jan. 18, 2018, 11:20 p.m. UTC | #11
On (01/18/18 18:09), Willem de Bruijn wrote:
> If that is true in general for PF_RDS, then it is a reasonable approach.
> How about treating it as a (follow-on) optimization path. Opportunistic
> piggybacking of notifications on data reads is more widely applicable.

sounds good.

> > that's similar to what I have, except that it does not have the
> > MSG_PEEK part (you'd need to enforce that the data portion
> > is upper-bounded, and that the application has the responsibility
> > of sending down "enough" buffer with recvmsg).
> 
> Right. I think that an upper bound is the simplest solution here.
> 
> By the way, if you allocate an skb immediately on page pinning, then
> there are always sufficient skbs to store all notifications. On errqueue
> enqueue just drop the new skb and copy its notification to the body of
> the skb already on the queue, if one exists and it has room. That is
> essentially what the tcp zerocopy code does with the [data, info] range.

ok, I'll give that a shot (I'm working through the other review comments
as well)

fwiw, the data-corruption issue I mentioned turned out to be a day-one
bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/). 
The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
this one out..

--Sowmini
Willem de Bruijn Jan. 18, 2018, 11:24 p.m. UTC | #12
On Thu, Jan 18, 2018 at 6:20 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 18:09), Willem de Bruijn wrote:
>> If that is true in general for PF_RDS, then it is a reasonable approach.
>> How about treating it as a (follow-on) optimization path. Opportunistic
>> piggybacking of notifications on data reads is more widely applicable.
>
> sounds good.
>
>> > that's similar to what I have, except that it does not have the
>> > MSG_PEEK part (you'd need to enforce that the data portion
>> > is upper-bounded, and that the application has the responsibility
>> > of sending down "enough" buffer with recvmsg).
>>
>> Right. I think that an upper bound is the simplest solution here.
>>
>> By the way, if you allocate an skb immediately on page pinning, then
>> there are always sufficient skbs to store all notifications. On errqueue
>> enqueue just drop the new skb and copy its notification to the body of
>> the skb already on the queue, if one exists and it has room. That is
>> essentially what the tcp zerocopy code does with the [data, info] range.
>
> ok, I'll give that a shot (I'm working through the other review comments
> as well)
>
> fwiw, the data-corruption issue I mentioned turned out to be a day-one
> bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/).
> The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
> this one out..

Thanks. Good to hear that it's not in zerocopy, itself.
diff mbox series

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2fba3be..cfd0e0f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2313,7 +2313,7 @@  static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 	}
 	if (flags & MSG_ERRQUEUE) {
 		ret = sock_recv_errqueue(sock->sk, m, total_len,
-					 SOL_PACKET, TUN_TX_TIMESTAMP);
+					 SOL_PACKET, TUN_TX_TIMESTAMP, flags);
 		goto out;
 	}
 	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT, ptr);
diff --git a/include/net/sock.h b/include/net/sock.h
index 73b7830..f0b6990 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2343,7 +2343,7 @@  static inline bool sk_listener(const struct sock *sk)
 int sock_get_timestamp(struct sock *, struct timeval __user *);
 int sock_get_timestampns(struct sock *, struct timespec __user *);
 int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
-		       int type);
+		       int type, int flags);
 
 bool sk_ns_capable(const struct sock *sk,
 		   struct user_namespace *user_ns, int cap);
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..4f52677 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2887,7 +2887,7 @@  void sock_enable_timestamp(struct sock *sk, int flag)
 }
 
 int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
-		       int level, int type)
+		       int level, int type, int flags)
 {
 	struct sock_exterr_skb *serr;
 	struct sk_buff *skb;
@@ -2916,7 +2916,10 @@  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
 	err = copied;
 
 out_free_skb:
-	kfree_skb(skb);
+	if (likely(!(flags & MSG_PEEK)))
+		kfree_skb(skb);
+	else
+		skb_queue_head(&sk->sk_error_queue, skb);
 out:
 	return err;
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ee7aa0b..4314f31 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3294,7 +3294,8 @@  static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 	if (flags & MSG_ERRQUEUE) {
 		err = sock_recv_errqueue(sk, msg, len,
-					 SOL_PACKET, PACKET_TX_TIMESTAMP);
+					 SOL_PACKET, PACKET_TX_TIMESTAMP,
+					 flags);
 		goto out;
 	}