Message ID | 20170814055259.31078-1-matthew@mjdsystems.ca |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, 2017-08-14 at 01:52 -0400, Matthew Dawson wrote: > Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove > headers from UDP packets before queueing"), when udp packets are being > peeked the requested extra offset is always 0 as there is no need to skip > the udp header. However, when the offset is 0 and the next skb is > of length 0, it is only returned once. The behaviour can be seen with > the following python script: > > #!/usr/bin/env python3 > from socket import *; > f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); > g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); > f.bind(('::', 0)); > addr=('::1', f.getsockname()[1]); > g.sendto(b'', addr) > g.sendto(b'b', addr) > print(f.recvfrom(10, MSG_PEEK)); > print(f.recvfrom(10, MSG_PEEK)); > > Where the expected output should be the empty string twice. > > Instead, make sk_peek_offset return negative values, and pass those values > to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset > to __skb_try_recv_from_queue is negative, the checked skb is never skipped. > After the call, the offset is set to 0 if negative to ensure all further > calculations are correct. > > Also simplify the if condition in __skb_try_recv_from_queue. If _off is > greater then 0, and off is greater then or equal to skb->len, then (_off || > skb->len) must always be true assuming skb->len >= 0 is always true. > > Also remove a redundant check around a call to sk_peek_offset in af_unix.c, > as it double checked if MSG_PEEK was set in the flags. > > Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> > --- > include/net/sock.h | 4 +--- > net/core/datagram.c | 4 ++-- > net/ipv4/udp.c | 4 ++++ > net/ipv6/udp.c | 4 ++++ > net/unix/af_unix.c | 8 +++++--- > 5 files changed, 16 insertions(+), 8 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 7c0632c7e870..aeeec62992ca 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val); > static inline int sk_peek_offset(struct sock *sk, int flags) > { > if (unlikely(flags & MSG_PEEK)) { > - s32 off = READ_ONCE(sk->sk_peek_off); > - if (off >= 0) > - return off; > + return READ_ONCE(sk->sk_peek_off); > } > > return 0; You probably want/must also update sk_set_peek_off() to allow negative values, elsewhere this will break as soon as the user will do SOL_SOCKET/SO_PEEK_OFF. I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag would help simplyifing the code: no need for negative offset; set such flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called (and clear it when a negative value is used), forward such flag to __skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select the proper peek behaviour. Paolo
On Monday, August 14, 2017 5:27:26 AM EDT Paolo Abeni wrote: > On Mon, 2017-08-14 at 01:52 -0400, Matthew Dawson wrote: > > Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove > > headers from UDP packets before queueing"), when udp packets are being > > peeked the requested extra offset is always 0 as there is no need to skip > > the udp header. However, when the offset is 0 and the next skb is > > of length 0, it is only returned once. The behaviour can be seen with > > the following python script: > > > > #!/usr/bin/env python3 > > from socket import *; > > f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); > > g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); > > f.bind(('::', 0)); > > addr=('::1', f.getsockname()[1]); > > g.sendto(b'', addr) > > g.sendto(b'b', addr) > > print(f.recvfrom(10, MSG_PEEK)); > > print(f.recvfrom(10, MSG_PEEK)); > > > > Where the expected output should be the empty string twice. > > > > Instead, make sk_peek_offset return negative values, and pass those values > > to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed > > offset > > to __skb_try_recv_from_queue is negative, the checked skb is never > > skipped. > > After the call, the offset is set to 0 if negative to ensure all further > > calculations are correct. > > > > Also simplify the if condition in __skb_try_recv_from_queue. If _off is > > greater then 0, and off is greater then or equal to skb->len, then (_off > > || > > skb->len) must always be true assuming skb->len >= 0 is always true. > > > > Also remove a redundant check around a call to sk_peek_offset in > > af_unix.c, > > as it double checked if MSG_PEEK was set in the flags. > > > > Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> > > --- > > > > include/net/sock.h | 4 +--- > > net/core/datagram.c | 4 ++-- > > net/ipv4/udp.c | 4 ++++ > > net/ipv6/udp.c | 4 ++++ > > net/unix/af_unix.c | 8 +++++--- > > 5 files changed, 16 insertions(+), 8 deletions(-) > > > > diff --git a/include/net/sock.h b/include/net/sock.h > > index 7c0632c7e870..aeeec62992ca 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val); > > > > static inline int sk_peek_offset(struct sock *sk, int flags) > > { > > > > if (unlikely(flags & MSG_PEEK)) { > > > > - s32 off = READ_ONCE(sk->sk_peek_off); > > - if (off >= 0) > > - return off; > > + return READ_ONCE(sk->sk_peek_off); > > > > } > > > > return 0; > > You probably want/must also update sk_set_peek_off() to allow negative > values, elsewhere this will break as soon as the user will do > SOL_SOCKET/SO_PEEK_OFF. I'm happy to do this either in this patch or as a second patch and make this a series, whatever the networking community prefers. I'm actually surprised that only unix sockets can have negative values. Is there a reason for that? I had assumed that sk_set_peek_off would allow negative values as the code already has to support negative values due to what the initial value is. > I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag > would help simplyifing the code: no need for negative offset; set such > flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called > (and clear it when a negative value is used), forward such flag to > __skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select > the proper peek behaviour. The negative value is documented and supported for unix sockets, so I don't think we can just reject all negative values. However, I do see a way to implement that while supporting user space sending negative values. If that is preferred let me know and I'll see what I can make. I assume that would make this patch target net-next? Would it be possible to pull this into net so it can fix this regression for the next kernel release, while I work on getting the better solution finished? I'm happy to to make __skb_try_recv_datagram/__skb_try_recv_from_queue accept the extra parameter so they don't see an offset less then 0 in this patch.
> I'm actually surprised that only unix sockets can have negative values. Is > there a reason for that? I had assumed that sk_set_peek_off would allow > negative values as the code already has to support negative values due to what > the initial value is. A negative initial value indicates that PEEK_OFF is disabled. It only makes sense to peek from a positive offset from the start of the data. >> I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag >> would help simplyifing the code: The behavior needs to be bifurcated between peeking with offset and without offset. When peeking with offset is enabled by setting SO_PEEK_OFF, subsequent reads do move the offset, so the observed behavior is correct. When sk->sk_peek_offset is negative, offset mode is disabled and the same packet must be read twice. An explicit boolean flag to discern between the two may make the code simpler to understand, not sure whether that is logically required. >> no need for negative offset; set such >> flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called >> (and clear it when a negative value is used), forward such flag to >> __skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select >> the proper peek behaviour. > The negative value is documented and supported for unix sockets, so I don't > think we can just reject all negative values. What is the documented behavior on those sockets? The original patch that introduced it explicitly mentions peek mode as disabled when the value is negative: https://www.spinics.net/lists/netdev/msg189589.html > However, I do see a way to > implement that while supporting user space sending negative values. If that > is preferred let me know and I'll see what I can make. > > I assume that would make this patch target net-next? I suspect that we can fix this with a small enough patch to be able to send to net. > Would it be possible to > pull this into net so it can fix this regression for the next kernel release, > while I work on getting the better solution finished? I'm happy to to make > __skb_try_recv_datagram/__skb_try_recv_from_queue accept the extra parameter > so they don't see an offset less then 0 in this patch. > > -- > Matthew
On Mon, 2017-08-14 at 11:03 -0400, Willem de Bruijn wrote: > > I'm actually surprised that only unix sockets can have negative values. Is > > there a reason for that? I had assumed that sk_set_peek_off would allow > > negative values as the code already has to support negative values due to what > > the initial value is. > > A negative initial value indicates that PEEK_OFF is disabled. It only > makes sense to peek from a positive offset from the start of the data. With the current code, the user space can disable peeking with offset setting a negative offset value, after that peeking with offset has been enabled. But only for UNIX sockets. I think the same should be allowed for UDP sockets. > > > I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag > > > would help simplyifing the code: > > The behavior needs to be bifurcated between peeking with > offset and without offset. > > When peeking with offset is enabled by setting SO_PEEK_OFF, > subsequent reads do move the offset, so the observed behavior > is correct. > > When sk->sk_peek_offset is negative, offset mode is disabled > and the same packet must be read twice. > > An explicit boolean flag to discern between the two may make > the code simpler to understand, not sure whether that is logically > required. Yes, an explicit PEEK_OFF flag is just to keep the code simpler, so that there is no need to add checks at every sk_peek_offset() call site and the relevant logic can be fully encapsulated under the MSG_PEEK branch in __skb_try_recv_from_queue(), I think/hope. It's not a functional requirement. Cheers, Paolo
On Monday, 14 August 2017 08:03:50 PDT Willem de Bruijn wrote: > > I'm actually surprised that only unix sockets can have negative values. > > Is > > there a reason for that? I had assumed that sk_set_peek_off would allow > > negative values as the code already has to support negative values due to > > what the initial value is. > > A negative initial value indicates that PEEK_OFF is disabled. It only > makes sense to peek from a positive offset from the start of the data. But here's a question: if the peek offset is equal to the length, should the reading return an empty datagram? This would indicate to the caller that there was a datagram there, which was skipped over. That's how we deal with empty datagrams anyway.
On Mon, Aug 14, 2017 at 12:06 PM, Thiago Macieira <thiago.macieira@intel.com> wrote: > On Monday, 14 August 2017 08:03:50 PDT Willem de Bruijn wrote: >> > I'm actually surprised that only unix sockets can have negative values. >> > Is >> > there a reason for that? I had assumed that sk_set_peek_off would allow >> > negative values as the code already has to support negative values due to >> > what the initial value is. >> >> A negative initial value indicates that PEEK_OFF is disabled. It only >> makes sense to peek from a positive offset from the start of the data. > > But here's a question: if the peek offset is equal to the length, should the > reading return an empty datagram? This would indicate to the caller that there > was a datagram there, which was skipped over. In the general case, no, it should read at the offset, which is the next skb. Zero length packets are a special case. This did come up before and we chose to signal their existence in the queue by returning 0 for each once, even in the offset-enabled mode. Since we only need to change no-offset semantics to fix this bug, I would not change this behavior, which is also expected by some applications by now. > > That's how we deal with empty datagrams anyway. What is? With no-offset and a zero payload skb at the head, peek or recv returns 0, right?
On Monday, 14 August 2017 09:33:48 PDT Willem de Bruijn wrote: > > But here's a question: if the peek offset is equal to the length, should > > the reading return an empty datagram? This would indicate to the caller > > that there was a datagram there, which was skipped over. > > In the general case, no, it should read at the offset, which is the next > skb. I beg to differ. In this particular case, we are talking about datagrams. If it were stream sockets, I would agree with you: just skip to the next. But in datagrams, the same way you do return zero-sized ones, I would return an empty one if you peeked at or past the end. > Since we only need to change no-offset semantics to fix this bug, > I would not change this behavior, which is also expected by some > applications by now. Do applications using SOCK_DGRAM rely on the behaviour of skipping over datagrams that are too short? > > That's how we deal with empty datagrams anyway. > > What is? With no-offset and a zero payload skb at the head, peek > or recv returns 0, right? Right.
On Mon, Aug 14, 2017 at 1:02 PM, Thiago Macieira <thiago.macieira@intel.com> wrote: > On Monday, 14 August 2017 09:33:48 PDT Willem de Bruijn wrote: >> > But here's a question: if the peek offset is equal to the length, should >> > the reading return an empty datagram? This would indicate to the caller >> > that there was a datagram there, which was skipped over. >> >> In the general case, no, it should read at the offset, which is the next >> skb. > > I beg to differ. In this particular case, we are talking about datagrams. If > it were stream sockets, I would agree with you: just skip to the next. But in > datagrams, the same way you do return zero-sized ones, I would return an empty > one if you peeked at or past the end. It can be argued either way. I would not change it in the scope of this bug. > >> Since we only need to change no-offset semantics to fix this bug, >> I would not change this behavior, which is also expected by some >> applications by now. > > Do applications using SOCK_DGRAM rely on the behaviour of skipping over > datagrams that are too short? It is established behavior. It cannot be ruled out that an application somewhere depends on it.
On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote: > > Do applications using SOCK_DGRAM rely on the behaviour of skipping over > > datagrams that are too short? > > It is established behavior. It cannot be ruled out that an application > somewhere depends on it. Understood. By the way, what were the usecases for the peek offset feature? Also, do they apply to non-peeking recv?
On Mon, Aug 14, 2017 at 2:33 PM, Thiago Macieira <thiago.macieira@intel.com> wrote: > On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote: >> > Do applications using SOCK_DGRAM rely on the behaviour of skipping over >> > datagrams that are too short? >> >> It is established behavior. It cannot be ruled out that an application >> somewhere depends on it. > > Understood. > > By the way, what were the usecases for the peek offset feature? The idea was to be able to peek at application headers of upper layer protocols and multiplex messages among threads. It proved so complex even for UDP that we did not attempt the same feature for TCP. Also, KCM implements demultiplexing using eBPF today. > Also, do they apply to non-peeking recv? Reading at an offset is not implemented. Especially for RPC over TCP, reading at an offset could make sense. Say, if a message is received completely, but it is head-of-line blocked behind another that has a hole. Here, too, KCM is an alternative.
On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote: > > By the way, what were the usecases for the peek offset feature? > > The idea was to be able to peek at application headers of upper > layer protocols and multiplex messages among threads. It proved > so complex even for UDP that we did not attempt the same feature > for TCP. Also, KCM implements demultiplexing using eBPF today. Interesting, but how would userspace coordinate like that? Suppose multiple threads are woken up by a datagram being received, they peek at a certain offset shared among them all to see which one reads. Suppose that thread is slow or blocked and, while it's getting its act together, another datagram arrives. Because of that, the other threads can't disable their polling. They will continually be woken up by the kernel if they go back to poll/select. Even with epoll, there's no new edge trigger since event is already at level. How will they avoid busy-waiting? And won't this secondary coordination obviate the need for offset peeking?
On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira <thiago.macieira@intel.com> wrote: > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote: >> > By the way, what were the usecases for the peek offset feature? >> >> The idea was to be able to peek at application headers of upper >> layer protocols and multiplex messages among threads. It proved >> so complex even for UDP that we did not attempt the same feature >> for TCP. Also, KCM implements demultiplexing using eBPF today. > > Interesting, but how would userspace coordinate like that? Suppose multiple > threads are woken up by a datagram being received This assumes a separate listener thread and worker threadpool.
On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote: > On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira > > <thiago.macieira@intel.com> wrote: > > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote: > >> > By the way, what were the usecases for the peek offset feature? > >> > >> The idea was to be able to peek at application headers of upper > >> layer protocols and multiplex messages among threads. It proved > >> so complex even for UDP that we did not attempt the same feature > >> for TCP. Also, KCM implements demultiplexing using eBPF today. > > > > Interesting, but how would userspace coordinate like that? Suppose > > multiple > > threads are woken up by a datagram being received > > This assumes a separate listener thread and worker threadpool. The listener thread still needs to synchronise with the worker that got activated and wait for it to recv from the socket before the listener thread can go back to poll(). If we are really talking about threads in the same process, it might be easier for the listener to just read the datagram anyway and pass it on to the worker. That way, it can proceed immediately to the next datagram and not have to wait for the possibly slow worker. If it is a separate process, then I don't see another way and this might be necessary. By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full datagram's size or is it the size minus the peek offset?
On Mon, Aug 14, 2017 at 3:15 PM, Thiago Macieira <thiago.macieira@intel.com> wrote: > On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote: >> On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira >> >> <thiago.macieira@intel.com> wrote: >> > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote: >> >> > By the way, what were the usecases for the peek offset feature? >> >> >> >> The idea was to be able to peek at application headers of upper >> >> layer protocols and multiplex messages among threads. It proved >> >> so complex even for UDP that we did not attempt the same feature >> >> for TCP. Also, KCM implements demultiplexing using eBPF today. >> > >> > Interesting, but how would userspace coordinate like that? Suppose >> > multiple >> > threads are woken up by a datagram being received >> >> This assumes a separate listener thread and worker threadpool. > > The listener thread still needs to synchronise with the worker that got > activated and wait for it to recv from the socket before the listener thread > can go back to poll(). > > If we are really talking about threads in the same process, it might be easier > for the listener to just read the datagram anyway and pass it on to the > worker. That way, it can proceed immediately to the next datagram and not have > to wait for the possibly slow worker. > > If it is a separate process, then I don't see another way and this might be > necessary. > > By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full > datagram's size or is it the size minus the peek offset? udp_recvmsg returns ulen if the flag is passed. if (flags & MSG_TRUNC) err = ulen; This is computed earlier in the function as udp payload length and not modified after. ulen = udp_skb_len(skb);
diff --git a/include/net/sock.h b/include/net/sock.h index 7c0632c7e870..aeeec62992ca 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val); static inline int sk_peek_offset(struct sock *sk, int flags) { if (unlikely(flags & MSG_PEEK)) { - s32 off = READ_ONCE(sk->sk_peek_off); - if (off >= 0) - return off; + return READ_ONCE(sk->sk_peek_off); } return 0; diff --git a/net/core/datagram.c b/net/core/datagram.c index ee5647bd91b3..a91cd8aa244b 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk, *last = queue->prev; skb_queue_walk(queue, skb) { if (flags & MSG_PEEK) { - if (_off >= skb->len && (skb->len || _off || - skb->peeked)) { + if (_off >= 0 && _off >= skb->len && + (_off || skb->peeked)) { _off -= skb->len; continue; } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index a7c804f73990..a0d8d6d285f4 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1576,6 +1576,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock, try_again: peeking = off = sk_peek_offset(sk, flags); skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err); + if (off < 0) { + off = 0; + peeking = 0; + } if (!skb) return err; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 578142b7ca3e..34d8f9f8d87d 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -364,6 +364,10 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, try_again: peeking = off = sk_peek_offset(sk, flags); skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err); + if (peeking < 0) { + off = 0; + peeking = 0; + } if (!skb) return err; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 7b52a380d710..390e0627dc05 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2127,6 +2127,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, skip = sk_peek_offset(sk, flags); skb = __skb_try_recv_datagram(sk, flags, NULL, &peeked, &skip, &err, &last); + if (skip < 0) + skip = 0; if (skb) break; @@ -2304,10 +2306,10 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state, */ mutex_lock(&u->iolock); - if (flags & MSG_PEEK) - skip = sk_peek_offset(sk, flags); - else + skip = sk_peek_offset(sk, flags); + if (skip < 0) { skip = 0; + } do { int chunk;
Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove headers from UDP packets before queueing"), when udp packets are being peeked the requested extra offset is always 0 as there is no need to skip the udp header. However, when the offset is 0 and the next skb is of length 0, it is only returned once. The behaviour can be seen with the following python script: #!/usr/bin/env python3 from socket import *; f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); f.bind(('::', 0)); addr=('::1', f.getsockname()[1]); g.sendto(b'', addr) g.sendto(b'b', addr) print(f.recvfrom(10, MSG_PEEK)); print(f.recvfrom(10, MSG_PEEK)); Where the expected output should be the empty string twice. Instead, make sk_peek_offset return negative values, and pass those values to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset to __skb_try_recv_from_queue is negative, the checked skb is never skipped. After the call, the offset is set to 0 if negative to ensure all further calculations are correct. Also simplify the if condition in __skb_try_recv_from_queue. If _off is greater then 0, and off is greater then or equal to skb->len, then (_off || skb->len) must always be true assuming skb->len >= 0 is always true. Also remove a redundant check around a call to sk_peek_offset in af_unix.c, as it double checked if MSG_PEEK was set in the flags. Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> --- include/net/sock.h | 4 +--- net/core/datagram.c | 4 ++-- net/ipv4/udp.c | 4 ++++ net/ipv6/udp.c | 4 ++++ net/unix/af_unix.c | 8 +++++--- 5 files changed, 16 insertions(+), 8 deletions(-)