diff mbox

[net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

Message ID CAF=yD-Lacw9T7G0XZBdHZwEb6HgRuaBoyUN1oTvWixu4a0Fy6Q@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Willem de Bruijn Aug. 15, 2017, 4:45 p.m. UTC
On Tue, Aug 15, 2017 at 11:40 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Mon, 2017-08-14 at 21:35 -0400, Willem de Bruijn wrote:
>> It is not infeasible to fix that and fix up all callers, as Matthew's
>> patch does. But perhaps this patch is simpler to reason about. Thoughts?
>>
>> +static inline bool sk_peek_at_offset(struct sock *sk)
>> +{
>> +       return READ_ONCE(sk->sk_peek_off) >= 0;
>> +}
>> +
>>  static inline int sk_peek_offset(struct sock *sk, int flags)
>>  {
>>         if (unlikely(flags & MSG_PEEK)) {
>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>> index ee5647bd91b3..30b53932af73 100644
>> --- a/net/core/datagram.c
>> +++ b/net/core/datagram.c
>> @@ -175,12 +175,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
>>         *last = queue->prev;
>>         skb_queue_walk(queue, skb) {
>>                 if (flags & MSG_PEEK) {
>> -                       if (_off >= skb->len && (skb->len || _off ||
>> -                                                skb->peeked)) {
>> +                       if (_off >= skb->len && sk_peek_at_offset(sk) &&
>> +                           (skb->len || _off || skb->peeked)) {
>
> I like the idea a lot, but I think/fear that bad things could happen
> since sk_peek_off is read twice at different times: one in
> sk_peek_offset() and one in the above chunk.
>
> If the above is not an issue (I am not sure) I'm very fine with this
> approach.

Good point. I don't think we need to read it in the callers at all.
Then int *off is only used to return to offset within the skb back
to the callers.

Though it should be read only once here, outside the loop.

Something like (also untested):

                return err;

The explicit peek_at_off is a bit verbose, but we need to be careful
about signedness between _off and skb->len.

> For the record, I thought something like the following (uncomplete,
> does not even compile):
> ---
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 8b13db5163cc..5085cf003b88 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -286,6 +286,7 @@ struct ucred {
>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>  #define MSG_BATCH      0x40000 /* sendmmsg(): more messages coming */
>  #define MSG_EOF         MSG_FIN
> +#define MSG_PEEK_OFF   0x80000

Yes, that also works well.

I'm afraid about exhausting the MSG_* flag space here for a
feature that is not exposed to userspace. We don't have many flags
left. We could shadow an existing flag that is unused in this context.

There is another difference between reading sk_peek_offset in the
caller or in __skb_try_recv_from_queue. The latter is called repeatedly
when it returns NULL. Each call can modify *off. I believe that it needs
to restart with _off at sk->sk_peek_off each time, as it restarts from the
head of the queue each time.

>
>  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7c0632c7e870..e75e024b55d2 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -504,12 +504,14 @@ enum sk_pacing {
>
>  int sk_set_peek_off(struct sock *sk, int val);
>
> -static inline int sk_peek_offset(struct sock *sk, int flags)
> +static inline int sk_peek_offset(struct sock *sk, int *flags)
>  {
> -       if (unlikely(flags & MSG_PEEK)) {
> +       if (unlikely(*flags & MSG_PEEK)) {
>                 s32 off = READ_ONCE(sk->sk_peek_off);
> -               if (off >= 0)
> +               if (off >= 0) {
> +                       *flags |= MSG_PEEK_OFF;
>                         return off;
> +               }
>         }
>
>         return 0;
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index ee5647bd91b3..91e1d014d64c 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
>         *last = queue->prev;
>         skb_queue_walk(queue, skb) {
>                 if (flags & MSG_PEEK) {
> -                       if (_off >= skb->len && (skb->len || _off ||
> -                                                skb->peeked)) {
> +                       if (flags & MSG_PEEK_OFF && _off >= skb->len &&
> +                           (skb->len || _off || skb->peeked)) {
>                                 _off -= skb->len;
>                                 continue;
>                         }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a7c804f73990..7e1bcd3500f4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1574,7 +1574,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
>                 return ip_recv_error(sk, msg, len, addr_len);
>
>  try_again:
> -       peeking = off = sk_peek_offset(sk, flags);
> +       peeking = off = sk_peek_offset(sk, &flags);
>         skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
>         if (!skb)
>                 return err;
> ---
> Paolo

Comments

Willem de Bruijn Aug. 15, 2017, 5 p.m. UTC | #1
> There is another difference between reading sk_peek_offset in the
> caller or in __skb_try_recv_from_queue. The latter is called repeatedly
> when it returns NULL. Each call can modify *off. I believe that it needs
> to restart with _off at sk->sk_peek_off each time, as it restarts from the
> head of the queue each time.

I made a mistake here. *off is not updated when returning NULL.

In that case, it is better to read sk_peek_offset once, than to read
it each time __skb_try_recv_from_queue is entered.
Paolo Abeni Aug. 15, 2017, 6:17 p.m. UTC | #2
On Tue, 2017-08-15 at 12:45 -0400, Willem de Bruijn wrote:
> On Tue, Aug 15, 2017 at 11:40 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > For the record, I thought something like the following (uncomplete,
> > does not even compile):
> > ---
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index 8b13db5163cc..5085cf003b88 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -286,6 +286,7 @@ struct ucred {
> >  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >  #define MSG_BATCH      0x40000 /* sendmmsg(): more messages coming */
> >  #define MSG_EOF         MSG_FIN
> > +#define MSG_PEEK_OFF   0x80000
> 
> Yes, that also works well.
> 
> I'm afraid about exhausting the MSG_* flag space here for a
> feature that is not exposed to userspace. We don't have many flags
> left. We could shadow an existing flag that is unused in this context.

That was my concern, too. Fortunately there are a bunch of flags
defined but apparently unused (MSG_FIN, MSG_SYN, MSG_RST) since long
time (if I'm not too low on coffee).

We can shadow one of them (and ev. drop the above define, if really
unused).

I think that the MSG_PEEK_OFF should be explicitly cleared in
sk_peek_offset() when the 'sk_peek_off' is negative, to avoid beeing
fooled by stray bits into the 'flags' argument.

I'll try to scatch-up something tomorrow.

Thanks,

Paolo
diff mbox

Patch

diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..06bad8726612 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -170,13 +170,15 @@  struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
                                          struct sk_buff **last)
 {
        struct sk_buff *skb;
-       int _off = *off;
+       int _off;
+       bool peek_at_off;

+       _off = sk_peek_offset(sk, flags);
+       peek_at_off = _off >= 0;
        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (peek_at_off && off >= skb->len &&
+                           (skb->len || _off || skb->peeked)) {
                                _off -= skb->len;
                                continue;
                        }

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a7c804f73990..4b51b9853406 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1574,7 +1574,7 @@  int udp_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len, int noblock,
                return ip_recv_error(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)