[net-next] sock: do not set sk_err in sock_dequeue_err_skb

Message ID	1478211867-24569-1-git-send-email-soheil.kdev@gmail.com
State	Accepted, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com> To: davem@davemloft.net, netdev@vger.kernel.org Cc: edumazet@google.com, willemb@google.com, ncardwell@google.com, Soheil Hassas Yeganeh <soheil@google.com> Subject: [PATCH net-next] sock: do not set sk_err in sock_dequeue_err_skb Date: Thu, 3 Nov 2016 18:24:27 -0400 Message-Id: <1478211867-24569-1-git-send-email-soheil.kdev@gmail.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

1478211867-24569-1-git-send-email-soheil.kdev@gmail.com

State

Accepted, archived

Delegated to:

David Miller

Headers

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
To: davem@davemloft.net, netdev@vger.kernel.org
Cc: edumazet@google.com, willemb@google.com, ncardwell@google.com,
	Soheil Hassas Yeganeh <soheil@google.com>
Subject: [PATCH net-next] sock: do not set sk_err in sock_dequeue_err_skb
Date: Thu,  3 Nov 2016 18:24:27 -0400
Message-Id: <1478211867-24569-1-git-send-email-soheil.kdev@gmail.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Commit Message

Soheil Hassas Yeganeh Nov. 3, 2016, 10:24 p.m. UTC

From: Soheil Hassas Yeganeh <soheil@google.com>

Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/core/skbuff.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Hannes Frederic Sowa Nov. 3, 2016, 11:10 p.m. UTC | #1

[also cc'ed Andy, albeit this doesn't seem to solve his initial problem,
right? <http://www.spinics.net/lists/netdev/msg331753.html>]

On 03.11.2016 23:24, Soheil Hassas Yeganeh wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> Do not set sk_err when dequeuing errors from the error queue.
> Doing so results in:
> a) Bugs: By overwriting existing sk_err values, it possibly
>    hides legitimate errors. It is also incorrect when local
>    errors are queued with ip_local_error. That happens in the
>    context of a system call, which already returns the error
>    code.
> b) Inconsistent behavior: When there are pending errors on
>    the error queue, sk_err is sometimes 0 (e.g., for
>    the first timestamp on the error queue) and sometimes
>    set to an error code (after dequeuing the first
>    timestamp).
> c) Suboptimality: Setting sk_err to ENOMSG on simple
>    TX timestamps can abort parallel reads and writes.
> 
> Removing this line doesn't break userspace. This is because
> userspace code cannot rely on sk_err for detecting whether
> there is something on the error queue. Except for ICMP messages
> received for UDP and RAW, sk_err is not set at enqueue time,
> and as a result sk_err can be 0 while there are plenty of
> errors on the error queue.
> 
> For ICMP packets in UDP and RAW, sk_err is set when they are
> enqueued on the error queue, but that does not result in aborting
> reads and writes. For such cases, sk_err is only readable via
> getsockopt(SO_ERROR) which will reset the value of sk_err on
> its own. More importantly, prior to this patch,
> recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
> sk_err is set by sock_dequeue_err_skb without atomic ops or
> locks) which can store 0 in sk_err even when we have ICMP
> messages pending. Removing this line from sock_dequeue_err_skb
> eliminates that race.
> 
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

I think it makes sense to remove this given your argumentation.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Willem de Bruijn Nov. 4, 2016, 7:26 p.m. UTC | #2

On Thu, Nov 3, 2016 at 7:10 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> [also cc'ed Andy, albeit this doesn't seem to solve his initial problem,
> right? <http://www.spinics.net/lists/netdev/msg331753.html>]

Indeed, this does not help disambiguate the source of an error
returned by a socketcall. It only reduces how often sk_err is leaked
into the socket call datapath. Local errors and timestamps never set
sk_err in the first place, so there can be no expectation that they
set it on dequeue.

I suspect that that line in sock_dequeue_err_skb originates from icmp
errors, as those do set sk_err whenever queueing an error. So
reenabling it when there are multiple errors on the queue makes sense,
if all queued errors would be icmp errors. But they are not. And that
path is racy, so there is no guarantee, either way. Processes that
rely on syscall blocking to read the errqueue still block for icmp
errors after this patch, due to the initial sk_err assignment on
enqueue.

Note that the comment in the patch that only TCP socket calls are
blocked is probably incorrect. TCP has its own check in tcp_sendmsg
and returns EPIPE when an sk_err is set, without clearing it. But
other send, recv, connect, .. implementations include a path to
sock_error that aborts the call (e.g., in __skb_try_recv_datagram),
returns the value of sk_err, and clears it. This is an inconsistency
between the protocols.

Andy's original problem mentions recvmsg. We could probably make that
unambiguous by returning a cmsg with the value of sk_err if RECVERR is
set and sk_err is the source of the error. But this still does not
solve the general issue, as other socketcalls, like send and connect,
can also return sk_err. We can modify the number of places that check
for it and currently abort system calls. Mainly sock_error(). One
option is to add a socket option (SOL_SOCKET, as sk_err is not limited
to when INET_RECVERR is set) to make that a noop. If the caller sets
this option it instead has to wait with poll to receive POLLERR and
call getsockopt SO_ERROR and recv MSG_ERRQUEUE to get the asynchronous
error.

Andy Lutomirski Nov. 7, 2016, 5:18 a.m. UTC | #3

On Nov 4, 2016 1:26 PM, "Willem de Bruijn"
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Nov 3, 2016 at 7:10 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > [also cc'ed Andy, albeit this doesn't seem to solve his initial problem,
> > right? <http://www.spinics.net/lists/netdev/msg331753.html>]
>
> Indeed, this does not help disambiguate the source of an error
> returned by a socketcall. It only reduces how often sk_err is leaked
> into the socket call datapath. Local errors and timestamps never set
> sk_err in the first place, so there can be no expectation that they
> set it on dequeue.
>
> I suspect that that line in sock_dequeue_err_skb originates from icmp
> errors, as those do set sk_err whenever queueing an error. So
> reenabling it when there are multiple errors on the queue makes sense,
> if all queued errors would be icmp errors. But they are not. And that
> path is racy, so there is no guarantee, either way. Processes that
> rely on syscall blocking to read the errqueue still block for icmp
> errors after this patch, due to the initial sk_err assignment on
> enqueue.
>
> Note that the comment in the patch that only TCP socket calls are
> blocked is probably incorrect. TCP has its own check in tcp_sendmsg
> and returns EPIPE when an sk_err is set, without clearing it. But
> other send, recv, connect, .. implementations include a path to
> sock_error that aborts the call (e.g., in __skb_try_recv_datagram),
> returns the value of sk_err, and clears it. This is an inconsistency
> between the protocols.
>
> Andy's original problem mentions recvmsg. We could probably make that
> unambiguous by returning a cmsg with the value of sk_err if RECVERR is
> set and sk_err is the source of the error. But this still does not
> solve the general issue, as other socketcalls, like send and connect,
> can also return sk_err. We can modify the number of places that check
> for it and currently abort system calls. Mainly sock_error(). One
> option is to add a socket option (SOL_SOCKET, as sk_err is not limited
> to when INET_RECVERR is set) to make that a noop. If the caller sets
> this option it instead has to wait with poll to receive POLLERR and
> call getsockopt SO_ERROR and recv MSG_ERRQUEUE to get the asynchronous
> error.

That would work for me, I think.  If this socket option were set, I
would hope that SO_ERROR wouldn't report anything if the only error
were a queued message on the error queue.

David Miller Nov. 8, 2016, 1:29 a.m. UTC | #4

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Thu,  3 Nov 2016 18:24:27 -0400

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> Do not set sk_err when dequeuing errors from the error queue.
> Doing so results in:
> a) Bugs: By overwriting existing sk_err values, it possibly
>    hides legitimate errors. It is also incorrect when local
>    errors are queued with ip_local_error. That happens in the
>    context of a system call, which already returns the error
>    code.
> b) Inconsistent behavior: When there are pending errors on
>    the error queue, sk_err is sometimes 0 (e.g., for
>    the first timestamp on the error queue) and sometimes
>    set to an error code (after dequeuing the first
>    timestamp).
> c) Suboptimality: Setting sk_err to ENOMSG on simple
>    TX timestamps can abort parallel reads and writes.
> 
> Removing this line doesn't break userspace. This is because
> userspace code cannot rely on sk_err for detecting whether
> there is something on the error queue. Except for ICMP messages
> received for UDP and RAW, sk_err is not set at enqueue time,
> and as a result sk_err can be 0 while there are plenty of
> errors on the error queue.
> 
> For ICMP packets in UDP and RAW, sk_err is set when they are
> enqueued on the error queue, but that does not result in aborting
> reads and writes. For such cases, sk_err is only readable via
> getsockopt(SO_ERROR) which will reset the value of sk_err on
> its own. More importantly, prior to this patch,
> recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
> sk_err is set by sock_dequeue_err_skb without atomic ops or
> locks) which can store 0 in sk_err even when we have ICMP
> messages pending. Removing this line from sock_dequeue_err_skb
> eliminates that race.
> 
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Ok, applied.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1e3e008..0b2a6e9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3725,7 +3725,6 @@  struct sk_buff *sock_dequeue_err_skb(struct sock *sk)
 		err = SKB_EXT_ERR(skb_next)->ee.ee_errno;
 	spin_unlock_irqrestore(&q->lock, flags);
 
-	sk->sk_err = err;
 	if (err)
 		sk->sk_error_report(sk);

[net-next] sock: do not set sk_err in sock_dequeue_err_skb

Commit Message

Comments

Patch