diff mbox

[net-next] sock: consistent errqueue errors and signals

Message ID 1409534896-372-1-git-send-email-willemb@google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Willem de Bruijn Sept. 1, 2014, 1:28 a.m. UTC
When a socket error is pending, send()/recv() must abort their normal
operation and return the error. An error means having non-zero
sk->sk_err or having non-empty sk->sk_error_queue.

Currently, the behavior for the second is inconsistent depending on
whether an error has previously been dequeued. In all cases,
recv()/send() test sk->sk_err. This is not modified on enqueue onto
the error queue, so may be 0. It is modified on dequeue, however, to
match the queued skb's errno. I observed the following when two errors
were queued:

  ret = poll(pollfd, 1, -1);
  assert(ret == 1);
  assert(pollfd.revents == POLLERR);

  ret = recv(fd, buf, size, MSG_NONBLOCK);
  assert(ret == -1 && errno == EAGAIN);		/* <-- A */

  ret = recv(fd, buf, size, MSG_ERRQUEUE);
  assert(ret > 0);

  ret = recv(fd, buf, size, MSG_NONBLOCK);
  assert(ret == -1 && errno == ENOMSG);		/* <-- B */

  ret = recv(fd, buf, size, MSG_ERRQUEUE);
  assert(ret > 0);

The recv call in B returns the error code embedded in
SKB_EXT_ERR(skb), in this case ENOMSG, because I am working with
timestamps. The recv call in A should have returned the
same.

Implement this behavior. This may surprise existing applications.

Also make the wake-up signal when data is ready on the error queue
consistent between enqueue and dequeue: use sk_error_report in both
cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>

---

This approach leaves one issue:
The states of sk->sk_err and sk->sk_error_queue are related, but only
loosely. Error queue enqueue, dequeue and other code may overwrite
sk->sk_err unconditionally. For one, sock_error will reset
sk->sk_err to 0 even if sk->sk_error_queue is not empty. If socket
calls should abort on all errors, then should be change to test
sk_error_queue.qlen. But, doing so requires taking a lock in a busy
data path.
---
 net/core/skbuff.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Hannes Frederic Sowa Sept. 1, 2014, 9:25 p.m. UTC | #1
On So, 2014-08-31 at 21:28 -0400, Willem de Bruijn wrote:
> When a socket error is pending, send()/recv() must abort their normal
> operation and return the error. An error means having non-zero
> sk->sk_err or having non-empty sk->sk_error_queue.
> 
> Currently, the behavior for the second is inconsistent depending on
> whether an error has previously been dequeued. In all cases,
> recv()/send() test sk->sk_err. This is not modified on enqueue onto
> the error queue, so may be 0. It is modified on dequeue, however, to
> match the queued skb's errno. I observed the following when two errors
> were queued:
> 
>   ret = poll(pollfd, 1, -1);
>   assert(ret == 1);
>   assert(pollfd.revents == POLLERR);
> 
>   ret = recv(fd, buf, size, MSG_NONBLOCK);
>   assert(ret == -1 && errno == EAGAIN);		/* <-- A */
> 
>   ret = recv(fd, buf, size, MSG_ERRQUEUE);
>   assert(ret > 0);
> 
>   ret = recv(fd, buf, size, MSG_NONBLOCK);
>   assert(ret == -1 && errno == ENOMSG);		/* <-- B */
> 
>   ret = recv(fd, buf, size, MSG_ERRQUEUE);
>   assert(ret > 0);
> 
> The recv call in B returns the error code embedded in
> SKB_EXT_ERR(skb), in this case ENOMSG, because I am working with
> timestamps. The recv call in A should have returned the
> same.
> 
> Implement this behavior. This may surprise existing applications.
> 
> Also make the wake-up signal when data is ready on the error queue
> consistent between enqueue and dequeue: use sk_error_report in both
> cases.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> This approach leaves one issue:
> The states of sk->sk_err and sk->sk_error_queue are related, but only
> loosely. Error queue enqueue, dequeue and other code may overwrite
> sk->sk_err unconditionally. For one, sock_error will reset
> sk->sk_err to 0 even if sk->sk_error_queue is not empty. If socket
> calls should abort on all errors, then should be change to test
> sk_error_queue.qlen. But, doing so requires taking a lock in a busy
> data path.
> ---
>  net/core/skbuff.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 163b673..f7a280b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3485,8 +3485,11 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>  	skb_dst_force(skb);
>  
>  	skb_queue_tail(&sk->sk_error_queue, skb);
> +	sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;
> +
>  	if (!sock_flag(sk, SOCK_DEAD))
> -		sk->sk_data_ready(sk);
> +		sk->sk_error_report(sk);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(sock_queue_err_skb);

From my experience in IPv6 code, we only do sk->sk_err updates directly
in protocol error handling code. In case of UDP IPv6 errors for example
we now notify sk_error_report two times with this patch (before the
patch we did sk_data_ready (this is what you changed) and
sk_error_report).

I really wonder if setting sk->sk_err in this function is the right
thing to do. It also depends on socket state bits (e.g. np->recverr) if
the update happens. So we still cannot get rid of the protocol dependent
sk->sk_err updates.

It looks like we have to check all error handling functions in the
protocols. Maybe timestamp code needs to adapt?

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Sept. 2, 2014, 3:20 p.m. UTC | #2
> From my experience in IPv6 code, we only do sk->sk_err updates directly
> in protocol error handling code. In case of UDP IPv6 errors for example
> we now notify sk_error_report two times with this patch (before the
> patch we did sk_data_ready (this is what you changed) and
> sk_error_report).

If the event is that an error is ready, is this not correct? The
wake up key should be POLLERR in both cases. In implementation
of sock_def_error_report and sock_def_readable, the difference
otherwise seems slim. I haven't checked all sk_data_ready and
sk_error_report implementations, though, so may have missed
differences for specific protocols. If this is not as obviously a strict
improvement as I thought, I'll just drop it.

> I really wonder if setting sk->sk_err in this function is the right
> thing to do.

I agree, in that it is hard to verify that this does not overwrite
an existing error. This patch only makes the behavior
consistent between enqueue and dequeue, but perhaps a
better way to achieve that is to change the dequeue side:
remove the assignment to sk->sk_err there. If so, then all
locations that currently check the state of sk->sk_err should
be changed to also check the qlen of the error queue and
if non-zero return the embedded error of the first skb. I'll
take a look whether that is feasible without adding locks
or atomics in the common path.

> It also depends on socket state bits (e.g. np->recverr) if
> the update happens. So we still cannot get rid of the protocol dependent
> sk->sk_err updates.
>
> It looks like we have to check all error handling functions in the
> protocols. Maybe timestamp code needs to adapt?

Does the above sound okay, or did you mean something else?

>
> Thanks,
> Hannes
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Sept. 2, 2014, 9:18 p.m. UTC | #3
Hello,

On Di, 2014-09-02 at 11:20 -0400, Willem de Bruijn wrote:
> > From my experience in IPv6 code, we only do sk->sk_err updates directly
> > in protocol error handling code. In case of UDP IPv6 errors for example
> > we now notify sk_error_report two times with this patch (before the
> > patch we did sk_data_ready (this is what you changed) and
> > sk_error_report).
> 
> If the event is that an error is ready, is this not correct? The
> wake up key should be POLLERR in both cases. In implementation
> of sock_def_error_report and sock_def_readable, the difference
> otherwise seems slim. I haven't checked all sk_data_ready and
> sk_error_report implementations, though, so may have missed
> differences for specific protocols. If this is not as obviously a strict
> improvement as I thought, I'll just drop it.

I am not sure if waking up the socket two times has unforeseen
consequences, we do so e.g. in __udp6_lib_err if recverr is set.

> > I really wonder if setting sk->sk_err in this function is the right
> > thing to do.
> 
> I agree, in that it is hard to verify that this does not overwrite
> an existing error. This patch only makes the behavior
> consistent between enqueue and dequeue, but perhaps a
> better way to achieve that is to change the dequeue side:
> remove the assignment to sk->sk_err there. If so, then all
> locations that currently check the state of sk->sk_err should
> be changed to also check the qlen of the error queue and
> if non-zero return the embedded error of the first skb. I'll
> take a look whether that is feasible without adding locks
> or atomics in the common path.

That would be great.

> 
> > It also depends on socket state bits (e.g. np->recverr) if
> > the update happens. So we still cannot get rid of the protocol dependent
> > sk->sk_err updates.
> >
> > It looks like we have to check all error handling functions in the
> > protocols. Maybe timestamp code needs to adapt?
> 
> Does the above sound okay, or did you mean something else?

Best thing would be to not keep the error status two times per socket.
Maybe it would make sense to always synchronize on the error queue and
don't check for sk->sk_err at all? It seems to get very hairy without
taking any locks though.

I even don't know what the semantics for sk_err should be. Should we
leave the oldest error in place until it got fetched? Then we could use
cmpxchg in slow path with 0 as the old value. They could easily become
unsynchronized if the user switches off recverr setsockopt. But I don't
think we need to handle that.

I think best effort should would be ok, too. Not having locked
instructions in fast path is much more important.

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Sept. 3, 2014, 2:34 a.m. UTC | #4
>> I agree, in that it is hard to verify that this does not overwrite
>> an existing error. This patch only makes the behavior
>> consistent between enqueue and dequeue, but perhaps a
>> better way to achieve that is to change the dequeue side:
>> remove the assignment to sk->sk_err there. If so, then all
>> locations that currently check the state of sk->sk_err should
>> be changed to also check the qlen of the error queue and
>> if non-zero return the embedded error of the first skb. I'll
>> take a look whether that is feasible without adding locks
>> or atomics in the common path.
>
> That would be great.

This looks feasible, though the patch is considerably longer.
If we can be sure that in skb_dequeue_err_skb in this patch

+       sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;

never overwrites an existing error, I think that this short
patch is preferable.

I wrote a preliminary alternative patch that modifies
sock_error to check the queue and converts other
code that manually tests sk->sk_err to instead call a new
function sock_has_error that similarly tests both. The patch
is rough and minimally tested, but I will send it out as rfc.

>>
>> > It also depends on socket state bits (e.g. np->recverr) if
>> > the update happens. So we still cannot get rid of the protocol dependent
>> > sk->sk_err updates.
>> >
>> > It looks like we have to check all error handling functions in the
>> > protocols. Maybe timestamp code needs to adapt?
>>
>> Does the above sound okay, or did you mean something else?
>
> Best thing would be to not keep the error status two times per socket.

I agree in principle, but this may be hard to achieve.

> Maybe it would make sense to always synchronize on the error queue and
> don't check for sk->sk_err at all?

sk->sk_err is set in many locations. I don't think that we can change all
those sites.

>It seems to get very hairy without
> taking any locks though.

To fix this particular issue, I prefer to leave it sk->sk_err as is, instead
separate the error queue logic from it and make sure that that has
consistent semantics.

> I even don't know what the semantics for sk_err should be.

Neither do I. That's one reason I'd rather not touch it :)

> Should we
> leave the oldest error in place until it got fetched? Then we could use
> cmpxchg in slow path with 0 as the old value. They could easily become
> unsynchronized if the user switches off recverr setsockopt. But I don't
> think we need to handle that.
>
> I think best effort should would be ok, too. Not having locked
> instructions in fast path is much more important.

We can optimistically test the qlen without a lock and only take the
lock and peek into the queue in the unlikely case that qlen is non-zero.
This is racy, but already done in various poll routines. It seems to be
protected some other way, or the race is deemed benign. Either way,
the patch does not change these callers. For recv/send, which it does
change, best effort seems acceptable to me, too. Perhaps we can do
better, but I'll send out the patch as is for now.

On a related note, a comment in net/core/sock.c states that
sk->sk_err is only changed under lock. This is not currently
true. skb_dequeue_err_skb does not take the socket lock,
for one.

> Thanks,
> Hannes

Thanks for having a look at the patch!
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Sept. 3, 2014, 12:14 p.m. UTC | #5
On So, 2014-08-31 at 21:28 -0400, Willem de Bruijn wrote:
> When a socket error is pending, send()/recv() must abort their normal
> operation and return the error. An error means having non-zero
> sk->sk_err or having non-empty sk->sk_error_queue.
> 
> Currently, the behavior for the second is inconsistent depending on
> whether an error has previously been dequeued. In all cases,
> recv()/send() test sk->sk_err. This is not modified on enqueue onto
> the error queue, so may be 0. It is modified on dequeue, however, to
> match the queued skb's errno. I observed the following when two errors
> were queued:
> 
>   ret = poll(pollfd, 1, -1);
>   assert(ret == 1);
>   assert(pollfd.revents == POLLERR);
> 
>   ret = recv(fd, buf, size, MSG_NONBLOCK);
>   assert(ret == -1 && errno == EAGAIN);		/* <-- A */
> 
>   ret = recv(fd, buf, size, MSG_ERRQUEUE);
>   assert(ret > 0);
> 
>   ret = recv(fd, buf, size, MSG_NONBLOCK);
>   assert(ret == -1 && errno == ENOMSG);		/* <-- B */
> 
>   ret = recv(fd, buf, size, MSG_ERRQUEUE);
>   assert(ret > 0);
> 
> The recv call in B returns the error code embedded in
> SKB_EXT_ERR(skb), in this case ENOMSG, because I am working with
> timestamps. The recv call in A should have returned the
> same.
> 
> Implement this behavior. This may surprise existing applications.
> 
> Also make the wake-up signal when data is ready on the error queue
> consistent between enqueue and dequeue: use sk_error_report in both
> cases.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> This approach leaves one issue:
> The states of sk->sk_err and sk->sk_error_queue are related, but only
> loosely. Error queue enqueue, dequeue and other code may overwrite
> sk->sk_err unconditionally. For one, sock_error will reset
> sk->sk_err to 0 even if sk->sk_error_queue is not empty. If socket
> calls should abort on all errors, then should be change to test
> sk_error_queue.qlen. But, doing so requires taking a lock in a busy
> data path.
> ---
>  net/core/skbuff.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 163b673..f7a280b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3485,8 +3485,11 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>  	skb_dst_force(skb);
>  
>  	skb_queue_tail(&sk->sk_error_queue, skb);
> +	sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;
> +
>  	if (!sock_flag(sk, SOCK_DEAD))
> -		sk->sk_data_ready(sk);
> +		sk->sk_error_report(sk);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(sock_queue_err_skb);

I just noticed another problem with this approach why I think we cannot
use it.

In case we generate an error back to the socket locally because
something (e.g. the packet was too big) in output path, we must not
update sk->sk_err because we return it immediately to the sender but
nonetheless must update the error queue.

It seems to me that this patch would report the same error two times to
the program then.

I'll check your other patch later today, thanks!

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Sept. 3, 2014, 7:40 p.m. UTC | #6
> I just noticed another problem with this approach why I think we cannot
> use it.
>
> In case we generate an error back to the socket locally because
> something (e.g. the packet was too big) in output path, we must not
> update sk->sk_err because we return it immediately to the sender but
> nonetheless must update the error queue.
>
> It seems to me that this patch would report the same error two times to
> the program then.

For instance in __ip_append_data reaching

  ip_local_error(sk, EMSGSIZE, ..);
  return -EMSGSIZE;

With this patch, where sk->sk_err is set, the error will continuously
be returned on send/recv until the caller reads from the error queue.
A recvmsg MSG_ERRQUEUE will cause the error to be reset immediately.
If socket option IP_RECVERR is set, the process should immediately
read the error queue when a send call fails with EMSGSIZE or any other
relevant error. In that case, the error is not reported more than
once.

The alternative patch does not overwrite sk->sk_err, but results in
this same modified user-visible behavior.

Without the patch, after a send call fails in the above code, the
process can continue calling send/recv normally while ignoring the
error, since all "if (sk->sk_err)" checks fall through. If we have to
treat this as legacy behavior, then neither patch should be applied.
Then, the semantics are that queued errors are non-blocking. In that
case, the only patch needed for consistent semantics is to remove the
sk->sk_err assigment in skb_dequeue_err_skb.

On a related note, when returning an error, returning the errno from
the item on the queue is a confusing signal. Some error codes are the
result of protocol processing and have nothing queued (e.g., EINVAL),
others are due to an queued error (ENOMSG) and some are simply
ambiguous (EMSGSIZE). It is not reasonable to expect processes to
figure out the three sets. One solution would be to dedicate a special
error code to denote "there is a message queued onto sk_error_queue,
read the actual errno from there". This choice only relevant if we
decide to return an error, at all, of course.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Sept. 4, 2014, 12:17 a.m. UTC | #7
Hi,

On Wed, Sep 3, 2014, at 21:40, Willem de Bruijn wrote:
> > I just noticed another problem with this approach why I think we cannot
> > use it.
> >
> > In case we generate an error back to the socket locally because
> > something (e.g. the packet was too big) in output path, we must not
> > update sk->sk_err because we return it immediately to the sender but
> > nonetheless must update the error queue.
> >
> > It seems to me that this patch would report the same error two times to
> > the program then.
> 
> For instance in __ip_append_data reaching
> 
>   ip_local_error(sk, EMSGSIZE, ..);
>   return -EMSGSIZE;
> 
> With this patch, where sk->sk_err is set, the error will continuously
> be returned on send/recv until the caller reads from the error queue.
> A recvmsg MSG_ERRQUEUE will cause the error to be reset immediately.

sk->sk_err will be reset by sock_error(), called from
__skb_recv_datagram or sock_alloc_send_pskb. So a normal receive or send
(even without MSG_ERRQUEUE) should clean the error code and report it.
No further looks into the error queue should happen currently (I am
currently talking about the patch in this thread).

I was concerned that we report the error twice. Once in the sendmsg path
and again on recvmsg or sendmsg, which finally clears sk_err.

> If socket option IP_RECVERR is set, the process should immediately
> read the error queue when a send call fails with EMSGSIZE or any other
> relevant error. In that case, the error is not reported more than
> once.

I agree but am afraid of applications start to break.
 
> The alternative patch does not overwrite sk->sk_err, but results in
> this same modified user-visible behavior.
> 
> Without the patch, after a send call fails in the above code, the
> process can continue calling send/recv normally while ignoring the
> error, since all "if (sk->sk_err)" checks fall through. If we have to
> treat this as legacy behavior, then neither patch should be applied.
> Then, the semantics are that queued errors are non-blocking. In that
> case, the only patch needed for consistent semantics is to remove the
> sk->sk_err assigment in skb_dequeue_err_skb.

I am afraid that we need to go down this route. :/

I would even let the sk->sk_error_report call be handled by the protocol
error reporting function, as it would need to deal with sk->sk_err, too.

Most applications susceptible to breakage here are ping and
tracepath/traceroute. I don't think they all check MSG_ERRQUEUE always
on every error. But it shouldn't matter that much because they are only
short running apps and error queue will be freed if application exits
before socket accounting will shut down the socket. If we start to
report one instance of an error event multiple times, I think they might
become confused.

> On a related note, when returning an error, returning the errno from
> the item on the queue is a confusing signal. Some error codes are the
> result of protocol processing and have nothing queued (e.g., EINVAL),
> others are due to an queued error (ENOMSG) and some are simply
> ambiguous (EMSGSIZE). It is not reasonable to expect processes to
> figure out the three sets. One solution would be to dedicate a special
> error code to denote "there is a message queued onto sk_error_queue,
> read the actual errno from there". This choice only relevant if we
> decide to return an error, at all, of course.

errno messages are higly sensitive regarding backwards compatibility. I
am totally with you that a special errno marker to indicate that a
packet was enqueued on the error queue would be much better. Currently
applications always have to query error queue on error indication,
otherwise socket accounting might shut down the socket at some point.
Sadly I don't see this change to happen.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Sept. 4, 2014, 6:13 p.m. UTC | #8
> I was concerned that we report the error twice. Once in the sendmsg path
> and again on recvmsg or sendmsg, which finally clears sk_err.

I had overlooked that messages can be queued onto the error queue
during send call processing (most cases of ip_local_error) as well as from
asynchronous events (ip_icmp_error, skb_complete_tx_timestamp, ..).

For the second case, the patch fixes that an enqueued message does
not have its error returned on the first subsequent system call. By
setting sk->sk_err unconditionally, the patch indeed introduces a
double notification in the other case.

Both paths can be made correct if sk->sk_err is set on enqueue
only when not called from inside the send() context. The easiest
change for timestamping is to set it in __skb_tstamp_tx directly.
Slightly more general would be to add

* /* queue errors from asynchronous events: report the error on the
next syscall */
+ int sock_queue_err_skb_and_notify(struct sock *sk, struct sk_buff *skb)
+ {
+  int ret = sock_queue_err_skb(sk);
+  if (!ret)
+    sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;
+ }

and call this from __skb_tstamp_tx, skb_complete_tx_timestamp,
skb_complete_wifi_ack. It is perhaps also safe to call from
ip[v6]_icmp_error, but as you point out there are many legacy expectations
there, so it may be best to leave that code path as is.

>> If socket option IP_RECVERR is set, the process should immediately
>> read the error queue when a send call fails with EMSGSIZE or any other
>> relevant error. In that case, the error is not reported more than
>> once.
>
> I agree but am afraid of applications start to break.
>
>> The alternative patch does not overwrite sk->sk_err, but results in
>> this same modified user-visible behavior.
>>
>> Without the patch, after a send call fails in the above code, the
>> process can continue calling send/recv normally while ignoring the
>> error, since all "if (sk->sk_err)" checks fall through. If we have to
>> treat this as legacy behavior, then neither patch should be applied.
>> Then, the semantics are that queued errors are non-blocking. In that
>> case, the only patch needed for consistent semantics is to remove the
>> sk->sk_err assigment in skb_dequeue_err_skb.
>
> I am afraid that we need to go down this route. :/
>
> I would even let the sk->sk_error_report call be handled by the protocol
> error reporting function, as it would need to deal with sk->sk_err, too.
>
> Most applications susceptible to breakage here are ping and
> tracepath/traceroute. I don't think they all check MSG_ERRQUEUE always
> on every error. But it shouldn't matter that much because they are only
> short running apps and error queue will be freed if application exits
> before socket accounting will shut down the socket. If we start to
> report one instance of an error event multiple times, I think they might
> become confused.
>
>> On a related note, when returning an error, returning the errno from
>> the item on the queue is a confusing signal. Some error codes are the
>> result of protocol processing and have nothing queued (e.g., EINVAL),
>> others are due to an queued error (ENOMSG) and some are simply
>> ambiguous (EMSGSIZE). It is not reasonable to expect processes to
>> figure out the three sets. One solution would be to dedicate a special
>> error code to denote "there is a message queued onto sk_error_queue,
>> read the actual errno from there". This choice only relevant if we
>> decide to return an error, at all, of course.
>
> errno messages are higly sensitive regarding backwards compatibility. I
> am totally with you that a special errno marker to indicate that a
> packet was enqueued on the error queue would be much better. Currently
> applications always have to query error queue on error indication,
> otherwise socket accounting might shut down the socket at some point.
> Sadly I don't see this change to happen.
>
> Bye,
> Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Sept. 4, 2014, 6:18 p.m. UTC | #9
On Thu, Sep 4, 2014 at 2:13 PM, Willem de Bruijn <willemb@google.com> wrote:
>> I was concerned that we report the error twice. Once in the sendmsg path
>> and again on recvmsg or sendmsg, which finally clears sk_err.
>
> I had overlooked that messages can be queued onto the error queue
> during send call processing (most cases of ip_local_error) as well as from
> asynchronous events (ip_icmp_error, skb_complete_tx_timestamp, ..).
>
> For the second case, the patch fixes that an enqueued message does
> not have its error returned on the first subsequent system call. By
> setting sk->sk_err unconditionally, the patch indeed introduces a
> double notification in the other case.
>
> Both paths can be made correct if sk->sk_err is set on enqueue
> only when not called from inside the send() context. The easiest
> change for timestamping is to set it in __skb_tstamp_tx directly.
> Slightly more general would be to add
>
> * /* queue errors from asynchronous events: report the error on the
> next syscall */
> + int sock_queue_err_skb_and_notify(struct sock *sk, struct sk_buff *skb)
> + {
> +  int ret = sock_queue_err_skb(sk);
> +  if (!ret)
> +    sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;
> + }


This was just a quick, buggy, sketch. The errno should be read and
cached locally before calling sock_queue_err_skb.

+ int ret, errno;
+
+ errno = SKB_EXT_ERR(skb)->ee.ee_errno;
+ ret = sock_queue_err_skb(sk, skb);
+ if (!ret)
+   sk->sk_err = errno;

The assignment is also not atomic and not protected by the socket
lock here. Neither is it on dequeue, but this warrants looking into.

> and call this from __skb_tstamp_tx, skb_complete_tx_timestamp,
> skb_complete_wifi_ack. It is perhaps also safe to call from
> ip[v6]_icmp_error, but as you point out there are many legacy expectations
> there, so it may be best to leave that code path as is.
>
>>> If socket option IP_RECVERR is set, the process should immediately
>>> read the error queue when a send call fails with EMSGSIZE or any other
>>> relevant error. In that case, the error is not reported more than
>>> once.
>>
>> I agree but am afraid of applications start to break.
>>
>>> The alternative patch does not overwrite sk->sk_err, but results in
>>> this same modified user-visible behavior.
>>>
>>> Without the patch, after a send call fails in the above code, the
>>> process can continue calling send/recv normally while ignoring the
>>> error, since all "if (sk->sk_err)" checks fall through. If we have to
>>> treat this as legacy behavior, then neither patch should be applied.
>>> Then, the semantics are that queued errors are non-blocking. In that
>>> case, the only patch needed for consistent semantics is to remove the
>>> sk->sk_err assigment in skb_dequeue_err_skb.
>>
>> I am afraid that we need to go down this route. :/
>>
>> I would even let the sk->sk_error_report call be handled by the protocol
>> error reporting function, as it would need to deal with sk->sk_err, too.
>>
>> Most applications susceptible to breakage here are ping and
>> tracepath/traceroute. I don't think they all check MSG_ERRQUEUE always
>> on every error. But it shouldn't matter that much because they are only
>> short running apps and error queue will be freed if application exits
>> before socket accounting will shut down the socket. If we start to
>> report one instance of an error event multiple times, I think they might
>> become confused.
>>
>>> On a related note, when returning an error, returning the errno from
>>> the item on the queue is a confusing signal. Some error codes are the
>>> result of protocol processing and have nothing queued (e.g., EINVAL),
>>> others are due to an queued error (ENOMSG) and some are simply
>>> ambiguous (EMSGSIZE). It is not reasonable to expect processes to
>>> figure out the three sets. One solution would be to dedicate a special
>>> error code to denote "there is a message queued onto sk_error_queue,
>>> read the actual errno from there". This choice only relevant if we
>>> decide to return an error, at all, of course.
>>
>> errno messages are higly sensitive regarding backwards compatibility. I
>> am totally with you that a special errno marker to indicate that a
>> packet was enqueued on the error queue would be much better. Currently
>> applications always have to query error queue on error indication,
>> otherwise socket accounting might shut down the socket at some point.
>> Sadly I don't see this change to happen.
>>
>> Bye,
>> Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 163b673..f7a280b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3485,8 +3485,11 @@  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 	skb_dst_force(skb);
 
 	skb_queue_tail(&sk->sk_error_queue, skb);
+	sk->sk_err = SKB_EXT_ERR(skb)->ee.ee_errno;
+
 	if (!sock_flag(sk, SOCK_DEAD))
-		sk->sk_data_ready(sk);
+		sk->sk_error_report(sk);
+
 	return 0;
 }
 EXPORT_SYMBOL(sock_queue_err_skb);