diff mbox

PROBLEM: network data corruption (bisected to e5a4b0bb803b)

Message ID 20170210214513.GA8937@ZenIV.linux.org.uk
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Al Viro Feb. 10, 2017, 9:45 p.m. UTC
[repost with netdev added - hadn't realized it wasn't in Cc]

On Tue, Aug 09, 2016 at 03:58:36PM +0100, Al Viro wrote:

> Actually returning to the original behaviour would be "restore ->msg_iter
> if we tried skb_copy_and_csum_datagram() and failed for any reason".  Which
> would be bloody inconsistent wrt EFAULT, since the other branch (chunk
> large enough to cover the entire recvmsg()) will copy as much as it can
> and (in old kernel) drain iovec or (on the current one) leave iov_iter
> advance unreverted.

To resurrect the old thread: the problem is still there.  Namely, csum
mismatch on packet should leave the iterator as it had been.  That much
is clear; the question is what should be done on EFAULT halfway through.

Semantics of both csum and non-csum skb_copy_datagram_msg() variants in
EFAULT case is an interesting question.  None of that family report
partial copy; it's full or -EFAULT.  So for the sake of basic sanity
it would be better to leave iterator in the original state when that
kind of thing happens.  On the other hand, quite a few callers don't
care about the state of iterator after that and I wonder if the overhead
would be sensitive.  OTTH, the overhead in question is "save 5 words into
local variable and don't use it in the normal case" - in the code that
copies an skb worth of data.

AFAICS, the following gives consistent (and minimally surprising) semantics,
as well as fixing the outright bug with iov_iter left advanced in case of csum
errors.  Comments?

Comments

Christian Lamparter Feb. 11, 2017, 7:37 p.m. UTC | #1
On Friday, February 10, 2017 9:45:13 PM CET Al Viro wrote:
> On Tue, Aug 09, 2016 at 03:58:36PM +0100, Al Viro wrote:
> 
> > Actually returning to the original behaviour would be "restore ->msg_iter
> > if we tried skb_copy_and_csum_datagram() and failed for any reason".  Which
> > would be bloody inconsistent wrt EFAULT, since the other branch (chunk
> > large enough to cover the entire recvmsg()) will copy as much as it can
> > and (in old kernel) drain iovec or (on the current one) leave iov_iter
> > advance unreverted.
> 
> To resurrect the old thread: the problem is still there.  Namely, csum
> mismatch on packet should leave the iterator as it had been.  That much
> is clear; the question is what should be done on EFAULT halfway through.
Thanks for being very persistent with this. The original problem report 
was just about the data corruption issue. I think everyone involved agrees
that restoring the iterator for cases where the checksum check failed
is definitely the right action. (And of course: It is high time that the
data corruption issue gets fixed).

However, it's as you have said earlier about -EFAULT:
"[...] That's why I hadn't simply ACKed the proposed patch; it very much smells
like we have something bogus with EFAULT handling in the whole area."

Because from the explanations that: (*) "-EFAULT can happen at any point, with
zero warning before you get actual page fault when copying the data and
have handle_mm_fault() return VM_FAULT_ERROR. "

I think if you follow through with this argument. You have the problem of:
How to handle EFAULT from skb_copy_datagram_* (and all it's "wrappers")?

Because on one hand, the iovec could be partially bad. I remember that 
the application could do the following shenanigans during recvmsg: 
 - mprotect() could've flipped page read-only and back to read-write.
 - Or truncate() could've shortened the mmapped file,
 - etc.

In this case the error should be propagated back to the userspace.

But OTOH, it could just be a temporary failure (*) and restoring the
iovec and trying again is needed.

Is this a correct/complete assessment of the problem at hand? Or did
I make a mistake / wrong assumption in there?

> Semantics of both csum and non-csum skb_copy_datagram_msg() variants in
> EFAULT case is an interesting question.  None of that family report
                                                              support?
> partial copy; it's full or -EFAULT.  So for the sake of basic sanity
> it would be better to leave iterator in the original state when that
> kind of thing happens.  On the other hand, quite a few callers don't
> care about the state of iterator after that and I wonder if the overhead
> would be sensitive.  OTTH, the overhead in question is "save 5 words into
> local variable and don't use it in the normal case" - in the code that
> copies an skb worth of data.
> 
> AFAICS, the following gives consistent (and minimally surprising) semantics,
> as well as fixing the outright bug with iov_iter left advanced in case of csum
> errors.  Comments?

I'm looking at:
<http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L4668>
<http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5232>
<http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5465>

From what I can see, the tcp functions tcp_data_queue(),
tcp_copy_to_iovec() and tcp_rcv_established() would need to be
extended to handle EFAULT. Because if the iovec is restored
and the application did something bad (mprotect(), truncate(),
 ...), this code would sort of loop?

If this is the case: How many retries do we want, before we can
say it is a permament failure (and abort)?

Regards,
Christian

[...]
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f1adddc1c5ac..ee8d962373af 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3060,8 +3060,17 @@ struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock,
>  				  int *err);
>  unsigned int datagram_poll(struct file *file, struct socket *sock,
>  			   struct poll_table_struct *wait);
> -int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
> +int __skb_copy_datagram_iter(const struct sk_buff *from, int offset,
>  			   struct iov_iter *to, int size);
> +static inline int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
> +					struct iov_iter *to, int size)
> +{
> +	struct iov_iter saved = *to;
> +	int res = __skb_copy_datagram_iter(from, offset, to, size);
> +	if (unlikely(res))
> +		*to = saved;
> +	return res;
> +}
>  static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
>  					struct msghdr *msg, int size)
>  {
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index ea633342ab0d..33ff2046dda1 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -394,7 +394,7 @@ EXPORT_SYMBOL(skb_kill_datagram);
>   *	@to: iovec iterator to copy to
>   *	@len: amount of data to copy from buffer to iovec
>   */
> -int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
> +int __skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
>  			   struct iov_iter *to, int len)
>  {
>  	int start = skb_headlen(skb);
> @@ -445,7 +445,7 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
>  		if ((copy = end - offset) > 0) {
>  			if (copy > len)
>  				copy = len;
> -			if (skb_copy_datagram_iter(frag_iter, offset - start,
> +			if (__skb_copy_datagram_iter(frag_iter, offset - start,
>  						   to, copy))
>  				goto fault;
>  			if ((len -= copy) == 0)
> @@ -471,7 +471,7 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
>  
>  	return 0;
>  }
> -EXPORT_SYMBOL(skb_copy_datagram_iter);
> +EXPORT_SYMBOL(__skb_copy_datagram_iter);
>  
>  /**
>   *	skb_copy_datagram_from_iter - Copy a datagram from an iov_iter.
> @@ -750,14 +750,16 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
>  {
>  	__wsum csum;
>  	int chunk = skb->len - hlen;
> +	struct iov_iter saved;
>  
>  	if (!chunk)
>  		return 0;
>  
> +	saved = msg->msg_iter;
>  	if (msg_data_left(msg) < chunk) {
>  		if (__skb_checksum_complete(skb))
>  			goto csum_error;
> -		if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
> +		if (__skb_copy_datagram_iter(skb, hlen, &msg->msg_iter, chunk))
>  			goto fault;
>  	} else {
>  		csum = csum_partial(skb->data, hlen, skb->csum);
> @@ -771,8 +773,10 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
>  	}
>  	return 0;
>  csum_error:
> +	msg->msg_iter = saved;
>  	return -EINVAL;
>  fault:
> +	msg->msg_iter = saved;
>  	return -EFAULT;
>  }
>  EXPORT_SYMBOL(skb_copy_and_csum_datagram_msg);

You mentioned "overhead" a few times. I tried to measure it with
iperf3 over loopback on a i7-4770. I lowered the MTU to 1500 in 
the hope to see see any difference and It still was very minute: 
188.33 GiB (without) vs 187.89 GiB (patched) for 100 seconds TCP
over IPv4. Of course, I would like to hear more results. Are 
there any special settings that could be interesting and worthwile?
Al Viro Feb. 12, 2017, 5:42 a.m. UTC | #2
On Sat, Feb 11, 2017 at 08:37:06PM +0100, Christian Lamparter wrote:

> I think if you follow through with this argument. You have the problem of:
> How to handle EFAULT from skb_copy_datagram_* (and all it's "wrappers")?
> 
> Because on one hand, the iovec could be partially bad. I remember that 
> the application could do the following shenanigans during recvmsg: 
>  - mprotect() could've flipped page read-only and back to read-write.
>  - Or truncate() could've shortened the mmapped file,
>  - etc.
> 
> In this case the error should be propagated back to the userspace.
> 
> But OTOH, it could just be a temporary failure (*) and restoring the
> iovec and trying again is needed.

No.  You can't _rely_ upon -EFAULT being repeated, but it's not something
you would expect to retry your way out of.

The sane semantics is
	* fail read/recvmsg (with EFAULT) if it's a datagram socket
	* fail if it's a stream socket and nothing has been read by
that point
	* a short read if something has been already read.

> Is this a correct/complete assessment of the problem at hand? Or did
> I make a mistake / wrong assumption in there?

> I'm looking at:
> <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L4668>
> <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5232>
> <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5465>
> 
> >From what I can see, the tcp functions tcp_data_queue(),
> tcp_copy_to_iovec() and tcp_rcv_established() would need to be
> extended to handle EFAULT. Because if the iovec is restored
> and the application did something bad (mprotect(), truncate(),
>  ...), this code would sort of loop?

tcp_v4_do_rcv() has every right to copy nothing whatsoever - it's a fastpath
and when e.g. it's called in context of another thread or when skb isn't the
next fragment expected it won't bother with tcp_copy_to_iovec() at all.
Failure to copy anything in there is just fine, as long as you don't end
up buggering tp->ucopy state (in particular, tp->ucopy.msg->msg_iter).

> If this is the case: How many retries do we want, before we can
> say it is a permament failure (and abort)?

We don't want any.  What happens is that this path won't copy anything and
when that skb gets to
                        err = skb_copy_datagram_msg(skb, offset, msg, used);
                        if (err) {
                                /* Exception. Bailout! */
                                if (!copied)
                                        copied = -EFAULT;
                                break;
                        }
in tcp_recvmsg() we'll get our short read.

Again, the trouble is not with tcp_v4_do_rcv() failing to copy something -
it's failing to copy and ending up with iov_iter advanced that might be
a problem.  E.g. tp->ucopy.len getting out of sync with tp->ucopy.msg->msg_iter,
etc.

Short read on fault is fine.  So's full copy if somebody had been flipping
memprotect() and slow path ends up catching the moment when the buffer is
writable.  Both outcomes are fine.  Having the same memprotect() flipping
leave ->msg_iter more than one would expect by tp->ucopy.len and everything
back with copy_to_user working again, OTOH, might confuse tcp_input.c.
Christian Lamparter Feb. 13, 2017, 9:56 p.m. UTC | #3
On Sunday, February 12, 2017 5:42:18 AM CET Al Viro wrote:
> On Sat, Feb 11, 2017 at 08:37:06PM +0100, Christian Lamparter wrote:
> 
> > I think if you follow through with this argument. You have the problem of:
> > How to handle EFAULT from skb_copy_datagram_* (and all it's "wrappers")?
> > 
> > Because on one hand, the iovec could be partially bad. I remember that 
> > the application could do the following shenanigans during recvmsg: 
> >  - mprotect() could've flipped page read-only and back to read-write.
> >  - Or truncate() could've shortened the mmapped file,
> >  - etc.
> > 
> > In this case the error should be propagated back to the userspace.
> > 
> > But OTOH, it could just be a temporary failure (*) and restoring the
> > iovec and trying again is needed.
> 
> No.  You can't _rely_ upon -EFAULT being repeated, but it's not something
> you would expect to retry your way out of.
> 
> The sane semantics is
> 	* fail read/recvmsg (with EFAULT) if it's a datagram socket
> 	* fail if it's a stream socket and nothing has been read by
> that point
> 	* a short read if something has been already read.
> 
> > Is this a correct/complete assessment of the problem at hand? Or did
> > I make a mistake / wrong assumption in there?
> 
> > I'm looking at:
> > <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L4668>
> > <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5232>
> > <http://lxr.free-electrons.com/source/net/ipv4/tcp_input.c#L5465>
> > 
> > >From what I can see, the tcp functions tcp_data_queue(),
> > tcp_copy_to_iovec() and tcp_rcv_established() would need to be
> > extended to handle EFAULT. Because if the iovec is restored
> > and the application did something bad (mprotect(), truncate(),
> >  ...), this code would sort of loop?
> 
> tcp_v4_do_rcv() has every right to copy nothing whatsoever - it's a fastpath
> and when e.g. it's called in context of another thread or when skb isn't the
> next fragment expected it won't bother with tcp_copy_to_iovec() at all.
> Failure to copy anything in there is just fine, as long as you don't end
> up buggering tp->ucopy state (in particular, tp->ucopy.msg->msg_iter).
> 
> > If this is the case: How many retries do we want, before we can
> > say it is a permament failure (and abort)?
> 
> We don't want any.  What happens is that this path won't copy anything and
> when that skb gets to
>                         err = skb_copy_datagram_msg(skb, offset, msg, used);
>                         if (err) {
>                                 /* Exception. Bailout! */
>                                 if (!copied)
>                                         copied = -EFAULT;
>                                 break;
>                         }
> in tcp_recvmsg() we'll get our short read.
> 
> Again, the trouble is not with tcp_v4_do_rcv() failing to copy something -
> it's failing to copy and ending up with iov_iter advanced that might be
> a problem.  E.g. tp->ucopy.len getting out of sync with tp->ucopy.msg->msg_iter,
> etc.
> 
> Short read on fault is fine.  So's full copy if somebody had been flipping
> memprotect() and slow path ends up catching the moment when the buffer is
> writable.  Both outcomes are fine.  Having the same memprotect() flipping
> leave ->msg_iter more than one would expect by tp->ucopy.len and everything
> back with copy_to_user working again, OTOH, might confuse tcp_input.c.
> 
Ok, thank you for sticking around. As for the patch: I've tested it with
the dlbug program from <https://lkml.org/lkml/2016/7/26/25> (modified to
pull from a local server) and the netem corruption policy as described
in <https://lkml.org/lkml/2016/8/3/181>. 
It works as expected. I did not get a single corruption with the patch
applied. Without the patch: every try had corruptions in random places. 

Tested-by: Christian Lamparter <chunkeey@googlemail.com>

Regards,
Christian
diff mbox

Patch

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index c27011bbe30c..14ae17e77603 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -848,7 +848,7 @@  static ssize_t macvtap_put_user(struct macvtap_queue *q,
 		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
 		total += VLAN_HLEN;
 
-		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
+		ret = __skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
 		if (ret || !iov_iter_count(iter))
 			goto done;
 
@@ -857,7 +857,7 @@  static ssize_t macvtap_put_user(struct macvtap_queue *q,
 			goto done;
 	}
 
-	ret = skb_copy_datagram_iter(skb, vlan_offset, iter,
+	ret = __skb_copy_datagram_iter(skb, vlan_offset, iter,
 				     skb->len - vlan_offset);
 
 done:
@@ -899,11 +899,14 @@  static ssize_t macvtap_do_read(struct macvtap_queue *q,
 		finish_wait(sk_sleep(&q->sk), &wait);
 
 	if (skb) {
+		struct iov_iter saved = *to;
 		ret = macvtap_put_user(q, skb, to);
-		if (unlikely(ret < 0))
+		if (unlikely(ret < 0)) {
+			*to = saved;
 			kfree_skb(skb);
-		else
+		} else {
 			consume_skb(skb);
+		}
 	}
 	return ret;
 }
diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index a411b43a69eb..0d8badc3c4e9 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -480,7 +480,7 @@  static ssize_t ppp_read(struct file *file, char __user *buf,
 	iov.iov_base = buf;
 	iov.iov_len = count;
 	iov_iter_init(&to, READ, &iov, 1, count);
-	if (skb_copy_datagram_iter(skb, 0, &to, skb->len))
+	if (__skb_copy_datagram_iter(skb, 0, &to, skb->len))
 		goto outf;
 	ret = skb->len;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 30863e378925..2003b8c9970e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1430,7 +1430,7 @@  static ssize_t tun_put_user(struct tun_struct *tun,
 
 		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
 
-		ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
+		ret = __skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
 		if (ret || !iov_iter_count(iter))
 			goto done;
 
@@ -1439,7 +1439,8 @@  static ssize_t tun_put_user(struct tun_struct *tun,
 			goto done;
 	}
 
-	skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
+	/* XXX: no error check? */
+	__skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
 
 done:
 	/* caller is in process context, */
@@ -1501,6 +1502,7 @@  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 {
 	struct sk_buff *skb;
 	ssize_t ret;
+	struct iov_iter saved;
 	int err;
 
 	tun_debug(KERN_INFO, tun, "tun_do_read\n");
@@ -1513,11 +1515,14 @@  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (!skb)
 		return err;
 
+	saved = *to;
 	ret = tun_put_user(tun, tfile, skb, to);
-	if (unlikely(ret < 0))
+	if (unlikely(ret < 0)) {
+		*to = saved;
 		kfree_skb(skb);
-	else
+	} else {
 		consume_skb(skb);
+	}
 
 	return ret;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f1adddc1c5ac..ee8d962373af 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3060,8 +3060,17 @@  struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock,
 				  int *err);
 unsigned int datagram_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait);
-int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
+int __skb_copy_datagram_iter(const struct sk_buff *from, int offset,
 			   struct iov_iter *to, int size);
+static inline int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
+					struct iov_iter *to, int size)
+{
+	struct iov_iter saved = *to;
+	int res = __skb_copy_datagram_iter(from, offset, to, size);
+	if (unlikely(res))
+		*to = saved;
+	return res;
+}
 static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
 					struct msghdr *msg, int size)
 {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ea633342ab0d..33ff2046dda1 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -394,7 +394,7 @@  EXPORT_SYMBOL(skb_kill_datagram);
  *	@to: iovec iterator to copy to
  *	@len: amount of data to copy from buffer to iovec
  */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+int __skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 			   struct iov_iter *to, int len)
 {
 	int start = skb_headlen(skb);
@@ -445,7 +445,7 @@  int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_iter(frag_iter, offset - start,
+			if (__skb_copy_datagram_iter(frag_iter, offset - start,
 						   to, copy))
 				goto fault;
 			if ((len -= copy) == 0)
@@ -471,7 +471,7 @@  int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 
 	return 0;
 }
-EXPORT_SYMBOL(skb_copy_datagram_iter);
+EXPORT_SYMBOL(__skb_copy_datagram_iter);
 
 /**
  *	skb_copy_datagram_from_iter - Copy a datagram from an iov_iter.
@@ -750,14 +750,16 @@  int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 {
 	__wsum csum;
 	int chunk = skb->len - hlen;
+	struct iov_iter saved;
 
 	if (!chunk)
 		return 0;
 
+	saved = msg->msg_iter;
 	if (msg_data_left(msg) < chunk) {
 		if (__skb_checksum_complete(skb))
 			goto csum_error;
-		if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
+		if (__skb_copy_datagram_iter(skb, hlen, &msg->msg_iter, chunk))
 			goto fault;
 	} else {
 		csum = csum_partial(skb->data, hlen, skb->csum);
@@ -771,8 +773,10 @@  int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
 	}
 	return 0;
 csum_error:
+	msg->msg_iter = saved;
 	return -EINVAL;
 fault:
+	msg->msg_iter = saved;
 	return -EFAULT;
 }
 EXPORT_SYMBOL(skb_copy_and_csum_datagram_msg);