diff mbox

[v3] xmit_compl_seq: information to reclaim vmsplice buffers

Message ID alpine.DEB.1.00.1009231426450.11579@pokey.mtv.corp.google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Sept. 23, 2010, 9:35 p.m. UTC
In this patch we propose to adds some socket API to retrieve the
 "transmit completion sequence number", essentially a byte counter
for the number of bytes that have been transmitted and will not be
retransmitted.  In the case of TCP, this should correspond to snd_una.

The purpose of this API is to provide information to userspace about
which buffers can be reclaimed when sending with vmsplice() on a
socket.

There are two methods for retrieving the completed sequence number:
through a simple getsockopt (implemented here for TCP), as well as
returning the value in the ancilary data of a recvmsg.

The expected flow would be something like:
   - Connect is created
   - Initial completion seq # is retrieved through the sockopt, and is
     stored in userspace "compl_seq" variable for the connection.
   - Whenever a send is done, compl_seq += # bytes sent.
   - When doing a vmsplice the completion sequence number is saved
     for each user space buffer, buffer_compl_seq = compl_seq.
   - When recvmsg returns with a completion sequence number in
     ancillary data, any buffers cover by that sequence number
     (where buffer_compl_seq < recvmsg_compl_seq) are reclaimed
     and can be written to again.
   - If no data is receieved on a connection (recvmsg does not
     return), a timeout can be used to call the getsockopt and
     reclaim buffers as a fallback.

Using recvmsg data in this manner is sort of a cheap way to get a
"callback" for when a vmspliced buffer is consumed.  It will work
well for a client where the response causes recvmsg to return.
On the server side it works well if there are a sufficient
number of requests coming on the connection (resorting to the
timeout if necessary as described above).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Sridhar Raman <sridharr@google.com>
---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Sept. 24, 2010, 1:48 a.m. UTC | #1
Le jeudi 23 septembre 2010 à 14:35 -0700, Tom Herbert a écrit :

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3e8a4db..3d8d33f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1768,6 +1768,12 @@ skip_copy:
>  
>  	TCP_CHECK_TIMER(sk);
>  	release_sock(sk);
> +
> +	/* Copy the first unacked seq into the receive msg control part. */
> +	if (sock_flag(sk, SOCK_XMIT_COMPL_SEQ))
> +		put_cmsg(msg, SOL_SOCKET, SCM_XMIT_COMPL_SEQ,
> +		    sizeof(tp->snd_una), &tp->snd_una);
> +
>  	return copied;
>  
>  out:

Tom,

I took the time to suggest : copy tp->snd_una before release_sock().

Why do you try to avoid this copy and introduce a bug ?

Hint : put_cmsg() (and copy_to_user()) makes no guarantee the copy is
atomic. 

It can be implemented by a 4 bytes copy, faut during this copy or being
interrupted, and application might get a garbled value, if tcp stack
modifies tp->snd_una during the copy.

If you want to optimize this (because release_sock() can process the
socket backlog, and update snd_una), please use :


	/* Copy the first unacked seq into the receive msg control part.
	 * As put_cmsg() might sleep, we must copy snd_una in a private var.
	 * Integer reads are atomic on all arches, so this copy can be
	 * performed while socket is unlocked.
	 */
	if (sock_flag(sk, SOCK_XMIT_COMPL_SEQ)) {
		u32 snd_una = tp->snd_una;

		put_cmsg(msg, SOL_SOCKET, SCM_XMIT_COMPL_SEQ,
			 sizeof(snd_una), &snd_una);
	}

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 27, 2010, 5:23 p.m. UTC | #2
On Thu, Sep 23, 2010 at 02:35:16PM -0700, Tom Herbert wrote:
> In this patch we propose to adds some socket API to retrieve the
>  "transmit completion sequence number", essentially a byte counter
> for the number of bytes that have been transmitted and will not be
> retransmitted.  In the case of TCP, this should correspond to snd_una.
> 
> The purpose of this API is to provide information to userspace about
> which buffers can be reclaimed when sending with vmsplice() on a
> socket.
> 
> There are two methods for retrieving the completed sequence number:
> through a simple getsockopt (implemented here for TCP), as well as
> returning the value in the ancilary data of a recvmsg.
> 
> The expected flow would be something like:
>    - Connect is created
>    - Initial completion seq # is retrieved through the sockopt, and is
>      stored in userspace "compl_seq" variable for the connection.
>    - Whenever a send is done, compl_seq += # bytes sent.
>    - When doing a vmsplice the completion sequence number is saved
>      for each user space buffer, buffer_compl_seq = compl_seq.
>    - When recvmsg returns with a completion sequence number in
>      ancillary data, any buffers cover by that sequence number
>      (where buffer_compl_seq < recvmsg_compl_seq) are reclaimed
>      and can be written to again.
>    - If no data is receieved on a connection (recvmsg does not
>      return), a timeout can be used to call the getsockopt and
>      reclaim buffers as a fallback.
> 
> Using recvmsg data in this manner is sort of a cheap way to get a
> "callback" for when a vmspliced buffer is consumed.  It will work
> well for a client where the response causes recvmsg to return.
> On the server side it works well if there are a sufficient
> number of requests coming on the connection (resorting to the
> timeout if necessary as described above).
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> Signed-off-by: Sridhar Raman <sridharr@google.com>

Can not packets referencing this memory
still be outstanding at the NIC device, if retransmit happens
before the ack but after the packet was passed to a device?

It's true that the reftransmit will likely get discarded
by the remote end, but this might be a security issue
if an application puts sensitive data in the buffer
and that gets inadvertently sent on the wire, can it not?
Tom Herbert Sept. 27, 2010, 6:38 p.m. UTC | #3
>
> Can not packets referencing this memory
> still be outstanding at the NIC device, if retransmit happens
> before the ack but after the packet was passed to a device?
>
> It's true that the reftransmit will likely get discarded
> by the remote end, but this might be a security issue
> if an application puts sensitive data in the buffer
> and that gets inadvertently sent on the wire, can it not?
>
Yes, this hole probably does exist.  I don't know how to fix it other
than more API that specifically reports when all references to a
buffer are released.  In the case of TCP that probably means we'd need
a destructor.

> --
> MST
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 27, 2010, 7:12 p.m. UTC | #4
Le lundi 27 septembre 2010 à 11:38 -0700, Tom Herbert a écrit :
> >
> > Can not packets referencing this memory
> > still be outstanding at the NIC device, if retransmit happens
> > before the ack but after the packet was passed to a device?
> >
> > It's true that the reftransmit will likely get discarded
> > by the remote end, but this might be a security issue
> > if an application puts sensitive data in the buffer
> > and that gets inadvertently sent on the wire, can it not?
> >
> Yes, this hole probably does exist.  I don't know how to fix it other
> than more API that specifically reports when all references to a
> buffer are released.  In the case of TCP that probably means we'd need
> a destructor.

Hmm, thats a serious problem IMHO. This would need destructor on data,
not on skb (see previous discussion about early skb_orphan and
af_packet)

Then, it needs to handle an ordered list of packets in flight, to be
able to return the sequence of the first packet.

Alternative would be to copy data on retransmits, for tcp sockets using
SOCK_XMIT_COMPL_SEQ. (ie not using skb_clone but skb_copy())




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 27, 2010, 9:49 p.m. UTC | #5
> Alternative would be to copy data on retransmits, for tcp sockets using
> SOCK_XMIT_COMPL_SEQ. (ie not using skb_clone but skb_copy())
>

That's coming awfully close to COW! ;-)

>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/alpha/include/asm/socket.h b/arch/alpha/include/asm/socket.h
index 06edfef..3587082 100644
--- a/arch/alpha/include/asm/socket.h
+++ b/arch/alpha/include/asm/socket.h
@@ -69,6 +69,9 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
+
 /* O_NONBLOCK clashes with the bits used for socket types.  Therefore we
  * have to define SOCK_NONBLOCK to a different value here.
  */
diff --git a/arch/arm/include/asm/socket.h b/arch/arm/include/asm/socket.h
index 90ffd04..962c365 100644
--- a/arch/arm/include/asm/socket.h
+++ b/arch/arm/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/avr32/include/asm/socket.h b/arch/avr32/include/asm/socket.h
index c8d1fae..de59dfe 100644
--- a/arch/avr32/include/asm/socket.h
+++ b/arch/avr32/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* __ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/asm/socket.h b/arch/cris/include/asm/socket.h
index 1a4a619..fe9a1ba 100644
--- a/arch/cris/include/asm/socket.h
+++ b/arch/cris/include/asm/socket.h
@@ -64,6 +64,8 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/asm/socket.h b/arch/frv/include/asm/socket.h
index a6b2688..769f1f5 100644
--- a/arch/frv/include/asm/socket.h
+++ b/arch/frv/include/asm/socket.h
@@ -62,5 +62,7 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/h8300/include/asm/socket.h b/arch/h8300/include/asm/socket.h
index 04c0f45..5ef4551 100644
--- a/arch/h8300/include/asm/socket.h
+++ b/arch/h8300/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/asm/socket.h b/arch/ia64/include/asm/socket.h
index 51427ea..217e4d8 100644
--- a/arch/ia64/include/asm/socket.h
+++ b/arch/ia64/include/asm/socket.h
@@ -71,4 +71,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/asm/socket.h b/arch/m32r/include/asm/socket.h
index 469787c..e0830c6 100644
--- a/arch/m32r/include/asm/socket.h
+++ b/arch/m32r/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/m68k/include/asm/socket.h b/arch/m68k/include/asm/socket.h
index 9bf49c8..438cb8e 100644
--- a/arch/m68k/include/asm/socket.h
+++ b/arch/m68k/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/mips/include/asm/socket.h b/arch/mips/include/asm/socket.h
index 9de5190..cb927d7 100644
--- a/arch/mips/include/asm/socket.h
+++ b/arch/mips/include/asm/socket.h
@@ -82,6 +82,9 @@  To add: #define SO_REUSEPORT 0x0200	/* Allow local address and port reuse.  */
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
+
 #ifdef __KERNEL__
 
 /** sock_type - Socket types
diff --git a/arch/mn10300/include/asm/socket.h b/arch/mn10300/include/asm/socket.h
index 4e60c42..1c09487 100644
--- a/arch/mn10300/include/asm/socket.h
+++ b/arch/mn10300/include/asm/socket.h
@@ -62,4 +62,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/asm/socket.h b/arch/parisc/include/asm/socket.h
index 225b7d6..c1b0479 100644
--- a/arch/parisc/include/asm/socket.h
+++ b/arch/parisc/include/asm/socket.h
@@ -61,6 +61,9 @@ 
 
 #define SO_RXQ_OVFL             0x4021
 
+#define SO_XMIT_COMPL_SEQ	0x4022
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
+
 /* O_NONBLOCK clashes with the bits used for socket types.  Therefore we
  * have to define SOCK_NONBLOCK to a different value here.
  */
diff --git a/arch/powerpc/include/asm/socket.h b/arch/powerpc/include/asm/socket.h
index 866f760..9452547 100644
--- a/arch/powerpc/include/asm/socket.h
+++ b/arch/powerpc/include/asm/socket.h
@@ -69,4 +69,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/asm/socket.h b/arch/s390/include/asm/socket.h
index fdff1e9..3e3c17b 100644
--- a/arch/s390/include/asm/socket.h
+++ b/arch/s390/include/asm/socket.h
@@ -70,4 +70,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/asm/socket.h b/arch/sparc/include/asm/socket.h
index 9d3fefc..0675f54 100644
--- a/arch/sparc/include/asm/socket.h
+++ b/arch/sparc/include/asm/socket.h
@@ -58,6 +58,9 @@ 
 
 #define SO_RXQ_OVFL             0x0024
 
+#define SO_XMIT_COMPL_SEQ	0x0025
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/asm/socket.h b/arch/xtensa/include/asm/socket.h
index cbdf2ff..3c5afbf 100644
--- a/arch/xtensa/include/asm/socket.h
+++ b/arch/xtensa/include/asm/socket.h
@@ -73,4 +73,6 @@ 
 
 #define SO_RXQ_OVFL             40
 
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..6dc1ed8 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,7 @@ 
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e64f4c6..f044aff 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -106,6 +106,7 @@  enum {
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
 #define TCP_USER_TIMEOUT	18	/* How long for loss retry before timeout */
+#define TCP_XMIT_COMPL_SEQ	19	/* Return current snd_una */
 
 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
diff --git a/include/net/sock.h b/include/net/sock.h
index 8ae97c4..e820e2b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -543,6 +543,7 @@  enum sock_flags {
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
+	SOCK_XMIT_COMPL_SEQ, /* SO_XMIT_COMPL_SEQ setting */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a06c4..7a10215 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -740,6 +740,12 @@  set_rcvbuf:
 		else
 			sock_reset_flag(sk, SOCK_RXQ_OVFL);
 		break;
+	case SO_XMIT_COMPL_SEQ:
+		if (valbool)
+			sock_set_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		else
+			sock_reset_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -961,6 +967,10 @@  int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
 		break;
 
+	case SO_XMIT_COMPL_SEQ:
+		v.val = !!sock_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3e8a4db..3d8d33f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1768,6 +1768,12 @@  skip_copy:
 
 	TCP_CHECK_TIMER(sk);
 	release_sock(sk);
+
+	/* Copy the first unacked seq into the receive msg control part. */
+	if (sock_flag(sk, SOCK_XMIT_COMPL_SEQ))
+		put_cmsg(msg, SOL_SOCKET, SCM_XMIT_COMPL_SEQ,
+		    sizeof(tp->snd_una), &tp->snd_una);
+
 	return copied;
 
 out:
@@ -2617,6 +2623,9 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
+	case TCP_XMIT_COMPL_SEQ:
+		val = tp->snd_una;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}