diff mbox

tcp: SO_TIMESTAMP implementation for TCP

Message ID alpine.DEB.1.00.1004292246440.12776@pokey.mtv.corp.google.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert April 30, 2010, 6:07 a.m. UTC
Implement SO_TIMESTAMP{NS} for TCP.  When this socket option is enabled
on a TCP socket, a timestamp for received data can be returned in the
ancillary data of a recvmsg with control message type SCM_TIMESTAMP{NS}.
The timestamp chosen is that of the skb most recently received from
which data was copied.  This is useful in debugging and timing
network operations.

Signed-off-by: Tom Herbert <therbert@google.com>
---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller April 30, 2010, 6:39 a.m. UTC | #1
From: Tom Herbert <therbert@google.com>
Date: Thu, 29 Apr 2010 23:07:54 -0700 (PDT)

> Implement SO_TIMESTAMP{NS} for TCP.  When this socket option is enabled
> on a TCP socket, a timestamp for received data can be returned in the
> ancillary data of a recvmsg with control message type SCM_TIMESTAMP{NS}.
> The timestamp chosen is that of the skb most recently received from
> which data was copied.  This is useful in debugging and timing
> network operations.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>

That's not what you're implementing here.

You're only updating the socket timestamp if the SKB passed into
the update function has a more recent timestamp.

There is nothing that says the timestamps have to be increasing and
with retransmits and such if it were me I would want to see the real
timestamp even if it was earlier than the most recently reported
timestamp.

I don't know, I really don't like this feature at all.  SO_TIMESTAMP
is really meant for datagram oriented sockets, where there is a
clearly defined "packet" whose timestamp you get.  A TCP receive can
involve hundreds of tiny packets so the timestamp can lack any
meaning.

All these new checks and branches for a feature of questionable value.
If you can modify you apps to grab this information you can also probe
for the information using external probing tools.

Sorry, I don't think I'll be applying this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 30, 2010, 11:41 p.m. UTC | #2
From: Tom Herbert <therbert@google.com>
Date: Fri, 30 Apr 2010 00:58:32 -0700

>> All these new checks and branches for a feature of questionable value.
> 
>> If you can modify you apps to grab this information you can also probe
>> for the information using external probing tools.
>>
> I don't see an nice way to do that, we're profiling a significant
> percentage of millions of connections over thousands of paths as part
> of standard operations while incurring negligible overhead.  The app
> can can easily timestamp its operations, but without some mechanism
> for getting timestamps out of a TCP connection, the networking portion
> of servicing requests is pretty much a black box in that.

If other people have an opinion about this, now would be the time
to speak up. :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bill Fink May 1, 2010, 5:07 a.m. UTC | #3
On Fri, 30 Apr 2010, David Miller wrote:

> From: Tom Herbert <therbert@google.com>
> Date: Fri, 30 Apr 2010 00:58:32 -0700
> 
> >> All these new checks and branches for a feature of questionable value.
> > 
> >> If you can modify you apps to grab this information you can also probe
> >> for the information using external probing tools.
> >>
> > I don't see an nice way to do that, we're profiling a significant
> > percentage of millions of connections over thousands of paths as part
> > of standard operations while incurring negligible overhead.  The app
> > can can easily timestamp its operations, but without some mechanism
> > for getting timestamps out of a TCP connection, the networking portion
> > of servicing requests is pretty much a black box in that.
> 
> If other people have an opinion about this, now would be the time
> to speak up. :-)

Not being a kernel hacker, I will naively ask if the kernel tracing
facility could somehow be used to provide the desired info (or could
be modified to provide it).

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 1, 2010, 5:31 a.m. UTC | #4
>> I don't see an nice way to do that, we're profiling a significant
>> percentage of millions of connections over thousands of paths as part
>> of standard operations while incurring negligible overhead.  The app
>> can can easily timestamp its operations, but without some mechanism
>> for getting timestamps out of a TCP connection, the networking portion
>> of servicing requests is pretty much a black box in that.
>
> If other people have an opinion about this, now would be the time
> to speak up. :-)
>
The use case that motivated this patch is really the same as that of
UDP in that application is receiving messages that it wants to to time
stamp; in the case of TCP the application extracts the frames out of
the stream.  The lack of a timestamp to discern when a message was
received over TCP is readily apparent when designing a message based
ULP that can dynamically select which protocol to run over.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 1, 2010, 5:40 a.m. UTC | #5
> Not being a kernel hacker, I will naively ask if the kernel tracing
> facility could somehow be used to provide the desired info (or could
> be modified to provide it).
>

We did consider kernel tracing (more in the context of implementing
RFC 4898).  In the case of trying get per packet timestamps,
correlating a ktrace event with an application message is probably too
high to make it practical.  If it weren't for the cost of
timestamp'ing every single skb being received, we'd probably have
SO_TIMESTAMP turned on permanently for many connections.  For now
we're settling for a percentage of messages for sampling.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 1, 2010, 6 a.m. UTC | #6
Le vendredi 30 avril 2010 à 22:40 -0700, Tom Herbert a écrit :
> > Not being a kernel hacker, I will naively ask if the kernel tracing
> > facility could somehow be used to provide the desired info (or could
> > be modified to provide it).
> >
> 
> We did consider kernel tracing (more in the context of implementing
> RFC 4898).  In the case of trying get per packet timestamps,
> correlating a ktrace event with an application message is probably too
> high to make it practical.  If it weren't for the cost of
> timestamp'ing every single skb being received, we'd probably have
> SO_TIMESTAMP turned on permanently for many connections.  For now
> we're settling for a percentage of messages for sampling.

Tom, did you tried to reuse existing skb or sk tstamps ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul LeoNerd Evans May 1, 2010, 12:06 p.m. UTC | #7
On Fri, Apr 30, 2010 at 04:41:15PM -0700, David Miller wrote:
> If other people have an opinion about this, now would be the time
> to speak up. :-)

I have to say I agree with David.

The "receive timestamp" for a TCP recv() call is completely meaningless.
Each byte in the stream arguably could have a set of receive timestamps,
being the timestamp of the underlying IPv4 packet containing a fragment
of a TCP segment that covered that byte. One recv() call could cover
many packets, many recv() calls could be required to consume one packet.
We just don't know from userland.

The point about IPv4 fragments in UDP is a reasonable one; that because
of IPv4 fragmentation there are still potentially multiple timestamps
that could be relevant to a single UDP recv() call. But no two recv()
calls can possibly relate to the same IPv4 fragments, so I feel this is
more defined. Plus, of all the IPv4 fragments that go into a single UDP
packet, one of them is special - the first one, the one containing the
UDP header. We could easily say "the timestamp of a UDP recv() call
shall be the time at which its header was received, even if other
fragments arrived before or after it". 

We cannot make any such distinction for some window in a TCP stream. All
TCP segments are indistinct in this manner.
diff mbox

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..7dbb662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -314,6 +314,7 @@  struct tcp_sock {
  	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
+	ktime_t lrxcopytime;	/* timestamp of newest data in copy to user space */
 
 	/* Data for direct copy to user */
 	struct {
@@ -466,6 +467,15 @@  static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 	return (struct tcp_sock *)sk;
 }
 
+static inline void tcp_update_lrxcopytime(struct sock *sk,
+					  const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (unlikely(skb->tstamp.tv64 >  tp->lrxcopytime.tv64))
+		tp->lrxcopytime = skb->tstamp;
+}
+
 struct tcp_timewait_sock {
 	struct inet_timewait_sock tw_sk;
 	u32			  tw_rcv_nxt;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8ce2974..c7f107a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1381,6 +1381,27 @@  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	return copied;
 }
 
+static inline void tcp_sock_recv_timestamp(struct msghdr *msg, struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (likely(!sock_flag(sk, SOCK_RCVTSTAMP)))
+		return;
+
+	if (msg->msg_controllen >= sizeof(struct timeval) &&
+	    tp->lrxcopytime.tv64) {
+		if (sock_flag(sk, SOCK_RCVTSTAMPNS)) {
+			struct timespec ts = ktime_to_timespec(tp->lrxcopytime);
+			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
+				 sizeof(ts), &ts);
+		} else {
+			struct timeval tv = ktime_to_timeval(tp->lrxcopytime);
+			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+			    sizeof(tv), &tv);
+		}
+	}
+}
+
 /*
  *	This routine copies from a sock struct into the user buffer.
  *
@@ -1414,6 +1435,7 @@  int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		goto out;
 
 	timeo = sock_rcvtimeo(sk, nonblock);
+	tp->lrxcopytime.tv64 = 0;
 
 	/* Urgent data needs to be handled specially. */
 	if (flags & MSG_OOB)
@@ -1691,6 +1713,8 @@  do_prequeue:
 					break;
 				}
 			}
+
+			tcp_update_lrxcopytime(sk, skb);
 		}
 
 		*seq += used;
@@ -1758,6 +1782,8 @@  skip_copy:
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
+	tcp_sock_recv_timestamp(msg, sk);
+
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e82162c..b94ad16 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4397,6 +4397,7 @@  static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 				tp->copied_seq += chunk;
 				eaten = (chunk == skb->len && !th->fin);
 				tcp_rcv_space_adjust(sk);
+				tcp_update_lrxcopytime(sk, skb);
 			}
 			local_bh_disable();
 		}
@@ -5061,6 +5062,7 @@  static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
 		tp->ucopy.len -= chunk;
 		tp->copied_seq += chunk;
 		tcp_rcv_space_adjust(sk);
+		tcp_update_lrxcopytime(sk, skb);
 	}
 
 	local_bh_disable();
@@ -5120,6 +5122,7 @@  static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
 		tp->ucopy.len -= chunk;
 		tp->copied_seq += chunk;
 		tcp_rcv_space_adjust(sk);
+		tcp_update_lrxcopytime(sk, skb);
 
 		if ((tp->ucopy.len == 0) ||
 		    (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||