resurrecting tcphealth

Message ID	87741204cd72d363d54dadf9a94bb6fe.squirrel@webmail.univie.ac.at
State	Rejected, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <87741204cd72d363d54dadf9a94bb6fe.squirrel@webmail.univie.ac.at> In-Reply-To: <20120713165544.6767ea8f@nehalam.linuxnetplumber.net> References: <e9caf38359467bfa8a1e2ac86f6ef2cc.squirrel@webmail.univie.ac.at> <20120713165544.6767ea8f@nehalam.linuxnetplumber.net> Date: Mon, 16 Jul 2012 13:33:17 +0200 Subject: Re: resurrecting tcphealth From: "Piotr Sawuk" <a9702387@unet.univie.ac.at> To: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org User-Agent: SquirrelMail/1.4.19 MIME-Version: 1.0 Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

87741204cd72d363d54dadf9a94bb6fe.squirrel@webmail.univie.ac.at

State

Rejected, archived

Delegated to:

David Miller

Headers

Message-ID: <87741204cd72d363d54dadf9a94bb6fe.squirrel@webmail.univie.ac.at>
In-Reply-To: <20120713165544.6767ea8f@nehalam.linuxnetplumber.net>
References: <e9caf38359467bfa8a1e2ac86f6ef2cc.squirrel@webmail.univie.ac.at>
	<20120713165544.6767ea8f@nehalam.linuxnetplumber.net>
Date: Mon, 16 Jul 2012 13:33:17 +0200
Subject: Re: resurrecting tcphealth
From: "Piotr Sawuk" <a9702387@unet.univie.ac.at>
To: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
User-Agent: SquirrelMail/1.4.19
MIME-Version: 1.0
Content-Type: text/plain;charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Commit Message

Piotr Sawuk July 16, 2012, 11:33 a.m. UTC

On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
> I am not sure if the is really necessary since the most
> of the stats are available elsewhere.

if by "most" you mean address and port then you're right.
but even the rtt reported by "ss -i" seems to differ from tcphealth.

however, if instead by "elsewhere" you mean "on the server"...
>

>>+		seq_printf(seq,
>>+		"TCP Health Monitoring (established connections only)\n"
>>+		" -Duplicate ACKs indicate lost or reordered packets on the
>>connection.\n"
>>+		" -Duplicate Packets Received signal a slow and badly inefficient
>>connection.\n"
>>+		" -RttEst estimates how long future packets will take on a round trip
>>over the connection.\n"
>>+		"id   Local Address        Remote Address       RttEst(ms) AcksSent "
>
> Header seems excessive, just put one line of header please.

I guess the header was sort of documentation for this patch.
I've put it into Kconfig instead.
>
>
>>+		"DupAcksSent PktsRecv DupPktsRecv\n");
>>+		goto out;
>>+	}
>>+
>>+	/* Loop through established TCP connections */
>>+	st = seq->private;
>>+
>>+
>>+	if (st->state == TCP_SEQ_STATE_ESTABLISHED)
>>+	{
>>+/*	; //insert read-lock here */
>
> Don't think you need read-lock

you mean I wont get segfault reading a tcp_sock that's gone?
>

> Kernel has %pI4 to print IP addresses.

thanks, I didn't know.
>

>>+		seq_printf(seq, "%*s\n", LINESZ - 1 - len, "");
>
> This padding of line is bogus, just print variable length line.
> Are you trying to make it fixed length record file?

I guess so, /proc/net/tcp is doing the same.
wont question the authors of that user-interface.

OK, new version, this time with Kconfig changed:



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet July 16, 2012, 11:46 a.m. UTC | #1

On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
> > I am not sure if the is really necessary since the most
> > of the stats are available elsewhere.
> 
> if by "most" you mean address and port then you're right.
> but even the rtt reported by "ss -i" seems to differ from tcphealth.

Thats because tcphealth is wrong, it assumes HZ=1000 ?

tp->srtt unit is jiffies, not ms.

tcphealth is a gross hack.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Piotr Sawuk July 16, 2012, 1:03 p.m. UTC | #2

On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>> > I am not sure if the is really necessary since the most
>> > of the stats are available elsewhere.
>>
>> if by "most" you mean address and port then you're right.
>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>
> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>
> tp->srtt unit is jiffies, not ms.

thanks. any conversion-functions in the kernel for that?
>
> tcphealth is a gross hack.

what would you do if you tried making it less gross?

I've not found any similar functionality, in the kernel.
I want to know an estimate for the percentage of data lost in tcp.
and I want to know that without actually sending much packets.
afterall I'm on the receiving end most of the time.
percentage of duplicate packets received is nice too.
you have any suggestions?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yuchung Cheng July 20, 2012, 2:06 p.m. UTC | #3

On Mon, Jul 16, 2012 at 6:03 AM, Piotr Sawuk <a9702387@unet.univie.ac.at> wrote:
> On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
>> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>>> > I am not sure if the is really necessary since the most
>>> > of the stats are available elsewhere.
>>>
>>> if by "most" you mean address and port then you're right.
>>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>>
>> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>>
>> tp->srtt unit is jiffies, not ms.
>
> thanks. any conversion-functions in the kernel for that?
>>
>> tcphealth is a gross hack.
>
> what would you do if you tried making it less gross?
>
> I've not found any similar functionality, in the kernel.
> I want to know an estimate for the percentage of data lost in tcp.
> and I want to know that without actually sending much packets.
> afterall I'm on the receiving end most of the time.
> percentage of duplicate packets received is nice too.
> you have any suggestions?

counting dupack may not be as reliable as you'd like.
say the remote sends you 100 packets and only the first one is lost,
you'll see 99 dupacks. Morover any small degree reordering (<3)
will generate substantial dupacks but the network is perfectly fine
(see Craig Patridge's "reordering is not pathological" paper).
unfortunately receiver can't and does not have to distinguish loss
 or reordering. you can infer that but it should not be kernel's job.
there are public tools that inspect tcpdump traces to do that

exposing duplicate packets received can be done via getsockopt(TCP_INFO)
although I don't know what that gives you. the remote can be too
aggressive in retransmission (not just because of a bad RTO) or
the network RTT fluctuates.

I don't what if tracking last_ack_sent (the latest RCV.NXT) without
knowing the ISN is useful.

btw the term project paper cited concludes SACK is not useful is simply
wrong. This makes me suspicious about how rigorous and thoughtful of
its design.

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Piotr Sawuk July 21, 2012, 10:34 a.m. UTC | #4

On Fr, 20.07.2012, 16:06, Yuchung Cheng wrote:
> On Mon, Jul 16, 2012 at 6:03 AM, Piotr Sawuk <a9702387@unet.univie.ac.at>
> wrote:
>> On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
>>> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>>>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>>>> > I am not sure if the is really necessary since the most
>>>> > of the stats are available elsewhere.
>>>>
>>>> if by "most" you mean address and port then you're right.
>>>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>>>
>>> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>>>
>>> tp->srtt unit is jiffies, not ms.
>>
>> thanks. any conversion-functions in the kernel for that?
>>>
>>> tcphealth is a gross hack.
>>
>> what would you do if you tried making it less gross?
>>
>> I've not found any similar functionality, in the kernel.
>> I want to know an estimate for the percentage of data lost in tcp.
>> and I want to know that without actually sending much packets.
>> afterall I'm on the receiving end most of the time.
>> percentage of duplicate packets received is nice too.
>> you have any suggestions?
>
> counting dupack may not be as reliable as you'd like.
> say the remote sends you 100 packets and only the first one is lost,
> you'll see 99 dupacks. Morover any small degree reordering (<3)
> will generate substantial dupacks but the network is perfectly fine

I understand that.
but you must consider the difference between network-health and tcp-health.
network-health on my end I can see by looking at wlan-signal strength.
slow downloads can have many causes, some loose cable is only one possibility.

for example I once played a lan-game, 2 computers connected directly.
however, one computer was 10 times slower than the other.
so when the slow computer would act as server, the game would never start.
the reason wasn't bad connection, it was packet-loss caused by slowness.
and it had to do with the protocol being used (i.e. not tcp).

so when in tcp I get high percentage of dupack I see something's wrong.
not necessarily with the physical connection, but with protocol-handling.
the paper showed dupacks indicate spikes in network-usage.
and as we all know the bottleneck isn't the cable, it's data-processing.
when there is a spike, lots of users connecting, network itself is ok.
however, reordering and lost packets indicate something's up with the server.
and that's actually the info I'm after.

if I were the net-admin I would be interested in network-health too.
bad connection indicated by packet-loss itself means I've got to check cables.
but a user might have much wider area of interest.
the user can't do anything about the cables, but yet is interested in them.
i.e. useless info for the net-admin could be interesting for the user.
that's why I do not recommend tcphealth for servers, useless overhead.

so, if you want to judge usefulness of this patch, consider the situation:
you are powerless but interested in responsiveness of thousands of servers.
you want to learn how those servers behave at different times of a day.
isn't dupacks and dup-packets the best info on that you can possibly get?

> (see Craig Patridge's "reordering is not pathological" paper).

thanks, will look into that.

> unfortunately receiver can't and does not have to distinguish loss

true, not needed for the protocol.
on a higher level it sill can be interesting though.
most of the work for preventing packetloss must be done by the sender.
but as I said before, the receiver can do something too: avoid traffic-jams!
in a network many things are predictable, can be reprogrammed.
this way a network could become more efficient as a whole.
that's what spikes my interest in tcphealth, thinking more globally.

>  or reordering. you can infer that but it should not be kernel's job.

that's why I made it an option as opposed to what the original author did.
theoretically it should be possible to get the same functionality without it.
just read the raw network-data and emulate the work of tcp and tcphealth.
but that definitely would add a big overhead as tcp-handling is duplicated.

> there are public tools that inspect tcpdump traces to do that

good example. so to figure out dupacks you filter out the acks.
and you must somehow compare them, or you parse them the way the kernel does.
in the latter case you'll have to recreate the kernel's internal data.
definitely faster, but could result in duplicate code, that requires space.

also you should consider that not all users have privilegues for tcpdump.
and if they had, it would add another security-risk to their computer.
and you'd have to consider multiple users on one computer, using that service.
I can imagine a daemon in the background doing what tcphealth does.
that's the alternative, allows for more fine-grained security.
it could disallow spying on what connections other users have.
(of course then you'd need to remove /proc/net/tcp output too.)
but imagine the nightmare of keeping that daemon secure.
afterall it must be privilegued to read all network data.

if the kernel would provide what I'm looking for, this daemon could still run.
but then it wouldn't need that risky privilegues, could focus on other stuff.
the task of preventing users from seeing eachothers connections is enough...
>
> exposing duplicate packets received can be done via getsockopt(TCP_INFO)
> although I don't know what that gives you. the remote can be too
> aggressive in retransmission (not just because of a bad RTO) or
> the network RTT fluctuates.

TCP_INFO contains only duplicate packets *sent* (retransmits), not received!
am I missing something? can you give a code-example that can obtain such info?
if running that code in userspace results in same values as tcphealth...
well, actually dupacks is more interesting than dup-packets.
afterall in usual usage the latter will always be zero.
>
> I don't what if tracking last_ack_sent (the latest RCV.NXT) without
> knowing the ISN is useful.

so you suggest I should store and compare ISN too, for accuracy?
you think the gain in accuracy justifies the added overhead?
>
> btw the term project paper cited concludes SACK is not useful is simply
> wrong. This makes me suspicious about how rigorous and thoughtful of
> its design.

isn't my paper.
but I'd guess the usefulness of SACK is only doubted from pov of users.
remember, users connect to many servers.
if a server behaves badly, choose another one.
servers do not have such a choice, for them SACK naturally is important.

a server would just need to look at TCP_INFO to see how useful SACK is.
so I would conclude the author was quite thoughtful about users' pov.
(and quite ignorant about the servers.)

no matter how little knowledge the authors have, tcphealth is interesting.
maybe it was a random discovery by sheer luck.
the correlation between the data it provides and reality is compelling.
if we'd judge inventions by the stupidity of their inventors...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff -rub A/include/linux/tcp.h B/include/linux/tcp.h
--- A/include/linux/tcp.h	2012-07-08 02:23:56.000000000 +0200
+++ B/include/linux/tcp.h	2012-07-16 09:04:54.000000000 +0200
@@ -492,6 +492,17 @@ 
 	 * contains related tcp_cookie_transactions fields.
 	 */
 	struct tcp_cookie_values  *cookie_values;
+
+#ifdef CONFIG_TCPHEALTH
+	/*
+	 * TCP health monitoring counters.
+	 */
+	__u32	dup_acks_sent;
+	__u32	dup_pkts_recv;
+	__u32	acks_sent;
+	__u32	pkts_recv;
+	__u32	last_ack_sent;	/* Sequence number of the last ack we sent. */
+#endif
 };

 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
Only in B/include/linux: tcp.h.orig
diff -rub A/net/ipv4/Kconfig B/net/ipv4/Kconfig
--- A/net/ipv4/Kconfig	2012-07-08 02:23:56.000000000 +0200
+++ B/net/ipv4/Kconfig	2012-07-16 11:56:15.000000000 +0200
@@ -619,6 +619,28 @@ 
 	default "reno" if DEFAULT_RENO
 	default "cubic"

+config TCPHEALTH
+	bool "TCP: client-side health-statistics (/proc/net/tcphealth)"
+	default n
+	---help---
+	IPv4 TCP Health Monitoring (established connections only):
+	 -Duplicate ACKs indicate there could be lost or reordered packets
+	  on the connection.
+	 -Duplicate Packets Received signal a slow and badly inefficient
+	  connection.
+	 -RttEst estimates how long future packets will take on a round trip
+	  over the connection.
+
+	Additionally you get total amount of sent ACKs and received Packets.
+	All these values are displayed seperately for each connection.
+	If you are running a dedicated server you wont need this.
+	Duplicate ACKs refers only to those sent upon receiving a Packet.
+	A server most likely doesn't receive much Packets to count.
+	Hence for a server these statistics wont be meaningful.
+	especially since they are split into individual connections.
+
+	If you plan to investigate why some download is slow, say Y.
+
 config TCP_MD5SIG
 	bool "TCP: MD5 Signature Option support (RFC2385) (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
Only in B/net/ipv4: Kconfig~
diff -rub A/net/ipv4/tcp_input.c B/net/ipv4/tcp_input.c
--- A/net/ipv4/tcp_input.c	2012-07-08 02:23:56.000000000 +0200
+++ B/net/ipv4/tcp_input.c	2012-07-16 09:28:23.000000000 +0200
@@ -4492,6 +4492,11 @@ 
 		}

 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+#ifdef CONFIG_TCPHEALTH
+			/* Course Timeout caused retransmit inefficiency-
+			 * this packet has been received twice. */
+			tp->dup_pkts_recv++;
+#endif
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
 			__skb_unlink(skb, &tp->out_of_order_queue);
 			__kfree_skb(skb);
@@ -4824,6 +4829,12 @@ 
 		return;
 	}

+#ifdef CONFIG_TCPHEALTH
+	/* A packet is a "duplicate" if it contains bytes we have already
received. */
+	if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
+		tp->dup_pkts_recv++;
+#endif
+
 	if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 		/* A retransmit, 2nd most common case.  Force an immediate ack. */
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
@@ -5535,6 +5546,12 @@ 

 	tp->rx_opt.saw_tstamp = 0;

+#ifdef CONFIG_TCPHEALTH
+	/*
+	 *	total per-connection packet arrivals.
+	 */
+	tp->pkts_recv++;
+#endif
 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
 	 *	if header_prediction is to be made
 	 *	'S' will always be tp->tcp_header_len >> 2
diff -rub A/net/ipv4/tcp_ipv4.c B/net/ipv4/tcp_ipv4.c
--- A/net/ipv4/tcp_ipv4.c	2012-07-08 02:23:56.000000000 +0200
+++ B/net/ipv4/tcp_ipv4.c	2012-07-16 10:12:48.000000000 +0200
@@ -2500,6 +2500,57 @@ 
 	return 0;
 }

+#ifdef CONFIG_TCPHEALTH
+/*
+ *	Output /proc/net/tcphealth
+ */
+#define LINESZ 128
+
+int tcp_health_seq_show(struct seq_file *seq, void *v)
+{
+	int len, num;
+	struct tcp_iter_state *st;
+
+	if (v == SEQ_START_TOKEN) {
+		seq_printf(seq,
+		"id   Local Address        Remote Address       RttEst(ms) AcksSent "
+		"DupAcksSent PktsRecv DupPktsRecv\n");
+		goto out;
+	}
+
+	/* Loop through established TCP connections */
+	st = seq->private;
+
+
+	if (st->state == TCP_SEQ_STATE_ESTABLISHED)
+	{
+		const struct tcp_sock *tp = tcp_sk(v);
+		const struct inet_sock *inet = inet_sk(v);
+
+		seq_printf(seq, "%d: %-21pI4:%u %-21pI4:%u "
+				"%8lu %8lu %8lu %8lu %8lu%n",
+				st->num,
+				inet->inet_rcv_saddr,
+				ntohs(inet->inet_sport),
+				inet->inet_daddr,
+				ntohs(inet->inet_dport),
+				tp->srtt >> 3,
+				tp->acks_sent,
+				tp->dup_acks_sent,
+				tp->pkts_recv,
+				tp->dup_pkts_recv,
+
+				&len
+			);
+
+		seq_printf(seq, "%*s\n", LINESZ - 1 - len, "");
+	}
+
+out:
+	return 0;
+}
+#endif /* CONFIG_TCPHEALTH */
+
 static const struct file_operations tcp_afinfo_seq_fops = {
 	.owner   = THIS_MODULE,
 	.open    = tcp_seq_open,
@@ -2508,6 +2559,17 @@ 
 	.release = seq_release_net
 };

+#ifdef CONFIG_TCPHEALTH
+static struct tcp_seq_afinfo tcphealth_seq_afinfo = {
+	.name		= "tcphealth",
+	.family		= AF_INET,
+	.seq_fops	= &tcp_afinfo_seq_fops,
+	.seq_ops	= {
+		.show		= tcp_health_seq_show,
+	},
+};
+#endif
+
 static struct tcp_seq_afinfo tcp4_seq_afinfo = {
 	.name		= "tcp",
 	.family		= AF_INET,
@@ -2519,12 +2581,20 @@ 

 static int __net_init tcp4_proc_init_net(struct net *net)
 {
-	return tcp_proc_register(net, &tcp4_seq_afinfo);
+	int ret = tcp_proc_register(net, &tcp4_seq_afinfo);
+#ifdef CONFIG_TCPHEALTH
+	if(ret == 0)
+		ret = tcp_proc_register(net, &tcphealth_seq_afinfo);
+#endif
+	return ret;
 }

 static void __net_exit tcp4_proc_exit_net(struct net *net)
 {
 	tcp_proc_unregister(net, &tcp4_seq_afinfo);
+#ifdef CONFIG_TCPHEALTH
+	tcp_proc_unregister(net, &tcphealth_seq_afinfo);
+#endif
 }

 static struct pernet_operations tcp4_net_ops = {
Only in B/net/ipv4: tcp_ipv4.c~
Only in B/net/ipv4: tcp_ipv4.c.orig
diff -rub A/net/ipv4/tcp_output.c B/net/ipv4/tcp_output.c
--- A/net/ipv4/tcp_output.c	2012-07-08 02:23:56.000000000 +0200
+++ B/net/ipv4/tcp_output.c	2012-07-16 09:44:02.000000000 +0200
@@ -2772,8 +2772,19 @@ 
 	skb_reserve(buff, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);

+#ifdef CONFIG_TCPHEALTH
+	/* If the rcv_nxt has not advanced since sending our last ACK, this is a
duplicate. */
+	if (tcp_sk(sk)->rcv_nxt == tcp_sk(sk)->last_ack_sent)
+		tcp_sk(sk)->dup_acks_sent++;
+	/* Record the total number of acks sent on this connection. */
+	tcp_sk(sk)->acks_sent++;
+#endif
+
 	/* Send it off, this clears delayed acks for us. */
 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+#ifdef CONFIG_TCPHEALTH
+	tcp_sk(sk)->last_ack_sent = tcp_sk(sk)->rcv_nxt;
+#endif
 	tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
 }

resurrecting tcphealth

Commit Message

Comments

Patch