diff mbox series

[net-next,v3] tcp: add a tracepoint for tcp retransmission

Message ID 20171013200316.28561-1-xiyou.wangcong@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series [net-next,v3] tcp: add a tracepoint for tcp retransmission | expand

Commit Message

Cong Wang Oct. 13, 2017, 8:03 p.m. UTC
We need a real-time notification for tcp retransmission
for monitoring.

Of course we could use ftrace to dynamically instrument this
kernel function too, however we can't retrieve the connection
information at the same time, for example perf-tools [1] reads
/proc/net/tcp for socket details, which is slow when we have
a lots of connections.

Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
and exposes src/dst IP addresses and ports of the connection.
This also makes it easier to integrate into perf.

Note, I expose both IPv4 and IPv6 addresses at the same time:
for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
Also, add sk and skb pointers as they are useful for BPF.

1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/trace/events/tcp.h | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |  1 +
 net/ipv4/tcp_output.c      |  3 ++
 3 files changed, 72 insertions(+)
 create mode 100644 include/trace/events/tcp.h

Comments

Brendan Gregg Oct. 13, 2017, 8:50 p.m. UTC | #1
On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> We need a real-time notification for tcp retransmission
> for monitoring.
>
> Of course we could use ftrace to dynamically instrument this
> kernel function too, however we can't retrieve the connection
> information at the same time, for example perf-tools [1] reads
> /proc/net/tcp for socket details, which is slow when we have
> a lots of connections.
>
> Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
> and exposes src/dst IP addresses and ports of the connection.
> This also makes it easier to integrate into perf.
>
> Note, I expose both IPv4 and IPv6 addresses at the same time:
> for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
> for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
> Also, add sk and skb pointers as they are useful for BPF.

Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
would be highly useful too, which Alexei already has in his list).

Should skp->__sk_common.skc_state be included in the format string, so
we don't have to always dig it out of the skaddr? For retransmits I
always want to know the TCP state, to determine if it is ESTABLISHED
(packet drop) or SYN_SENT (backlog full) or something else.

We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
for tracing at the same time as retransmits (like my tools do), but
that can be added later.

Brendan
Alexei Starovoitov Oct. 13, 2017, 10:09 p.m. UTC | #2
On Fri, Oct 13, 2017 at 01:50:44PM -0700, Brendan Gregg wrote:
> On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > We need a real-time notification for tcp retransmission
> > for monitoring.
> >
> > Of course we could use ftrace to dynamically instrument this
> > kernel function too, however we can't retrieve the connection
> > information at the same time, for example perf-tools [1] reads
> > /proc/net/tcp for socket details, which is slow when we have
> > a lots of connections.
> >
> > Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
> > and exposes src/dst IP addresses and ports of the connection.
> > This also makes it easier to integrate into perf.
> >
> > Note, I expose both IPv4 and IPv6 addresses at the same time:
> > for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
> > for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
> > Also, add sk and skb pointers as they are useful for BPF.
> 
> Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
> would be highly useful too, which Alexei already has in his list).
> 
> Should skp->__sk_common.skc_state be included in the format string, so
> we don't have to always dig it out of the skaddr? For retransmits I
> always want to know the TCP state, to determine if it is ESTABLISHED
> (packet drop) or SYN_SENT (backlog full) or something else.

let's not expose internal socket fields into tp fields.
Few people still believe that tp fields are abi, so to be safe
no such fields should be exposed.
It's trivial enough to read sk_state from bpf program
with bpf_probe_read().

> We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
> for tracing at the same time as retransmits (like my tools do), but
> that can be added later.

hmm. why?
This single tracepoint will cover both cases of retransmits.

For the patch:
Acked-by: Alexei Starovoitov <ast@kernel.org>

I believe it will fit our use case perfectly.
David Ahern Oct. 13, 2017, 11:13 p.m. UTC | #3
On 10/13/17 2:03 PM, Cong Wang wrote:
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> new file mode 100644
> index 000000000000..3d1cbd072b7e
> --- /dev/null
> +++ b/include/trace/events/tcp.h
> @@ -0,0 +1,68 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM tcp
> +
> +#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_TCP_H
> +
> +#include <linux/ipv6.h>
> +#include <linux/tcp.h>
> +#include <linux/tracepoint.h>
> +#include <net/ipv6.h>
> +
> +TRACE_EVENT(tcp_retransmit_skb,
> +
> +	TP_PROTO(struct sock *sk, struct sk_buff *skb),
> +
> +	TP_ARGS(sk, skb),
> +
> +	TP_STRUCT__entry(
> +		__field(void *, skbaddr)
> +		__field(void *, skaddr)
> +		__field(__u16, sport)
> +		__field(__u16, dport)
> +		__array(__u8, saddr, 4)
> +		__array(__u8, daddr, 4)
> +		__array(__u8, saddr_v6, 16)
> +		__array(__u8, daddr_v6, 16)
> +	),
> +
> +	TP_fast_assign(
> +		struct ipv6_pinfo *np = inet6_sk(sk);
> +		struct inet_sock *inet = inet_sk(sk);
> +		struct in6_addr *pin6;
> +		__be32 *p32;
> +
> +		__entry->skbaddr = skb;
> +		__entry->skaddr = sk;
> +
> +		__entry->sport = ntohs(inet->inet_sport);
> +		__entry->dport = ntohs(inet->inet_dport);
> +
> +		p32 = (__be32 *) __entry->saddr;
> +		*p32 = inet->inet_saddr;
> +
> +		p32 = (__be32 *) __entry->daddr;
> +		*p32 =  inet->inet_daddr;
> +
> +		if (np) {
> +			pin6 = (struct in6_addr *)__entry->saddr_v6;
> +			*pin6 = np->saddr;
> +			pin6 = (struct in6_addr *)__entry->daddr_v6;
> +			*pin6 = *(np->daddr_cache);
> +		} else {
> +			pin6 = (struct in6_addr *)__entry->saddr_v6;
> +			ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
> +			pin6 = (struct in6_addr *)__entry->daddr_v6;
> +			ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
> +		}
> +	),
> +
> +	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6 daddrv6=%pI6",

%pI6c is more user friendly for IPv6 addresses.
Brendan Gregg Oct. 13, 2017, 11:58 p.m. UTC | #4
On Fri, Oct 13, 2017 at 3:09 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Oct 13, 2017 at 01:50:44PM -0700, Brendan Gregg wrote:
>> On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > We need a real-time notification for tcp retransmission
>> > for monitoring.
>> >
>> > Of course we could use ftrace to dynamically instrument this
>> > kernel function too, however we can't retrieve the connection
>> > information at the same time, for example perf-tools [1] reads
>> > /proc/net/tcp for socket details, which is slow when we have
>> > a lots of connections.
>> >
>> > Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
>> > and exposes src/dst IP addresses and ports of the connection.
>> > This also makes it easier to integrate into perf.
>> >
>> > Note, I expose both IPv4 and IPv6 addresses at the same time:
>> > for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
>> > for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
>> > Also, add sk and skb pointers as they are useful for BPF.
>>
>> Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
>> would be highly useful too, which Alexei already has in his list).
>>
>> Should skp->__sk_common.skc_state be included in the format string, so
>> we don't have to always dig it out of the skaddr? For retransmits I
>> always want to know the TCP state, to determine if it is ESTABLISHED
>> (packet drop) or SYN_SENT (backlog full) or something else.
>
> let's not expose internal socket fields into tp fields.
> Few people still believe that tp fields are abi, so to be safe
> no such fields should be exposed.
> It's trivial enough to read sk_state from bpf program
> with bpf_probe_read().

Ah, right, the number mapping for TCP_ESTABLISHED and friends is a
Linux implementation detail, and not from the RFCs. Ok, I can dig it
from the skp instead.

>
>> We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
>> for tracing at the same time as retransmits (like my tools do), but
>> that can be added later.
>
> hmm. why?
> This single tracepoint will cover both cases of retransmits.

I don't think tcp_send_loss_probe() TLP goes through
__tcp_retransmit_skb(): look at the path to bumping
LINUX_MIB_TCPLOSSPROBES. I was thinking that later on we might want to
add a tcp:tcp_send_tlp tracepoint, in addition to this
tcp:tcp_retransmit_skb tracepoint, for investigating the same kind of
issues: packet loss. This existing tcp:tcp_retransmit_skb tracepoint
patch is ok.

Acked-by: Brendan Gregg <bgregg@netflix.com>

(with or without %pI6c)

Brendan
David Miller Oct. 15, 2017, 1:45 a.m. UTC | #5
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Fri, 13 Oct 2017 13:03:16 -0700

> We need a real-time notification for tcp retransmission
> for monitoring.
> 
> Of course we could use ftrace to dynamically instrument this
> kernel function too, however we can't retrieve the connection
> information at the same time, for example perf-tools [1] reads
> /proc/net/tcp for socket details, which is slow when we have
> a lots of connections.
> 
> Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
> and exposes src/dst IP addresses and ports of the connection.
> This also makes it easier to integrate into perf.
> 
> Note, I expose both IPv4 and IPv6 addresses at the same time:
> for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
> for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
> Also, add sk and skb pointers as they are useful for BPF.
> 
> 1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thank you.
diff mbox series

Patch

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index 000000000000..3d1cbd072b7e
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,68 @@ 
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/tracepoint.h>
+#include <net/ipv6.h>
+
+TRACE_EVENT(tcp_retransmit_skb,
+
+	TP_PROTO(struct sock *sk, struct sk_buff *skb),
+
+	TP_ARGS(sk, skb),
+
+	TP_STRUCT__entry(
+		__field(void *, skbaddr)
+		__field(void *, skaddr)
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__array(__u8, saddr, 4)
+		__array(__u8, daddr, 4)
+		__array(__u8, saddr_v6, 16)
+		__array(__u8, daddr_v6, 16)
+	),
+
+	TP_fast_assign(
+		struct ipv6_pinfo *np = inet6_sk(sk);
+		struct inet_sock *inet = inet_sk(sk);
+		struct in6_addr *pin6;
+		__be32 *p32;
+
+		__entry->skbaddr = skb;
+		__entry->skaddr = sk;
+
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+
+		p32 = (__be32 *) __entry->saddr;
+		*p32 = inet->inet_saddr;
+
+		p32 = (__be32 *) __entry->daddr;
+		*p32 =  inet->inet_daddr;
+
+		if (np) {
+			pin6 = (struct in6_addr *)__entry->saddr_v6;
+			*pin6 = np->saddr;
+			pin6 = (struct in6_addr *)__entry->daddr_v6;
+			*pin6 = *(np->daddr_cache);
+		} else {
+			pin6 = (struct in6_addr *)__entry->saddr_v6;
+			ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
+			pin6 = (struct in6_addr *)__entry->daddr_v6;
+			ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
+		}
+	),
+
+	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6 daddrv6=%pI6",
+		  __entry->sport, __entry->dport, __entry->saddr, __entry->daddr,
+		  __entry->saddr_v6, __entry->daddr_v6)
+);
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index 1132820c8e62..f4e4fa2db505 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@ 
 #include <trace/events/napi.h>
 #include <trace/events/sock.h>
 #include <trace/events/udp.h>
+#include <trace/events/tcp.h>
 #include <trace/events/fib.h>
 #include <trace/events/qdisc.h>
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 696b0a168f16..6c74f2a39778 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -42,6 +42,8 @@ 
 #include <linux/gfp.h>
 #include <linux/module.h>
 
+#include <trace/events/tcp.h>
+
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
 
@@ -2875,6 +2877,7 @@  int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 
 	if (likely(!err)) {
 		TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
+		trace_tcp_retransmit_skb(sk, skb);
 	} else if (err != -EBUSY) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
 	}