Message ID | 65795E11DBF1E645A09CEC7EAEE94B9C3FBC0707@USINDEVS02.corp.hds.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Jun 14, 2011 at 03:24:14PM -0400, Satoru Moriya wrote: > Hi, > > kernel drops packets when the amount of memory which is used for socket buffer > exceeds limitations such as /proc/sys/net/ipv4/udp_mem. But currently we can't > catch that event and know why packets are dropped. And also it is difficult to > configure sysctl knob appropriately because we don't know when/why packets > dropped. > There are several ways to do this already. Every drop that occurs in the stack should have a corresponding statistical counter exposed for it, and we also have a tracepoint in kfree_skb that the dropwatch system monitors to inform us of dropped packets in a certralized fashion. Not saying this tracepoint isn't worthwhile, only that it covers already covered ground. > This patch adds tracepoint to __sk_mem_schedule(), which is called each time > the socket memory usage exceeds limitations and kernel drops a packet. > It allows us to hook in and examine when and why it happens. > > Note that this patch only collects information which is needed for udp > because it's a RFC patch to show its concept and acutually we need it(*). > If you guys need to get other parameters, please let me know. I'll add it. > > (*) Reason why we need this tracepoint for UDP > Transaction data is sent by UDP multicast in finance systems because of its > low overhead characteristics. UDP itself does not guarantee reliability, > ordering and data integrity, but the system is designed not to drop any packets > even when it is high load situation. And in that system if kernel drops packets, > we need to find a root cause to avoid it next time. > Again, this is why dropwatch exists. UDP gets into this path from: __udp_queue_rcv_skb ip_queue_rcv_skb sock_queue_rcv_skb sk_rmem_schedule __sk_mem_schedule If ip_queue_rcv_skb fails we increment the UDP_MIB_RCVBUFERRORS counter as well as the UDP_MIB_INERRORS counter, and on the kfree_skb call after those increments, dropwatch will report the frame loss and the fact that it occured in __udp_queue_rcv_skb I still think its an interesting tracepoint, just because it might be nice to know which sockets are expanding their snd/rcv buffer space, but why not modify the tracepoint so that it accepts the return code of __sk_mem_schedule and call it from both sk_rmem_schedule and sk_wmem_schedule. That way you can use the tracepoint to record both successfull expansion and failed expansions. Neil > Any comments are welcome. > > Signed-off-by: Satoru Moriya <satoru.moriya@hds.com> > --- > include/trace/events/sock.h | 46 +++++++++++++++++++++++++++++++++++++++++++ > net/core/net-traces.c | 1 + > net/core/sock.c | 4 +++ > 3 files changed, 51 insertions(+), 0 deletions(-) > create mode 100644 include/trace/events/sock.h > > diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h > new file mode 100644 > index 0000000..409735a > --- /dev/null > +++ b/include/trace/events/sock.h > @@ -0,0 +1,46 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM sock > + > +#if !defined(_TRACE_SOCK_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_SOCK_H > + > +#include <net/sock.h> > +#include <linux/tracepoint.h> > + > +TRACE_EVENT(sock_exceed_buf_limit, > + > + TP_PROTO(struct sock *sk, struct proto *prot, long allocated), > + > + TP_ARGS(sk, prot, allocated), > + > + TP_STRUCT__entry( > + __array(char, name, 32) > + __field(long *, sysctl_mem) > + __field(long, allocated) > + __field(int, sysctl_rmem) > + __field(int, rmem_alloc) > + ), > + > + TP_fast_assign( > + strncpy(__entry->name, prot->name, 32); > + __entry->sysctl_mem = prot->sysctl_mem; > + __entry->allocated = allocated; > + __entry->sysctl_rmem = atomic_read(&sk->sk_rmem_alloc); > + __entry->rmem_alloc = prot->sysctl_rmem[0]; > + ), > + > + TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld " > + "sysctl_rmem=%d rmem_alloc=%d", > + __entry->name, > + __entry->sysctl_mem[0], > + __entry->sysctl_mem[1], > + __entry->sysctl_mem[2], > + __entry->allocated, > + __entry->sysctl_rmem, > + __entry->rmem_alloc) > +); > + > +#endif /* _TRACE_SOCK_H */ > + > +/* This part must be outside protection */ > +#include <trace/define_trace.h> > diff --git a/net/core/net-traces.c b/net/core/net-traces.c > index 7f1bb2a..b9756f5 100644 > --- a/net/core/net-traces.c > +++ b/net/core/net-traces.c > @@ -28,6 +28,7 @@ > #include <trace/events/skb.h> > #include <trace/events/net.h> > #include <trace/events/napi.h> > +#include <trace/events/sock.h> > > EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb); > > diff --git a/net/core/sock.c b/net/core/sock.c > index 6e81978..8389032 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -128,6 +128,8 @@ > > #include <linux/filter.h> > > +#include <trace/events/sock.h> > + > #ifdef CONFIG_INET > #include <net/tcp.h> > #endif > @@ -1736,6 +1738,8 @@ suppress_allocation: > return 1; > } > > + trace_sock_exceed_buf_limit(sk, prot, allocated); > + > /* Alas. Undo changes. */ > sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM; > atomic_long_sub(amt, prot->memory_allocated); > -- > 1.7.1 > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Neil, Thank you for looking at it. I'm afraid that my explanation was not enough. On 06/15/2011 07:07 AM, Neil Horman wrote: > On Tue, Jun 14, 2011 at 03:24:14PM -0400, Satoru Moriya wrote: > > There are several ways to do this already. Every drop that occurs in the stack > should have a corresponding statistical counter exposed for it, and we also have > a tracepoint in kfree_skb that the dropwatch system monitors to inform us of > dropped packets in a certralized fashion. Not saying this tracepoint isn't > worthwhile, only that it covers already covered ground. Right. We can catch packet drop events via dropwatch/kfree_skb tracepoint. OTOH, what I'd like to know is why kernel dropped packets. Basically, we're able to know the reason because kfree_skb tracepoint records its return address. But in this case it is difficult to know a detailed reason because there are some possibilities. I'll try to explain UDP case. > Again, this is why dropwatch exists. UDP gets into this path from: > __udp_queue_rcv_skb > ip_queue_rcv_skb > sock_queue_rcv_skb > sk_rmem_schedule > __sk_mem_schedule > > If ip_queue_rcv_skb fails we increment the UDP_MIB_RCVBUFERRORS counter as well > as the UDP_MIB_INERRORS counter, and on the kfree_skb call after those > increments, dropwatch will report the frame loss and the fact that it occured in > __udp_queue_rcv_skb As you said, we're able to catch the packet drop in __udp_queue_rcv_skb and it means ip_queue_rcv_skb/sock_queue_rcv_skb returned negative value. In sock_queue_rcv_skb there are 3 possibilities where it returns nagative val but we can't separate them from the kfree_skb tracepoint. Moreover sock_queue_rcv_skb calls __sk_mem_schedule and there are several if-statement to decide whether kernel should drop a packet in it. I'd like to know the condition when it returned error because some of them are related to sysctl knob e.g. /proc/sys/net/ipv4/udp_mem, udp_rmem_min, udp_wmem_min for UDP case and we can tune kernel behavior easily. Also they are needed to show root cause of packet drop to our customers. Does it make sense? > I still think its an interesting tracepoint, just because it might be nice to > know which sockets are expanding their snd/rcv buffer space, but why not modify > the tracepoint so that it accepts the return code of __sk_mem_schedule and call > it from both sk_rmem_schedule and sk_wmem_schedule. That way you can use the > tracepoint to record both successfull expansion and failed expansions. It may be good but not enough for me as I mentioned above. Regards, Satoru-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 15, 2011 at 03:18:09PM -0400, Satoru Moriya wrote: > Hi Neil, > > Thank you for looking at it. > I'm afraid that my explanation was not enough. > > On 06/15/2011 07:07 AM, Neil Horman wrote: > > On Tue, Jun 14, 2011 at 03:24:14PM -0400, Satoru Moriya wrote: > > > > There are several ways to do this already. Every drop that occurs in the stack > > should have a corresponding statistical counter exposed for it, and we also have > > a tracepoint in kfree_skb that the dropwatch system monitors to inform us of > > dropped packets in a certralized fashion. Not saying this tracepoint isn't > > worthwhile, only that it covers already covered ground. > > Right. We can catch packet drop events via dropwatch/kfree_skb tracepoint. > OTOH, what I'd like to know is why kernel dropped packets. Basically, > we're able to know the reason because kfree_skb tracepoint records its > return address. But in this case it is difficult to know a detailed reason > because there are some possibilities. I'll try to explain UDP case. Ok, but you're patch only gets us incrmentally closer to that goal. You can tell if theres too much memory pressue to accept another packet on a socket, but it doesn't tell you if for instance if sk_filter indicated an error, or something else failed. In short it sounds to me like you're trying to debug a specific problem, rather than add something generally useful. > > > Again, this is why dropwatch exists. UDP gets into this path from: > > __udp_queue_rcv_skb > > ip_queue_rcv_skb > > sock_queue_rcv_skb > > sk_rmem_schedule > > __sk_mem_schedule > > > > If ip_queue_rcv_skb fails we increment the UDP_MIB_RCVBUFERRORS counter as well > > as the UDP_MIB_INERRORS counter, and on the kfree_skb call after those > > increments, dropwatch will report the frame loss and the fact that it occured in > > __udp_queue_rcv_skb > > As you said, we're able to catch the packet drop in __udp_queue_rcv_skb > and it means ip_queue_rcv_skb/sock_queue_rcv_skb returned negative value. > In sock_queue_rcv_skb there are 3 possibilities where it returns nagative > val but we can't separate them from the kfree_skb tracepoint. Moreover This is true, but you're patch doesn't do that either (not that either feature holds that as a specific goal). > sock_queue_rcv_skb calls __sk_mem_schedule and there are several > if-statement to decide whether kernel should drop a packet in it. I'd like > to know the condition when it returned error because some of them are > related to sysctl knob e.g. /proc/sys/net/ipv4/udp_mem, udp_rmem_min, > udp_wmem_min for UDP case and we can tune kernel behavior easily. > Also they are needed to show root cause of packet drop to our customers. > > Does it make sense? > Yes, it makes sense, but I think the patch could be made generally more useful. > > I still think its an interesting tracepoint, just because it might be nice to > > know which sockets are expanding their snd/rcv buffer space, but why not modify > > the tracepoint so that it accepts the return code of __sk_mem_schedule and call > > it from both sk_rmem_schedule and sk_wmem_schedule. That way you can use the > > tracepoint to record both successfull expansion and failed expansions. > > It may be good but not enough for me as I mentioned above. > > > Regards, > Satoru-- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/15/2011 04:04 PM, Neil Horman wrote: > On Wed, Jun 15, 2011 at 03:18:09PM -0400, Satoru Moriya wrote: > >> Right. We can catch packet drop events via dropwatch/kfree_skb tracepoint. >> OTOH, what I'd like to know is why kernel dropped packets. Basically, >> we're able to know the reason because kfree_skb tracepoint records its >> return address. But in this case it is difficult to know a detailed reason >> because there are some possibilities. I'll try to explain UDP case. > Ok, but you're patch only gets us incrmentally closer to that goal. You can > tell if theres too much memory pressue to accept another packet on a socket, but > it doesn't tell you if for instance if sk_filter indicated an error, or > something else failed. In short it sounds to me like you're trying to debug a > specific problem, rather than add something generally useful. You're right. >> As you said, we're able to catch the packet drop in __udp_queue_rcv_skb >> and it means ip_queue_rcv_skb/sock_queue_rcv_skb returned negative value. >> In sock_queue_rcv_skb there are 3 possibilities where it returns nagative >> val but we can't separate them from the kfree_skb tracepoint. Moreover > This is true, but you're patch doesn't do that either (not that either feature > holds that as a specific goal). Right. > >> sock_queue_rcv_skb calls __sk_mem_schedule and there are several >> if-statement to decide whether kernel should drop a packet in it. I'd like >> to know the condition when it returned error because some of them are >> related to sysctl knob e.g. /proc/sys/net/ipv4/udp_mem, udp_rmem_min, >> udp_wmem_min for UDP case and we can tune kernel behavior easily. >> Also they are needed to show root cause of packet drop to our customers. >> >> Does it make sense? >> > Yes, it makes sense, but I think the patch could be made generally more useful. OK. I'll update my patch to make it more useful. Thanks, Satoru-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h new file mode 100644 index 0000000..409735a --- /dev/null +++ b/include/trace/events/sock.h @@ -0,0 +1,46 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM sock + +#if !defined(_TRACE_SOCK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SOCK_H + +#include <net/sock.h> +#include <linux/tracepoint.h> + +TRACE_EVENT(sock_exceed_buf_limit, + + TP_PROTO(struct sock *sk, struct proto *prot, long allocated), + + TP_ARGS(sk, prot, allocated), + + TP_STRUCT__entry( + __array(char, name, 32) + __field(long *, sysctl_mem) + __field(long, allocated) + __field(int, sysctl_rmem) + __field(int, rmem_alloc) + ), + + TP_fast_assign( + strncpy(__entry->name, prot->name, 32); + __entry->sysctl_mem = prot->sysctl_mem; + __entry->allocated = allocated; + __entry->sysctl_rmem = atomic_read(&sk->sk_rmem_alloc); + __entry->rmem_alloc = prot->sysctl_rmem[0]; + ), + + TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld " + "sysctl_rmem=%d rmem_alloc=%d", + __entry->name, + __entry->sysctl_mem[0], + __entry->sysctl_mem[1], + __entry->sysctl_mem[2], + __entry->allocated, + __entry->sysctl_rmem, + __entry->rmem_alloc) +); + +#endif /* _TRACE_SOCK_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/net/core/net-traces.c b/net/core/net-traces.c index 7f1bb2a..b9756f5 100644 --- a/net/core/net-traces.c +++ b/net/core/net-traces.c @@ -28,6 +28,7 @@ #include <trace/events/skb.h> #include <trace/events/net.h> #include <trace/events/napi.h> +#include <trace/events/sock.h> EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb); diff --git a/net/core/sock.c b/net/core/sock.c index 6e81978..8389032 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -128,6 +128,8 @@ #include <linux/filter.h> +#include <trace/events/sock.h> + #ifdef CONFIG_INET #include <net/tcp.h> #endif @@ -1736,6 +1738,8 @@ suppress_allocation: return 1; } + trace_sock_exceed_buf_limit(sk, prot, allocated); + /* Alas. Undo changes. */ sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM; atomic_long_sub(amt, prot->memory_allocated);
Hi, kernel drops packets when the amount of memory which is used for socket buffer exceeds limitations such as /proc/sys/net/ipv4/udp_mem. But currently we can't catch that event and know why packets are dropped. And also it is difficult to configure sysctl knob appropriately because we don't know when/why packets dropped. This patch adds tracepoint to __sk_mem_schedule(), which is called each time the socket memory usage exceeds limitations and kernel drops a packet. It allows us to hook in and examine when and why it happens. Note that this patch only collects information which is needed for udp because it's a RFC patch to show its concept and acutually we need it(*). If you guys need to get other parameters, please let me know. I'll add it. (*) Reason why we need this tracepoint for UDP Transaction data is sent by UDP multicast in finance systems because of its low overhead characteristics. UDP itself does not guarantee reliability, ordering and data integrity, but the system is designed not to drop any packets even when it is high load situation. And in that system if kernel drops packets, we need to find a root cause to avoid it next time. Any comments are welcome. Signed-off-by: Satoru Moriya <satoru.moriya@hds.com> --- include/trace/events/sock.h | 46 +++++++++++++++++++++++++++++++++++++++++++ net/core/net-traces.c | 1 + net/core/sock.c | 4 +++ 3 files changed, 51 insertions(+), 0 deletions(-) create mode 100644 include/trace/events/sock.h