diff mbox series

[v2] net: tracepoint: fix print wrong sysctl_mem value

Message ID 20200908020939.7653-1-dust.li@linux.alibaba.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [v2] net: tracepoint: fix print wrong sysctl_mem value | expand

Commit Message

Dust Li Sept. 8, 2020, 2:09 a.m. UTC
sysctl_mem is an point, and tracepoint entry do not support
been visited like an array. Use an __array(3) to store sysctl_mem
instead.

tracpoint output with and without this fix:
- without fix:
   28821.074 sock:sock_exceed_buf_limit:proto:UDP
   sysctl_mem=-1741233440,19,322156906942464 allocated=19 sysctl_rmem=4096
   rmem_alloc=75008 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
   kind=SK_MEM_RECV

- with fix:
  2126.136 sock:sock_exceed_buf_limit:proto:UDP
  sysctl_mem=18,122845,184266 allocated=19 sysctl_rmem=4096
  rmem_alloc=73728 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
  kind=SK_MEM_RECV

v2: use __array(3) instead of 3 long type to store sysctl_mem

Fixes: 3847ce32aea9fdf ("core: add tracepoints for queueing skb to rcvbuf")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 include/trace/events/sock.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

David Miller Sept. 9, 2020, 2:49 a.m. UTC | #1
From: Dust Li <dust.li@linux.alibaba.com>
Date: Tue,  8 Sep 2020 10:09:39 +0800

> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_STRUCT__entry(
>  		__array(char, name, 32)
> -		__field(long *, sysctl_mem)
> +		__array(long, sysctl_mem, 3)
>  		__field(long, allocated)
>  		__field(int, sysctl_rmem)
>  		__field(int, rmem_alloc)
> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_fast_assign(
>  		strncpy(__entry->name, prot->name, 32);
> -		__entry->sysctl_mem = prot->sysctl_mem;
> +		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
> +		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
> +		__entry->sysctl_mem[2] = prot->sysctl_mem[2];

I can't understand at all why the current code doesn't work.

We assign a pointer to entry->sysctl_mem and then print out the
three words pointed to by that.

It's so wasteful to copy this over every tracepoint entry so
the pointer approach is very desirable.
Dust Li Sept. 10, 2020, 2:34 p.m. UTC | #2
On Tue, Sep 08, 2020 at 07:49:52PM -0700, David Miller wrote:
>From: Dust Li <dust.li@linux.alibaba.com>
>Date: Tue,  8 Sep 2020 10:09:39 +0800
>
>> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>  
>>  	TP_STRUCT__entry(
>>  		__array(char, name, 32)
>> -		__field(long *, sysctl_mem)
>> +		__array(long, sysctl_mem, 3)
>>  		__field(long, allocated)
>>  		__field(int, sysctl_rmem)
>>  		__field(int, rmem_alloc)
>> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>  
>>  	TP_fast_assign(
>>  		strncpy(__entry->name, prot->name, 32);
>> -		__entry->sysctl_mem = prot->sysctl_mem;
>> +		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
>> +		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
>> +		__entry->sysctl_mem[2] = prot->sysctl_mem[2];
>
>I can't understand at all why the current code doesn't work.
>
>We assign a pointer to entry->sysctl_mem and then print out the
>three words pointed to by that.
>
>It's so wasteful to copy this over every tracepoint entry so
>the pointer approach is very desirable.

Thanks for your reply!

I took a close look at the code generated by tracepoint and
found the problem is not the tracepoint itself, but `perf trace`.

My previous output was got by running:
  `perf trace -e sock:sock_exceed_buf_limit`

This time, I tried directly read from /sys/kernel/debug/tracing/trace,
and everything is right :)

So I checked the code of perf tool, and found the foundamatal difference
is `perf trace` did the string formatting in the userspace, but raw ftrace
did it in the kernel.

When using `perf trace`, the kernel passes the string format and the
data to perf using the perf ringbuffer, and no one in the kernel will
try to visit the pointer sysctl_mem is pointed to, so the the userspace
perf got the original pointer of sysctl_mem and tries to do the formating,
which result in the wrong data in the commit log.

The key call trace when using `perf trace` is this:

trace_sock_exceed_buf_limit()
    --> perf_trace_sock_exceed_buf_limit()
        {
	        ...
                perf_fetch_caller_regs(__regs);
                {
                        strncpy(entry->name, prot->name, 32);
                        entry->sysctl_mem = prot->sysctl_mem;
                        entry->allocated = allocated;
                        entry->sysctl_rmem = sk_get_rmem0(sk, prot);
                        entry->rmem_alloc = atomic_read(&sk->sk_backlog.rmem_alloc);
                        entry->sysctl_wmem = sk_get_wmem0(sk, prot);
                        entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc);
                        entry->wmem_queued = sk->sk_wmem_queued;
                        entry->kind = kind;
                }
                perf_trace_run_bpf_submit(entry, __entry_size, rctx, event_call, \
					  __count, __regs, head, __task);
        }


Here *entry* is directly passed in to perf_trace_run_bpf_submit()
as raw data, and perf_trace_run_bpf_submit() won't do string formatting
but just pass them to the userspace perf, which will finnally did the
formatting, but it's already too late to get sysctl_mem[x].

So, any pointer dereference in tracepoint entry should failed in
`perf trace`.

Thanks.
Dust.Li
diff mbox series

Patch

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index a966d4b5ab37..914e58938480 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -98,7 +98,7 @@  TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__array(long, sysctl_mem, 3)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -110,7 +110,9 @@  TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
+		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
+		__entry->sysctl_mem[2] = prot->sysctl_mem[2];
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = sk_get_rmem0(sk, prot);
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);