Message ID | 169451498457.387657.18163846390132907462.stgit@ebuild |
---|---|
State | Accepted |
Commit | 13dde113107b21d3e1c5f4197590f7a1ce71459e |
Headers | show |
Series | [ovs-dev,v4] utilities: Add kernel_delay.py script to debug a busy Linux kernel. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | warning | apply and check: warning |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
On Tue, Sep 12, 2023 at 12:36:41PM +0200, Eelco Chaudron wrote: > This patch adds an utility that can be used to determine if > an issue is related to a lack of Linux kernel resources. > > This tool is also featured in a Red Hat developers blog article: > > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Simon Horman <horms@ovn.org>
On 9/12/23 12:36, Eelco Chaudron wrote: > This patch adds an utility that can be used to determine if > an issue is related to a lack of Linux kernel resources. > > This tool is also featured in a Red Hat developers blog article: > > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > Reviewed-by: Adrian Moreno <amorenoz@redhat.com> > --- > v2: Addressed review comments from Aaron. > v3: Changed wording in documentation. > v4: Addressed review comments from Adrian. > > utilities/automake.mk | 4 > utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ > utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ > 3 files changed, 2020 insertions(+) > create mode 100755 utilities/usdt-scripts/kernel_delay.py > create mode 100644 utilities/usdt-scripts/kernel_delay.rst > > diff --git a/utilities/automake.mk b/utilities/automake.mk > index 37d679f82..9a2114df4 100644 > --- a/utilities/automake.mk > +++ b/utilities/automake.mk > @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib > usdt_SCRIPTS += \ > utilities/usdt-scripts/bridge_loop.bt \ > utilities/usdt-scripts/dpif_nl_exec_monitor.py \ > + utilities/usdt-scripts/kernel_delay.py \ > + utilities/usdt-scripts/kernel_delay.rst \ > utilities/usdt-scripts/reval_monitor.py \ > utilities/usdt-scripts/upcall_cost.py \ > utilities/usdt-scripts/upcall_monitor.py > @@ -70,6 +72,8 @@ EXTRA_DIST += \ > utilities/docker/debian/build-kernel-modules.sh \ > utilities/usdt-scripts/bridge_loop.bt \ > utilities/usdt-scripts/dpif_nl_exec_monitor.py \ > + utilities/usdt-scripts/kernel_delay.py \ > + utilities/usdt-scripts/kernel_delay.rst \ > utilities/usdt-scripts/reval_monitor.py \ > utilities/usdt-scripts/upcall_cost.py \ > utilities/usdt-scripts/upcall_monitor.py > diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py > new file mode 100755 > index 000000000..636e108be > --- /dev/null > +++ b/utilities/usdt-scripts/kernel_delay.py > @@ -0,0 +1,1420 @@ > +#!/usr/bin/env python3 > +# > +# Copyright (c) 2022,2023 Red Hat, Inc. > +# > +# Licensed under the Apache License, Version 2.0 (the "License"); > +# you may not use this file except in compliance with the License. > +# You may obtain a copy of the License at: > +# > +# http://www.apache.org/licenses/LICENSE-2.0 > +# > +# Unless required by applicable law or agreed to in writing, software > +# distributed under the License is distributed on an "AS IS" BASIS, > +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > +# See the License for the specific language governing permissions and > +# limitations under the License. > +# > +# > +# Script information: > +# ------------------- > +# This script allows a developer to quickly identify if the issue at hand > +# might be related to the kernel running out of resources or if it really is > +# an Open vSwitch issue. > +# > +# For documentation see the kernel_delay.rst file. > +# > +# > +# Dependencies: > +# ------------- > +# You need to install the BCC package for your specific platform or build it > +# yourself using the following instructions: > +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md > +# > +# Python needs the following additional packages installed: > +# - pytz > +# - psutil > +# > +# You can either install your distribution specific package or use pip: > +# pip install pytz psutil > +# > +import argparse > +import datetime > +import os > +import pytz > +import psutil > +import re > +import sys > +import time > + > +import ctypes as ct > + > +try: > + from bcc import BPF, USDT, USDTException > + from bcc.syscall import syscalls, syscall_name > +except ModuleNotFoundError: > + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") > + sys.exit(os.EX_OSFILE) > + > +from enum import IntEnum > + > + > +# > +# Actual eBPF source code > +# > +EBPF_SOURCE = """ > +#include <linux/irq.h> > +#include <linux/sched.h> > + > +#define MONITOR_PID <MONITOR_PID> > + > +enum { > +<EVENT_ENUM> > +}; > + > +struct event_t { > + u64 ts; > + u32 tid; > + u32 id; > + > + int user_stack_id; > + int kernel_stack_id; > + > + u32 syscall; > + u64 entry_ts; > + > +}; > + > +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); > +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); > +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); > +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); > + > +BPF_ARRAY(capture_on, u64, 1); > +static inline bool capture_enabled(u64 pid_tgid) { > + int key = 0; > + u64 *ret; > + > + if ((pid_tgid >> 32) != MONITOR_PID) > + return false; > + > + ret = capture_on.lookup(&key); > + return ret && *ret == 1; > +} > + > +static inline bool capture_enabled__() { > + int key = 0; > + u64 *ret; > + > + ret = capture_on.lookup(&key); > + return ret && *ret == 1; > +} > + > +static struct event_t *get_event(uint32_t id) { > + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); > + > + if (!event) { > + dropcnt.increment(0); > + return NULL; > + } > + > + event->id = id; > + event->ts = bpf_ktime_get_ns(); > + event->tid = bpf_get_current_pid_tgid(); > + > + return event; > +} > + > +static int start_trigger() { > + int key = 0; > + u64 *val = capture_on.lookup(&key); > + > + /* If the value is -1 we can't start as we are still processing the > + * results in userspace. */ > + if (!val || *val != 0) { > + trigger_miss.increment(0); > + return 0; > + } > + > + struct event_t *event = get_event(EVENT_START_TRIGGER); > + if (event) { > + events.ringbuf_submit(event, 0); > + *val = 1; > + } else { > + trigger_miss.increment(0); > + } > + return 0; > +} > + > +static int stop_trigger() { > + int key = 0; > + u64 *val = capture_on.lookup(&key); > + > + if (!val || *val != 1) > + return 0; > + > + struct event_t *event = get_event(EVENT_STOP_TRIGGER); > + > + if (event) > + events.ringbuf_submit(event, 0); > + > + if (val) > + *val = -1; > + > + return 0; > +} > + > +<START_TRIGGER> > +<STOP_TRIGGER> > + > + > +/* > + * For the syscall monitor the following probes get installed. > + */ > +struct syscall_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct syscall_data_key_t { > + u32 pid; > + u32 tid; > + u32 syscall; > +}; > + > +BPF_HASH(syscall_start, u64, u64); > +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); > + > +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + u64 t = bpf_ktime_get_ns(); > + syscall_start.update(&pid_tgid, &t); > + > + return 0; > +} > + > +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { > + struct syscall_data_t *val, zero = {}; > + struct syscall_data_key_t key; > + > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + key.pid = pid_tgid >> 32; > + key.tid = (u32)pid_tgid; > + key.syscall = args->id; > + > + u64 *start_ns = syscall_start.lookup(&pid_tgid); > + > + if (!start_ns) > + return 0; > + > + val = syscall_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + > + if (<SYSCALL_TRACE_EVENTS>) { > + struct event_t *event = get_event(EVENT_SYSCALL); > + if (event) { > + event->syscall = args->id; > + event->entry_ts = *start_ns; > + if (<STACK_TRACE_ENABLED>) { > + event->user_stack_id = stack_traces.get_stackid( > + args, BPF_F_USER_STACK); > + event->kernel_stack_id = stack_traces.get_stackid( > + args, 0); > + } > + events.ringbuf_submit(event, 0); > + } > + } > + } > + return 0; > +} > + > + > +/* > + * For measuring the thread run time, we need the following. > + */ > +struct run_time_data_t { > + u64 count; > + u64 total_ns; > + u64 max_ns; > + u64 min_ns; > +}; > + > +struct pid_tid_key_t { > + u32 pid; > + u32 tid; > +}; > + > +BPF_HASH(run_start, u64, u64); > +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); > + > +static inline void thread_start_run(u64 pid_tgid, u64 ktime) > +{ > + run_start.update(&pid_tgid, &ktime); > +} > + > +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) > +{ > + u64 pid_tgid = (u64) tgid << 32 | pid; > + u64 *start_ns = run_start.lookup(&pid_tgid); > + > + if (!start_ns || *start_ns == 0) > + return; > + > + struct run_time_data_t *val, zero = {}; > + struct pid_tid_key_t key = { .pid = tgid, > + .tid = pid }; > + > + val = run_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = ktime - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->max_ns == 0 || delta > val->max_ns) > + val->max_ns = delta; > + if (val->min_ns == 0 || delta < val->min_ns) > + val->min_ns = delta; > + } > + *start_ns = 0; > +} > + > + > +/* > + * For measuring the thread-ready delay, we need the following. > + */ > +struct ready_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +BPF_HASH(ready_start, u64, u64); > +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); > + > +static inline int sched_wakeup__(u32 pid, u32 tgid) > +{ > + u64 pid_tgid = (u64) tgid << 32 | pid; > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + u64 t = bpf_ktime_get_ns(); > + ready_start.update(&pid_tgid, &t); > + return 0; > +} > + > +RAW_TRACEPOINT_PROBE(sched_wakeup) > +{ > + struct task_struct *t = (struct task_struct *)ctx->args[0]; > + return sched_wakeup__(t->pid, t->tgid); > +} > + > +RAW_TRACEPOINT_PROBE(sched_wakeup_new) > +{ > + struct task_struct *t = (struct task_struct *)ctx->args[0]; > + return sched_wakeup__(t->pid, t->tgid); > +} > + > +RAW_TRACEPOINT_PROBE(sched_switch) > +{ > + struct task_struct *prev = (struct task_struct *)ctx->args[1]; > + struct task_struct *next= (struct task_struct *)ctx->args[2]; > + u64 ktime = 0; > + > + if (!capture_enabled__()) > + return 0; > + > + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) > + sched_wakeup__(prev->pid, prev->tgid); > + > + if (prev->tgid == MONITOR_PID) { > + ktime = bpf_ktime_get_ns(); > + thread_stop_run(prev->pid, prev->tgid, ktime); > + } > + > + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; > + > + if (next->tgid != MONITOR_PID) > + return 0; > + > + if (ktime == 0) > + ktime = bpf_ktime_get_ns(); > + > + u64 *start_ns = ready_start.lookup(&pid_tgid); > + > + if (start_ns && *start_ns != 0) { > + > + struct ready_data_t *val, zero = {}; > + struct pid_tid_key_t key = { .pid = next->tgid, > + .tid = next->pid }; > + > + val = ready_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = ktime - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + *start_ns = 0; > + } > + > + thread_start_run(pid_tgid, ktime); > + return 0; > +} > + > + > +/* > + * For measuring the hard irq time, we need the following. > + */ > +struct hardirq_start_data_t { > + u64 start_ns; > + char irq_name[32]; > +}; > + > +struct hardirq_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct hardirq_data_key_t { > + u32 pid; > + u32 tid; > + char irq_name[32]; > +}; > + > +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); > +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); > + > +TRACEPOINT_PROBE(irq, irq_handler_entry) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct hardirq_start_data_t data = {}; > + > + data.start_ns = bpf_ktime_get_ns(); > + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); > + hardirq_start.update(&pid_tgid, &data); > + return 0; > +} > + > +TRACEPOINT_PROBE(irq, irq_handler_exit) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct hardirq_start_data_t *data; > + data = hardirq_start.lookup(&pid_tgid); > + if (!data || data->start_ns == 0) > + return 0; > + > + if (args->ret != IRQ_NONE) { > + struct hardirq_data_t *val, zero = {}; > + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, > + .tid = (u32)pid_tgid }; > + > + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), > + data->irq_name); > + val = hardirq_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - data->start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + } > + > + data->start_ns = 0; > + return 0; > +} > + > + > +/* > + * For measuring the soft irq time, we need the following. > + */ > +struct softirq_start_data_t { > + u64 start_ns; > + u32 vec_nr; > +}; > + > +struct softirq_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct softirq_data_key_t { > + u32 pid; > + u32 tid; > + u32 vec_nr; > +}; > + > +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); > +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); > + > +TRACEPOINT_PROBE(irq, softirq_entry) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct softirq_start_data_t data = {}; > + > + data.start_ns = bpf_ktime_get_ns(); > + data.vec_nr = args->vec; > + softirq_start.update(&pid_tgid, &data); > + return 0; > +} > + > +TRACEPOINT_PROBE(irq, softirq_exit) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct softirq_start_data_t *data; > + data = softirq_start.lookup(&pid_tgid); > + if (!data || data->start_ns == 0) > + return 0; > + > + struct softirq_data_t *val, zero = {}; > + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, > + .tid = (u32)pid_tgid, > + .vec_nr = data->vec_nr}; > + > + val = softirq_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - data->start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + > + data->start_ns = 0; > + return 0; > +} > +""" > + > + > +# > +# time_ns() > +# > +try: > + from time import time_ns > +except ImportError: > + # For compatibility with Python <= v3.6. > + def time_ns(): > + now = datetime.datetime.now() > + return int(now.timestamp() * 1e9) > + > + > +# > +# Probe class to use for the start/stop triggers > +# > +class Probe(object): > + ''' > + The goal for this object is to support as many as possible > + probe/events as supported by BCC. See > + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments > + ''' > + def __init__(self, probe, pid=None): > + self.pid = pid > + self.text_probe = probe > + self._parse_text_probe() > + > + def __str__(self): > + if self.probe_type == "usdt": > + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, > + self.usdt_provider, self.usdt_probe) > + elif self.probe_type == "trace": > + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, > + self.trace_system, self.trace_event) > + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": > + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, > + self.kprobe_function) > + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": > + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, > + self.uprobe_function) > + else: > + return "[{}] <{}:unknown probe>".format(self.text_probe, > + self.probe_type) > + > + def _raise(self, error): > + raise ValueError("[{}]; {}".format(self.text_probe, error)) > + > + def _verify_kprobe_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_trace_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_uprobe_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_usdt_probe(self): > + if not self.pid: > + self._raise("USDT probes need a valid PID.") > + > + usdt = USDT(pid=self.pid) > + > + for probe in usdt.enumerate_probes(): > + if probe.provider.decode('utf-8') == self.usdt_provider and \ > + probe.name.decode('utf-8') == self.usdt_probe: > + return > + > + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, > + self.usdt_probe)) > + > + def _parse_text_probe(self): > + ''' > + The text probe format is defined as follows: > + <probe_type>:<probe_specific> > + > + Types: > + USDT: u|usdt:<provider>:<probe> > + TRACE: t|trace:<system>:<event> > + KPROBE: k|kprobe:<kernel_function> > + KRETPROBE: kr|kretprobe:<kernel_function> > + UPROBE: up|uprobe:<function> > + URETPROBE: ur|uretprobe:<function> > + ''' > + args = self.text_probe.split(":") > + if len(args) <= 1: > + self._raise("Can't extract probe type.") > + > + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", > + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: > + self._raise("Invalid probe type '{}'".format(args[0])) > + > + self.probe_type = "kprobe" if args[0] == "k" else args[0] > + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type > + self.probe_type = "trace" if args[0] == "t" else self.probe_type > + self.probe_type = "usdt" if args[0] == "u" else self.probe_type > + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type > + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type > + > + if self.probe_type == "usdt": > + if len(args) != 3: > + self._raise("Invalid number of arguments for USDT") > + > + self.usdt_provider = args[1] > + self.usdt_probe = args[2] > + self._verify_usdt_probe() > + > + elif self.probe_type == "trace": > + if len(args) != 3: > + self._raise("Invalid number of arguments for TRACE") > + > + self.trace_system = args[1] > + self.trace_event = args[2] > + self._verify_trace_probe() > + > + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": > + if len(args) != 2: > + self._raise("Invalid number of arguments for K(RET)PROBE") > + self.kprobe_function = args[1] > + self._verify_kprobe_probe() > + > + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": > + if len(args) != 2: > + self._raise("Invalid number of arguments for U(RET)PROBE") > + self.uprobe_function = args[1] > + self._verify_uprobe_probe() > + > + def _get_kprobe_c_code(self, function_name, function_content): > + # > + # The kprobe__* do not require a function name, so it's > + # ignored in the code generation. > + # > + return """ > +int {}__{}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(self.probe_type, self.kprobe_function, function_content) > + > + def _get_trace_c_code(self, function_name, function_content): > + # > + # The TRACEPOINT_PROBE() do not require a function name, so it's > + # ignored in the code generation. > + # > + return """ > +TRACEPOINT_PROBE({},{}) {{ > + {} > +}} > +""".format(self.trace_system, self.trace_event, function_content) > + > + def _get_uprobe_c_code(self, function_name, function_content): > + return """ > +int {}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(function_name, function_content) > + > + def _get_usdt_c_code(self, function_name, function_content): > + return """ > +int {}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(function_name, function_content) > + > + def get_c_code(self, function_name, function_content): > + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': > + return self._get_kprobe_c_code(function_name, function_content) > + elif self.probe_type == 'trace': > + return self._get_trace_c_code(function_name, function_content) > + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': > + return self._get_uprobe_c_code(function_name, function_content) > + elif self.probe_type == 'usdt': > + return self._get_usdt_c_code(function_name, function_content) > + > + return "" > + > + def probe_name(self): > + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': > + return "{}".format(self.kprobe_function) > + elif self.probe_type == 'trace': > + return "{}:{}".format(self.trace_system, > + self.trace_event) > + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': > + return "{}".format(self.uprobe_function) > + elif self.probe_type == 'usdt': > + return "{}:{}".format(self.usdt_provider, > + self.usdt_probe) > + > + return "" > + > + > +# > +# event_to_dict() > +# > +def event_to_dict(event): > + return dict([(field, getattr(event, field)) > + for (field, _) in event._fields_ > + if isinstance(getattr(event, field), (int, bytes))]) > + > + > +# > +# Event enum > +# > +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], > + start=0) > + > + > +# > +# process_event() > +# > +def process_event(ctx, data, size): > + global start_trigger_ts > + global stop_trigger_ts > + > + event = bpf['events'].event(data) > + if event.id == Event.SYSCALL: > + syscall_events.append({"tid": event.tid, > + "ts_entry": event.entry_ts, > + "ts_exit": event.ts, > + "syscall": event.syscall, > + "user_stack_id": event.user_stack_id, > + "kernel_stack_id": event.kernel_stack_id}) > + elif event.id == Event.START_TRIGGER: > + # > + # This event would have started the trigger already, so all we need to > + # do is record the start timestamp. > + # > + start_trigger_ts = event.ts > + > + elif event.id == Event.STOP_TRIGGER: > + # > + # This event would have stopped the trigger already, so all we need to > + # do is record the start timestamp. > + stop_trigger_ts = event.ts > + > + > +# > +# next_power_of_two() > +# > +def next_power_of_two(val): > + np = 1 > + while np < val: > + np *= 2 > + return np > + > + > +# > +# unsigned_int() > +# > +def unsigned_int(value): > + try: > + value = int(value) > + except ValueError: > + raise argparse.ArgumentTypeError("must be an integer") > + > + if value < 0: > + raise argparse.ArgumentTypeError("must be positive") > + return value > + > + > +# > +# unsigned_nonzero_int() > +# > +def unsigned_nonzero_int(value): > + value = unsigned_int(value) > + if value == 0: > + raise argparse.ArgumentTypeError("must be nonzero") > + return value > + > + > +# > +# get_thread_name() > +# > +def get_thread_name(pid, tid): > + try: > + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: > + return f.readline().strip("\n") > + except FileNotFoundError: > + pass > + > + return f"<unknown:{pid}/{tid}>" > + > + > +# > +# get_vec_nr_name() > +# > +def get_vec_nr_name(vec_nr): > + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", > + "tasklet", "sched", "hrtimer", "rcu"] > + > + if vec_nr < 0 or vec_nr > len(known_vec_nr): > + return f"<unknown:{vec_nr}>" > + > + return known_vec_nr[vec_nr] > + > + > +# > +# start/stop/reset capture > +# > +def start_capture(): > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) > + > + > +def stop_capture(force=False): > + if force: > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) > + else: > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) > + > + > +def capture_running(): > + return bpf["capture_on"][ct.c_int(0)].value == 1 > + > + > +def reset_capture(): > + bpf["syscall_start"].clear() > + bpf["syscall_data"].clear() > + bpf["run_start"].clear() > + bpf["run_data"].clear() > + bpf["ready_start"].clear() > + bpf["ready_data"].clear() > + bpf["hardirq_start"].clear() > + bpf["hardirq_data"].clear() > + bpf["softirq_start"].clear() > + bpf["softirq_data"].clear() > + bpf["stack_traces"].clear() > + > + > +# > +# Display timestamp > +# > +def print_timestamp(msg): > + ltz = datetime.datetime.now() > + utc = ltz.astimezone(pytz.utc) > + time_string = "{} @{} ({} UTC)".format( > + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) > + print(time_string) > + > + > +# > +# process_results() > +# > +def process_results(syscall_events=None, trigger_delta=None): > + if trigger_delta: > + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". > + format(trigger_delta)) > + else: > + print_timestamp("# Sample dump") > + > + # > + # First get a list of all threads we need to report on. > + # > + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() > + if k.syscall != 0xffffffff} > + > + threads_run = {k.tid for k, _ in bpf["run_data"].items() > + if k.pid != 0xffffffff} > + > + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() > + if k.pid != 0xffffffff} > + > + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() > + if k.pid != 0xffffffff} > + > + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() > + if k.pid != 0xffffffff} > + > + threads = sorted(threads_syscall | threads_run | threads_ready | > + threads_hardirq | threads_softirq, > + key=lambda x: get_thread_name(options.pid, x)) > + > + # > + # Print header... > + # > + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) > + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) > + indent = 28 * " " > + > + # > + # Print all events/statistics per threads. > + # > + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] > + for thread in threads: > + > + if thread != threads[0]: > + print("") > + > + # > + # SYSCALL_STATISTICS > + # > + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( > + thread, get_thread_name(options.pid, thread), > + "[SYSCALL STATISTICS]", indent, > + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) > + > + total_count = 0 > + total_ns = 0 > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["syscall_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( > + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, > + v.count, v.total_ns, v.worst_ns)) > + if k.syscall != poll_id: > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:6} {:10} {:16,}".format( > + indent, "TOTAL( - poll):", "", total_count, total_ns)) > + > + # > + # THREAD RUN STATISTICS > + # > + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( > + "", "", "[THREAD RUN STATISTICS]", indent, > + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) > + > + for k, v in filter(lambda t: t[0].tid == thread, > + bpf["run_data"].items()): > + > + print("{}{:10} {:16,} {:16,} {:16,}".format( > + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) > + > + # > + # THREAD READY STATISTICS > + # > + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( > + "", "", "[THREAD READY STATISTICS]", indent, > + "SCHED_CNT", "TOTAL ns", "MAX ns")) > + > + for k, v in filter(lambda t: t[0].tid == thread, > + bpf["ready_data"].items()): > + > + print("{}{:10} {:16,} {:16,}".format( > + indent, v.count, v.total_ns, v.worst_ns)) > + > + # > + # HARD IRQ STATISTICS > + # > + total_ns = 0 > + total_count = 0 > + header_printed = False > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["hardirq_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + if not header_printed: > + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". > + format("", "", "[HARD IRQ STATISTICS]", indent, > + "NAME", "COUNT", "TOTAL ns", "MAX ns")) > + header_printed = True > + > + print("{}{:20.20} {:10} {:16,} {:16,}".format( > + indent, k.irq_name.decode('utf-8'), > + v.count, v.total_ns, v.worst_ns)) > + > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:10} {:16,}".format( > + indent, "TOTAL:", total_count, total_ns)) > + > + # > + # SOFT IRQ STATISTICS > + # > + total_ns = 0 > + total_count = 0 > + header_printed = False > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["softirq_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + if not header_printed: > + print("\n{:10} {:16} {}\n" > + "{}{:20} {:>7} {:>10} {:>16} {:>16}". > + format("", "", "[SOFT IRQ STATISTICS]", indent, > + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) > + header_printed = True > + > + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( > + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, > + v.count, v.total_ns, v.worst_ns)) > + > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:7} {:10} {:16,}".format( > + indent, "TOTAL:", "", total_count, total_ns)) > + > + # > + # Print events > + # > + lost_stack_traces = 0 > + if syscall_events: > + stack_traces = bpf.get_table("stack_traces") > + > + print("\n\n# SYSCALL EVENTS:" > + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( > + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", > + "DELTA (us)", "SYSCALL")) > + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( > + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, > + "-" * 10, "-" * 16)) > + for event in syscall_events: > + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( > + " " * 2, > + event["ts_entry"], event["ts_exit"], event["tid"], > + get_thread_name(options.pid, event["tid"]), > + int((event["ts_exit"] - event["ts_entry"]) / 1000), > + syscall_name(event["syscall"]).decode('utf-8'))) > + # > + # Not sure where to put this, but I'll add some info on stack > + # traces here... Userspace stack traces are very limited due to > + # the fact that bcc does not support dwarf backtraces. As OVS > + # gets compiled without frame pointers we will not see much. > + # If however, OVS does get built with frame pointers, we should not > + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle > + # the debug symbols correctly. Also, note that for kernel > + # traces you should not use BPF_STACK_TRACE_BUILDID, so two > + # buffers are needed. > + # > + # Some info on manual dwarf walk support: > + # https://github.com/iovisor/bcc/issues/3515 > + # https://github.com/iovisor/bcc/pull/4463 > + # > + if options.stack_trace_size == 0: > + continue > + > + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: > + lost_stack_traces += 1 > + > + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ > + if event['kernel_stack_id'] >= 0 else [] > + user_stack = stack_traces.walk(event['user_stack_id']) \ > + if event['user_stack_id'] >= 0 else [] > + > + for addr in kernel_stack: > + print("{}{}".format( > + " " * 10, > + bpf.ksym(addr, show_module=True, > + show_offset=True).decode('utf-8', 'replace'))) > + > + for addr in user_stack: > + addr_str = bpf.sym(addr, options.pid, show_module=True, > + show_offset=True).decode('utf-8', 'replace') > + > + if addr_str == "[unknown]": > + addr_str += " 0x{:x}".format(addr) > + > + print("{}{}".format(" " * 10, addr_str)) > + > + # > + # Print any footer messages. > + # > + if lost_stack_traces > 0: > + print("\n#WARNING: We where not able to display {} stack traces!\n" > + "# Consider increasing the stack trace size using\n" > + "# the '--stack-trace-size' option.\n" > + "# Note that this can also happen due to a stack id\n" > + "# collision.".format(lost_stack_traces)) > + > + > +# > +# main() > +# > +def main(): > + # > + # Don't like these globals, but ctx passing does not seem to work with the > + # existing open_ring_buffer() API :( > + # > + global bpf > + global options > + global syscall_events > + global start_trigger_ts > + global stop_trigger_ts > + > + start_trigger_ts = 0 > + stop_trigger_ts = 0 > + > + # > + # Argument parsing > + # > + parser = argparse.ArgumentParser() > + > + parser.add_argument("-D", "--debug", > + help="Enable eBPF debugging", > + type=int, const=0x3f, default=0, nargs='?') > + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", > + help="ovs-vswitch's PID", > + type=unsigned_int, default=None) > + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", > + help="Record syscall events that take longer than " > + "DURATION_NS. Omit the duration value to record all " > + "syscall events", > + type=unsigned_int, const=0, default=None, nargs='?') > + parser.add_argument("--buffer-page-count", > + help="Number of BPF ring buffer pages, default 1024", > + type=unsigned_int, default=1024, metavar="NUMBER") > + parser.add_argument("--sample-count", > + help="Number of sample runs, default 1", > + type=unsigned_nonzero_int, default=1, metavar="RUNS") > + parser.add_argument("--sample-interval", > + help="Delay between sample runs, default 0", > + type=float, default=0, metavar="SECONDS") > + parser.add_argument("--sample-time", > + help="Sample time, default 0.5 seconds", > + type=float, default=0.5, metavar="SECONDS") > + parser.add_argument("--skip-syscall-poll-events", > + help="Skip poll() syscalls with --syscall-events", > + action="store_true") > + parser.add_argument("--stack-trace-size", > + help="Number of unique stack traces that can be " > + "recorded, default 4096. 0 to disable", > + type=unsigned_int, default=4096) > + parser.add_argument("--start-trigger", metavar="TRIGGER", > + help="Start trigger, see documentation for details", > + type=str, default=None) > + parser.add_argument("--stop-trigger", metavar="TRIGGER", > + help="Stop trigger, see documentation for details", > + type=str, default=None) > + parser.add_argument("--trigger-delta", metavar="DURATION_NS", > + help="Only report event when the trigger duration > " > + "DURATION_NS, default 0 (all events)", > + type=unsigned_int, const=0, default=0, nargs='?') > + > + options = parser.parse_args() > + > + # > + # Find the PID of the ovs-vswitchd daemon if not specified. > + # > + if not options.pid: > + for proc in psutil.process_iter(): > + if 'ovs-vswitchd' in proc.name(): > + if options.pid: > + print("ERROR: Multiple ovs-vswitchd daemons running, " > + "use the -p option!") > + sys.exit(os.EX_NOINPUT) > + > + options.pid = proc.pid > + > + # > + # Error checking on input parameters. > + # > + if not options.pid: > + print("ERROR: Failed to find ovs-vswitchd's PID!") > + sys.exit(os.EX_UNAVAILABLE) > + > + options.buffer_page_count = next_power_of_two(options.buffer_page_count) > + > + # > + # Make sure we are running as root, or else we can not attach the probes. > + # > + if os.geteuid() != 0: > + print("ERROR: We need to run as root to attached probes!") > + sys.exit(os.EX_NOPERM) > + > + # > + # Setup any of the start stop triggers > + # > + if options.start_trigger is not None: > + try: > + start_trigger = Probe(options.start_trigger, pid=options.pid) > + except ValueError as e: > + print(f"ERROR: Invalid start trigger {str(e)}") > + sys.exit(os.EX_CONFIG) > + else: > + start_trigger = None > + > + if options.stop_trigger is not None: > + try: > + stop_trigger = Probe(options.stop_trigger, pid=options.pid) > + except ValueError as e: > + print(f"ERROR: Invalid stop trigger {str(e)}") > + sys.exit(os.EX_CONFIG) > + else: > + stop_trigger = None > + > + # > + # Attach probe to running process. > + # > + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( > + [" EVENT_{} = {},".format( > + event.name, event.value) for event in Event])) > + source = source.replace("<BUFFER_PAGE_CNT>", > + str(options.buffer_page_count)) > + source = source.replace("<MONITOR_PID>", str(options.pid)) > + > + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: > + source = source.replace('<STATE_FIELD>', 'state') > + else: > + source = source.replace('<STATE_FIELD>', '__state') > + > + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] > + if options.syscall_events is None: > + syscall_trace_events = "false" > + elif options.syscall_events == 0: > + if not options.skip_syscall_poll_events: > + syscall_trace_events = "true" > + else: > + syscall_trace_events = f"args->id != {poll_id}" > + else: > + syscall_trace_events = "delta > {}".format(options.syscall_events) > + if options.skip_syscall_poll_events: > + syscall_trace_events += f" && args->id != {poll_id}" > + > + source = source.replace("<SYSCALL_TRACE_EVENTS>", > + syscall_trace_events) > + > + source = source.replace("<STACK_TRACE_SIZE>", > + str(options.stack_trace_size)) > + > + source = source.replace("<STACK_TRACE_ENABLED>", "true" > + if options.stack_trace_size > 0 else "false") > + > + # > + # Handle start/stop probes > + # > + if start_trigger: > + source = source.replace("<START_TRIGGER>", > + start_trigger.get_c_code( > + "start_trigger_probe", > + "return start_trigger();")) > + else: > + source = source.replace("<START_TRIGGER>", "") > + > + if stop_trigger: > + source = source.replace("<STOP_TRIGGER>", > + stop_trigger.get_c_code( > + "stop_trigger_probe", > + "return stop_trigger();")) > + else: > + source = source.replace("<STOP_TRIGGER>", "") > + > + # > + # Setup usdt or other probes that need handling trough the BFP class. > + # > + usdt = USDT(pid=int(options.pid)) > + try: > + if start_trigger and start_trigger.probe_type == 'usdt': > + usdt.enable_probe(probe=start_trigger.probe_name(), > + fn_name="start_trigger_probe") > + if stop_trigger and stop_trigger.probe_type == 'usdt': > + usdt.enable_probe(probe=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe") > + > + except USDTException as e: > + print("ERROR: {}".format( > + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). > + replace("--with-dtrace or --enable-dtrace", > + "--enable-usdt-probes"))) > + sys.exit(os.EX_OSERR) > + > + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) > + > + if start_trigger: > + try: > + if start_trigger.probe_type == "uprobe": > + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", > + sym=start_trigger.probe_name(), > + fn_name="start_trigger_probe", > + pid=options.pid) > + > + if start_trigger.probe_type == "uretprobe": > + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", > + sym=start_trigger.probe_name(), > + fn_name="start_trigger_probe", > + pid=options.pid) > + except Exception as e: > + print("ERROR: Failed attaching uprobe start trigger " > + f"'{start_trigger.probe_name()}';\n {str(e)}") > + sys.exit(os.EX_OSERR) > + > + if stop_trigger: > + try: > + if stop_trigger.probe_type == "uprobe": > + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", > + sym=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe", > + pid=options.pid) > + > + if stop_trigger.probe_type == "uretprobe": > + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", > + sym=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe", > + pid=options.pid) > + except Exception as e: > + print("ERROR: Failed attaching uprobe stop trigger" > + f"'{stop_trigger.probe_name()}';\n {str(e)}") > + sys.exit(os.EX_OSERR) > + > + # > + # If no triggers are configured use the delay configuration > + # > + bpf['events'].open_ring_buffer(process_event) > + > + sample_count = 0 > + while sample_count < options.sample_count: > + sample_count += 1 > + syscall_events = [] > + > + if not options.start_trigger: > + print_timestamp("# Start sampling") > + start_capture() > + stop_time = -1 if options.stop_trigger else \ > + time_ns() + options.sample_time * 1000000000 > + else: > + # For start triggers the stop time depends on the start trigger > + # time, or depends on the stop trigger if configured. > + stop_time = -1 if options.stop_trigger else 0 > + > + while True: > + keyboard_interrupt = False > + try: > + last_start_ts = start_trigger_ts > + last_stop_ts = stop_trigger_ts > + > + if stop_time > 0: > + delay = int((stop_time - time_ns()) / 1000000) > + if delay <= 0: > + break > + else: > + delay = -1 > + > + bpf.ring_buffer_poll(timeout=delay) > + > + if stop_time <= 0 and last_start_ts != start_trigger_ts: > + print_timestamp( > + "# Start sampling (trigger@{})".format( > + start_trigger_ts)) > + > + if not options.stop_trigger: > + stop_time = time_ns() + \ > + options.sample_time * 1000000000 > + > + if last_stop_ts != stop_trigger_ts: > + break > + > + except KeyboardInterrupt: > + keyboard_interrupt = True > + break > + > + if options.stop_trigger and not capture_running(): > + print_timestamp("# Stop sampling (trigger@{})".format( > + stop_trigger_ts)) > + else: > + print_timestamp("# Stop sampling") > + > + if stop_trigger_ts != 0 and start_trigger_ts != 0: > + trigger_delta = stop_trigger_ts - start_trigger_ts > + else: > + trigger_delta = None > + > + if not trigger_delta or trigger_delta >= options.trigger_delta: > + stop_capture(force=True) # Prevent a new trigger to start. > + process_results(syscall_events=syscall_events, > + trigger_delta=trigger_delta) > + elif trigger_delta: > + sample_count -= 1 > + print_timestamp("# Sample dump skipped, delta {:,} ns".format( > + trigger_delta)) > + > + reset_capture() > + stop_capture() > + > + if keyboard_interrupt: > + break > + > + if options.sample_interval > 0: > + time.sleep(options.sample_interval) > + > + # > + # Report lost events. > + # > + dropcnt = bpf.get_table("dropcnt") > + for k in dropcnt.keys(): > + count = dropcnt.sum(k).value > + if k.value == 0 and count > 0: > + print("\n# WARNING: Not all events were captured, {} were " > + "dropped!\n# Increase the BPF ring buffer size " > + "with the --buffer-page-count option.".format(count)) > + > + if (options.sample_count > 1): > + trigger_miss = bpf.get_table("trigger_miss") > + for k in trigger_miss.keys(): > + count = trigger_miss.sum(k).value > + if k.value == 0 and count > 0: > + print("\n# WARNING: Not all start triggers were successful. " > + "{} were missed due to\n# slow userspace " > + "processing!".format(count)) > + > + > +# > +# Start main() as the default entry point... > +# > +if __name__ == '__main__': > + main() > diff --git a/utilities/usdt-scripts/kernel_delay.rst b/utilities/usdt-scripts/kernel_delay.rst > new file mode 100644 > index 000000000..0ebd30afb > --- /dev/null > +++ b/utilities/usdt-scripts/kernel_delay.rst > @@ -0,0 +1,596 @@ > +Troubleshooting Open vSwitch: Is the kernel to blame? > +===================================================== > +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left > +wondering if the issue is really OVS-related, or if it's a problem with the > +kernel being overloaded. Messages in the log like > +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from > +experience, these are mostly related to an overloaded Linux Kernel. > +The kernel_delay.py tool can help you quickly identify if the focus of your > +investigation should be OVS or the Linux kernel. > + > + > +Introduction > +------------ > +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ > +framework to install eBPF probes. The data the eBPF probes collect will be > +analyzed and presented to the user by the Python script. Some of the presented > +data can also be captured by the individual scripts included in the BBC [#BCC]_ > +framework. > + > +kernel_delay.py has two modes of operation: > + > +- In **time mode**, the tool runs for a specific time and collects the > + information. > +- In **trigger mode**, event collection can be started and/or stopped based on > + a specific eBPF probe. Currently, the following probes are supported: > + - USDT probes > + - Kernel tracepoints > + - kprobe > + - kretprobe > + - uprobe > + - uretprobe > + > + > +In addition, the option, ``--sample-count``, exists to specify how many > +iterations you would like to do. When using triggers, you can also ignore > +samples if they are less than a number of nanoseconds with the > +``--trigger-delta`` option. The latter might be useful when debugging Linux > +syscalls which take a long time to complete. More on this later. Finally, you > +can configure the delay between two sample runs with the ``--sample-interval`` > +option. > + > +Before getting into more details, you can run the tool without any options > +to see what the output looks like. Notice that it will try to automatically > +get the process ID of the running ``ovs-vswitchd``. You can overwrite this > +with the ``--pid`` option. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py > + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) > + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) > + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 27090 ovs-vswitchd [SYSCALL STATISTICS] > + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> > + > + 31741 revalidator122 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + poll 7 5 184,193,176 184,191,520 > + recvmsg 47 494 125,208,756 310,331 > + futex 202 8 18,768,758 4,023,039 > + sendto 44 10 375,861 266,867 > + sendmsg 46 4 43,294 11,213 > + write 1 1 5,949 5,949 > + getrusage 98 1 1,424 1,424 > + read 0 1 1,292 1,292 > + TOTAL( - poll): 519 144,405,334 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 6 136,764,071 1,480 115,146,424 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 7 11,334 6,636 > + > + [HARD IRQ STATISTICS] > + NAME COUNT TOTAL ns MAX ns > + eno8303-rx-1 1 3,586 3,586 > + TOTAL: 1 3,586 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + net_rx 3 1 17,699 17,699 > + sched 7 6 13,820 3,226 > + rcu 9 16 13,586 1,554 > + timer 1 3 10,259 3,815 > + TOTAL: 26 55,364 > + > + > +By default, the tool will run for half a second in `time mode`. To extend this > +you can use the ``--sample-time`` option. > + > + > +What will it report > +------------------- > +The above sample output separates the captured data on a per-thread basis. > +For this, it displays the thread's id (``TID``) and name (``THREAD``), > +followed by resource-specific data. Which are: > + > +- ``SYSCALL STATISTICS`` > +- ``THREAD RUN STATISTICS`` > +- ``THREAD READY STATISTICS`` > +- ``HARD IRQ STATISTICS`` > +- ``SOFT IRQ STATISTICS`` > + > +The following sections will describe in detail what statistics they report. > + > + > +``SYSCALL STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~ > +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during > +the measurement interval. This includes the number of times the syscall was > +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), > +and the worst-case duration of a single call (``MAX ns``). > + > +It also shows the total of all system calls, but it excludes the poll system > +call, as the purpose of this call is to wait for activity on a set of sockets, > +and usually, the thread gets swapped out. > + > +Note that it only counts calls that started and stopped during the > +measurement interval! > + > + > +``THREAD RUN STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~~~ > +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU > +during the measurement interval. > + > +Note that these statistics only count events where the thread started and > +stopped running on a CPU during the measurement interval. For example, if > +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. > +If not, there might be a misconfiguration. > + > + > +``THREAD READY STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +``THREAD READY STATISTICS`` tell you the time between the thread being ready > +to run and it actually running on the CPU. > + > +Note that these statistics only count events where the thread was getting > +ready to run and started running during the measurement interval. > + > + > +``HARD IRQ STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~ > +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard > +interrupts during the threads run time. > + > +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), > +the total time spent in the interrupt handler (``TOTAL ns``), and the > +worst-case duration (``MAX ns``). > + > + > +``SOFT IRQ STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~ > +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft > +interrupts during the threads run time. > + > +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the > +number of interrupts (``COUNT``), the total time spent in the interrupt > +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). > + > + > +The ``--syscall-events`` option > +------------------------------- > +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, > +the tool can also report each individual syscall. This can be a usefull > +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. > + > +All you need to do is add the ``--syscall-events`` option, with or without > +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter > +allows you to exclude events that take less than the supplied time. > + > +The ``--skip-syscall-poll-events`` option allows you to exclude poll > +syscalls from the report. > + > +Below is an example run, note that the resource-specific data is removed > +to highlight the syscall events: > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events > + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) > + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) > + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 3359686 ipf_clean2 [SYSCALL STATISTICS] > + ... > + 3359635 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 3359697 revalidator12 [SYSCALL STATISTICS] > + ... > + 3359698 revalidator13 [SYSCALL STATISTICS] > + ... > + 3359699 revalidator14 [SYSCALL STATISTICS] > + ... > + 3359700 revalidator15 [SYSCALL STATISTICS] > + ... > + > + # SYSCALL EVENTS: > + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL > + ------------------- ------------------- ---------- ---------------- ---------- ---------------- > + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003936313a63 > + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003134313a63 > + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003936313a63 > + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI_accept+0x4d [libc.so.6] > + pfd_accept+0x3a [ovs-vswitchd] > + [unknown] 0x7fff19f2bd00 > + [unknown] 0xe4b8001f0f > + > +As you can see above, the output also shows the stackback trace. You can > +disable this using the ``--stack-trace-size 0`` option. > + > +As you can see above, the backtrace does not show a lot of useful information > +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further > +analyze system call backtraces, you could use perf. The following perf > +script can do this for you (refer to the embedded instructions): > + > +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py > + > + > +Using triggers > +-------------- > +The tool supports start and, or stop triggers. This will allow you to capture > +statistics triggered by a specific event. The following combinations of > +stop-and-start triggers can be used. > + > +If you only use ``--start-trigger``, the inspection start when the trigger > +happens and runs until the ``--sample-time`` number of seconds has passed. > +The example below shows all the supported options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ > + --sample-count 4 --sample-interval 1 > + > + > +If you only use ``--stop-trigger``, the inspection starts immediately and > +stops when the trigger happens. The example below shows all the supported > +options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ > + --sample-count 4 --sample-interval 1 > + > + > +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the > +statistics are captured between the two first occurrences of these events. > +The example below shows all the supported options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ > + --stop-trigger upr:bridge_run \ > + --sample-count 4 --sample-interval 1 \ > + --trigger-delta 50000 > + > +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, > +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. > +For more details on the supported tracepoints, check out the BCC > +documentation [#BCC_EVENT]_. > + > +The list below shows the supported triggers and their argument format: > + > +**USDT probes:** > + [u|usdt]:{provider}:{probe} > +**Kernel tracepoint:** > + [t:trace]:{system}:{event} > +**kprobe:** > + [k:kprobe]:{kernel_function} > +**kretprobe:** > + [kr:kretprobe]:{kernel_function} > +**uprobe:** > + [up:uprobe]:{function} > +**uretprobe:** > + [upr:uretprobe]:{function} > + > +Here are a couple of trigger examples, more use-case-specific examples can be > +found in the *Examples* section. > + > +.. code-block:: console > + > + --start|stop-trigger u:udpif_revalidator:start_dump > + --start|stop-trigger t:openvswitch:ovs_dp_upcall > + --start|stop-trigger k:ovs_dp_process_packet > + --start|stop-trigger kr:ovs_dp_process_packet > + --start|stop-trigger up:bridge_run > + --start|stop-trigger upr:bridge_run > + > + > +Examples > +-------- > +This section will give some examples of how to use this tool in real-world > +scenarios. Let's start with the issue where Open vSwitch reports > +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note > +that there is a blog available explaining how the revalidator process works > +in OVS [#REVAL_BLOG]_. > + > +First, let me explain this log message. It gets logged if the time delta > +between two ``poll_block()`` calls is more than 1 second. In other words, > +the process was spending a lot of time processing stuff that was made > +available by the return of the ``poll_block()`` function. > + > +Do a run with the tool using the existing USDT revalidator probes as a start > +and stop trigger (Note that the resource-specific data is removed from the none > +revalidator threads): > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done > + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) > + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) > + # Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1457761 handler24 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + sendmsg 46 6110 123,274,761 41,776 > + recvmsg 47 136299 99,397,508 49,896 > + futex 202 51 7,655,832 7,536,776 > + poll 7 4068 1,202,883 2,907 > + getrusage 98 2034 586,602 1,398 > + sendto 44 9 213,682 27,417 > + TOTAL( - poll): 144503 231,128,385 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 1 1,438 1,438 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 21 59,145 3,769 > + rcu 9 50 42,917 2,234 > + TOTAL: 71 102,062 > + 1457733 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 1457792 revalidator55 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + futex 202 73 572,576,329 19,621,600 > + recvmsg 47 815 296,697,618 405,338 > + sendto 44 3 78,302 26,837 > + sendmsg 46 3 38,712 13,250 > + write 1 1 5,073 5,073 > + TOTAL( - poll): 895 869,396,034 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 48 394,350,393 1,729 140,455,796 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 49 23,650 1,559 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 14 26,889 3,041 > + rcu 9 28 23,024 1,600 > + TOTAL: 42 49,913 > + > + > +Above you see from the start of the output that the trigger took more than a > +second (1,187,986,454 ns), which is already know, by looking at the output of > +the ``ovs-vsctl upcall/show`` command. > + > +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it > +spent almost 870ms handling syscalls, and there were no poll() calls being > +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, > +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned > +that this time does not include the time being on the CPU at the start or stop > +of an event. What is exactly the case here, because USDT probes were used. > + > +From the above data and maybe some ``top`` output, it can be determined that > +the *revalidator55* thread is taking a lot of CPU time, probably because it > +has to do a lot of revalidator work by itself. The solution here is to increase > +the number of revalidator threads, so more work could be done in parallel. > + > +Here is another run of the same command in another scenario: > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done > + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) > + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) > + # Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1457733 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 1457792 revalidator55 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + recvmsg 47 284 193,422,110 46,248,418 > + sendto 44 2 46,685 23,665 > + sendmsg 46 2 24,916 12,703 > + write 1 1 6,534 6,534 > + TOTAL( - poll): 289 193,500,245 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 2 47,333,558 331,516 47,002,042 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 3 87,000,403 45,999,712 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 2 9,504 5,109 > + TOTAL: 2 9,504 > + > + > +Here you can see the revalidator run took about 263ms, which does not look > +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was > +waiting 87ms for a CPU to be run on. This means the revalidator process could > +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay > +of almost 46ms can be seen, which hints at an overloaded system. > + > +One final example that uses a ``uprobe`` to get some statistics on a > +``bridge_run()`` execution that takes more than 1ms. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000 > + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) > + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) > + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) > + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) > + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) > + # Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE unknown THREADS> > + 3371102 handler66 [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE HANDLER THREADS> > + 3366258 ovs-vswitchd [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + futex 202 43 403,469 199,312 > + clone3 435 13 174,394 30,731 > + munmap 11 8 115,774 21,861 > + poll 7 5 92,969 38,307 > + unlink 87 2 49,918 35,741 > + mprotect 10 8 47,618 13,201 > + accept 43 10 31,360 6,976 > + mmap 9 8 30,279 5,776 > + write 1 6 27,720 11,774 > + rt_sigprocmask 14 28 12,281 970 > + read 0 6 9,478 2,318 > + recvfrom 45 3 7,024 4,024 > + sendto 44 1 4,684 4,684 > + getrusage 98 5 4,594 1,342 > + close 3 2 2,918 1,627 > + recvmsg 47 1 2,722 2,722 > + TOTAL( - poll): 144 924,233 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 13 817,605 5,433 524,376 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 14 28,646 11,566 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + rcu 9 1 2,838 2,838 > + TOTAL: 1 2,838 > + > + 3371110 revalidator74 [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE NEW revalidator THREADS> > + 3366311 urcu3 [SYSCALL STATISTICS] > + ... > + > + > +OVS removed some of the threads and their resource-specific data, but based > +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some > +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some > +``clone3`` syscalls, indicating threads were created. In this example, it was > +due to the deletion of a bridge, which resulted in the recreation of the > +revalidator and handler threads. > + > + > +Use with Openshift > +------------------ > +This section describes how you would use the tool on a node in an OpenShift > +cluster. It assumes you have console access to the node, either directly or > +through a debug container. > + > +A base fedora38 container will be used through podman, as this will allow the > +use of some additional tools and packages needed. > + > +First the containers need to be started: > + > +.. code-block:: console > + > + [core@sno-master ~]$ sudo podman run -it --rm \ > + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ > + --privileged --network=host --pid=host \ > + -v /lib/modules:/lib/modules:ro \ > + -v /sys/kernel/debug:/sys/kernel/debug \ > + -v /proc:/proc \ > + -v /:/mnt/rootdir \ > + quay.io/fedora/fedora:38-x86_64 > + > + [(DEBUG)root@sno-master /]# > + > + > +Next add the ``linux_delay.py`` dependencies: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ > + python3-pytz python3-psutil > + > + > +You need to install the devel, debug and source RPMs for your OVS and kernel > +version: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master home]# rpm -i \ > + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ > + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ > + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm > + > + > +Now the tool can be started. Here the above ``bridge_run()`` example is used: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run > + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) > + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) > + # Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1246 ovs-vswitchd [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + getdents64 217 2 8,560 8,162 > + openat 257 1 6,951 6,951 > + accept 43 4 6,942 3,763 > + recvfrom 45 1 3,726 3,726 > + recvmsg 47 2 2,880 2,188 > + stat 4 2 1,946 1,384 > + close 3 1 1,393 1,393 > + fstat 5 1 1,324 1,324 > + TOTAL( - poll): 14 33,722 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + > + > +.. rubric:: Footnotes > + > +.. [#BCC] https://github.com/iovisor/bcc > +.. [#BCC_EVENT] https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments > +.. [#REVAL_BLOG] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained >
On 25 Sep 2023, at 13:49, Adrian Moreno wrote: > On 9/12/23 12:36, Eelco Chaudron wrote: >> This patch adds an utility that can be used to determine if >> an issue is related to a lack of Linux kernel resources. >> >> This tool is also featured in a Red Hat developers blog article: >> >> https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame >> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >> > > Reviewed-by: Adrian Moreno <amorenoz@redhat.com> Thanks Adrian! Aaron, are you still planning to look at this after the changes? If not I’ll go ahead an apply it. Cheers, Eelco >> --- >> v2: Addressed review comments from Aaron. >> v3: Changed wording in documentation. >> v4: Addressed review comments from Adrian. >> >> utilities/automake.mk | 4 >> utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ >> utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ >> 3 files changed, 2020 insertions(+) >> create mode 100755 utilities/usdt-scripts/kernel_delay.py >> create mode 100644 utilities/usdt-scripts/kernel_delay.rst >> >> diff --git a/utilities/automake.mk b/utilities/automake.mk >> index 37d679f82..9a2114df4 100644 >> --- a/utilities/automake.mk >> +++ b/utilities/automake.mk >> @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib >> usdt_SCRIPTS += \ >> utilities/usdt-scripts/bridge_loop.bt \ >> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >> + utilities/usdt-scripts/kernel_delay.py \ >> + utilities/usdt-scripts/kernel_delay.rst \ >> utilities/usdt-scripts/reval_monitor.py \ >> utilities/usdt-scripts/upcall_cost.py \ >> utilities/usdt-scripts/upcall_monitor.py >> @@ -70,6 +72,8 @@ EXTRA_DIST += \ >> utilities/docker/debian/build-kernel-modules.sh \ >> utilities/usdt-scripts/bridge_loop.bt \ >> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >> + utilities/usdt-scripts/kernel_delay.py \ >> + utilities/usdt-scripts/kernel_delay.rst \ >> utilities/usdt-scripts/reval_monitor.py \ >> utilities/usdt-scripts/upcall_cost.py \ >> utilities/usdt-scripts/upcall_monitor.py >> diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py >> new file mode 100755 >> index 000000000..636e108be >> --- /dev/null >> +++ b/utilities/usdt-scripts/kernel_delay.py >> @@ -0,0 +1,1420 @@ >> +#!/usr/bin/env python3 >> +# >> +# Copyright (c) 2022,2023 Red Hat, Inc. >> +# >> +# Licensed under the Apache License, Version 2.0 (the "License"); >> +# you may not use this file except in compliance with the License. >> +# You may obtain a copy of the License at: >> +# >> +# http://www.apache.org/licenses/LICENSE-2.0 >> +# >> +# Unless required by applicable law or agreed to in writing, software >> +# distributed under the License is distributed on an "AS IS" BASIS, >> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> +# See the License for the specific language governing permissions and >> +# limitations under the License. >> +# >> +# >> +# Script information: >> +# ------------------- >> +# This script allows a developer to quickly identify if the issue at hand >> +# might be related to the kernel running out of resources or if it really is >> +# an Open vSwitch issue. >> +# >> +# For documentation see the kernel_delay.rst file. >> +# >> +# >> +# Dependencies: >> +# ------------- >> +# You need to install the BCC package for your specific platform or build it >> +# yourself using the following instructions: >> +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md >> +# >> +# Python needs the following additional packages installed: >> +# - pytz >> +# - psutil >> +# >> +# You can either install your distribution specific package or use pip: >> +# pip install pytz psutil >> +# >> +import argparse >> +import datetime >> +import os >> +import pytz >> +import psutil >> +import re >> +import sys >> +import time >> + >> +import ctypes as ct >> + >> +try: >> + from bcc import BPF, USDT, USDTException >> + from bcc.syscall import syscalls, syscall_name >> +except ModuleNotFoundError: >> + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") >> + sys.exit(os.EX_OSFILE) >> + >> +from enum import IntEnum >> + >> + >> +# >> +# Actual eBPF source code >> +# >> +EBPF_SOURCE = """ >> +#include <linux/irq.h> >> +#include <linux/sched.h> >> + >> +#define MONITOR_PID <MONITOR_PID> >> + >> +enum { >> +<EVENT_ENUM> >> +}; >> + >> +struct event_t { >> + u64 ts; >> + u32 tid; >> + u32 id; >> + >> + int user_stack_id; >> + int kernel_stack_id; >> + >> + u32 syscall; >> + u64 entry_ts; >> + >> +}; >> + >> +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); >> +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); >> +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); >> +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); >> + >> +BPF_ARRAY(capture_on, u64, 1); >> +static inline bool capture_enabled(u64 pid_tgid) { >> + int key = 0; >> + u64 *ret; >> + >> + if ((pid_tgid >> 32) != MONITOR_PID) >> + return false; >> + >> + ret = capture_on.lookup(&key); >> + return ret && *ret == 1; >> +} >> + >> +static inline bool capture_enabled__() { >> + int key = 0; >> + u64 *ret; >> + >> + ret = capture_on.lookup(&key); >> + return ret && *ret == 1; >> +} >> + >> +static struct event_t *get_event(uint32_t id) { >> + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); >> + >> + if (!event) { >> + dropcnt.increment(0); >> + return NULL; >> + } >> + >> + event->id = id; >> + event->ts = bpf_ktime_get_ns(); >> + event->tid = bpf_get_current_pid_tgid(); >> + >> + return event; >> +} >> + >> +static int start_trigger() { >> + int key = 0; >> + u64 *val = capture_on.lookup(&key); >> + >> + /* If the value is -1 we can't start as we are still processing the >> + * results in userspace. */ >> + if (!val || *val != 0) { >> + trigger_miss.increment(0); >> + return 0; >> + } >> + >> + struct event_t *event = get_event(EVENT_START_TRIGGER); >> + if (event) { >> + events.ringbuf_submit(event, 0); >> + *val = 1; >> + } else { >> + trigger_miss.increment(0); >> + } >> + return 0; >> +} >> + >> +static int stop_trigger() { >> + int key = 0; >> + u64 *val = capture_on.lookup(&key); >> + >> + if (!val || *val != 1) >> + return 0; >> + >> + struct event_t *event = get_event(EVENT_STOP_TRIGGER); >> + >> + if (event) >> + events.ringbuf_submit(event, 0); >> + >> + if (val) >> + *val = -1; >> + >> + return 0; >> +} >> + >> +<START_TRIGGER> >> +<STOP_TRIGGER> >> + >> + >> +/* >> + * For the syscall monitor the following probes get installed. >> + */ >> +struct syscall_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct syscall_data_key_t { >> + u32 pid; >> + u32 tid; >> + u32 syscall; >> +}; >> + >> +BPF_HASH(syscall_start, u64, u64); >> +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); >> + >> +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + u64 t = bpf_ktime_get_ns(); >> + syscall_start.update(&pid_tgid, &t); >> + >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { >> + struct syscall_data_t *val, zero = {}; >> + struct syscall_data_key_t key; >> + >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + key.pid = pid_tgid >> 32; >> + key.tid = (u32)pid_tgid; >> + key.syscall = args->id; >> + >> + u64 *start_ns = syscall_start.lookup(&pid_tgid); >> + >> + if (!start_ns) >> + return 0; >> + >> + val = syscall_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + >> + if (<SYSCALL_TRACE_EVENTS>) { >> + struct event_t *event = get_event(EVENT_SYSCALL); >> + if (event) { >> + event->syscall = args->id; >> + event->entry_ts = *start_ns; >> + if (<STACK_TRACE_ENABLED>) { >> + event->user_stack_id = stack_traces.get_stackid( >> + args, BPF_F_USER_STACK); >> + event->kernel_stack_id = stack_traces.get_stackid( >> + args, 0); >> + } >> + events.ringbuf_submit(event, 0); >> + } >> + } >> + } >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the thread run time, we need the following. >> + */ >> +struct run_time_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 max_ns; >> + u64 min_ns; >> +}; >> + >> +struct pid_tid_key_t { >> + u32 pid; >> + u32 tid; >> +}; >> + >> +BPF_HASH(run_start, u64, u64); >> +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); >> + >> +static inline void thread_start_run(u64 pid_tgid, u64 ktime) >> +{ >> + run_start.update(&pid_tgid, &ktime); >> +} >> + >> +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) >> +{ >> + u64 pid_tgid = (u64) tgid << 32 | pid; >> + u64 *start_ns = run_start.lookup(&pid_tgid); >> + >> + if (!start_ns || *start_ns == 0) >> + return; >> + >> + struct run_time_data_t *val, zero = {}; >> + struct pid_tid_key_t key = { .pid = tgid, >> + .tid = pid }; >> + >> + val = run_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = ktime - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->max_ns == 0 || delta > val->max_ns) >> + val->max_ns = delta; >> + if (val->min_ns == 0 || delta < val->min_ns) >> + val->min_ns = delta; >> + } >> + *start_ns = 0; >> +} >> + >> + >> +/* >> + * For measuring the thread-ready delay, we need the following. >> + */ >> +struct ready_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +BPF_HASH(ready_start, u64, u64); >> +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); >> + >> +static inline int sched_wakeup__(u32 pid, u32 tgid) >> +{ >> + u64 pid_tgid = (u64) tgid << 32 | pid; >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + u64 t = bpf_ktime_get_ns(); >> + ready_start.update(&pid_tgid, &t); >> + return 0; >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_wakeup) >> +{ >> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >> + return sched_wakeup__(t->pid, t->tgid); >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_wakeup_new) >> +{ >> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >> + return sched_wakeup__(t->pid, t->tgid); >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_switch) >> +{ >> + struct task_struct *prev = (struct task_struct *)ctx->args[1]; >> + struct task_struct *next= (struct task_struct *)ctx->args[2]; >> + u64 ktime = 0; >> + >> + if (!capture_enabled__()) >> + return 0; >> + >> + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) >> + sched_wakeup__(prev->pid, prev->tgid); >> + >> + if (prev->tgid == MONITOR_PID) { >> + ktime = bpf_ktime_get_ns(); >> + thread_stop_run(prev->pid, prev->tgid, ktime); >> + } >> + >> + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; >> + >> + if (next->tgid != MONITOR_PID) >> + return 0; >> + >> + if (ktime == 0) >> + ktime = bpf_ktime_get_ns(); >> + >> + u64 *start_ns = ready_start.lookup(&pid_tgid); >> + >> + if (start_ns && *start_ns != 0) { >> + >> + struct ready_data_t *val, zero = {}; >> + struct pid_tid_key_t key = { .pid = next->tgid, >> + .tid = next->pid }; >> + >> + val = ready_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = ktime - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + *start_ns = 0; >> + } >> + >> + thread_start_run(pid_tgid, ktime); >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the hard irq time, we need the following. >> + */ >> +struct hardirq_start_data_t { >> + u64 start_ns; >> + char irq_name[32]; >> +}; >> + >> +struct hardirq_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct hardirq_data_key_t { >> + u32 pid; >> + u32 tid; >> + char irq_name[32]; >> +}; >> + >> +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); >> +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); >> + >> +TRACEPOINT_PROBE(irq, irq_handler_entry) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct hardirq_start_data_t data = {}; >> + >> + data.start_ns = bpf_ktime_get_ns(); >> + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); >> + hardirq_start.update(&pid_tgid, &data); >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(irq, irq_handler_exit) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct hardirq_start_data_t *data; >> + data = hardirq_start.lookup(&pid_tgid); >> + if (!data || data->start_ns == 0) >> + return 0; >> + >> + if (args->ret != IRQ_NONE) { >> + struct hardirq_data_t *val, zero = {}; >> + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, >> + .tid = (u32)pid_tgid }; >> + >> + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), >> + data->irq_name); >> + val = hardirq_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + } >> + >> + data->start_ns = 0; >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the soft irq time, we need the following. >> + */ >> +struct softirq_start_data_t { >> + u64 start_ns; >> + u32 vec_nr; >> +}; >> + >> +struct softirq_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct softirq_data_key_t { >> + u32 pid; >> + u32 tid; >> + u32 vec_nr; >> +}; >> + >> +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); >> +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); >> + >> +TRACEPOINT_PROBE(irq, softirq_entry) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct softirq_start_data_t data = {}; >> + >> + data.start_ns = bpf_ktime_get_ns(); >> + data.vec_nr = args->vec; >> + softirq_start.update(&pid_tgid, &data); >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(irq, softirq_exit) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct softirq_start_data_t *data; >> + data = softirq_start.lookup(&pid_tgid); >> + if (!data || data->start_ns == 0) >> + return 0; >> + >> + struct softirq_data_t *val, zero = {}; >> + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, >> + .tid = (u32)pid_tgid, >> + .vec_nr = data->vec_nr}; >> + >> + val = softirq_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + >> + data->start_ns = 0; >> + return 0; >> +} >> +""" >> + >> + >> +# >> +# time_ns() >> +# >> +try: >> + from time import time_ns >> +except ImportError: >> + # For compatibility with Python <= v3.6. >> + def time_ns(): >> + now = datetime.datetime.now() >> + return int(now.timestamp() * 1e9) >> + >> + >> +# >> +# Probe class to use for the start/stop triggers >> +# >> +class Probe(object): >> + ''' >> + The goal for this object is to support as many as possible >> + probe/events as supported by BCC. See >> + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >> + ''' >> + def __init__(self, probe, pid=None): >> + self.pid = pid >> + self.text_probe = probe >> + self._parse_text_probe() >> + >> + def __str__(self): >> + if self.probe_type == "usdt": >> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >> + self.usdt_provider, self.usdt_probe) >> + elif self.probe_type == "trace": >> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >> + self.trace_system, self.trace_event) >> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >> + self.kprobe_function) >> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >> + self.uprobe_function) >> + else: >> + return "[{}] <{}:unknown probe>".format(self.text_probe, >> + self.probe_type) >> + >> + def _raise(self, error): >> + raise ValueError("[{}]; {}".format(self.text_probe, error)) >> + >> + def _verify_kprobe_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_trace_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_uprobe_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_usdt_probe(self): >> + if not self.pid: >> + self._raise("USDT probes need a valid PID.") >> + >> + usdt = USDT(pid=self.pid) >> + >> + for probe in usdt.enumerate_probes(): >> + if probe.provider.decode('utf-8') == self.usdt_provider and \ >> + probe.name.decode('utf-8') == self.usdt_probe: >> + return >> + >> + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, >> + self.usdt_probe)) >> + >> + def _parse_text_probe(self): >> + ''' >> + The text probe format is defined as follows: >> + <probe_type>:<probe_specific> >> + >> + Types: >> + USDT: u|usdt:<provider>:<probe> >> + TRACE: t|trace:<system>:<event> >> + KPROBE: k|kprobe:<kernel_function> >> + KRETPROBE: kr|kretprobe:<kernel_function> >> + UPROBE: up|uprobe:<function> >> + URETPROBE: ur|uretprobe:<function> >> + ''' >> + args = self.text_probe.split(":") >> + if len(args) <= 1: >> + self._raise("Can't extract probe type.") >> + >> + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", >> + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: >> + self._raise("Invalid probe type '{}'".format(args[0])) >> + >> + self.probe_type = "kprobe" if args[0] == "k" else args[0] >> + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type >> + self.probe_type = "trace" if args[0] == "t" else self.probe_type >> + self.probe_type = "usdt" if args[0] == "u" else self.probe_type >> + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type >> + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type >> + >> + if self.probe_type == "usdt": >> + if len(args) != 3: >> + self._raise("Invalid number of arguments for USDT") >> + >> + self.usdt_provider = args[1] >> + self.usdt_probe = args[2] >> + self._verify_usdt_probe() >> + >> + elif self.probe_type == "trace": >> + if len(args) != 3: >> + self._raise("Invalid number of arguments for TRACE") >> + >> + self.trace_system = args[1] >> + self.trace_event = args[2] >> + self._verify_trace_probe() >> + >> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >> + if len(args) != 2: >> + self._raise("Invalid number of arguments for K(RET)PROBE") >> + self.kprobe_function = args[1] >> + self._verify_kprobe_probe() >> + >> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >> + if len(args) != 2: >> + self._raise("Invalid number of arguments for U(RET)PROBE") >> + self.uprobe_function = args[1] >> + self._verify_uprobe_probe() >> + >> + def _get_kprobe_c_code(self, function_name, function_content): >> + # >> + # The kprobe__* do not require a function name, so it's >> + # ignored in the code generation. >> + # >> + return """ >> +int {}__{}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(self.probe_type, self.kprobe_function, function_content) >> + >> + def _get_trace_c_code(self, function_name, function_content): >> + # >> + # The TRACEPOINT_PROBE() do not require a function name, so it's >> + # ignored in the code generation. >> + # >> + return """ >> +TRACEPOINT_PROBE({},{}) {{ >> + {} >> +}} >> +""".format(self.trace_system, self.trace_event, function_content) >> + >> + def _get_uprobe_c_code(self, function_name, function_content): >> + return """ >> +int {}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(function_name, function_content) >> + >> + def _get_usdt_c_code(self, function_name, function_content): >> + return """ >> +int {}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(function_name, function_content) >> + >> + def get_c_code(self, function_name, function_content): >> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >> + return self._get_kprobe_c_code(function_name, function_content) >> + elif self.probe_type == 'trace': >> + return self._get_trace_c_code(function_name, function_content) >> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >> + return self._get_uprobe_c_code(function_name, function_content) >> + elif self.probe_type == 'usdt': >> + return self._get_usdt_c_code(function_name, function_content) >> + >> + return "" >> + >> + def probe_name(self): >> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >> + return "{}".format(self.kprobe_function) >> + elif self.probe_type == 'trace': >> + return "{}:{}".format(self.trace_system, >> + self.trace_event) >> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >> + return "{}".format(self.uprobe_function) >> + elif self.probe_type == 'usdt': >> + return "{}:{}".format(self.usdt_provider, >> + self.usdt_probe) >> + >> + return "" >> + >> + >> +# >> +# event_to_dict() >> +# >> +def event_to_dict(event): >> + return dict([(field, getattr(event, field)) >> + for (field, _) in event._fields_ >> + if isinstance(getattr(event, field), (int, bytes))]) >> + >> + >> +# >> +# Event enum >> +# >> +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], >> + start=0) >> + >> + >> +# >> +# process_event() >> +# >> +def process_event(ctx, data, size): >> + global start_trigger_ts >> + global stop_trigger_ts >> + >> + event = bpf['events'].event(data) >> + if event.id == Event.SYSCALL: >> + syscall_events.append({"tid": event.tid, >> + "ts_entry": event.entry_ts, >> + "ts_exit": event.ts, >> + "syscall": event.syscall, >> + "user_stack_id": event.user_stack_id, >> + "kernel_stack_id": event.kernel_stack_id}) >> + elif event.id == Event.START_TRIGGER: >> + # >> + # This event would have started the trigger already, so all we need to >> + # do is record the start timestamp. >> + # >> + start_trigger_ts = event.ts >> + >> + elif event.id == Event.STOP_TRIGGER: >> + # >> + # This event would have stopped the trigger already, so all we need to >> + # do is record the start timestamp. >> + stop_trigger_ts = event.ts >> + >> + >> +# >> +# next_power_of_two() >> +# >> +def next_power_of_two(val): >> + np = 1 >> + while np < val: >> + np *= 2 >> + return np >> + >> + >> +# >> +# unsigned_int() >> +# >> +def unsigned_int(value): >> + try: >> + value = int(value) >> + except ValueError: >> + raise argparse.ArgumentTypeError("must be an integer") >> + >> + if value < 0: >> + raise argparse.ArgumentTypeError("must be positive") >> + return value >> + >> + >> +# >> +# unsigned_nonzero_int() >> +# >> +def unsigned_nonzero_int(value): >> + value = unsigned_int(value) >> + if value == 0: >> + raise argparse.ArgumentTypeError("must be nonzero") >> + return value >> + >> + >> +# >> +# get_thread_name() >> +# >> +def get_thread_name(pid, tid): >> + try: >> + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: >> + return f.readline().strip("\n") >> + except FileNotFoundError: >> + pass >> + >> + return f"<unknown:{pid}/{tid}>" >> + >> + >> +# >> +# get_vec_nr_name() >> +# >> +def get_vec_nr_name(vec_nr): >> + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", >> + "tasklet", "sched", "hrtimer", "rcu"] >> + >> + if vec_nr < 0 or vec_nr > len(known_vec_nr): >> + return f"<unknown:{vec_nr}>" >> + >> + return known_vec_nr[vec_nr] >> + >> + >> +# >> +# start/stop/reset capture >> +# >> +def start_capture(): >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) >> + >> + >> +def stop_capture(force=False): >> + if force: >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) >> + else: >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) >> + >> + >> +def capture_running(): >> + return bpf["capture_on"][ct.c_int(0)].value == 1 >> + >> + >> +def reset_capture(): >> + bpf["syscall_start"].clear() >> + bpf["syscall_data"].clear() >> + bpf["run_start"].clear() >> + bpf["run_data"].clear() >> + bpf["ready_start"].clear() >> + bpf["ready_data"].clear() >> + bpf["hardirq_start"].clear() >> + bpf["hardirq_data"].clear() >> + bpf["softirq_start"].clear() >> + bpf["softirq_data"].clear() >> + bpf["stack_traces"].clear() >> + >> + >> +# >> +# Display timestamp >> +# >> +def print_timestamp(msg): >> + ltz = datetime.datetime.now() >> + utc = ltz.astimezone(pytz.utc) >> + time_string = "{} @{} ({} UTC)".format( >> + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) >> + print(time_string) >> + >> + >> +# >> +# process_results() >> +# >> +def process_results(syscall_events=None, trigger_delta=None): >> + if trigger_delta: >> + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". >> + format(trigger_delta)) >> + else: >> + print_timestamp("# Sample dump") >> + >> + # >> + # First get a list of all threads we need to report on. >> + # >> + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() >> + if k.syscall != 0xffffffff} >> + >> + threads_run = {k.tid for k, _ in bpf["run_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads = sorted(threads_syscall | threads_run | threads_ready | >> + threads_hardirq | threads_softirq, >> + key=lambda x: get_thread_name(options.pid, x)) >> + >> + # >> + # Print header... >> + # >> + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) >> + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) >> + indent = 28 * " " >> + >> + # >> + # Print all events/statistics per threads. >> + # >> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >> + for thread in threads: >> + >> + if thread != threads[0]: >> + print("") >> + >> + # >> + # SYSCALL_STATISTICS >> + # >> + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( >> + thread, get_thread_name(options.pid, thread), >> + "[SYSCALL STATISTICS]", indent, >> + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) >> + >> + total_count = 0 >> + total_ns = 0 >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["syscall_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( >> + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, >> + v.count, v.total_ns, v.worst_ns)) >> + if k.syscall != poll_id: >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:6} {:10} {:16,}".format( >> + indent, "TOTAL( - poll):", "", total_count, total_ns)) >> + >> + # >> + # THREAD RUN STATISTICS >> + # >> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( >> + "", "", "[THREAD RUN STATISTICS]", indent, >> + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) >> + >> + for k, v in filter(lambda t: t[0].tid == thread, >> + bpf["run_data"].items()): >> + >> + print("{}{:10} {:16,} {:16,} {:16,}".format( >> + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) >> + >> + # >> + # THREAD READY STATISTICS >> + # >> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( >> + "", "", "[THREAD READY STATISTICS]", indent, >> + "SCHED_CNT", "TOTAL ns", "MAX ns")) >> + >> + for k, v in filter(lambda t: t[0].tid == thread, >> + bpf["ready_data"].items()): >> + >> + print("{}{:10} {:16,} {:16,}".format( >> + indent, v.count, v.total_ns, v.worst_ns)) >> + >> + # >> + # HARD IRQ STATISTICS >> + # >> + total_ns = 0 >> + total_count = 0 >> + header_printed = False >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["hardirq_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + if not header_printed: >> + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". >> + format("", "", "[HARD IRQ STATISTICS]", indent, >> + "NAME", "COUNT", "TOTAL ns", "MAX ns")) >> + header_printed = True >> + >> + print("{}{:20.20} {:10} {:16,} {:16,}".format( >> + indent, k.irq_name.decode('utf-8'), >> + v.count, v.total_ns, v.worst_ns)) >> + >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:10} {:16,}".format( >> + indent, "TOTAL:", total_count, total_ns)) >> + >> + # >> + # SOFT IRQ STATISTICS >> + # >> + total_ns = 0 >> + total_count = 0 >> + header_printed = False >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["softirq_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + if not header_printed: >> + print("\n{:10} {:16} {}\n" >> + "{}{:20} {:>7} {:>10} {:>16} {:>16}". >> + format("", "", "[SOFT IRQ STATISTICS]", indent, >> + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) >> + header_printed = True >> + >> + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( >> + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, >> + v.count, v.total_ns, v.worst_ns)) >> + >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:7} {:10} {:16,}".format( >> + indent, "TOTAL:", "", total_count, total_ns)) >> + >> + # >> + # Print events >> + # >> + lost_stack_traces = 0 >> + if syscall_events: >> + stack_traces = bpf.get_table("stack_traces") >> + >> + print("\n\n# SYSCALL EVENTS:" >> + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( >> + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", >> + "DELTA (us)", "SYSCALL")) >> + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( >> + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, >> + "-" * 10, "-" * 16)) >> + for event in syscall_events: >> + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( >> + " " * 2, >> + event["ts_entry"], event["ts_exit"], event["tid"], >> + get_thread_name(options.pid, event["tid"]), >> + int((event["ts_exit"] - event["ts_entry"]) / 1000), >> + syscall_name(event["syscall"]).decode('utf-8'))) >> + # >> + # Not sure where to put this, but I'll add some info on stack >> + # traces here... Userspace stack traces are very limited due to >> + # the fact that bcc does not support dwarf backtraces. As OVS >> + # gets compiled without frame pointers we will not see much. >> + # If however, OVS does get built with frame pointers, we should not >> + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle >> + # the debug symbols correctly. Also, note that for kernel >> + # traces you should not use BPF_STACK_TRACE_BUILDID, so two >> + # buffers are needed. >> + # >> + # Some info on manual dwarf walk support: >> + # https://github.com/iovisor/bcc/issues/3515 >> + # https://github.com/iovisor/bcc/pull/4463 >> + # >> + if options.stack_trace_size == 0: >> + continue >> + >> + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: >> + lost_stack_traces += 1 >> + >> + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ >> + if event['kernel_stack_id'] >= 0 else [] >> + user_stack = stack_traces.walk(event['user_stack_id']) \ >> + if event['user_stack_id'] >= 0 else [] >> + >> + for addr in kernel_stack: >> + print("{}{}".format( >> + " " * 10, >> + bpf.ksym(addr, show_module=True, >> + show_offset=True).decode('utf-8', 'replace'))) >> + >> + for addr in user_stack: >> + addr_str = bpf.sym(addr, options.pid, show_module=True, >> + show_offset=True).decode('utf-8', 'replace') >> + >> + if addr_str == "[unknown]": >> + addr_str += " 0x{:x}".format(addr) >> + >> + print("{}{}".format(" " * 10, addr_str)) >> + >> + # >> + # Print any footer messages. >> + # >> + if lost_stack_traces > 0: >> + print("\n#WARNING: We where not able to display {} stack traces!\n" >> + "# Consider increasing the stack trace size using\n" >> + "# the '--stack-trace-size' option.\n" >> + "# Note that this can also happen due to a stack id\n" >> + "# collision.".format(lost_stack_traces)) >> + >> + >> +# >> +# main() >> +# >> +def main(): >> + # >> + # Don't like these globals, but ctx passing does not seem to work with the >> + # existing open_ring_buffer() API :( >> + # >> + global bpf >> + global options >> + global syscall_events >> + global start_trigger_ts >> + global stop_trigger_ts >> + >> + start_trigger_ts = 0 >> + stop_trigger_ts = 0 >> + >> + # >> + # Argument parsing >> + # >> + parser = argparse.ArgumentParser() >> + >> + parser.add_argument("-D", "--debug", >> + help="Enable eBPF debugging", >> + type=int, const=0x3f, default=0, nargs='?') >> + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", >> + help="ovs-vswitch's PID", >> + type=unsigned_int, default=None) >> + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", >> + help="Record syscall events that take longer than " >> + "DURATION_NS. Omit the duration value to record all " >> + "syscall events", >> + type=unsigned_int, const=0, default=None, nargs='?') >> + parser.add_argument("--buffer-page-count", >> + help="Number of BPF ring buffer pages, default 1024", >> + type=unsigned_int, default=1024, metavar="NUMBER") >> + parser.add_argument("--sample-count", >> + help="Number of sample runs, default 1", >> + type=unsigned_nonzero_int, default=1, metavar="RUNS") >> + parser.add_argument("--sample-interval", >> + help="Delay between sample runs, default 0", >> + type=float, default=0, metavar="SECONDS") >> + parser.add_argument("--sample-time", >> + help="Sample time, default 0.5 seconds", >> + type=float, default=0.5, metavar="SECONDS") >> + parser.add_argument("--skip-syscall-poll-events", >> + help="Skip poll() syscalls with --syscall-events", >> + action="store_true") >> + parser.add_argument("--stack-trace-size", >> + help="Number of unique stack traces that can be " >> + "recorded, default 4096. 0 to disable", >> + type=unsigned_int, default=4096) >> + parser.add_argument("--start-trigger", metavar="TRIGGER", >> + help="Start trigger, see documentation for details", >> + type=str, default=None) >> + parser.add_argument("--stop-trigger", metavar="TRIGGER", >> + help="Stop trigger, see documentation for details", >> + type=str, default=None) >> + parser.add_argument("--trigger-delta", metavar="DURATION_NS", >> + help="Only report event when the trigger duration > " >> + "DURATION_NS, default 0 (all events)", >> + type=unsigned_int, const=0, default=0, nargs='?') >> + >> + options = parser.parse_args() >> + >> + # >> + # Find the PID of the ovs-vswitchd daemon if not specified. >> + # >> + if not options.pid: >> + for proc in psutil.process_iter(): >> + if 'ovs-vswitchd' in proc.name(): >> + if options.pid: >> + print("ERROR: Multiple ovs-vswitchd daemons running, " >> + "use the -p option!") >> + sys.exit(os.EX_NOINPUT) >> + >> + options.pid = proc.pid >> + >> + # >> + # Error checking on input parameters. >> + # >> + if not options.pid: >> + print("ERROR: Failed to find ovs-vswitchd's PID!") >> + sys.exit(os.EX_UNAVAILABLE) >> + >> + options.buffer_page_count = next_power_of_two(options.buffer_page_count) >> + >> + # >> + # Make sure we are running as root, or else we can not attach the probes. >> + # >> + if os.geteuid() != 0: >> + print("ERROR: We need to run as root to attached probes!") >> + sys.exit(os.EX_NOPERM) >> + >> + # >> + # Setup any of the start stop triggers >> + # >> + if options.start_trigger is not None: >> + try: >> + start_trigger = Probe(options.start_trigger, pid=options.pid) >> + except ValueError as e: >> + print(f"ERROR: Invalid start trigger {str(e)}") >> + sys.exit(os.EX_CONFIG) >> + else: >> + start_trigger = None >> + >> + if options.stop_trigger is not None: >> + try: >> + stop_trigger = Probe(options.stop_trigger, pid=options.pid) >> + except ValueError as e: >> + print(f"ERROR: Invalid stop trigger {str(e)}") >> + sys.exit(os.EX_CONFIG) >> + else: >> + stop_trigger = None >> + >> + # >> + # Attach probe to running process. >> + # >> + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( >> + [" EVENT_{} = {},".format( >> + event.name, event.value) for event in Event])) >> + source = source.replace("<BUFFER_PAGE_CNT>", >> + str(options.buffer_page_count)) >> + source = source.replace("<MONITOR_PID>", str(options.pid)) >> + >> + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: >> + source = source.replace('<STATE_FIELD>', 'state') >> + else: >> + source = source.replace('<STATE_FIELD>', '__state') >> + >> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >> + if options.syscall_events is None: >> + syscall_trace_events = "false" >> + elif options.syscall_events == 0: >> + if not options.skip_syscall_poll_events: >> + syscall_trace_events = "true" >> + else: >> + syscall_trace_events = f"args->id != {poll_id}" >> + else: >> + syscall_trace_events = "delta > {}".format(options.syscall_events) >> + if options.skip_syscall_poll_events: >> + syscall_trace_events += f" && args->id != {poll_id}" >> + >> + source = source.replace("<SYSCALL_TRACE_EVENTS>", >> + syscall_trace_events) >> + >> + source = source.replace("<STACK_TRACE_SIZE>", >> + str(options.stack_trace_size)) >> + >> + source = source.replace("<STACK_TRACE_ENABLED>", "true" >> + if options.stack_trace_size > 0 else "false") >> + >> + # >> + # Handle start/stop probes >> + # >> + if start_trigger: >> + source = source.replace("<START_TRIGGER>", >> + start_trigger.get_c_code( >> + "start_trigger_probe", >> + "return start_trigger();")) >> + else: >> + source = source.replace("<START_TRIGGER>", "") >> + >> + if stop_trigger: >> + source = source.replace("<STOP_TRIGGER>", >> + stop_trigger.get_c_code( >> + "stop_trigger_probe", >> + "return stop_trigger();")) >> + else: >> + source = source.replace("<STOP_TRIGGER>", "") >> + >> + # >> + # Setup usdt or other probes that need handling trough the BFP class. >> + # >> + usdt = USDT(pid=int(options.pid)) >> + try: >> + if start_trigger and start_trigger.probe_type == 'usdt': >> + usdt.enable_probe(probe=start_trigger.probe_name(), >> + fn_name="start_trigger_probe") >> + if stop_trigger and stop_trigger.probe_type == 'usdt': >> + usdt.enable_probe(probe=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe") >> + >> + except USDTException as e: >> + print("ERROR: {}".format( >> + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). >> + replace("--with-dtrace or --enable-dtrace", >> + "--enable-usdt-probes"))) >> + sys.exit(os.EX_OSERR) >> + >> + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) >> + >> + if start_trigger: >> + try: >> + if start_trigger.probe_type == "uprobe": >> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >> + sym=start_trigger.probe_name(), >> + fn_name="start_trigger_probe", >> + pid=options.pid) >> + >> + if start_trigger.probe_type == "uretprobe": >> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >> + sym=start_trigger.probe_name(), >> + fn_name="start_trigger_probe", >> + pid=options.pid) >> + except Exception as e: >> + print("ERROR: Failed attaching uprobe start trigger " >> + f"'{start_trigger.probe_name()}';\n {str(e)}") >> + sys.exit(os.EX_OSERR) >> + >> + if stop_trigger: >> + try: >> + if stop_trigger.probe_type == "uprobe": >> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >> + sym=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe", >> + pid=options.pid) >> + >> + if stop_trigger.probe_type == "uretprobe": >> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >> + sym=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe", >> + pid=options.pid) >> + except Exception as e: >> + print("ERROR: Failed attaching uprobe stop trigger" >> + f"'{stop_trigger.probe_name()}';\n {str(e)}") >> + sys.exit(os.EX_OSERR) >> + >> + # >> + # If no triggers are configured use the delay configuration >> + # >> + bpf['events'].open_ring_buffer(process_event) >> + >> + sample_count = 0 >> + while sample_count < options.sample_count: >> + sample_count += 1 >> + syscall_events = [] >> + >> + if not options.start_trigger: >> + print_timestamp("# Start sampling") >> + start_capture() >> + stop_time = -1 if options.stop_trigger else \ >> + time_ns() + options.sample_time * 1000000000 >> + else: >> + # For start triggers the stop time depends on the start trigger >> + # time, or depends on the stop trigger if configured. >> + stop_time = -1 if options.stop_trigger else 0 >> + >> + while True: >> + keyboard_interrupt = False >> + try: >> + last_start_ts = start_trigger_ts >> + last_stop_ts = stop_trigger_ts >> + >> + if stop_time > 0: >> + delay = int((stop_time - time_ns()) / 1000000) >> + if delay <= 0: >> + break >> + else: >> + delay = -1 >> + >> + bpf.ring_buffer_poll(timeout=delay) >> + >> + if stop_time <= 0 and last_start_ts != start_trigger_ts: >> + print_timestamp( >> + "# Start sampling (trigger@{})".format( >> + start_trigger_ts)) >> + >> + if not options.stop_trigger: >> + stop_time = time_ns() + \ >> + options.sample_time * 1000000000 >> + >> + if last_stop_ts != stop_trigger_ts: >> + break >> + >> + except KeyboardInterrupt: >> + keyboard_interrupt = True >> + break >> + >> + if options.stop_trigger and not capture_running(): >> + print_timestamp("# Stop sampling (trigger@{})".format( >> + stop_trigger_ts)) >> + else: >> + print_timestamp("# Stop sampling") >> + >> + if stop_trigger_ts != 0 and start_trigger_ts != 0: >> + trigger_delta = stop_trigger_ts - start_trigger_ts >> + else: >> + trigger_delta = None >> + >> + if not trigger_delta or trigger_delta >= options.trigger_delta: >> + stop_capture(force=True) # Prevent a new trigger to start. >> + process_results(syscall_events=syscall_events, >> + trigger_delta=trigger_delta) >> + elif trigger_delta: >> + sample_count -= 1 >> + print_timestamp("# Sample dump skipped, delta {:,} ns".format( >> + trigger_delta)) >> + >> + reset_capture() >> + stop_capture() >> + >> + if keyboard_interrupt: >> + break >> + >> + if options.sample_interval > 0: >> + time.sleep(options.sample_interval) >> + >> + # >> + # Report lost events. >> + # >> + dropcnt = bpf.get_table("dropcnt") >> + for k in dropcnt.keys(): >> + count = dropcnt.sum(k).value >> + if k.value == 0 and count > 0: >> + print("\n# WARNING: Not all events were captured, {} were " >> + "dropped!\n# Increase the BPF ring buffer size " >> + "with the --buffer-page-count option.".format(count)) >> + >> + if (options.sample_count > 1): >> + trigger_miss = bpf.get_table("trigger_miss") >> + for k in trigger_miss.keys(): >> + count = trigger_miss.sum(k).value >> + if k.value == 0 and count > 0: >> + print("\n# WARNING: Not all start triggers were successful. " >> + "{} were missed due to\n# slow userspace " >> + "processing!".format(count)) >> + >> + >> +# >> +# Start main() as the default entry point... >> +# >> +if __name__ == '__main__': >> + main() >> diff --git a/utilities/usdt-scripts/kernel_delay.rst b/utilities/usdt-scripts/kernel_delay.rst >> new file mode 100644 >> index 000000000..0ebd30afb >> --- /dev/null >> +++ b/utilities/usdt-scripts/kernel_delay.rst >> @@ -0,0 +1,596 @@ >> +Troubleshooting Open vSwitch: Is the kernel to blame? >> +===================================================== >> +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left >> +wondering if the issue is really OVS-related, or if it's a problem with the >> +kernel being overloaded. Messages in the log like >> +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from >> +experience, these are mostly related to an overloaded Linux Kernel. >> +The kernel_delay.py tool can help you quickly identify if the focus of your >> +investigation should be OVS or the Linux kernel. >> + >> + >> +Introduction >> +------------ >> +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ >> +framework to install eBPF probes. The data the eBPF probes collect will be >> +analyzed and presented to the user by the Python script. Some of the presented >> +data can also be captured by the individual scripts included in the BBC [#BCC]_ >> +framework. >> + >> +kernel_delay.py has two modes of operation: >> + >> +- In **time mode**, the tool runs for a specific time and collects the >> + information. >> +- In **trigger mode**, event collection can be started and/or stopped based on >> + a specific eBPF probe. Currently, the following probes are supported: >> + - USDT probes >> + - Kernel tracepoints >> + - kprobe >> + - kretprobe >> + - uprobe >> + - uretprobe >> + >> + >> +In addition, the option, ``--sample-count``, exists to specify how many >> +iterations you would like to do. When using triggers, you can also ignore >> +samples if they are less than a number of nanoseconds with the >> +``--trigger-delta`` option. The latter might be useful when debugging Linux >> +syscalls which take a long time to complete. More on this later. Finally, you >> +can configure the delay between two sample runs with the ``--sample-interval`` >> +option. >> + >> +Before getting into more details, you can run the tool without any options >> +to see what the output looks like. Notice that it will try to automatically >> +get the process ID of the running ``ovs-vswitchd``. You can overwrite this >> +with the ``--pid`` option. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py >> + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) >> + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) >> + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 27090 ovs-vswitchd [SYSCALL STATISTICS] >> + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> >> + >> + 31741 revalidator122 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + poll 7 5 184,193,176 184,191,520 >> + recvmsg 47 494 125,208,756 310,331 >> + futex 202 8 18,768,758 4,023,039 >> + sendto 44 10 375,861 266,867 >> + sendmsg 46 4 43,294 11,213 >> + write 1 1 5,949 5,949 >> + getrusage 98 1 1,424 1,424 >> + read 0 1 1,292 1,292 >> + TOTAL( - poll): 519 144,405,334 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 6 136,764,071 1,480 115,146,424 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 7 11,334 6,636 >> + >> + [HARD IRQ STATISTICS] >> + NAME COUNT TOTAL ns MAX ns >> + eno8303-rx-1 1 3,586 3,586 >> + TOTAL: 1 3,586 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + net_rx 3 1 17,699 17,699 >> + sched 7 6 13,820 3,226 >> + rcu 9 16 13,586 1,554 >> + timer 1 3 10,259 3,815 >> + TOTAL: 26 55,364 >> + >> + >> +By default, the tool will run for half a second in `time mode`. To extend this >> +you can use the ``--sample-time`` option. >> + >> + >> +What will it report >> +------------------- >> +The above sample output separates the captured data on a per-thread basis. >> +For this, it displays the thread's id (``TID``) and name (``THREAD``), >> +followed by resource-specific data. Which are: >> + >> +- ``SYSCALL STATISTICS`` >> +- ``THREAD RUN STATISTICS`` >> +- ``THREAD READY STATISTICS`` >> +- ``HARD IRQ STATISTICS`` >> +- ``SOFT IRQ STATISTICS`` >> + >> +The following sections will describe in detail what statistics they report. >> + >> + >> +``SYSCALL STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~ >> +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during >> +the measurement interval. This includes the number of times the syscall was >> +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), >> +and the worst-case duration of a single call (``MAX ns``). >> + >> +It also shows the total of all system calls, but it excludes the poll system >> +call, as the purpose of this call is to wait for activity on a set of sockets, >> +and usually, the thread gets swapped out. >> + >> +Note that it only counts calls that started and stopped during the >> +measurement interval! >> + >> + >> +``THREAD RUN STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~~~ >> +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU >> +during the measurement interval. >> + >> +Note that these statistics only count events where the thread started and >> +stopped running on a CPU during the measurement interval. For example, if >> +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. >> +If not, there might be a misconfiguration. >> + >> + >> +``THREAD READY STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> +``THREAD READY STATISTICS`` tell you the time between the thread being ready >> +to run and it actually running on the CPU. >> + >> +Note that these statistics only count events where the thread was getting >> +ready to run and started running during the measurement interval. >> + >> + >> +``HARD IRQ STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~ >> +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard >> +interrupts during the threads run time. >> + >> +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), >> +the total time spent in the interrupt handler (``TOTAL ns``), and the >> +worst-case duration (``MAX ns``). >> + >> + >> +``SOFT IRQ STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~ >> +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft >> +interrupts during the threads run time. >> + >> +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the >> +number of interrupts (``COUNT``), the total time spent in the interrupt >> +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). >> + >> + >> +The ``--syscall-events`` option >> +------------------------------- >> +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, >> +the tool can also report each individual syscall. This can be a usefull >> +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. >> + >> +All you need to do is add the ``--syscall-events`` option, with or without >> +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter >> +allows you to exclude events that take less than the supplied time. >> + >> +The ``--skip-syscall-poll-events`` option allows you to exclude poll >> +syscalls from the report. >> + >> +Below is an example run, note that the resource-specific data is removed >> +to highlight the syscall events: >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events >> + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) >> + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) >> + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 3359686 ipf_clean2 [SYSCALL STATISTICS] >> + ... >> + 3359635 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 3359697 revalidator12 [SYSCALL STATISTICS] >> + ... >> + 3359698 revalidator13 [SYSCALL STATISTICS] >> + ... >> + 3359699 revalidator14 [SYSCALL STATISTICS] >> + ... >> + 3359700 revalidator15 [SYSCALL STATISTICS] >> + ... >> + >> + # SYSCALL EVENTS: >> + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL >> + ------------------- ------------------- ---------- ---------------- ---------- ---------------- >> + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003936313a63 >> + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003134313a63 >> + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003936313a63 >> + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI_accept+0x4d [libc.so.6] >> + pfd_accept+0x3a [ovs-vswitchd] >> + [unknown] 0x7fff19f2bd00 >> + [unknown] 0xe4b8001f0f >> + >> +As you can see above, the output also shows the stackback trace. You can >> +disable this using the ``--stack-trace-size 0`` option. >> + >> +As you can see above, the backtrace does not show a lot of useful information >> +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further >> +analyze system call backtraces, you could use perf. The following perf >> +script can do this for you (refer to the embedded instructions): >> + >> +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py >> + >> + >> +Using triggers >> +-------------- >> +The tool supports start and, or stop triggers. This will allow you to capture >> +statistics triggered by a specific event. The following combinations of >> +stop-and-start triggers can be used. >> + >> +If you only use ``--start-trigger``, the inspection start when the trigger >> +happens and runs until the ``--sample-time`` number of seconds has passed. >> +The example below shows all the supported options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ >> + --sample-count 4 --sample-interval 1 >> + >> + >> +If you only use ``--stop-trigger``, the inspection starts immediately and >> +stops when the trigger happens. The example below shows all the supported >> +options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ >> + --sample-count 4 --sample-interval 1 >> + >> + >> +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the >> +statistics are captured between the two first occurrences of these events. >> +The example below shows all the supported options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ >> + --stop-trigger upr:bridge_run \ >> + --sample-count 4 --sample-interval 1 \ >> + --trigger-delta 50000 >> + >> +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, >> +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. >> +For more details on the supported tracepoints, check out the BCC >> +documentation [#BCC_EVENT]_. >> + >> +The list below shows the supported triggers and their argument format: >> + >> +**USDT probes:** >> + [u|usdt]:{provider}:{probe} >> +**Kernel tracepoint:** >> + [t:trace]:{system}:{event} >> +**kprobe:** >> + [k:kprobe]:{kernel_function} >> +**kretprobe:** >> + [kr:kretprobe]:{kernel_function} >> +**uprobe:** >> + [up:uprobe]:{function} >> +**uretprobe:** >> + [upr:uretprobe]:{function} >> + >> +Here are a couple of trigger examples, more use-case-specific examples can be >> +found in the *Examples* section. >> + >> +.. code-block:: console >> + >> + --start|stop-trigger u:udpif_revalidator:start_dump >> + --start|stop-trigger t:openvswitch:ovs_dp_upcall >> + --start|stop-trigger k:ovs_dp_process_packet >> + --start|stop-trigger kr:ovs_dp_process_packet >> + --start|stop-trigger up:bridge_run >> + --start|stop-trigger upr:bridge_run >> + >> + >> +Examples >> +-------- >> +This section will give some examples of how to use this tool in real-world >> +scenarios. Let's start with the issue where Open vSwitch reports >> +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note >> +that there is a blog available explaining how the revalidator process works >> +in OVS [#REVAL_BLOG]_. >> + >> +First, let me explain this log message. It gets logged if the time delta >> +between two ``poll_block()`` calls is more than 1 second. In other words, >> +the process was spending a lot of time processing stuff that was made >> +available by the return of the ``poll_block()`` function. >> + >> +Do a run with the tool using the existing USDT revalidator probes as a start >> +and stop trigger (Note that the resource-specific data is removed from the none >> +revalidator threads): >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done >> + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) >> + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) >> + # Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1457761 handler24 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + sendmsg 46 6110 123,274,761 41,776 >> + recvmsg 47 136299 99,397,508 49,896 >> + futex 202 51 7,655,832 7,536,776 >> + poll 7 4068 1,202,883 2,907 >> + getrusage 98 2034 586,602 1,398 >> + sendto 44 9 213,682 27,417 >> + TOTAL( - poll): 144503 231,128,385 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 1 1,438 1,438 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 21 59,145 3,769 >> + rcu 9 50 42,917 2,234 >> + TOTAL: 71 102,062 >> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 1457792 revalidator55 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + futex 202 73 572,576,329 19,621,600 >> + recvmsg 47 815 296,697,618 405,338 >> + sendto 44 3 78,302 26,837 >> + sendmsg 46 3 38,712 13,250 >> + write 1 1 5,073 5,073 >> + TOTAL( - poll): 895 869,396,034 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 48 394,350,393 1,729 140,455,796 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 49 23,650 1,559 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 14 26,889 3,041 >> + rcu 9 28 23,024 1,600 >> + TOTAL: 42 49,913 >> + >> + >> +Above you see from the start of the output that the trigger took more than a >> +second (1,187,986,454 ns), which is already know, by looking at the output of >> +the ``ovs-vsctl upcall/show`` command. >> + >> +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it >> +spent almost 870ms handling syscalls, and there were no poll() calls being >> +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, >> +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned >> +that this time does not include the time being on the CPU at the start or stop >> +of an event. What is exactly the case here, because USDT probes were used. >> + >> +From the above data and maybe some ``top`` output, it can be determined that >> +the *revalidator55* thread is taking a lot of CPU time, probably because it >> +has to do a lot of revalidator work by itself. The solution here is to increase >> +the number of revalidator threads, so more work could be done in parallel. >> + >> +Here is another run of the same command in another scenario: >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done >> + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) >> + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) >> + # Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 1457792 revalidator55 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + recvmsg 47 284 193,422,110 46,248,418 >> + sendto 44 2 46,685 23,665 >> + sendmsg 46 2 24,916 12,703 >> + write 1 1 6,534 6,534 >> + TOTAL( - poll): 289 193,500,245 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 2 47,333,558 331,516 47,002,042 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 3 87,000,403 45,999,712 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 2 9,504 5,109 >> + TOTAL: 2 9,504 >> + >> + >> +Here you can see the revalidator run took about 263ms, which does not look >> +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was >> +waiting 87ms for a CPU to be run on. This means the revalidator process could >> +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay >> +of almost 46ms can be seen, which hints at an overloaded system. >> + >> +One final example that uses a ``uprobe`` to get some statistics on a >> +``bridge_run()`` execution that takes more than 1ms. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000 >> + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) >> + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) >> + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) >> + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) >> + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) >> + # Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE unknown THREADS> >> + 3371102 handler66 [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE HANDLER THREADS> >> + 3366258 ovs-vswitchd [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + futex 202 43 403,469 199,312 >> + clone3 435 13 174,394 30,731 >> + munmap 11 8 115,774 21,861 >> + poll 7 5 92,969 38,307 >> + unlink 87 2 49,918 35,741 >> + mprotect 10 8 47,618 13,201 >> + accept 43 10 31,360 6,976 >> + mmap 9 8 30,279 5,776 >> + write 1 6 27,720 11,774 >> + rt_sigprocmask 14 28 12,281 970 >> + read 0 6 9,478 2,318 >> + recvfrom 45 3 7,024 4,024 >> + sendto 44 1 4,684 4,684 >> + getrusage 98 5 4,594 1,342 >> + close 3 2 2,918 1,627 >> + recvmsg 47 1 2,722 2,722 >> + TOTAL( - poll): 144 924,233 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 13 817,605 5,433 524,376 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 14 28,646 11,566 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + rcu 9 1 2,838 2,838 >> + TOTAL: 1 2,838 >> + >> + 3371110 revalidator74 [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE NEW revalidator THREADS> >> + 3366311 urcu3 [SYSCALL STATISTICS] >> + ... >> + >> + >> +OVS removed some of the threads and their resource-specific data, but based >> +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some >> +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some >> +``clone3`` syscalls, indicating threads were created. In this example, it was >> +due to the deletion of a bridge, which resulted in the recreation of the >> +revalidator and handler threads. >> + >> + >> +Use with Openshift >> +------------------ >> +This section describes how you would use the tool on a node in an OpenShift >> +cluster. It assumes you have console access to the node, either directly or >> +through a debug container. >> + >> +A base fedora38 container will be used through podman, as this will allow the >> +use of some additional tools and packages needed. >> + >> +First the containers need to be started: >> + >> +.. code-block:: console >> + >> + [core@sno-master ~]$ sudo podman run -it --rm \ >> + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ >> + --privileged --network=host --pid=host \ >> + -v /lib/modules:/lib/modules:ro \ >> + -v /sys/kernel/debug:/sys/kernel/debug \ >> + -v /proc:/proc \ >> + -v /:/mnt/rootdir \ >> + quay.io/fedora/fedora:38-x86_64 >> + >> + [(DEBUG)root@sno-master /]# >> + >> + >> +Next add the ``linux_delay.py`` dependencies: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ >> + python3-pytz python3-psutil >> + >> + >> +You need to install the devel, debug and source RPMs for your OVS and kernel >> +version: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master home]# rpm -i \ >> + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ >> + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ >> + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm >> + >> + >> +Now the tool can be started. Here the above ``bridge_run()`` example is used: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run >> + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) >> + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) >> + # Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1246 ovs-vswitchd [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + getdents64 217 2 8,560 8,162 >> + openat 257 1 6,951 6,951 >> + accept 43 4 6,942 3,763 >> + recvfrom 45 1 3,726 3,726 >> + recvmsg 47 2 2,880 2,188 >> + stat 4 2 1,946 1,384 >> + close 3 1 1,393 1,393 >> + fstat 5 1 1,324 1,324 >> + TOTAL( - poll): 14 33,722 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + >> + >> +.. rubric:: Footnotes >> + >> +.. [#BCC] https://github.com/iovisor/bcc >> +.. [#BCC_EVENT] https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >> +.. [#REVAL_BLOG] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained >> > > -- > Adrián Moreno
Eelco Chaudron <echaudro@redhat.com> writes: > On 25 Sep 2023, at 13:49, Adrian Moreno wrote: > >> On 9/12/23 12:36, Eelco Chaudron wrote: >>> This patch adds an utility that can be used to determine if >>> an issue is related to a lack of Linux kernel resources. >>> >>> This tool is also featured in a Red Hat developers blog article: >>> >>> https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame >>> >>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>> >> >> Reviewed-by: Adrian Moreno <amorenoz@redhat.com> > > Thanks Adrian! Aaron, are you still planning to look at this after the > changes? If not I’ll go ahead an apply it. Sorry - just back from PTO. I'll take a look tomorrow if you can hold off. If you don't hear about it from me within 24 hours, please move on without me. > Cheers, > > Eelco > >>> --- >>> v2: Addressed review comments from Aaron. >>> v3: Changed wording in documentation. >>> v4: Addressed review comments from Adrian. >>> >>> utilities/automake.mk | 4 >>> utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ >>> utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ >>> 3 files changed, 2020 insertions(+) >>> create mode 100755 utilities/usdt-scripts/kernel_delay.py >>> create mode 100644 utilities/usdt-scripts/kernel_delay.rst >>> >>> diff --git a/utilities/automake.mk b/utilities/automake.mk >>> index 37d679f82..9a2114df4 100644 >>> --- a/utilities/automake.mk >>> +++ b/utilities/automake.mk >>> @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib >>> usdt_SCRIPTS += \ >>> utilities/usdt-scripts/bridge_loop.bt \ >>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >>> + utilities/usdt-scripts/kernel_delay.py \ >>> + utilities/usdt-scripts/kernel_delay.rst \ >>> utilities/usdt-scripts/reval_monitor.py \ >>> utilities/usdt-scripts/upcall_cost.py \ >>> utilities/usdt-scripts/upcall_monitor.py >>> @@ -70,6 +72,8 @@ EXTRA_DIST += \ >>> utilities/docker/debian/build-kernel-modules.sh \ >>> utilities/usdt-scripts/bridge_loop.bt \ >>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >>> + utilities/usdt-scripts/kernel_delay.py \ >>> + utilities/usdt-scripts/kernel_delay.rst \ >>> utilities/usdt-scripts/reval_monitor.py \ >>> utilities/usdt-scripts/upcall_cost.py \ >>> utilities/usdt-scripts/upcall_monitor.py >>> diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py >>> new file mode 100755 >>> index 000000000..636e108be >>> --- /dev/null >>> +++ b/utilities/usdt-scripts/kernel_delay.py >>> @@ -0,0 +1,1420 @@ >>> +#!/usr/bin/env python3 >>> +# >>> +# Copyright (c) 2022,2023 Red Hat, Inc. >>> +# >>> +# Licensed under the Apache License, Version 2.0 (the "License"); >>> +# you may not use this file except in compliance with the License. >>> +# You may obtain a copy of the License at: >>> +# >>> +# http://www.apache.org/licenses/LICENSE-2.0 >>> +# >>> +# Unless required by applicable law or agreed to in writing, software >>> +# distributed under the License is distributed on an "AS IS" BASIS, >>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >>> +# See the License for the specific language governing permissions and >>> +# limitations under the License. >>> +# >>> +# >>> +# Script information: >>> +# ------------------- >>> +# This script allows a developer to quickly identify if the issue at hand >>> +# might be related to the kernel running out of resources or if it really is >>> +# an Open vSwitch issue. >>> +# >>> +# For documentation see the kernel_delay.rst file. >>> +# >>> +# >>> +# Dependencies: >>> +# ------------- >>> +# You need to install the BCC package for your specific platform or build it >>> +# yourself using the following instructions: >>> +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md >>> +# >>> +# Python needs the following additional packages installed: >>> +# - pytz >>> +# - psutil >>> +# >>> +# You can either install your distribution specific package or use pip: >>> +# pip install pytz psutil >>> +# >>> +import argparse >>> +import datetime >>> +import os >>> +import pytz >>> +import psutil >>> +import re >>> +import sys >>> +import time >>> + >>> +import ctypes as ct >>> + >>> +try: >>> + from bcc import BPF, USDT, USDTException >>> + from bcc.syscall import syscalls, syscall_name >>> +except ModuleNotFoundError: >>> + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") >>> + sys.exit(os.EX_OSFILE) >>> + >>> +from enum import IntEnum >>> + >>> + >>> +# >>> +# Actual eBPF source code >>> +# >>> +EBPF_SOURCE = """ >>> +#include <linux/irq.h> >>> +#include <linux/sched.h> >>> + >>> +#define MONITOR_PID <MONITOR_PID> >>> + >>> +enum { >>> +<EVENT_ENUM> >>> +}; >>> + >>> +struct event_t { >>> + u64 ts; >>> + u32 tid; >>> + u32 id; >>> + >>> + int user_stack_id; >>> + int kernel_stack_id; >>> + >>> + u32 syscall; >>> + u64 entry_ts; >>> + >>> +}; >>> + >>> +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); >>> +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); >>> +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); >>> +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); >>> + >>> +BPF_ARRAY(capture_on, u64, 1); >>> +static inline bool capture_enabled(u64 pid_tgid) { >>> + int key = 0; >>> + u64 *ret; >>> + >>> + if ((pid_tgid >> 32) != MONITOR_PID) >>> + return false; >>> + >>> + ret = capture_on.lookup(&key); >>> + return ret && *ret == 1; >>> +} >>> + >>> +static inline bool capture_enabled__() { >>> + int key = 0; >>> + u64 *ret; >>> + >>> + ret = capture_on.lookup(&key); >>> + return ret && *ret == 1; >>> +} >>> + >>> +static struct event_t *get_event(uint32_t id) { >>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); >>> + >>> + if (!event) { >>> + dropcnt.increment(0); >>> + return NULL; >>> + } >>> + >>> + event->id = id; >>> + event->ts = bpf_ktime_get_ns(); >>> + event->tid = bpf_get_current_pid_tgid(); >>> + >>> + return event; >>> +} >>> + >>> +static int start_trigger() { >>> + int key = 0; >>> + u64 *val = capture_on.lookup(&key); >>> + >>> + /* If the value is -1 we can't start as we are still processing the >>> + * results in userspace. */ >>> + if (!val || *val != 0) { >>> + trigger_miss.increment(0); >>> + return 0; >>> + } >>> + >>> + struct event_t *event = get_event(EVENT_START_TRIGGER); >>> + if (event) { >>> + events.ringbuf_submit(event, 0); >>> + *val = 1; >>> + } else { >>> + trigger_miss.increment(0); >>> + } >>> + return 0; >>> +} >>> + >>> +static int stop_trigger() { >>> + int key = 0; >>> + u64 *val = capture_on.lookup(&key); >>> + >>> + if (!val || *val != 1) >>> + return 0; >>> + >>> + struct event_t *event = get_event(EVENT_STOP_TRIGGER); >>> + >>> + if (event) >>> + events.ringbuf_submit(event, 0); >>> + >>> + if (val) >>> + *val = -1; >>> + >>> + return 0; >>> +} >>> + >>> +<START_TRIGGER> >>> +<STOP_TRIGGER> >>> + >>> + >>> +/* >>> + * For the syscall monitor the following probes get installed. >>> + */ >>> +struct syscall_data_t { >>> + u64 count; >>> + u64 total_ns; >>> + u64 worst_ns; >>> +}; >>> + >>> +struct syscall_data_key_t { >>> + u32 pid; >>> + u32 tid; >>> + u32 syscall; >>> +}; >>> + >>> +BPF_HASH(syscall_start, u64, u64); >>> +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); >>> + >>> +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + u64 t = bpf_ktime_get_ns(); >>> + syscall_start.update(&pid_tgid, &t); >>> + >>> + return 0; >>> +} >>> + >>> +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { >>> + struct syscall_data_t *val, zero = {}; >>> + struct syscall_data_key_t key; >>> + >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + key.pid = pid_tgid >> 32; >>> + key.tid = (u32)pid_tgid; >>> + key.syscall = args->id; >>> + >>> + u64 *start_ns = syscall_start.lookup(&pid_tgid); >>> + >>> + if (!start_ns) >>> + return 0; >>> + >>> + val = syscall_data.lookup_or_try_init(&key, &zero); >>> + if (val) { >>> + u64 delta = bpf_ktime_get_ns() - *start_ns; >>> + val->count++; >>> + val->total_ns += delta; >>> + if (val->worst_ns == 0 || delta > val->worst_ns) >>> + val->worst_ns = delta; >>> + >>> + if (<SYSCALL_TRACE_EVENTS>) { >>> + struct event_t *event = get_event(EVENT_SYSCALL); >>> + if (event) { >>> + event->syscall = args->id; >>> + event->entry_ts = *start_ns; >>> + if (<STACK_TRACE_ENABLED>) { >>> + event->user_stack_id = stack_traces.get_stackid( >>> + args, BPF_F_USER_STACK); >>> + event->kernel_stack_id = stack_traces.get_stackid( >>> + args, 0); >>> + } >>> + events.ringbuf_submit(event, 0); >>> + } >>> + } >>> + } >>> + return 0; >>> +} >>> + >>> + >>> +/* >>> + * For measuring the thread run time, we need the following. >>> + */ >>> +struct run_time_data_t { >>> + u64 count; >>> + u64 total_ns; >>> + u64 max_ns; >>> + u64 min_ns; >>> +}; >>> + >>> +struct pid_tid_key_t { >>> + u32 pid; >>> + u32 tid; >>> +}; >>> + >>> +BPF_HASH(run_start, u64, u64); >>> +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); >>> + >>> +static inline void thread_start_run(u64 pid_tgid, u64 ktime) >>> +{ >>> + run_start.update(&pid_tgid, &ktime); >>> +} >>> + >>> +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) >>> +{ >>> + u64 pid_tgid = (u64) tgid << 32 | pid; >>> + u64 *start_ns = run_start.lookup(&pid_tgid); >>> + >>> + if (!start_ns || *start_ns == 0) >>> + return; >>> + >>> + struct run_time_data_t *val, zero = {}; >>> + struct pid_tid_key_t key = { .pid = tgid, >>> + .tid = pid }; >>> + >>> + val = run_data.lookup_or_try_init(&key, &zero); >>> + if (val) { >>> + u64 delta = ktime - *start_ns; >>> + val->count++; >>> + val->total_ns += delta; >>> + if (val->max_ns == 0 || delta > val->max_ns) >>> + val->max_ns = delta; >>> + if (val->min_ns == 0 || delta < val->min_ns) >>> + val->min_ns = delta; >>> + } >>> + *start_ns = 0; >>> +} >>> + >>> + >>> +/* >>> + * For measuring the thread-ready delay, we need the following. >>> + */ >>> +struct ready_data_t { >>> + u64 count; >>> + u64 total_ns; >>> + u64 worst_ns; >>> +}; >>> + >>> +BPF_HASH(ready_start, u64, u64); >>> +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); >>> + >>> +static inline int sched_wakeup__(u32 pid, u32 tgid) >>> +{ >>> + u64 pid_tgid = (u64) tgid << 32 | pid; >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + u64 t = bpf_ktime_get_ns(); >>> + ready_start.update(&pid_tgid, &t); >>> + return 0; >>> +} >>> + >>> +RAW_TRACEPOINT_PROBE(sched_wakeup) >>> +{ >>> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >>> + return sched_wakeup__(t->pid, t->tgid); >>> +} >>> + >>> +RAW_TRACEPOINT_PROBE(sched_wakeup_new) >>> +{ >>> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >>> + return sched_wakeup__(t->pid, t->tgid); >>> +} >>> + >>> +RAW_TRACEPOINT_PROBE(sched_switch) >>> +{ >>> + struct task_struct *prev = (struct task_struct *)ctx->args[1]; >>> + struct task_struct *next= (struct task_struct *)ctx->args[2]; >>> + u64 ktime = 0; >>> + >>> + if (!capture_enabled__()) >>> + return 0; >>> + >>> + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) >>> + sched_wakeup__(prev->pid, prev->tgid); >>> + >>> + if (prev->tgid == MONITOR_PID) { >>> + ktime = bpf_ktime_get_ns(); >>> + thread_stop_run(prev->pid, prev->tgid, ktime); >>> + } >>> + >>> + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; >>> + >>> + if (next->tgid != MONITOR_PID) >>> + return 0; >>> + >>> + if (ktime == 0) >>> + ktime = bpf_ktime_get_ns(); >>> + >>> + u64 *start_ns = ready_start.lookup(&pid_tgid); >>> + >>> + if (start_ns && *start_ns != 0) { >>> + >>> + struct ready_data_t *val, zero = {}; >>> + struct pid_tid_key_t key = { .pid = next->tgid, >>> + .tid = next->pid }; >>> + >>> + val = ready_data.lookup_or_try_init(&key, &zero); >>> + if (val) { >>> + u64 delta = ktime - *start_ns; >>> + val->count++; >>> + val->total_ns += delta; >>> + if (val->worst_ns == 0 || delta > val->worst_ns) >>> + val->worst_ns = delta; >>> + } >>> + *start_ns = 0; >>> + } >>> + >>> + thread_start_run(pid_tgid, ktime); >>> + return 0; >>> +} >>> + >>> + >>> +/* >>> + * For measuring the hard irq time, we need the following. >>> + */ >>> +struct hardirq_start_data_t { >>> + u64 start_ns; >>> + char irq_name[32]; >>> +}; >>> + >>> +struct hardirq_data_t { >>> + u64 count; >>> + u64 total_ns; >>> + u64 worst_ns; >>> +}; >>> + >>> +struct hardirq_data_key_t { >>> + u32 pid; >>> + u32 tid; >>> + char irq_name[32]; >>> +}; >>> + >>> +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); >>> +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); >>> + >>> +TRACEPOINT_PROBE(irq, irq_handler_entry) >>> +{ >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + struct hardirq_start_data_t data = {}; >>> + >>> + data.start_ns = bpf_ktime_get_ns(); >>> + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); >>> + hardirq_start.update(&pid_tgid, &data); >>> + return 0; >>> +} >>> + >>> +TRACEPOINT_PROBE(irq, irq_handler_exit) >>> +{ >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + struct hardirq_start_data_t *data; >>> + data = hardirq_start.lookup(&pid_tgid); >>> + if (!data || data->start_ns == 0) >>> + return 0; >>> + >>> + if (args->ret != IRQ_NONE) { >>> + struct hardirq_data_t *val, zero = {}; >>> + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, >>> + .tid = (u32)pid_tgid }; >>> + >>> + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), >>> + data->irq_name); >>> + val = hardirq_data.lookup_or_try_init(&key, &zero); >>> + if (val) { >>> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >>> + val->count++; >>> + val->total_ns += delta; >>> + if (val->worst_ns == 0 || delta > val->worst_ns) >>> + val->worst_ns = delta; >>> + } >>> + } >>> + >>> + data->start_ns = 0; >>> + return 0; >>> +} >>> + >>> + >>> +/* >>> + * For measuring the soft irq time, we need the following. >>> + */ >>> +struct softirq_start_data_t { >>> + u64 start_ns; >>> + u32 vec_nr; >>> +}; >>> + >>> +struct softirq_data_t { >>> + u64 count; >>> + u64 total_ns; >>> + u64 worst_ns; >>> +}; >>> + >>> +struct softirq_data_key_t { >>> + u32 pid; >>> + u32 tid; >>> + u32 vec_nr; >>> +}; >>> + >>> +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); >>> +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); >>> + >>> +TRACEPOINT_PROBE(irq, softirq_entry) >>> +{ >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + struct softirq_start_data_t data = {}; >>> + >>> + data.start_ns = bpf_ktime_get_ns(); >>> + data.vec_nr = args->vec; >>> + softirq_start.update(&pid_tgid, &data); >>> + return 0; >>> +} >>> + >>> +TRACEPOINT_PROBE(irq, softirq_exit) >>> +{ >>> + u64 pid_tgid = bpf_get_current_pid_tgid(); >>> + >>> + if (!capture_enabled(pid_tgid)) >>> + return 0; >>> + >>> + struct softirq_start_data_t *data; >>> + data = softirq_start.lookup(&pid_tgid); >>> + if (!data || data->start_ns == 0) >>> + return 0; >>> + >>> + struct softirq_data_t *val, zero = {}; >>> + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, >>> + .tid = (u32)pid_tgid, >>> + .vec_nr = data->vec_nr}; >>> + >>> + val = softirq_data.lookup_or_try_init(&key, &zero); >>> + if (val) { >>> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >>> + val->count++; >>> + val->total_ns += delta; >>> + if (val->worst_ns == 0 || delta > val->worst_ns) >>> + val->worst_ns = delta; >>> + } >>> + >>> + data->start_ns = 0; >>> + return 0; >>> +} >>> +""" >>> + >>> + >>> +# >>> +# time_ns() >>> +# >>> +try: >>> + from time import time_ns >>> +except ImportError: >>> + # For compatibility with Python <= v3.6. >>> + def time_ns(): >>> + now = datetime.datetime.now() >>> + return int(now.timestamp() * 1e9) >>> + >>> + >>> +# >>> +# Probe class to use for the start/stop triggers >>> +# >>> +class Probe(object): >>> + ''' >>> + The goal for this object is to support as many as possible >>> + probe/events as supported by BCC. See >>> + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >>> + ''' >>> + def __init__(self, probe, pid=None): >>> + self.pid = pid >>> + self.text_probe = probe >>> + self._parse_text_probe() >>> + >>> + def __str__(self): >>> + if self.probe_type == "usdt": >>> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >>> + self.usdt_provider, self.usdt_probe) >>> + elif self.probe_type == "trace": >>> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >>> + self.trace_system, self.trace_event) >>> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >>> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >>> + self.kprobe_function) >>> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >>> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >>> + self.uprobe_function) >>> + else: >>> + return "[{}] <{}:unknown probe>".format(self.text_probe, >>> + self.probe_type) >>> + >>> + def _raise(self, error): >>> + raise ValueError("[{}]; {}".format(self.text_probe, error)) >>> + >>> + def _verify_kprobe_probe(self): >>> + # Nothing to verify for now, just return. >>> + return >>> + >>> + def _verify_trace_probe(self): >>> + # Nothing to verify for now, just return. >>> + return >>> + >>> + def _verify_uprobe_probe(self): >>> + # Nothing to verify for now, just return. >>> + return >>> + >>> + def _verify_usdt_probe(self): >>> + if not self.pid: >>> + self._raise("USDT probes need a valid PID.") >>> + >>> + usdt = USDT(pid=self.pid) >>> + >>> + for probe in usdt.enumerate_probes(): >>> + if probe.provider.decode('utf-8') == self.usdt_provider and \ >>> + probe.name.decode('utf-8') == self.usdt_probe: >>> + return >>> + >>> + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, >>> + self.usdt_probe)) >>> + >>> + def _parse_text_probe(self): >>> + ''' >>> + The text probe format is defined as follows: >>> + <probe_type>:<probe_specific> >>> + >>> + Types: >>> + USDT: u|usdt:<provider>:<probe> >>> + TRACE: t|trace:<system>:<event> >>> + KPROBE: k|kprobe:<kernel_function> >>> + KRETPROBE: kr|kretprobe:<kernel_function> >>> + UPROBE: up|uprobe:<function> >>> + URETPROBE: ur|uretprobe:<function> >>> + ''' >>> + args = self.text_probe.split(":") >>> + if len(args) <= 1: >>> + self._raise("Can't extract probe type.") >>> + >>> + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", >>> + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: >>> + self._raise("Invalid probe type '{}'".format(args[0])) >>> + >>> + self.probe_type = "kprobe" if args[0] == "k" else args[0] >>> + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type >>> + self.probe_type = "trace" if args[0] == "t" else self.probe_type >>> + self.probe_type = "usdt" if args[0] == "u" else self.probe_type >>> + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type >>> + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type >>> + >>> + if self.probe_type == "usdt": >>> + if len(args) != 3: >>> + self._raise("Invalid number of arguments for USDT") >>> + >>> + self.usdt_provider = args[1] >>> + self.usdt_probe = args[2] >>> + self._verify_usdt_probe() >>> + >>> + elif self.probe_type == "trace": >>> + if len(args) != 3: >>> + self._raise("Invalid number of arguments for TRACE") >>> + >>> + self.trace_system = args[1] >>> + self.trace_event = args[2] >>> + self._verify_trace_probe() >>> + >>> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >>> + if len(args) != 2: >>> + self._raise("Invalid number of arguments for K(RET)PROBE") >>> + self.kprobe_function = args[1] >>> + self._verify_kprobe_probe() >>> + >>> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >>> + if len(args) != 2: >>> + self._raise("Invalid number of arguments for U(RET)PROBE") >>> + self.uprobe_function = args[1] >>> + self._verify_uprobe_probe() >>> + >>> + def _get_kprobe_c_code(self, function_name, function_content): >>> + # >>> + # The kprobe__* do not require a function name, so it's >>> + # ignored in the code generation. >>> + # >>> + return """ >>> +int {}__{}(struct pt_regs *ctx) {{ >>> + {} >>> +}} >>> +""".format(self.probe_type, self.kprobe_function, function_content) >>> + >>> + def _get_trace_c_code(self, function_name, function_content): >>> + # >>> + # The TRACEPOINT_PROBE() do not require a function name, so it's >>> + # ignored in the code generation. >>> + # >>> + return """ >>> +TRACEPOINT_PROBE({},{}) {{ >>> + {} >>> +}} >>> +""".format(self.trace_system, self.trace_event, function_content) >>> + >>> + def _get_uprobe_c_code(self, function_name, function_content): >>> + return """ >>> +int {}(struct pt_regs *ctx) {{ >>> + {} >>> +}} >>> +""".format(function_name, function_content) >>> + >>> + def _get_usdt_c_code(self, function_name, function_content): >>> + return """ >>> +int {}(struct pt_regs *ctx) {{ >>> + {} >>> +}} >>> +""".format(function_name, function_content) >>> + >>> + def get_c_code(self, function_name, function_content): >>> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >>> + return self._get_kprobe_c_code(function_name, function_content) >>> + elif self.probe_type == 'trace': >>> + return self._get_trace_c_code(function_name, function_content) >>> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >>> + return self._get_uprobe_c_code(function_name, function_content) >>> + elif self.probe_type == 'usdt': >>> + return self._get_usdt_c_code(function_name, function_content) >>> + >>> + return "" >>> + >>> + def probe_name(self): >>> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >>> + return "{}".format(self.kprobe_function) >>> + elif self.probe_type == 'trace': >>> + return "{}:{}".format(self.trace_system, >>> + self.trace_event) >>> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >>> + return "{}".format(self.uprobe_function) >>> + elif self.probe_type == 'usdt': >>> + return "{}:{}".format(self.usdt_provider, >>> + self.usdt_probe) >>> + >>> + return "" >>> + >>> + >>> +# >>> +# event_to_dict() >>> +# >>> +def event_to_dict(event): >>> + return dict([(field, getattr(event, field)) >>> + for (field, _) in event._fields_ >>> + if isinstance(getattr(event, field), (int, bytes))]) >>> + >>> + >>> +# >>> +# Event enum >>> +# >>> +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], >>> + start=0) >>> + >>> + >>> +# >>> +# process_event() >>> +# >>> +def process_event(ctx, data, size): >>> + global start_trigger_ts >>> + global stop_trigger_ts >>> + >>> + event = bpf['events'].event(data) >>> + if event.id == Event.SYSCALL: >>> + syscall_events.append({"tid": event.tid, >>> + "ts_entry": event.entry_ts, >>> + "ts_exit": event.ts, >>> + "syscall": event.syscall, >>> + "user_stack_id": event.user_stack_id, >>> + "kernel_stack_id": event.kernel_stack_id}) >>> + elif event.id == Event.START_TRIGGER: >>> + # >>> + # This event would have started the trigger already, so all we need to >>> + # do is record the start timestamp. >>> + # >>> + start_trigger_ts = event.ts >>> + >>> + elif event.id == Event.STOP_TRIGGER: >>> + # >>> + # This event would have stopped the trigger already, so all we need to >>> + # do is record the start timestamp. >>> + stop_trigger_ts = event.ts >>> + >>> + >>> +# >>> +# next_power_of_two() >>> +# >>> +def next_power_of_two(val): >>> + np = 1 >>> + while np < val: >>> + np *= 2 >>> + return np >>> + >>> + >>> +# >>> +# unsigned_int() >>> +# >>> +def unsigned_int(value): >>> + try: >>> + value = int(value) >>> + except ValueError: >>> + raise argparse.ArgumentTypeError("must be an integer") >>> + >>> + if value < 0: >>> + raise argparse.ArgumentTypeError("must be positive") >>> + return value >>> + >>> + >>> +# >>> +# unsigned_nonzero_int() >>> +# >>> +def unsigned_nonzero_int(value): >>> + value = unsigned_int(value) >>> + if value == 0: >>> + raise argparse.ArgumentTypeError("must be nonzero") >>> + return value >>> + >>> + >>> +# >>> +# get_thread_name() >>> +# >>> +def get_thread_name(pid, tid): >>> + try: >>> + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: >>> + return f.readline().strip("\n") >>> + except FileNotFoundError: >>> + pass >>> + >>> + return f"<unknown:{pid}/{tid}>" >>> + >>> + >>> +# >>> +# get_vec_nr_name() >>> +# >>> +def get_vec_nr_name(vec_nr): >>> + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", >>> + "tasklet", "sched", "hrtimer", "rcu"] >>> + >>> + if vec_nr < 0 or vec_nr > len(known_vec_nr): >>> + return f"<unknown:{vec_nr}>" >>> + >>> + return known_vec_nr[vec_nr] >>> + >>> + >>> +# >>> +# start/stop/reset capture >>> +# >>> +def start_capture(): >>> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) >>> + >>> + >>> +def stop_capture(force=False): >>> + if force: >>> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) >>> + else: >>> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) >>> + >>> + >>> +def capture_running(): >>> + return bpf["capture_on"][ct.c_int(0)].value == 1 >>> + >>> + >>> +def reset_capture(): >>> + bpf["syscall_start"].clear() >>> + bpf["syscall_data"].clear() >>> + bpf["run_start"].clear() >>> + bpf["run_data"].clear() >>> + bpf["ready_start"].clear() >>> + bpf["ready_data"].clear() >>> + bpf["hardirq_start"].clear() >>> + bpf["hardirq_data"].clear() >>> + bpf["softirq_start"].clear() >>> + bpf["softirq_data"].clear() >>> + bpf["stack_traces"].clear() >>> + >>> + >>> +# >>> +# Display timestamp >>> +# >>> +def print_timestamp(msg): >>> + ltz = datetime.datetime.now() >>> + utc = ltz.astimezone(pytz.utc) >>> + time_string = "{} @{} ({} UTC)".format( >>> + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) >>> + print(time_string) >>> + >>> + >>> +# >>> +# process_results() >>> +# >>> +def process_results(syscall_events=None, trigger_delta=None): >>> + if trigger_delta: >>> + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". >>> + format(trigger_delta)) >>> + else: >>> + print_timestamp("# Sample dump") >>> + >>> + # >>> + # First get a list of all threads we need to report on. >>> + # >>> + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() >>> + if k.syscall != 0xffffffff} >>> + >>> + threads_run = {k.tid for k, _ in bpf["run_data"].items() >>> + if k.pid != 0xffffffff} >>> + >>> + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() >>> + if k.pid != 0xffffffff} >>> + >>> + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() >>> + if k.pid != 0xffffffff} >>> + >>> + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() >>> + if k.pid != 0xffffffff} >>> + >>> + threads = sorted(threads_syscall | threads_run | threads_ready | >>> + threads_hardirq | threads_softirq, >>> + key=lambda x: get_thread_name(options.pid, x)) >>> + >>> + # >>> + # Print header... >>> + # >>> + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) >>> + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) >>> + indent = 28 * " " >>> + >>> + # >>> + # Print all events/statistics per threads. >>> + # >>> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >>> + for thread in threads: >>> + >>> + if thread != threads[0]: >>> + print("") >>> + >>> + # >>> + # SYSCALL_STATISTICS >>> + # >>> + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( >>> + thread, get_thread_name(options.pid, thread), >>> + "[SYSCALL STATISTICS]", indent, >>> + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) >>> + >>> + total_count = 0 >>> + total_ns = 0 >>> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >>> + bpf["syscall_data"].items()), >>> + key=lambda kv: -kv[1].total_ns): >>> + >>> + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( >>> + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, >>> + v.count, v.total_ns, v.worst_ns)) >>> + if k.syscall != poll_id: >>> + total_count += v.count >>> + total_ns += v.total_ns >>> + >>> + if total_count > 0: >>> + print("{}{:20.20} {:6} {:10} {:16,}".format( >>> + indent, "TOTAL( - poll):", "", total_count, total_ns)) >>> + >>> + # >>> + # THREAD RUN STATISTICS >>> + # >>> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( >>> + "", "", "[THREAD RUN STATISTICS]", indent, >>> + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) >>> + >>> + for k, v in filter(lambda t: t[0].tid == thread, >>> + bpf["run_data"].items()): >>> + >>> + print("{}{:10} {:16,} {:16,} {:16,}".format( >>> + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) >>> + >>> + # >>> + # THREAD READY STATISTICS >>> + # >>> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( >>> + "", "", "[THREAD READY STATISTICS]", indent, >>> + "SCHED_CNT", "TOTAL ns", "MAX ns")) >>> + >>> + for k, v in filter(lambda t: t[0].tid == thread, >>> + bpf["ready_data"].items()): >>> + >>> + print("{}{:10} {:16,} {:16,}".format( >>> + indent, v.count, v.total_ns, v.worst_ns)) >>> + >>> + # >>> + # HARD IRQ STATISTICS >>> + # >>> + total_ns = 0 >>> + total_count = 0 >>> + header_printed = False >>> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >>> + bpf["hardirq_data"].items()), >>> + key=lambda kv: -kv[1].total_ns): >>> + >>> + if not header_printed: >>> + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". >>> + format("", "", "[HARD IRQ STATISTICS]", indent, >>> + "NAME", "COUNT", "TOTAL ns", "MAX ns")) >>> + header_printed = True >>> + >>> + print("{}{:20.20} {:10} {:16,} {:16,}".format( >>> + indent, k.irq_name.decode('utf-8'), >>> + v.count, v.total_ns, v.worst_ns)) >>> + >>> + total_count += v.count >>> + total_ns += v.total_ns >>> + >>> + if total_count > 0: >>> + print("{}{:20.20} {:10} {:16,}".format( >>> + indent, "TOTAL:", total_count, total_ns)) >>> + >>> + # >>> + # SOFT IRQ STATISTICS >>> + # >>> + total_ns = 0 >>> + total_count = 0 >>> + header_printed = False >>> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >>> + bpf["softirq_data"].items()), >>> + key=lambda kv: -kv[1].total_ns): >>> + >>> + if not header_printed: >>> + print("\n{:10} {:16} {}\n" >>> + "{}{:20} {:>7} {:>10} {:>16} {:>16}". >>> + format("", "", "[SOFT IRQ STATISTICS]", indent, >>> + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) >>> + header_printed = True >>> + >>> + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( >>> + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, >>> + v.count, v.total_ns, v.worst_ns)) >>> + >>> + total_count += v.count >>> + total_ns += v.total_ns >>> + >>> + if total_count > 0: >>> + print("{}{:20.20} {:7} {:10} {:16,}".format( >>> + indent, "TOTAL:", "", total_count, total_ns)) >>> + >>> + # >>> + # Print events >>> + # >>> + lost_stack_traces = 0 >>> + if syscall_events: >>> + stack_traces = bpf.get_table("stack_traces") >>> + >>> + print("\n\n# SYSCALL EVENTS:" >>> + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( >>> + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", >>> + "DELTA (us)", "SYSCALL")) >>> + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( >>> + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, >>> + "-" * 10, "-" * 16)) >>> + for event in syscall_events: >>> + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( >>> + " " * 2, >>> + event["ts_entry"], event["ts_exit"], event["tid"], >>> + get_thread_name(options.pid, event["tid"]), >>> + int((event["ts_exit"] - event["ts_entry"]) / 1000), >>> + syscall_name(event["syscall"]).decode('utf-8'))) >>> + # >>> + # Not sure where to put this, but I'll add some info on stack >>> + # traces here... Userspace stack traces are very limited due to >>> + # the fact that bcc does not support dwarf backtraces. As OVS >>> + # gets compiled without frame pointers we will not see much. >>> + # If however, OVS does get built with frame pointers, we should not >>> + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle >>> + # the debug symbols correctly. Also, note that for kernel >>> + # traces you should not use BPF_STACK_TRACE_BUILDID, so two >>> + # buffers are needed. >>> + # >>> + # Some info on manual dwarf walk support: >>> + # https://github.com/iovisor/bcc/issues/3515 >>> + # https://github.com/iovisor/bcc/pull/4463 >>> + # >>> + if options.stack_trace_size == 0: >>> + continue >>> + >>> + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: >>> + lost_stack_traces += 1 >>> + >>> + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ >>> + if event['kernel_stack_id'] >= 0 else [] >>> + user_stack = stack_traces.walk(event['user_stack_id']) \ >>> + if event['user_stack_id'] >= 0 else [] >>> + >>> + for addr in kernel_stack: >>> + print("{}{}".format( >>> + " " * 10, >>> + bpf.ksym(addr, show_module=True, >>> + show_offset=True).decode('utf-8', 'replace'))) >>> + >>> + for addr in user_stack: >>> + addr_str = bpf.sym(addr, options.pid, show_module=True, >>> + show_offset=True).decode('utf-8', 'replace') >>> + >>> + if addr_str == "[unknown]": >>> + addr_str += " 0x{:x}".format(addr) >>> + >>> + print("{}{}".format(" " * 10, addr_str)) >>> + >>> + # >>> + # Print any footer messages. >>> + # >>> + if lost_stack_traces > 0: >>> + print("\n#WARNING: We where not able to display {} stack traces!\n" >>> + "# Consider increasing the stack trace size using\n" >>> + "# the '--stack-trace-size' option.\n" >>> + "# Note that this can also happen due to a stack id\n" >>> + "# collision.".format(lost_stack_traces)) >>> + >>> + >>> +# >>> +# main() >>> +# >>> +def main(): >>> + # >>> + # Don't like these globals, but ctx passing does not seem to work with the >>> + # existing open_ring_buffer() API :( >>> + # >>> + global bpf >>> + global options >>> + global syscall_events >>> + global start_trigger_ts >>> + global stop_trigger_ts >>> + >>> + start_trigger_ts = 0 >>> + stop_trigger_ts = 0 >>> + >>> + # >>> + # Argument parsing >>> + # >>> + parser = argparse.ArgumentParser() >>> + >>> + parser.add_argument("-D", "--debug", >>> + help="Enable eBPF debugging", >>> + type=int, const=0x3f, default=0, nargs='?') >>> + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", >>> + help="ovs-vswitch's PID", >>> + type=unsigned_int, default=None) >>> + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", >>> + help="Record syscall events that take longer than " >>> + "DURATION_NS. Omit the duration value to record all " >>> + "syscall events", >>> + type=unsigned_int, const=0, default=None, nargs='?') >>> + parser.add_argument("--buffer-page-count", >>> + help="Number of BPF ring buffer pages, default 1024", >>> + type=unsigned_int, default=1024, metavar="NUMBER") >>> + parser.add_argument("--sample-count", >>> + help="Number of sample runs, default 1", >>> + type=unsigned_nonzero_int, default=1, metavar="RUNS") >>> + parser.add_argument("--sample-interval", >>> + help="Delay between sample runs, default 0", >>> + type=float, default=0, metavar="SECONDS") >>> + parser.add_argument("--sample-time", >>> + help="Sample time, default 0.5 seconds", >>> + type=float, default=0.5, metavar="SECONDS") >>> + parser.add_argument("--skip-syscall-poll-events", >>> + help="Skip poll() syscalls with --syscall-events", >>> + action="store_true") >>> + parser.add_argument("--stack-trace-size", >>> + help="Number of unique stack traces that can be " >>> + "recorded, default 4096. 0 to disable", >>> + type=unsigned_int, default=4096) >>> + parser.add_argument("--start-trigger", metavar="TRIGGER", >>> + help="Start trigger, see documentation for details", >>> + type=str, default=None) >>> + parser.add_argument("--stop-trigger", metavar="TRIGGER", >>> + help="Stop trigger, see documentation for details", >>> + type=str, default=None) >>> + parser.add_argument("--trigger-delta", metavar="DURATION_NS", >>> + help="Only report event when the trigger duration > " >>> + "DURATION_NS, default 0 (all events)", >>> + type=unsigned_int, const=0, default=0, nargs='?') >>> + >>> + options = parser.parse_args() >>> + >>> + # >>> + # Find the PID of the ovs-vswitchd daemon if not specified. >>> + # >>> + if not options.pid: >>> + for proc in psutil.process_iter(): >>> + if 'ovs-vswitchd' in proc.name(): >>> + if options.pid: >>> + print("ERROR: Multiple ovs-vswitchd daemons running, " >>> + "use the -p option!") >>> + sys.exit(os.EX_NOINPUT) >>> + >>> + options.pid = proc.pid >>> + >>> + # >>> + # Error checking on input parameters. >>> + # >>> + if not options.pid: >>> + print("ERROR: Failed to find ovs-vswitchd's PID!") >>> + sys.exit(os.EX_UNAVAILABLE) >>> + >>> + options.buffer_page_count = next_power_of_two(options.buffer_page_count) >>> + >>> + # >>> + # Make sure we are running as root, or else we can not attach the probes. >>> + # >>> + if os.geteuid() != 0: >>> + print("ERROR: We need to run as root to attached probes!") >>> + sys.exit(os.EX_NOPERM) >>> + >>> + # >>> + # Setup any of the start stop triggers >>> + # >>> + if options.start_trigger is not None: >>> + try: >>> + start_trigger = Probe(options.start_trigger, pid=options.pid) >>> + except ValueError as e: >>> + print(f"ERROR: Invalid start trigger {str(e)}") >>> + sys.exit(os.EX_CONFIG) >>> + else: >>> + start_trigger = None >>> + >>> + if options.stop_trigger is not None: >>> + try: >>> + stop_trigger = Probe(options.stop_trigger, pid=options.pid) >>> + except ValueError as e: >>> + print(f"ERROR: Invalid stop trigger {str(e)}") >>> + sys.exit(os.EX_CONFIG) >>> + else: >>> + stop_trigger = None >>> + >>> + # >>> + # Attach probe to running process. >>> + # >>> + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( >>> + [" EVENT_{} = {},".format( >>> + event.name, event.value) for event in Event])) >>> + source = source.replace("<BUFFER_PAGE_CNT>", >>> + str(options.buffer_page_count)) >>> + source = source.replace("<MONITOR_PID>", str(options.pid)) >>> + >>> + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: >>> + source = source.replace('<STATE_FIELD>', 'state') >>> + else: >>> + source = source.replace('<STATE_FIELD>', '__state') >>> + >>> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >>> + if options.syscall_events is None: >>> + syscall_trace_events = "false" >>> + elif options.syscall_events == 0: >>> + if not options.skip_syscall_poll_events: >>> + syscall_trace_events = "true" >>> + else: >>> + syscall_trace_events = f"args->id != {poll_id}" >>> + else: >>> + syscall_trace_events = "delta > {}".format(options.syscall_events) >>> + if options.skip_syscall_poll_events: >>> + syscall_trace_events += f" && args->id != {poll_id}" >>> + >>> + source = source.replace("<SYSCALL_TRACE_EVENTS>", >>> + syscall_trace_events) >>> + >>> + source = source.replace("<STACK_TRACE_SIZE>", >>> + str(options.stack_trace_size)) >>> + >>> + source = source.replace("<STACK_TRACE_ENABLED>", "true" >>> + if options.stack_trace_size > 0 else "false") >>> + >>> + # >>> + # Handle start/stop probes >>> + # >>> + if start_trigger: >>> + source = source.replace("<START_TRIGGER>", >>> + start_trigger.get_c_code( >>> + "start_trigger_probe", >>> + "return start_trigger();")) >>> + else: >>> + source = source.replace("<START_TRIGGER>", "") >>> + >>> + if stop_trigger: >>> + source = source.replace("<STOP_TRIGGER>", >>> + stop_trigger.get_c_code( >>> + "stop_trigger_probe", >>> + "return stop_trigger();")) >>> + else: >>> + source = source.replace("<STOP_TRIGGER>", "") >>> + >>> + # >>> + # Setup usdt or other probes that need handling trough the BFP class. >>> + # >>> + usdt = USDT(pid=int(options.pid)) >>> + try: >>> + if start_trigger and start_trigger.probe_type == 'usdt': >>> + usdt.enable_probe(probe=start_trigger.probe_name(), >>> + fn_name="start_trigger_probe") >>> + if stop_trigger and stop_trigger.probe_type == 'usdt': >>> + usdt.enable_probe(probe=stop_trigger.probe_name(), >>> + fn_name="stop_trigger_probe") >>> + >>> + except USDTException as e: >>> + print("ERROR: {}".format( >>> + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). >>> + replace("--with-dtrace or --enable-dtrace", >>> + "--enable-usdt-probes"))) >>> + sys.exit(os.EX_OSERR) >>> + >>> + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) >>> + >>> + if start_trigger: >>> + try: >>> + if start_trigger.probe_type == "uprobe": >>> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >>> + sym=start_trigger.probe_name(), >>> + fn_name="start_trigger_probe", >>> + pid=options.pid) >>> + >>> + if start_trigger.probe_type == "uretprobe": >>> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >>> + sym=start_trigger.probe_name(), >>> + fn_name="start_trigger_probe", >>> + pid=options.pid) >>> + except Exception as e: >>> + print("ERROR: Failed attaching uprobe start trigger " >>> + f"'{start_trigger.probe_name()}';\n {str(e)}") >>> + sys.exit(os.EX_OSERR) >>> + >>> + if stop_trigger: >>> + try: >>> + if stop_trigger.probe_type == "uprobe": >>> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >>> + sym=stop_trigger.probe_name(), >>> + fn_name="stop_trigger_probe", >>> + pid=options.pid) >>> + >>> + if stop_trigger.probe_type == "uretprobe": >>> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >>> + sym=stop_trigger.probe_name(), >>> + fn_name="stop_trigger_probe", >>> + pid=options.pid) >>> + except Exception as e: >>> + print("ERROR: Failed attaching uprobe stop trigger" >>> + f"'{stop_trigger.probe_name()}';\n {str(e)}") >>> + sys.exit(os.EX_OSERR) >>> + >>> + # >>> + # If no triggers are configured use the delay configuration >>> + # >>> + bpf['events'].open_ring_buffer(process_event) >>> + >>> + sample_count = 0 >>> + while sample_count < options.sample_count: >>> + sample_count += 1 >>> + syscall_events = [] >>> + >>> + if not options.start_trigger: >>> + print_timestamp("# Start sampling") >>> + start_capture() >>> + stop_time = -1 if options.stop_trigger else \ >>> + time_ns() + options.sample_time * 1000000000 >>> + else: >>> + # For start triggers the stop time depends on the start trigger >>> + # time, or depends on the stop trigger if configured. >>> + stop_time = -1 if options.stop_trigger else 0 >>> + >>> + while True: >>> + keyboard_interrupt = False >>> + try: >>> + last_start_ts = start_trigger_ts >>> + last_stop_ts = stop_trigger_ts >>> + >>> + if stop_time > 0: >>> + delay = int((stop_time - time_ns()) / 1000000) >>> + if delay <= 0: >>> + break >>> + else: >>> + delay = -1 >>> + >>> + bpf.ring_buffer_poll(timeout=delay) >>> + >>> + if stop_time <= 0 and last_start_ts != start_trigger_ts: >>> + print_timestamp( >>> + "# Start sampling (trigger@{})".format( >>> + start_trigger_ts)) >>> + >>> + if not options.stop_trigger: >>> + stop_time = time_ns() + \ >>> + options.sample_time * 1000000000 >>> + >>> + if last_stop_ts != stop_trigger_ts: >>> + break >>> + >>> + except KeyboardInterrupt: >>> + keyboard_interrupt = True >>> + break >>> + >>> + if options.stop_trigger and not capture_running(): >>> + print_timestamp("# Stop sampling (trigger@{})".format( >>> + stop_trigger_ts)) >>> + else: >>> + print_timestamp("# Stop sampling") >>> + >>> + if stop_trigger_ts != 0 and start_trigger_ts != 0: >>> + trigger_delta = stop_trigger_ts - start_trigger_ts >>> + else: >>> + trigger_delta = None >>> + >>> + if not trigger_delta or trigger_delta >= options.trigger_delta: >>> + stop_capture(force=True) # Prevent a new trigger to start. >>> + process_results(syscall_events=syscall_events, >>> + trigger_delta=trigger_delta) >>> + elif trigger_delta: >>> + sample_count -= 1 >>> + print_timestamp("# Sample dump skipped, delta {:,} ns".format( >>> + trigger_delta)) >>> + >>> + reset_capture() >>> + stop_capture() >>> + >>> + if keyboard_interrupt: >>> + break >>> + >>> + if options.sample_interval > 0: >>> + time.sleep(options.sample_interval) >>> + >>> + # >>> + # Report lost events. >>> + # >>> + dropcnt = bpf.get_table("dropcnt") >>> + for k in dropcnt.keys(): >>> + count = dropcnt.sum(k).value >>> + if k.value == 0 and count > 0: >>> + print("\n# WARNING: Not all events were captured, {} were " >>> + "dropped!\n# Increase the BPF ring buffer size " >>> + "with the --buffer-page-count option.".format(count)) >>> + >>> + if (options.sample_count > 1): >>> + trigger_miss = bpf.get_table("trigger_miss") >>> + for k in trigger_miss.keys(): >>> + count = trigger_miss.sum(k).value >>> + if k.value == 0 and count > 0: >>> + print("\n# WARNING: Not all start triggers were successful. " >>> + "{} were missed due to\n# slow userspace " >>> + "processing!".format(count)) >>> + >>> + >>> +# >>> +# Start main() as the default entry point... >>> +# >>> +if __name__ == '__main__': >>> + main() >>> diff --git a/utilities/usdt-scripts/kernel_delay.rst > b/utilities/usdt-scripts/kernel_delay.rst >>> new file mode 100644 >>> index 000000000..0ebd30afb >>> --- /dev/null >>> +++ b/utilities/usdt-scripts/kernel_delay.rst >>> @@ -0,0 +1,596 @@ >>> +Troubleshooting Open vSwitch: Is the kernel to blame? >>> +===================================================== >>> +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left >>> +wondering if the issue is really OVS-related, or if it's a problem with the >>> +kernel being overloaded. Messages in the log like >>> +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from >>> +experience, these are mostly related to an overloaded Linux Kernel. >>> +The kernel_delay.py tool can help you quickly identify if the focus of your >>> +investigation should be OVS or the Linux kernel. >>> + >>> + >>> +Introduction >>> +------------ >>> +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ >>> +framework to install eBPF probes. The data the eBPF probes collect will be >>> +analyzed and presented to the user by the Python script. Some of the presented >>> +data can also be captured by the individual scripts included in the BBC [#BCC]_ >>> +framework. >>> + >>> +kernel_delay.py has two modes of operation: >>> + >>> +- In **time mode**, the tool runs for a specific time and collects the >>> + information. >>> +- In **trigger mode**, event collection can be started and/or stopped based on >>> + a specific eBPF probe. Currently, the following probes are supported: >>> + - USDT probes >>> + - Kernel tracepoints >>> + - kprobe >>> + - kretprobe >>> + - uprobe >>> + - uretprobe >>> + >>> + >>> +In addition, the option, ``--sample-count``, exists to specify how many >>> +iterations you would like to do. When using triggers, you can also ignore >>> +samples if they are less than a number of nanoseconds with the >>> +``--trigger-delta`` option. The latter might be useful when debugging Linux >>> +syscalls which take a long time to complete. More on this later. Finally, you >>> +can configure the delay between two sample runs with the ``--sample-interval`` >>> +option. >>> + >>> +Before getting into more details, you can run the tool without any options >>> +to see what the output looks like. Notice that it will try to automatically >>> +get the process ID of the running ``ovs-vswitchd``. You can overwrite this >>> +with the ``--pid`` option. >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py >>> + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) >>> + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) >>> + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 27090 ovs-vswitchd [SYSCALL STATISTICS] >>> + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> >>> + >>> + 31741 revalidator122 [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + poll 7 5 184,193,176 184,191,520 >>> + recvmsg 47 494 125,208,756 310,331 >>> + futex 202 8 18,768,758 4,023,039 >>> + sendto 44 10 375,861 266,867 >>> + sendmsg 46 4 43,294 11,213 >>> + write 1 1 5,949 5,949 >>> + getrusage 98 1 1,424 1,424 >>> + read 0 1 1,292 1,292 >>> + TOTAL( - poll): 519 144,405,334 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + 6 136,764,071 1,480 115,146,424 >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + 7 11,334 6,636 >>> + >>> + [HARD IRQ STATISTICS] >>> + NAME COUNT TOTAL ns MAX ns >>> + eno8303-rx-1 1 3,586 3,586 >>> + TOTAL: 1 3,586 >>> + >>> + [SOFT IRQ STATISTICS] >>> + NAME VECT_NR COUNT TOTAL ns MAX ns >>> + net_rx 3 1 17,699 17,699 >>> + sched 7 6 13,820 3,226 >>> + rcu 9 16 13,586 1,554 >>> + timer 1 3 10,259 3,815 >>> + TOTAL: 26 55,364 >>> + >>> + >>> +By default, the tool will run for half a second in `time mode`. To extend this >>> +you can use the ``--sample-time`` option. >>> + >>> + >>> +What will it report >>> +------------------- >>> +The above sample output separates the captured data on a per-thread basis. >>> +For this, it displays the thread's id (``TID``) and name (``THREAD``), >>> +followed by resource-specific data. Which are: >>> + >>> +- ``SYSCALL STATISTICS`` >>> +- ``THREAD RUN STATISTICS`` >>> +- ``THREAD READY STATISTICS`` >>> +- ``HARD IRQ STATISTICS`` >>> +- ``SOFT IRQ STATISTICS`` >>> + >>> +The following sections will describe in detail what statistics they report. >>> + >>> + >>> +``SYSCALL STATISTICS`` >>> +~~~~~~~~~~~~~~~~~~~~~~ >>> +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during >>> +the measurement interval. This includes the number of times the syscall was >>> +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), >>> +and the worst-case duration of a single call (``MAX ns``). >>> + >>> +It also shows the total of all system calls, but it excludes the poll system >>> +call, as the purpose of this call is to wait for activity on a set of sockets, >>> +and usually, the thread gets swapped out. >>> + >>> +Note that it only counts calls that started and stopped during the >>> +measurement interval! >>> + >>> + >>> +``THREAD RUN STATISTICS`` >>> +~~~~~~~~~~~~~~~~~~~~~~~~~ >>> +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU >>> +during the measurement interval. >>> + >>> +Note that these statistics only count events where the thread started and >>> +stopped running on a CPU during the measurement interval. For example, if >>> +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. >>> +If not, there might be a misconfiguration. >>> + >>> + >>> +``THREAD READY STATISTICS`` >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> +``THREAD READY STATISTICS`` tell you the time between the thread being ready >>> +to run and it actually running on the CPU. >>> + >>> +Note that these statistics only count events where the thread was getting >>> +ready to run and started running during the measurement interval. >>> + >>> + >>> +``HARD IRQ STATISTICS`` >>> +~~~~~~~~~~~~~~~~~~~~~~~ >>> +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard >>> +interrupts during the threads run time. >>> + >>> +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), >>> +the total time spent in the interrupt handler (``TOTAL ns``), and the >>> +worst-case duration (``MAX ns``). >>> + >>> + >>> +``SOFT IRQ STATISTICS`` >>> +~~~~~~~~~~~~~~~~~~~~~~~ >>> +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft >>> +interrupts during the threads run time. >>> + >>> +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the >>> +number of interrupts (``COUNT``), the total time spent in the interrupt >>> +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). >>> + >>> + >>> +The ``--syscall-events`` option >>> +------------------------------- >>> +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, >>> +the tool can also report each individual syscall. This can be a usefull >>> +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. >>> + >>> +All you need to do is add the ``--syscall-events`` option, with or without >>> +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter >>> +allows you to exclude events that take less than the supplied time. >>> + >>> +The ``--skip-syscall-poll-events`` option allows you to exclude poll >>> +syscalls from the report. >>> + >>> +Below is an example run, note that the resource-specific data is removed >>> +to highlight the syscall events: >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events >>> + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) >>> + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) >>> + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 3359686 ipf_clean2 [SYSCALL STATISTICS] >>> + ... >>> + 3359635 ovs-vswitchd [SYSCALL STATISTICS] >>> + ... >>> + 3359697 revalidator12 [SYSCALL STATISTICS] >>> + ... >>> + 3359698 revalidator13 [SYSCALL STATISTICS] >>> + ... >>> + 3359699 revalidator14 [SYSCALL STATISTICS] >>> + ... >>> + 3359700 revalidator15 [SYSCALL STATISTICS] >>> + ... >>> + >>> + # SYSCALL EVENTS: >>> + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL >>> + ------------------- ------------------- ---------- > ---------------- ---------- ---------------- >>> + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode+0x9 [kernel] >>> + do_syscall_64+0x68 [kernel] >>> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >>> + __GI___lll_lock_wait+0x30 [libc.so.6] >>> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >>> + [unknown] 0x696c003936313a63 >>> + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode+0x9 [kernel] >>> + do_syscall_64+0x68 [kernel] >>> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >>> + __GI___lll_lock_wait+0x30 [libc.so.6] >>> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >>> + [unknown] 0x696c003134313a63 >>> + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode+0x9 [kernel] >>> + do_syscall_64+0x68 [kernel] >>> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >>> + __GI___lll_lock_wait+0x30 [libc.so.6] >>> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >>> + [unknown] 0x696c003936313a63 >>> + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >>> + syscall_exit_to_user_mode+0x9 [kernel] >>> + do_syscall_64+0x68 [kernel] >>> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >>> + __GI_accept+0x4d [libc.so.6] >>> + pfd_accept+0x3a [ovs-vswitchd] >>> + [unknown] 0x7fff19f2bd00 >>> + [unknown] 0xe4b8001f0f >>> + >>> +As you can see above, the output also shows the stackback trace. You can >>> +disable this using the ``--stack-trace-size 0`` option. >>> + >>> +As you can see above, the backtrace does not show a lot of useful information >>> +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further >>> +analyze system call backtraces, you could use perf. The following perf >>> +script can do this for you (refer to the embedded instructions): >>> + >>> +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py >>> + >>> + >>> +Using triggers >>> +-------------- >>> +The tool supports start and, or stop triggers. This will allow you to capture >>> +statistics triggered by a specific event. The following combinations of >>> +stop-and-start triggers can be used. >>> + >>> +If you only use ``--start-trigger``, the inspection start when the trigger >>> +happens and runs until the ``--sample-time`` number of seconds has passed. >>> +The example below shows all the supported options in this scenario. >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ >>> + --sample-count 4 --sample-interval 1 >>> + >>> + >>> +If you only use ``--stop-trigger``, the inspection starts immediately and >>> +stops when the trigger happens. The example below shows all the supported >>> +options in this scenario. >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ >>> + --sample-count 4 --sample-interval 1 >>> + >>> + >>> +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the >>> +statistics are captured between the two first occurrences of these events. >>> +The example below shows all the supported options in this scenario. >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ >>> + --stop-trigger upr:bridge_run \ >>> + --sample-count 4 --sample-interval 1 \ >>> + --trigger-delta 50000 >>> + >>> +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, >>> +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. >>> +For more details on the supported tracepoints, check out the BCC >>> +documentation [#BCC_EVENT]_. >>> + >>> +The list below shows the supported triggers and their argument format: >>> + >>> +**USDT probes:** >>> + [u|usdt]:{provider}:{probe} >>> +**Kernel tracepoint:** >>> + [t:trace]:{system}:{event} >>> +**kprobe:** >>> + [k:kprobe]:{kernel_function} >>> +**kretprobe:** >>> + [kr:kretprobe]:{kernel_function} >>> +**uprobe:** >>> + [up:uprobe]:{function} >>> +**uretprobe:** >>> + [upr:uretprobe]:{function} >>> + >>> +Here are a couple of trigger examples, more use-case-specific examples can be >>> +found in the *Examples* section. >>> + >>> +.. code-block:: console >>> + >>> + --start|stop-trigger u:udpif_revalidator:start_dump >>> + --start|stop-trigger t:openvswitch:ovs_dp_upcall >>> + --start|stop-trigger k:ovs_dp_process_packet >>> + --start|stop-trigger kr:ovs_dp_process_packet >>> + --start|stop-trigger up:bridge_run >>> + --start|stop-trigger upr:bridge_run >>> + >>> + >>> +Examples >>> +-------- >>> +This section will give some examples of how to use this tool in real-world >>> +scenarios. Let's start with the issue where Open vSwitch reports >>> +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note >>> +that there is a blog available explaining how the revalidator process works >>> +in OVS [#REVAL_BLOG]_. >>> + >>> +First, let me explain this log message. It gets logged if the time delta >>> +between two ``poll_block()`` calls is more than 1 second. In other words, >>> +the process was spending a lot of time processing stuff that was made >>> +available by the return of the ``poll_block()`` function. >>> + >>> +Do a run with the tool using the existing USDT revalidator probes as a start >>> +and stop trigger (Note that the resource-specific data is removed from the none >>> +revalidator threads): >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --start-trigger > u:udpif_revalidator:start_dump --stop-trigger > u:udpif_revalidator:sweep_done >>> + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) >>> + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) >>> + # Triggered sample dump, stop-start delta 1,187,986,454 ns > @2023-06-14T14:52:01.298021 (12:52:01 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 1457761 handler24 [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + sendmsg 46 6110 123,274,761 41,776 >>> + recvmsg 47 136299 99,397,508 49,896 >>> + futex 202 51 7,655,832 7,536,776 >>> + poll 7 4068 1,202,883 2,907 >>> + getrusage 98 2034 586,602 1,398 >>> + sendto 44 9 213,682 27,417 >>> + TOTAL( - poll): 144503 231,128,385 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + 1 1,438 1,438 >>> + >>> + [SOFT IRQ STATISTICS] >>> + NAME VECT_NR COUNT TOTAL ns MAX ns >>> + sched 7 21 59,145 3,769 >>> + rcu 9 50 42,917 2,234 >>> + TOTAL: 71 102,062 >>> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >>> + ... >>> + 1457792 revalidator55 [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + futex 202 73 572,576,329 19,621,600 >>> + recvmsg 47 815 296,697,618 405,338 >>> + sendto 44 3 78,302 26,837 >>> + sendmsg 46 3 38,712 13,250 >>> + write 1 1 5,073 5,073 >>> + TOTAL( - poll): 895 869,396,034 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + 48 394,350,393 1,729 140,455,796 >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + 49 23,650 1,559 >>> + >>> + [SOFT IRQ STATISTICS] >>> + NAME VECT_NR COUNT TOTAL ns MAX ns >>> + sched 7 14 26,889 3,041 >>> + rcu 9 28 23,024 1,600 >>> + TOTAL: 42 49,913 >>> + >>> + >>> +Above you see from the start of the output that the trigger took more than a >>> +second (1,187,986,454 ns), which is already know, by looking at the output of >>> +the ``ovs-vsctl upcall/show`` command. >>> + >>> +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it >>> +spent almost 870ms handling syscalls, and there were no poll() calls being >>> +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, >>> +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned >>> +that this time does not include the time being on the CPU at the start or stop >>> +of an event. What is exactly the case here, because USDT probes were used. >>> + >>> +From the above data and maybe some ``top`` output, it can be determined that >>> +the *revalidator55* thread is taking a lot of CPU time, probably because it >>> +has to do a lot of revalidator work by itself. The solution here is to increase >>> +the number of revalidator threads, so more work could be done in parallel. >>> + >>> +Here is another run of the same command in another scenario: >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --start-trigger > u:udpif_revalidator:start_dump --stop-trigger > u:udpif_revalidator:sweep_done >>> + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) >>> + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) >>> + # Triggered sample dump, stop-start delta 263,181,230 ns > @2023-06-14T15:48:23.781414 (13:48:23 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >>> + ... >>> + 1457792 revalidator55 [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + recvmsg 47 284 193,422,110 46,248,418 >>> + sendto 44 2 46,685 23,665 >>> + sendmsg 46 2 24,916 12,703 >>> + write 1 1 6,534 6,534 >>> + TOTAL( - poll): 289 193,500,245 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + 2 47,333,558 331,516 47,002,042 >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + 3 87,000,403 45,999,712 >>> + >>> + [SOFT IRQ STATISTICS] >>> + NAME VECT_NR COUNT TOTAL ns MAX ns >>> + sched 7 2 9,504 5,109 >>> + TOTAL: 2 9,504 >>> + >>> + >>> +Here you can see the revalidator run took about 263ms, which does not look >>> +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was >>> +waiting 87ms for a CPU to be run on. This means the revalidator process could >>> +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay >>> +of almost 46ms can be seen, which hints at an overloaded system. >>> + >>> +One final example that uses a ``uprobe`` to get some statistics on a >>> +``bridge_run()`` execution that takes more than 1ms. >>> + >>> +.. code-block:: console >>> + >>> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run > --stop-trigger ur:bridge_run --trigger-delta 1000000 >>> + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) >>> + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) >>> + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) >>> + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) >>> + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) >>> + # Triggered sample dump, stop-start delta 1,609,374 ns > @2023-06-14T16:21:10.542565 (14:21:10 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] >>> + ... <REMOVED 7 MORE unknown THREADS> >>> + 3371102 handler66 [SYSCALL STATISTICS] >>> + ... <REMOVED 7 MORE HANDLER THREADS> >>> + 3366258 ovs-vswitchd [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + futex 202 43 403,469 199,312 >>> + clone3 435 13 174,394 30,731 >>> + munmap 11 8 115,774 21,861 >>> + poll 7 5 92,969 38,307 >>> + unlink 87 2 49,918 35,741 >>> + mprotect 10 8 47,618 13,201 >>> + accept 43 10 31,360 6,976 >>> + mmap 9 8 30,279 5,776 >>> + write 1 6 27,720 11,774 >>> + rt_sigprocmask 14 28 12,281 970 >>> + read 0 6 9,478 2,318 >>> + recvfrom 45 3 7,024 4,024 >>> + sendto 44 1 4,684 4,684 >>> + getrusage 98 5 4,594 1,342 >>> + close 3 2 2,918 1,627 >>> + recvmsg 47 1 2,722 2,722 >>> + TOTAL( - poll): 144 924,233 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + 13 817,605 5,433 524,376 >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + 14 28,646 11,566 >>> + >>> + [SOFT IRQ STATISTICS] >>> + NAME VECT_NR COUNT TOTAL ns MAX ns >>> + rcu 9 1 2,838 2,838 >>> + TOTAL: 1 2,838 >>> + >>> + 3371110 revalidator74 [SYSCALL STATISTICS] >>> + ... <REMOVED 7 MORE NEW revalidator THREADS> >>> + 3366311 urcu3 [SYSCALL STATISTICS] >>> + ... >>> + >>> + >>> +OVS removed some of the threads and their resource-specific data, but based >>> +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some >>> +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some >>> +``clone3`` syscalls, indicating threads were created. In this example, it was >>> +due to the deletion of a bridge, which resulted in the recreation of the >>> +revalidator and handler threads. >>> + >>> + >>> +Use with Openshift >>> +------------------ >>> +This section describes how you would use the tool on a node in an OpenShift >>> +cluster. It assumes you have console access to the node, either directly or >>> +through a debug container. >>> + >>> +A base fedora38 container will be used through podman, as this will allow the >>> +use of some additional tools and packages needed. >>> + >>> +First the containers need to be started: >>> + >>> +.. code-block:: console >>> + >>> + [core@sno-master ~]$ sudo podman run -it --rm \ >>> + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ >>> + --privileged --network=host --pid=host \ >>> + -v /lib/modules:/lib/modules:ro \ >>> + -v /sys/kernel/debug:/sys/kernel/debug \ >>> + -v /proc:/proc \ >>> + -v /:/mnt/rootdir \ >>> + quay.io/fedora/fedora:38-x86_64 >>> + >>> + [(DEBUG)root@sno-master /]# >>> + >>> + >>> +Next add the ``linux_delay.py`` dependencies: >>> + >>> +.. code-block:: console >>> + >>> + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ >>> + python3-pytz python3-psutil >>> + >>> + >>> +You need to install the devel, debug and source RPMs for your OVS and kernel >>> +version: >>> + >>> +.. code-block:: console >>> + >>> + [(DEBUG)root@sno-master home]# rpm -i \ >>> + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ >>> + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ >>> + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm >>> + >>> + >>> +Now the tool can be started. Here the above ``bridge_run()`` example is used: >>> + >>> +.. code-block:: console >>> + >>> + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger > up:bridge_run --stop-trigger ur:bridge_run >>> + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) >>> + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) >>> + # Triggered sample dump, stop-start delta 100,467 ns > @2023-06-15T11:44:07.628569 (11:44:07 UTC) >>> + TID THREAD <RESOURCE SPECIFIC> >>> + ---------- ---------------- > ---------------------------------------------------------------------------- >>> + 1246 ovs-vswitchd [SYSCALL STATISTICS] >>> + NAME NUMBER COUNT TOTAL ns MAX ns >>> + getdents64 217 2 8,560 8,162 >>> + openat 257 1 6,951 6,951 >>> + accept 43 4 6,942 3,763 >>> + recvfrom 45 1 3,726 3,726 >>> + recvmsg 47 2 2,880 2,188 >>> + stat 4 2 1,946 1,384 >>> + close 3 1 1,393 1,393 >>> + fstat 5 1 1,324 1,324 >>> + TOTAL( - poll): 14 33,722 >>> + >>> + [THREAD RUN STATISTICS] >>> + SCHED_CNT TOTAL ns MIN ns MAX ns >>> + >>> + [THREAD READY STATISTICS] >>> + SCHED_CNT TOTAL ns MAX ns >>> + >>> + >>> +.. rubric:: Footnotes >>> + >>> +.. [#BCC] https://github.com/iovisor/bcc >>> +.. [#BCC_EVENT] > https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >>> +.. [#REVAL_BLOG] > https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained >>> >> >> -- >> Adrián Moreno
Eelco Chaudron <echaudro@redhat.com> writes: > This patch adds an utility that can be used to determine if > an issue is related to a lack of Linux kernel resources. > > This tool is also featured in a Red Hat developers blog article: > > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > > --- Nits below, with those addressed Acked-by: Aaron Conole <aconole@redhat.com> > v2: Addressed review comments from Aaron. > v3: Changed wording in documentation. > v4: Addressed review comments from Adrian. > > utilities/automake.mk | 4 > utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ > utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ > 3 files changed, 2020 insertions(+) > create mode 100755 utilities/usdt-scripts/kernel_delay.py > create mode 100644 utilities/usdt-scripts/kernel_delay.rst General: This code has a mix of " and ' (see for example, used vs fn_name in some blocks). I think it is preferable to use " everywhere. It is a bit ambiguous in the python coding style, and I see it used interchangeably. It would be preferable to keep it consistent here. > diff --git a/utilities/automake.mk b/utilities/automake.mk > index 37d679f82..9a2114df4 100644 > --- a/utilities/automake.mk > +++ b/utilities/automake.mk > @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib > usdt_SCRIPTS += \ > utilities/usdt-scripts/bridge_loop.bt \ > utilities/usdt-scripts/dpif_nl_exec_monitor.py \ > + utilities/usdt-scripts/kernel_delay.py \ > + utilities/usdt-scripts/kernel_delay.rst \ > utilities/usdt-scripts/reval_monitor.py \ > utilities/usdt-scripts/upcall_cost.py \ > utilities/usdt-scripts/upcall_monitor.py > @@ -70,6 +72,8 @@ EXTRA_DIST += \ > utilities/docker/debian/build-kernel-modules.sh \ > utilities/usdt-scripts/bridge_loop.bt \ > utilities/usdt-scripts/dpif_nl_exec_monitor.py \ > + utilities/usdt-scripts/kernel_delay.py \ > + utilities/usdt-scripts/kernel_delay.rst \ > utilities/usdt-scripts/reval_monitor.py \ > utilities/usdt-scripts/upcall_cost.py \ > utilities/usdt-scripts/upcall_monitor.py > diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py > new file mode 100755 > index 000000000..636e108be > --- /dev/null > +++ b/utilities/usdt-scripts/kernel_delay.py > @@ -0,0 +1,1420 @@ > +#!/usr/bin/env python3 > +# > +# Copyright (c) 2022,2023 Red Hat, Inc. > +# > +# Licensed under the Apache License, Version 2.0 (the "License"); > +# you may not use this file except in compliance with the License. > +# You may obtain a copy of the License at: > +# > +# http://www.apache.org/licenses/LICENSE-2.0 > +# > +# Unless required by applicable law or agreed to in writing, software > +# distributed under the License is distributed on an "AS IS" BASIS, > +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > +# See the License for the specific language governing permissions and > +# limitations under the License. > +# > +# > +# Script information: > +# ------------------- > +# This script allows a developer to quickly identify if the issue at hand > +# might be related to the kernel running out of resources or if it really is > +# an Open vSwitch issue. > +# > +# For documentation see the kernel_delay.rst file. > +# > +# > +# Dependencies: > +# ------------- > +# You need to install the BCC package for your specific platform or build it > +# yourself using the following instructions: > +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md > +# > +# Python needs the following additional packages installed: > +# - pytz > +# - psutil > +# > +# You can either install your distribution specific package or use pip: > +# pip install pytz psutil > +# > +import argparse > +import datetime > +import os > +import pytz > +import psutil > +import re > +import sys > +import time > + > +import ctypes as ct > + > +try: > + from bcc import BPF, USDT, USDTException > + from bcc.syscall import syscalls, syscall_name > +except ModuleNotFoundError: > + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") > + sys.exit(os.EX_OSFILE) > + > +from enum import IntEnum > + > + > +# > +# Actual eBPF source code > +# > +EBPF_SOURCE = """ > +#include <linux/irq.h> > +#include <linux/sched.h> > + > +#define MONITOR_PID <MONITOR_PID> > + > +enum { > +<EVENT_ENUM> > +}; > + > +struct event_t { > + u64 ts; > + u32 tid; > + u32 id; > + > + int user_stack_id; > + int kernel_stack_id; > + > + u32 syscall; > + u64 entry_ts; > + > +}; > + > +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); > +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); > +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); > +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); > + > +BPF_ARRAY(capture_on, u64, 1); > +static inline bool capture_enabled(u64 pid_tgid) { > + int key = 0; > + u64 *ret; > + > + if ((pid_tgid >> 32) != MONITOR_PID) > + return false; > + > + ret = capture_on.lookup(&key); > + return ret && *ret == 1; > +} > + > +static inline bool capture_enabled__() { > + int key = 0; > + u64 *ret; > + > + ret = capture_on.lookup(&key); > + return ret && *ret == 1; > +} > + > +static struct event_t *get_event(uint32_t id) { > + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); > + > + if (!event) { > + dropcnt.increment(0); > + return NULL; > + } > + > + event->id = id; > + event->ts = bpf_ktime_get_ns(); > + event->tid = bpf_get_current_pid_tgid(); > + > + return event; > +} > + > +static int start_trigger() { > + int key = 0; > + u64 *val = capture_on.lookup(&key); > + > + /* If the value is -1 we can't start as we are still processing the > + * results in userspace. */ > + if (!val || *val != 0) { > + trigger_miss.increment(0); > + return 0; > + } > + > + struct event_t *event = get_event(EVENT_START_TRIGGER); > + if (event) { > + events.ringbuf_submit(event, 0); > + *val = 1; > + } else { > + trigger_miss.increment(0); > + } > + return 0; > +} > + > +static int stop_trigger() { > + int key = 0; > + u64 *val = capture_on.lookup(&key); > + > + if (!val || *val != 1) > + return 0; > + > + struct event_t *event = get_event(EVENT_STOP_TRIGGER); > + > + if (event) > + events.ringbuf_submit(event, 0); > + > + if (val) > + *val = -1; > + > + return 0; > +} > + > +<START_TRIGGER> > +<STOP_TRIGGER> > + > + > +/* > + * For the syscall monitor the following probes get installed. > + */ > +struct syscall_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct syscall_data_key_t { > + u32 pid; > + u32 tid; > + u32 syscall; > +}; > + > +BPF_HASH(syscall_start, u64, u64); > +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); > + > +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + u64 t = bpf_ktime_get_ns(); > + syscall_start.update(&pid_tgid, &t); > + > + return 0; > +} > + > +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { > + struct syscall_data_t *val, zero = {}; > + struct syscall_data_key_t key; > + > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + key.pid = pid_tgid >> 32; > + key.tid = (u32)pid_tgid; > + key.syscall = args->id; > + > + u64 *start_ns = syscall_start.lookup(&pid_tgid); > + > + if (!start_ns) > + return 0; > + > + val = syscall_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + > + if (<SYSCALL_TRACE_EVENTS>) { > + struct event_t *event = get_event(EVENT_SYSCALL); > + if (event) { > + event->syscall = args->id; > + event->entry_ts = *start_ns; > + if (<STACK_TRACE_ENABLED>) { > + event->user_stack_id = stack_traces.get_stackid( > + args, BPF_F_USER_STACK); > + event->kernel_stack_id = stack_traces.get_stackid( > + args, 0); > + } > + events.ringbuf_submit(event, 0); > + } > + } > + } > + return 0; > +} > + > + > +/* > + * For measuring the thread run time, we need the following. > + */ > +struct run_time_data_t { > + u64 count; > + u64 total_ns; > + u64 max_ns; > + u64 min_ns; > +}; > + > +struct pid_tid_key_t { > + u32 pid; > + u32 tid; > +}; > + > +BPF_HASH(run_start, u64, u64); > +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); > + > +static inline void thread_start_run(u64 pid_tgid, u64 ktime) > +{ > + run_start.update(&pid_tgid, &ktime); > +} > + > +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) > +{ > + u64 pid_tgid = (u64) tgid << 32 | pid; > + u64 *start_ns = run_start.lookup(&pid_tgid); > + > + if (!start_ns || *start_ns == 0) > + return; > + > + struct run_time_data_t *val, zero = {}; > + struct pid_tid_key_t key = { .pid = tgid, > + .tid = pid }; > + > + val = run_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = ktime - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->max_ns == 0 || delta > val->max_ns) > + val->max_ns = delta; > + if (val->min_ns == 0 || delta < val->min_ns) > + val->min_ns = delta; > + } > + *start_ns = 0; > +} > + > + > +/* > + * For measuring the thread-ready delay, we need the following. > + */ > +struct ready_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +BPF_HASH(ready_start, u64, u64); > +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); > + > +static inline int sched_wakeup__(u32 pid, u32 tgid) > +{ > + u64 pid_tgid = (u64) tgid << 32 | pid; > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + u64 t = bpf_ktime_get_ns(); > + ready_start.update(&pid_tgid, &t); > + return 0; > +} > + > +RAW_TRACEPOINT_PROBE(sched_wakeup) > +{ > + struct task_struct *t = (struct task_struct *)ctx->args[0]; > + return sched_wakeup__(t->pid, t->tgid); > +} > + > +RAW_TRACEPOINT_PROBE(sched_wakeup_new) > +{ > + struct task_struct *t = (struct task_struct *)ctx->args[0]; > + return sched_wakeup__(t->pid, t->tgid); > +} > + > +RAW_TRACEPOINT_PROBE(sched_switch) > +{ > + struct task_struct *prev = (struct task_struct *)ctx->args[1]; > + struct task_struct *next= (struct task_struct *)ctx->args[2]; > + u64 ktime = 0; > + > + if (!capture_enabled__()) > + return 0; > + > + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) > + sched_wakeup__(prev->pid, prev->tgid); > + > + if (prev->tgid == MONITOR_PID) { > + ktime = bpf_ktime_get_ns(); > + thread_stop_run(prev->pid, prev->tgid, ktime); > + } > + > + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; > + > + if (next->tgid != MONITOR_PID) > + return 0; > + > + if (ktime == 0) > + ktime = bpf_ktime_get_ns(); > + > + u64 *start_ns = ready_start.lookup(&pid_tgid); > + > + if (start_ns && *start_ns != 0) { > + > + struct ready_data_t *val, zero = {}; > + struct pid_tid_key_t key = { .pid = next->tgid, > + .tid = next->pid }; > + > + val = ready_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = ktime - *start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + *start_ns = 0; > + } > + > + thread_start_run(pid_tgid, ktime); > + return 0; > +} > + > + > +/* > + * For measuring the hard irq time, we need the following. > + */ > +struct hardirq_start_data_t { > + u64 start_ns; > + char irq_name[32]; > +}; > + > +struct hardirq_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct hardirq_data_key_t { > + u32 pid; > + u32 tid; > + char irq_name[32]; > +}; > + > +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); > +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); > + > +TRACEPOINT_PROBE(irq, irq_handler_entry) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct hardirq_start_data_t data = {}; > + > + data.start_ns = bpf_ktime_get_ns(); > + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); > + hardirq_start.update(&pid_tgid, &data); > + return 0; > +} > + > +TRACEPOINT_PROBE(irq, irq_handler_exit) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct hardirq_start_data_t *data; > + data = hardirq_start.lookup(&pid_tgid); > + if (!data || data->start_ns == 0) > + return 0; > + > + if (args->ret != IRQ_NONE) { > + struct hardirq_data_t *val, zero = {}; > + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, > + .tid = (u32)pid_tgid }; > + > + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), > + data->irq_name); > + val = hardirq_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - data->start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + } > + > + data->start_ns = 0; > + return 0; > +} > + > + > +/* > + * For measuring the soft irq time, we need the following. > + */ > +struct softirq_start_data_t { > + u64 start_ns; > + u32 vec_nr; > +}; > + > +struct softirq_data_t { > + u64 count; > + u64 total_ns; > + u64 worst_ns; > +}; > + > +struct softirq_data_key_t { > + u32 pid; > + u32 tid; > + u32 vec_nr; > +}; > + > +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); > +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); > + > +TRACEPOINT_PROBE(irq, softirq_entry) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct softirq_start_data_t data = {}; > + > + data.start_ns = bpf_ktime_get_ns(); > + data.vec_nr = args->vec; > + softirq_start.update(&pid_tgid, &data); > + return 0; > +} > + > +TRACEPOINT_PROBE(irq, softirq_exit) > +{ > + u64 pid_tgid = bpf_get_current_pid_tgid(); > + > + if (!capture_enabled(pid_tgid)) > + return 0; > + > + struct softirq_start_data_t *data; > + data = softirq_start.lookup(&pid_tgid); > + if (!data || data->start_ns == 0) > + return 0; > + > + struct softirq_data_t *val, zero = {}; > + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, > + .tid = (u32)pid_tgid, > + .vec_nr = data->vec_nr}; > + > + val = softirq_data.lookup_or_try_init(&key, &zero); > + if (val) { > + u64 delta = bpf_ktime_get_ns() - data->start_ns; > + val->count++; > + val->total_ns += delta; > + if (val->worst_ns == 0 || delta > val->worst_ns) > + val->worst_ns = delta; > + } > + > + data->start_ns = 0; > + return 0; > +} > +""" > + > + > +# > +# time_ns() > +# > +try: > + from time import time_ns > +except ImportError: > + # For compatibility with Python <= v3.6. > + def time_ns(): > + now = datetime.datetime.now() > + return int(now.timestamp() * 1e9) > + > + > +# > +# Probe class to use for the start/stop triggers > +# > +class Probe(object): > + ''' > + The goal for this object is to support as many as possible > + probe/events as supported by BCC. See > + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments > + ''' > + def __init__(self, probe, pid=None): > + self.pid = pid > + self.text_probe = probe > + self._parse_text_probe() > + > + def __str__(self): > + if self.probe_type == "usdt": > + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, > + self.usdt_provider, self.usdt_probe) > + elif self.probe_type == "trace": > + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, > + self.trace_system, self.trace_event) > + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": > + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, > + self.kprobe_function) > + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": > + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, > + self.uprobe_function) > + else: > + return "[{}] <{}:unknown probe>".format(self.text_probe, > + self.probe_type) > + > + def _raise(self, error): > + raise ValueError("[{}]; {}".format(self.text_probe, error)) > + > + def _verify_kprobe_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_trace_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_uprobe_probe(self): > + # Nothing to verify for now, just return. > + return > + > + def _verify_usdt_probe(self): > + if not self.pid: > + self._raise("USDT probes need a valid PID.") > + > + usdt = USDT(pid=self.pid) > + > + for probe in usdt.enumerate_probes(): > + if probe.provider.decode('utf-8') == self.usdt_provider and \ > + probe.name.decode('utf-8') == self.usdt_probe: > + return > + > + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, > + self.usdt_probe)) > + > + def _parse_text_probe(self): > + ''' > + The text probe format is defined as follows: > + <probe_type>:<probe_specific> > + > + Types: > + USDT: u|usdt:<provider>:<probe> > + TRACE: t|trace:<system>:<event> > + KPROBE: k|kprobe:<kernel_function> > + KRETPROBE: kr|kretprobe:<kernel_function> > + UPROBE: up|uprobe:<function> > + URETPROBE: ur|uretprobe:<function> > + ''' > + args = self.text_probe.split(":") > + if len(args) <= 1: > + self._raise("Can't extract probe type.") > + > + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", > + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: > + self._raise("Invalid probe type '{}'".format(args[0])) > + > + self.probe_type = "kprobe" if args[0] == "k" else args[0] > + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type > + self.probe_type = "trace" if args[0] == "t" else self.probe_type > + self.probe_type = "usdt" if args[0] == "u" else self.probe_type > + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type > + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type > + > + if self.probe_type == "usdt": > + if len(args) != 3: > + self._raise("Invalid number of arguments for USDT") > + > + self.usdt_provider = args[1] > + self.usdt_probe = args[2] > + self._verify_usdt_probe() > + > + elif self.probe_type == "trace": > + if len(args) != 3: > + self._raise("Invalid number of arguments for TRACE") > + > + self.trace_system = args[1] > + self.trace_event = args[2] > + self._verify_trace_probe() > + > + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": > + if len(args) != 2: > + self._raise("Invalid number of arguments for K(RET)PROBE") > + self.kprobe_function = args[1] > + self._verify_kprobe_probe() > + > + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": > + if len(args) != 2: > + self._raise("Invalid number of arguments for U(RET)PROBE") > + self.uprobe_function = args[1] > + self._verify_uprobe_probe() > + > + def _get_kprobe_c_code(self, function_name, function_content): > + # > + # The kprobe__* do not require a function name, so it's > + # ignored in the code generation. > + # > + return """ > +int {}__{}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(self.probe_type, self.kprobe_function, function_content) > + > + def _get_trace_c_code(self, function_name, function_content): > + # > + # The TRACEPOINT_PROBE() do not require a function name, so it's > + # ignored in the code generation. > + # > + return """ > +TRACEPOINT_PROBE({},{}) {{ > + {} > +}} > +""".format(self.trace_system, self.trace_event, function_content) > + > + def _get_uprobe_c_code(self, function_name, function_content): > + return """ > +int {}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(function_name, function_content) > + > + def _get_usdt_c_code(self, function_name, function_content): > + return """ > +int {}(struct pt_regs *ctx) {{ > + {} > +}} > +""".format(function_name, function_content) > + > + def get_c_code(self, function_name, function_content): > + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': > + return self._get_kprobe_c_code(function_name, function_content) > + elif self.probe_type == 'trace': > + return self._get_trace_c_code(function_name, function_content) > + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': > + return self._get_uprobe_c_code(function_name, function_content) > + elif self.probe_type == 'usdt': > + return self._get_usdt_c_code(function_name, function_content) > + > + return "" > + > + def probe_name(self): > + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': > + return "{}".format(self.kprobe_function) > + elif self.probe_type == 'trace': > + return "{}:{}".format(self.trace_system, > + self.trace_event) > + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': > + return "{}".format(self.uprobe_function) > + elif self.probe_type == 'usdt': > + return "{}:{}".format(self.usdt_provider, > + self.usdt_probe) > + > + return "" > + > + > +# > +# event_to_dict() > +# > +def event_to_dict(event): > + return dict([(field, getattr(event, field)) > + for (field, _) in event._fields_ > + if isinstance(getattr(event, field), (int, bytes))]) > + > + > +# > +# Event enum > +# > +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], > + start=0) > + > + > +# > +# process_event() > +# > +def process_event(ctx, data, size): > + global start_trigger_ts > + global stop_trigger_ts > + > + event = bpf['events'].event(data) > + if event.id == Event.SYSCALL: > + syscall_events.append({"tid": event.tid, > + "ts_entry": event.entry_ts, > + "ts_exit": event.ts, > + "syscall": event.syscall, > + "user_stack_id": event.user_stack_id, > + "kernel_stack_id": event.kernel_stack_id}) > + elif event.id == Event.START_TRIGGER: > + # > + # This event would have started the trigger already, so all we need to > + # do is record the start timestamp. > + # > + start_trigger_ts = event.ts > + > + elif event.id == Event.STOP_TRIGGER: > + # > + # This event would have stopped the trigger already, so all we need to > + # do is record the start timestamp. > + stop_trigger_ts = event.ts > + > + > +# > +# next_power_of_two() > +# > +def next_power_of_two(val): > + np = 1 > + while np < val: > + np *= 2 > + return np > + > + > +# > +# unsigned_int() > +# > +def unsigned_int(value): > + try: > + value = int(value) > + except ValueError: > + raise argparse.ArgumentTypeError("must be an integer") > + > + if value < 0: > + raise argparse.ArgumentTypeError("must be positive") > + return value > + > + > +# > +# unsigned_nonzero_int() > +# > +def unsigned_nonzero_int(value): > + value = unsigned_int(value) > + if value == 0: > + raise argparse.ArgumentTypeError("must be nonzero") > + return value > + > + > +# > +# get_thread_name() > +# > +def get_thread_name(pid, tid): > + try: > + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: > + return f.readline().strip("\n") > + except FileNotFoundError: > + pass > + > + return f"<unknown:{pid}/{tid}>" > + > + > +# > +# get_vec_nr_name() > +# > +def get_vec_nr_name(vec_nr): > + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", > + "tasklet", "sched", "hrtimer", "rcu"] > + > + if vec_nr < 0 or vec_nr > len(known_vec_nr): > + return f"<unknown:{vec_nr}>" > + > + return known_vec_nr[vec_nr] > + > + > +# > +# start/stop/reset capture > +# > +def start_capture(): > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) > + > + > +def stop_capture(force=False): > + if force: > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) > + else: > + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) > + > + > +def capture_running(): > + return bpf["capture_on"][ct.c_int(0)].value == 1 > + > + > +def reset_capture(): > + bpf["syscall_start"].clear() > + bpf["syscall_data"].clear() > + bpf["run_start"].clear() > + bpf["run_data"].clear() > + bpf["ready_start"].clear() > + bpf["ready_data"].clear() > + bpf["hardirq_start"].clear() > + bpf["hardirq_data"].clear() > + bpf["softirq_start"].clear() > + bpf["softirq_data"].clear() > + bpf["stack_traces"].clear() > + > + > +# > +# Display timestamp > +# > +def print_timestamp(msg): > + ltz = datetime.datetime.now() > + utc = ltz.astimezone(pytz.utc) > + time_string = "{} @{} ({} UTC)".format( > + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) > + print(time_string) > + > + > +# > +# process_results() > +# > +def process_results(syscall_events=None, trigger_delta=None): > + if trigger_delta: > + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". > + format(trigger_delta)) > + else: > + print_timestamp("# Sample dump") > + > + # > + # First get a list of all threads we need to report on. > + # > + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() > + if k.syscall != 0xffffffff} > + > + threads_run = {k.tid for k, _ in bpf["run_data"].items() > + if k.pid != 0xffffffff} > + > + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() > + if k.pid != 0xffffffff} > + > + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() > + if k.pid != 0xffffffff} > + > + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() > + if k.pid != 0xffffffff} > + > + threads = sorted(threads_syscall | threads_run | threads_ready | > + threads_hardirq | threads_softirq, > + key=lambda x: get_thread_name(options.pid, x)) > + > + # > + # Print header... > + # > + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) > + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) > + indent = 28 * " " > + > + # > + # Print all events/statistics per threads. > + # > + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] > + for thread in threads: > + > + if thread != threads[0]: > + print("") > + > + # > + # SYSCALL_STATISTICS > + # > + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( > + thread, get_thread_name(options.pid, thread), > + "[SYSCALL STATISTICS]", indent, > + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) > + > + total_count = 0 > + total_ns = 0 > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["syscall_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( > + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, > + v.count, v.total_ns, v.worst_ns)) > + if k.syscall != poll_id: > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:6} {:10} {:16,}".format( > + indent, "TOTAL( - poll):", "", total_count, total_ns)) > + > + # > + # THREAD RUN STATISTICS > + # > + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( > + "", "", "[THREAD RUN STATISTICS]", indent, > + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) > + > + for k, v in filter(lambda t: t[0].tid == thread, > + bpf["run_data"].items()): > + > + print("{}{:10} {:16,} {:16,} {:16,}".format( > + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) > + > + # > + # THREAD READY STATISTICS > + # > + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( > + "", "", "[THREAD READY STATISTICS]", indent, > + "SCHED_CNT", "TOTAL ns", "MAX ns")) > + > + for k, v in filter(lambda t: t[0].tid == thread, > + bpf["ready_data"].items()): > + > + print("{}{:10} {:16,} {:16,}".format( > + indent, v.count, v.total_ns, v.worst_ns)) > + > + # > + # HARD IRQ STATISTICS > + # > + total_ns = 0 > + total_count = 0 > + header_printed = False > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["hardirq_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + if not header_printed: > + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". > + format("", "", "[HARD IRQ STATISTICS]", indent, > + "NAME", "COUNT", "TOTAL ns", "MAX ns")) > + header_printed = True > + > + print("{}{:20.20} {:10} {:16,} {:16,}".format( > + indent, k.irq_name.decode('utf-8'), > + v.count, v.total_ns, v.worst_ns)) > + > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:10} {:16,}".format( > + indent, "TOTAL:", total_count, total_ns)) > + > + # > + # SOFT IRQ STATISTICS > + # > + total_ns = 0 > + total_count = 0 > + header_printed = False > + for k, v in sorted(filter(lambda t: t[0].tid == thread, > + bpf["softirq_data"].items()), > + key=lambda kv: -kv[1].total_ns): > + > + if not header_printed: > + print("\n{:10} {:16} {}\n" > + "{}{:20} {:>7} {:>10} {:>16} {:>16}". > + format("", "", "[SOFT IRQ STATISTICS]", indent, > + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) > + header_printed = True > + > + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( > + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, > + v.count, v.total_ns, v.worst_ns)) > + > + total_count += v.count > + total_ns += v.total_ns > + > + if total_count > 0: > + print("{}{:20.20} {:7} {:10} {:16,}".format( > + indent, "TOTAL:", "", total_count, total_ns)) > + > + # > + # Print events > + # > + lost_stack_traces = 0 > + if syscall_events: > + stack_traces = bpf.get_table("stack_traces") > + > + print("\n\n# SYSCALL EVENTS:" > + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( > + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", > + "DELTA (us)", "SYSCALL")) > + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( > + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, > + "-" * 10, "-" * 16)) > + for event in syscall_events: > + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( > + " " * 2, > + event["ts_entry"], event["ts_exit"], event["tid"], > + get_thread_name(options.pid, event["tid"]), > + int((event["ts_exit"] - event["ts_entry"]) / 1000), > + syscall_name(event["syscall"]).decode('utf-8'))) > + # > + # Not sure where to put this, but I'll add some info on stack > + # traces here... Userspace stack traces are very limited due to > + # the fact that bcc does not support dwarf backtraces. As OVS > + # gets compiled without frame pointers we will not see much. > + # If however, OVS does get built with frame pointers, we should not > + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle > + # the debug symbols correctly. Also, note that for kernel > + # traces you should not use BPF_STACK_TRACE_BUILDID, so two > + # buffers are needed. > + # > + # Some info on manual dwarf walk support: > + # https://github.com/iovisor/bcc/issues/3515 > + # https://github.com/iovisor/bcc/pull/4463 > + # > + if options.stack_trace_size == 0: > + continue > + > + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: > + lost_stack_traces += 1 > + > + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ > + if event['kernel_stack_id'] >= 0 else [] > + user_stack = stack_traces.walk(event['user_stack_id']) \ > + if event['user_stack_id'] >= 0 else [] > + > + for addr in kernel_stack: > + print("{}{}".format( > + " " * 10, > + bpf.ksym(addr, show_module=True, > + show_offset=True).decode('utf-8', 'replace'))) > + > + for addr in user_stack: > + addr_str = bpf.sym(addr, options.pid, show_module=True, > + show_offset=True).decode('utf-8', 'replace') > + > + if addr_str == "[unknown]": > + addr_str += " 0x{:x}".format(addr) > + > + print("{}{}".format(" " * 10, addr_str)) > + > + # > + # Print any footer messages. > + # > + if lost_stack_traces > 0: > + print("\n#WARNING: We where not able to display {} stack traces!\n" > + "# Consider increasing the stack trace size using\n" > + "# the '--stack-trace-size' option.\n" > + "# Note that this can also happen due to a stack id\n" > + "# collision.".format(lost_stack_traces)) > + > + > +# > +# main() > +# > +def main(): > + # > + # Don't like these globals, but ctx passing does not seem to work with the > + # existing open_ring_buffer() API :( > + # > + global bpf > + global options > + global syscall_events > + global start_trigger_ts > + global stop_trigger_ts > + > + start_trigger_ts = 0 > + stop_trigger_ts = 0 > + > + # > + # Argument parsing > + # > + parser = argparse.ArgumentParser() > + > + parser.add_argument("-D", "--debug", > + help="Enable eBPF debugging", > + type=int, const=0x3f, default=0, nargs='?') > + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", > + help="ovs-vswitch's PID", > + type=unsigned_int, default=None) > + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", > + help="Record syscall events that take longer than " > + "DURATION_NS. Omit the duration value to record all " > + "syscall events", > + type=unsigned_int, const=0, default=None, nargs='?') > + parser.add_argument("--buffer-page-count", > + help="Number of BPF ring buffer pages, default 1024", > + type=unsigned_int, default=1024, metavar="NUMBER") > + parser.add_argument("--sample-count", > + help="Number of sample runs, default 1", > + type=unsigned_nonzero_int, default=1, metavar="RUNS") > + parser.add_argument("--sample-interval", > + help="Delay between sample runs, default 0", > + type=float, default=0, metavar="SECONDS") > + parser.add_argument("--sample-time", > + help="Sample time, default 0.5 seconds", > + type=float, default=0.5, metavar="SECONDS") > + parser.add_argument("--skip-syscall-poll-events", > + help="Skip poll() syscalls with --syscall-events", > + action="store_true") > + parser.add_argument("--stack-trace-size", > + help="Number of unique stack traces that can be " > + "recorded, default 4096. 0 to disable", > + type=unsigned_int, default=4096) > + parser.add_argument("--start-trigger", metavar="TRIGGER", > + help="Start trigger, see documentation for details", > + type=str, default=None) > + parser.add_argument("--stop-trigger", metavar="TRIGGER", > + help="Stop trigger, see documentation for details", > + type=str, default=None) > + parser.add_argument("--trigger-delta", metavar="DURATION_NS", > + help="Only report event when the trigger duration > " > + "DURATION_NS, default 0 (all events)", > + type=unsigned_int, const=0, default=0, nargs='?') > + > + options = parser.parse_args() > + > + # > + # Find the PID of the ovs-vswitchd daemon if not specified. > + # > + if not options.pid: > + for proc in psutil.process_iter(): > + if 'ovs-vswitchd' in proc.name(): > + if options.pid: > + print("ERROR: Multiple ovs-vswitchd daemons running, " > + "use the -p option!") > + sys.exit(os.EX_NOINPUT) > + > + options.pid = proc.pid > + > + # > + # Error checking on input parameters. > + # > + if not options.pid: > + print("ERROR: Failed to find ovs-vswitchd's PID!") > + sys.exit(os.EX_UNAVAILABLE) > + > + options.buffer_page_count = next_power_of_two(options.buffer_page_count) > + > + # > + # Make sure we are running as root, or else we can not attach the probes. > + # > + if os.geteuid() != 0: > + print("ERROR: We need to run as root to attached probes!") > + sys.exit(os.EX_NOPERM) > + > + # > + # Setup any of the start stop triggers > + # > + if options.start_trigger is not None: > + try: > + start_trigger = Probe(options.start_trigger, pid=options.pid) > + except ValueError as e: > + print(f"ERROR: Invalid start trigger {str(e)}") > + sys.exit(os.EX_CONFIG) > + else: > + start_trigger = None > + > + if options.stop_trigger is not None: > + try: > + stop_trigger = Probe(options.stop_trigger, pid=options.pid) > + except ValueError as e: > + print(f"ERROR: Invalid stop trigger {str(e)}") > + sys.exit(os.EX_CONFIG) > + else: > + stop_trigger = None > + > + # > + # Attach probe to running process. > + # > + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( > + [" EVENT_{} = {},".format( > + event.name, event.value) for event in Event])) > + source = source.replace("<BUFFER_PAGE_CNT>", > + str(options.buffer_page_count)) > + source = source.replace("<MONITOR_PID>", str(options.pid)) > + > + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: > + source = source.replace('<STATE_FIELD>', 'state') > + else: > + source = source.replace('<STATE_FIELD>', '__state') > + > + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] > + if options.syscall_events is None: > + syscall_trace_events = "false" > + elif options.syscall_events == 0: > + if not options.skip_syscall_poll_events: > + syscall_trace_events = "true" > + else: > + syscall_trace_events = f"args->id != {poll_id}" > + else: > + syscall_trace_events = "delta > {}".format(options.syscall_events) > + if options.skip_syscall_poll_events: > + syscall_trace_events += f" && args->id != {poll_id}" > + > + source = source.replace("<SYSCALL_TRACE_EVENTS>", > + syscall_trace_events) > + > + source = source.replace("<STACK_TRACE_SIZE>", > + str(options.stack_trace_size)) > + > + source = source.replace("<STACK_TRACE_ENABLED>", "true" > + if options.stack_trace_size > 0 else "false") > + > + # > + # Handle start/stop probes > + # > + if start_trigger: > + source = source.replace("<START_TRIGGER>", > + start_trigger.get_c_code( > + "start_trigger_probe", > + "return start_trigger();")) > + else: > + source = source.replace("<START_TRIGGER>", "") > + > + if stop_trigger: > + source = source.replace("<STOP_TRIGGER>", > + stop_trigger.get_c_code( > + "stop_trigger_probe", > + "return stop_trigger();")) > + else: > + source = source.replace("<STOP_TRIGGER>", "") > + > + # > + # Setup usdt or other probes that need handling trough the BFP class. > + # > + usdt = USDT(pid=int(options.pid)) > + try: > + if start_trigger and start_trigger.probe_type == 'usdt': > + usdt.enable_probe(probe=start_trigger.probe_name(), > + fn_name="start_trigger_probe") > + if stop_trigger and stop_trigger.probe_type == 'usdt': > + usdt.enable_probe(probe=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe") > + > + except USDTException as e: > + print("ERROR: {}".format( > + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). > + replace("--with-dtrace or --enable-dtrace", > + "--enable-usdt-probes"))) > + sys.exit(os.EX_OSERR) > + > + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) > + > + if start_trigger: > + try: > + if start_trigger.probe_type == "uprobe": > + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", > + sym=start_trigger.probe_name(), > + fn_name="start_trigger_probe", > + pid=options.pid) > + > + if start_trigger.probe_type == "uretprobe": > + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", > + sym=start_trigger.probe_name(), > + fn_name="start_trigger_probe", > + pid=options.pid) > + except Exception as e: > + print("ERROR: Failed attaching uprobe start trigger " > + f"'{start_trigger.probe_name()}';\n {str(e)}") > + sys.exit(os.EX_OSERR) > + > + if stop_trigger: > + try: > + if stop_trigger.probe_type == "uprobe": > + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", > + sym=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe", > + pid=options.pid) > + > + if stop_trigger.probe_type == "uretprobe": > + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", > + sym=stop_trigger.probe_name(), > + fn_name="stop_trigger_probe", > + pid=options.pid) > + except Exception as e: > + print("ERROR: Failed attaching uprobe stop trigger" > + f"'{stop_trigger.probe_name()}';\n {str(e)}") > + sys.exit(os.EX_OSERR) > + > + # > + # If no triggers are configured use the delay configuration > + # > + bpf['events'].open_ring_buffer(process_event) > + > + sample_count = 0 > + while sample_count < options.sample_count: > + sample_count += 1 > + syscall_events = [] > + > + if not options.start_trigger: > + print_timestamp("# Start sampling") > + start_capture() > + stop_time = -1 if options.stop_trigger else \ > + time_ns() + options.sample_time * 1000000000 > + else: > + # For start triggers the stop time depends on the start trigger > + # time, or depends on the stop trigger if configured. > + stop_time = -1 if options.stop_trigger else 0 > + > + while True: > + keyboard_interrupt = False > + try: > + last_start_ts = start_trigger_ts > + last_stop_ts = stop_trigger_ts > + > + if stop_time > 0: > + delay = int((stop_time - time_ns()) / 1000000) > + if delay <= 0: > + break > + else: > + delay = -1 > + > + bpf.ring_buffer_poll(timeout=delay) > + > + if stop_time <= 0 and last_start_ts != start_trigger_ts: > + print_timestamp( > + "# Start sampling (trigger@{})".format( > + start_trigger_ts)) > + > + if not options.stop_trigger: > + stop_time = time_ns() + \ > + options.sample_time * 1000000000 > + > + if last_stop_ts != stop_trigger_ts: > + break > + > + except KeyboardInterrupt: > + keyboard_interrupt = True > + break > + > + if options.stop_trigger and not capture_running(): > + print_timestamp("# Stop sampling (trigger@{})".format( > + stop_trigger_ts)) > + else: > + print_timestamp("# Stop sampling") > + > + if stop_trigger_ts != 0 and start_trigger_ts != 0: > + trigger_delta = stop_trigger_ts - start_trigger_ts > + else: > + trigger_delta = None > + > + if not trigger_delta or trigger_delta >= options.trigger_delta: > + stop_capture(force=True) # Prevent a new trigger to start. > + process_results(syscall_events=syscall_events, > + trigger_delta=trigger_delta) > + elif trigger_delta: > + sample_count -= 1 > + print_timestamp("# Sample dump skipped, delta {:,} ns".format( > + trigger_delta)) > + > + reset_capture() > + stop_capture() > + > + if keyboard_interrupt: > + break > + > + if options.sample_interval > 0: > + time.sleep(options.sample_interval) > + > + # > + # Report lost events. > + # > + dropcnt = bpf.get_table("dropcnt") > + for k in dropcnt.keys(): > + count = dropcnt.sum(k).value > + if k.value == 0 and count > 0: > + print("\n# WARNING: Not all events were captured, {} were " > + "dropped!\n# Increase the BPF ring buffer size " > + "with the --buffer-page-count option.".format(count)) > + > + if (options.sample_count > 1): I think it is more pythonic to omit the parens here. > + trigger_miss = bpf.get_table("trigger_miss") > + for k in trigger_miss.keys(): > + count = trigger_miss.sum(k).value > + if k.value == 0 and count > 0: > + print("\n# WARNING: Not all start triggers were successful. " > + "{} were missed due to\n# slow userspace " > + "processing!".format(count)) > + > + > +# > +# Start main() as the default entry point... > +# > +if __name__ == '__main__': > + main() > diff --git a/utilities/usdt-scripts/kernel_delay.rst b/utilities/usdt-scripts/kernel_delay.rst > new file mode 100644 > index 000000000..0ebd30afb > --- /dev/null > +++ b/utilities/usdt-scripts/kernel_delay.rst > @@ -0,0 +1,596 @@ > +Troubleshooting Open vSwitch: Is the kernel to blame? > +===================================================== > +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left > +wondering if the issue is really OVS-related, or if it's a problem with the > +kernel being overloaded. Messages in the log like > +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from > +experience, these are mostly related to an overloaded Linux Kernel. > +The kernel_delay.py tool can help you quickly identify if the focus of your > +investigation should be OVS or the Linux kernel. > + > + > +Introduction > +------------ > +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ > +framework to install eBPF probes. The data the eBPF probes collect will be > +analyzed and presented to the user by the Python script. Some of the presented > +data can also be captured by the individual scripts included in the BBC [#BCC]_ > +framework. > + > +kernel_delay.py has two modes of operation: > + > +- In **time mode**, the tool runs for a specific time and collects the > + information. > +- In **trigger mode**, event collection can be started and/or stopped based on > + a specific eBPF probe. Currently, the following probes are supported: > + - USDT probes > + - Kernel tracepoints > + - kprobe > + - kretprobe > + - uprobe > + - uretprobe > + > + > +In addition, the option, ``--sample-count``, exists to specify how many > +iterations you would like to do. When using triggers, you can also ignore > +samples if they are less than a number of nanoseconds with the > +``--trigger-delta`` option. The latter might be useful when debugging Linux > +syscalls which take a long time to complete. More on this later. Finally, you > +can configure the delay between two sample runs with the ``--sample-interval`` > +option. > + > +Before getting into more details, you can run the tool without any options > +to see what the output looks like. Notice that it will try to automatically > +get the process ID of the running ``ovs-vswitchd``. You can overwrite this > +with the ``--pid`` option. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py > + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) > + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) > + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 27090 ovs-vswitchd [SYSCALL STATISTICS] > + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> > + > + 31741 revalidator122 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + poll 7 5 184,193,176 184,191,520 > + recvmsg 47 494 125,208,756 310,331 > + futex 202 8 18,768,758 4,023,039 > + sendto 44 10 375,861 266,867 > + sendmsg 46 4 43,294 11,213 > + write 1 1 5,949 5,949 > + getrusage 98 1 1,424 1,424 > + read 0 1 1,292 1,292 > + TOTAL( - poll): 519 144,405,334 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 6 136,764,071 1,480 115,146,424 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 7 11,334 6,636 > + > + [HARD IRQ STATISTICS] > + NAME COUNT TOTAL ns MAX ns > + eno8303-rx-1 1 3,586 3,586 > + TOTAL: 1 3,586 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + net_rx 3 1 17,699 17,699 > + sched 7 6 13,820 3,226 > + rcu 9 16 13,586 1,554 > + timer 1 3 10,259 3,815 > + TOTAL: 26 55,364 > + > + > +By default, the tool will run for half a second in `time mode`. To extend this > +you can use the ``--sample-time`` option. > + > + > +What will it report > +------------------- > +The above sample output separates the captured data on a per-thread basis. > +For this, it displays the thread's id (``TID``) and name (``THREAD``), > +followed by resource-specific data. Which are: > + > +- ``SYSCALL STATISTICS`` > +- ``THREAD RUN STATISTICS`` > +- ``THREAD READY STATISTICS`` > +- ``HARD IRQ STATISTICS`` > +- ``SOFT IRQ STATISTICS`` > + > +The following sections will describe in detail what statistics they report. > + > + > +``SYSCALL STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~ > +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during > +the measurement interval. This includes the number of times the syscall was > +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), > +and the worst-case duration of a single call (``MAX ns``). > + > +It also shows the total of all system calls, but it excludes the poll system > +call, as the purpose of this call is to wait for activity on a set of sockets, > +and usually, the thread gets swapped out. > + > +Note that it only counts calls that started and stopped during the > +measurement interval! > + > + > +``THREAD RUN STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~~~ > +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU > +during the measurement interval. > + > +Note that these statistics only count events where the thread started and > +stopped running on a CPU during the measurement interval. For example, if > +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. > +If not, there might be a misconfiguration. > + > + > +``THREAD READY STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +``THREAD READY STATISTICS`` tell you the time between the thread being ready > +to run and it actually running on the CPU. > + > +Note that these statistics only count events where the thread was getting > +ready to run and started running during the measurement interval. > + > + > +``HARD IRQ STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~ > +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard > +interrupts during the threads run time. > + > +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), > +the total time spent in the interrupt handler (``TOTAL ns``), and the > +worst-case duration (``MAX ns``). > + > + > +``SOFT IRQ STATISTICS`` > +~~~~~~~~~~~~~~~~~~~~~~~ > +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft > +interrupts during the threads run time. > + > +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the > +number of interrupts (``COUNT``), the total time spent in the interrupt > +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). > + > + > +The ``--syscall-events`` option > +------------------------------- > +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, > +the tool can also report each individual syscall. This can be a usefull > +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. > + > +All you need to do is add the ``--syscall-events`` option, with or without > +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter > +allows you to exclude events that take less than the supplied time. > + > +The ``--skip-syscall-poll-events`` option allows you to exclude poll > +syscalls from the report. > + > +Below is an example run, note that the resource-specific data is removed > +to highlight the syscall events: > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events > + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) > + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) > + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 3359686 ipf_clean2 [SYSCALL STATISTICS] > + ... > + 3359635 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 3359697 revalidator12 [SYSCALL STATISTICS] > + ... > + 3359698 revalidator13 [SYSCALL STATISTICS] > + ... > + 3359699 revalidator14 [SYSCALL STATISTICS] > + ... > + 3359700 revalidator15 [SYSCALL STATISTICS] > + ... > + > + # SYSCALL EVENTS: > + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL > + ------------------- ------------------- ---------- ---------------- ---------- ---------------- > + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003936313a63 > + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003134313a63 > + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI___lll_lock_wait+0x30 [libc.so.6] > + ovs_mutex_lock_at+0x18 [ovs-vswitchd] > + [unknown] 0x696c003936313a63 > + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode_prepare+0x161 [kernel] > + syscall_exit_to_user_mode+0x9 [kernel] > + do_syscall_64+0x68 [kernel] > + entry_SYSCALL_64_after_hwframe+0x72 [kernel] > + __GI_accept+0x4d [libc.so.6] > + pfd_accept+0x3a [ovs-vswitchd] > + [unknown] 0x7fff19f2bd00 > + [unknown] 0xe4b8001f0f > + > +As you can see above, the output also shows the stackback trace. You can > +disable this using the ``--stack-trace-size 0`` option. > + > +As you can see above, the backtrace does not show a lot of useful information > +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further > +analyze system call backtraces, you could use perf. The following perf > +script can do this for you (refer to the embedded instructions): > + > +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py > + > + > +Using triggers > +-------------- > +The tool supports start and, or stop triggers. This will allow you to capture > +statistics triggered by a specific event. The following combinations of > +stop-and-start triggers can be used. > + > +If you only use ``--start-trigger``, the inspection start when the trigger > +happens and runs until the ``--sample-time`` number of seconds has passed. > +The example below shows all the supported options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ > + --sample-count 4 --sample-interval 1 > + > + > +If you only use ``--stop-trigger``, the inspection starts immediately and > +stops when the trigger happens. The example below shows all the supported > +options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ > + --sample-count 4 --sample-interval 1 > + > + > +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the > +statistics are captured between the two first occurrences of these events. > +The example below shows all the supported options in this scenario. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ > + --stop-trigger upr:bridge_run \ > + --sample-count 4 --sample-interval 1 \ > + --trigger-delta 50000 > + > +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, > +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. > +For more details on the supported tracepoints, check out the BCC > +documentation [#BCC_EVENT]_. > + > +The list below shows the supported triggers and their argument format: > + > +**USDT probes:** > + [u|usdt]:{provider}:{probe} > +**Kernel tracepoint:** > + [t:trace]:{system}:{event} > +**kprobe:** > + [k:kprobe]:{kernel_function} > +**kretprobe:** > + [kr:kretprobe]:{kernel_function} > +**uprobe:** > + [up:uprobe]:{function} > +**uretprobe:** > + [upr:uretprobe]:{function} > + > +Here are a couple of trigger examples, more use-case-specific examples can be > +found in the *Examples* section. > + > +.. code-block:: console > + > + --start|stop-trigger u:udpif_revalidator:start_dump > + --start|stop-trigger t:openvswitch:ovs_dp_upcall > + --start|stop-trigger k:ovs_dp_process_packet > + --start|stop-trigger kr:ovs_dp_process_packet > + --start|stop-trigger up:bridge_run > + --start|stop-trigger upr:bridge_run > + > + > +Examples > +-------- > +This section will give some examples of how to use this tool in real-world > +scenarios. Let's start with the issue where Open vSwitch reports > +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note > +that there is a blog available explaining how the revalidator process works > +in OVS [#REVAL_BLOG]_. > + > +First, let me explain this log message. It gets logged if the time delta > +between two ``poll_block()`` calls is more than 1 second. In other words, > +the process was spending a lot of time processing stuff that was made > +available by the return of the ``poll_block()`` function. > + > +Do a run with the tool using the existing USDT revalidator probes as a start > +and stop trigger (Note that the resource-specific data is removed from the none > +revalidator threads): > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done > + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) > + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) > + # Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1457761 handler24 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + sendmsg 46 6110 123,274,761 41,776 > + recvmsg 47 136299 99,397,508 49,896 > + futex 202 51 7,655,832 7,536,776 > + poll 7 4068 1,202,883 2,907 > + getrusage 98 2034 586,602 1,398 > + sendto 44 9 213,682 27,417 > + TOTAL( - poll): 144503 231,128,385 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 1 1,438 1,438 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 21 59,145 3,769 > + rcu 9 50 42,917 2,234 > + TOTAL: 71 102,062 > + 1457733 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 1457792 revalidator55 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + futex 202 73 572,576,329 19,621,600 > + recvmsg 47 815 296,697,618 405,338 > + sendto 44 3 78,302 26,837 > + sendmsg 46 3 38,712 13,250 > + write 1 1 5,073 5,073 > + TOTAL( - poll): 895 869,396,034 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 48 394,350,393 1,729 140,455,796 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 49 23,650 1,559 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 14 26,889 3,041 > + rcu 9 28 23,024 1,600 > + TOTAL: 42 49,913 > + > + > +Above you see from the start of the output that the trigger took more than a > +second (1,187,986,454 ns), which is already know, by looking at the output of > +the ``ovs-vsctl upcall/show`` command. > + > +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it > +spent almost 870ms handling syscalls, and there were no poll() calls being > +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, > +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned > +that this time does not include the time being on the CPU at the start or stop > +of an event. What is exactly the case here, because USDT probes were used. > + > +From the above data and maybe some ``top`` output, it can be determined that > +the *revalidator55* thread is taking a lot of CPU time, probably because it > +has to do a lot of revalidator work by itself. The solution here is to increase > +the number of revalidator threads, so more work could be done in parallel. > + > +Here is another run of the same command in another scenario: > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done > + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) > + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) > + # Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1457733 ovs-vswitchd [SYSCALL STATISTICS] > + ... > + 1457792 revalidator55 [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + recvmsg 47 284 193,422,110 46,248,418 > + sendto 44 2 46,685 23,665 > + sendmsg 46 2 24,916 12,703 > + write 1 1 6,534 6,534 > + TOTAL( - poll): 289 193,500,245 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 2 47,333,558 331,516 47,002,042 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 3 87,000,403 45,999,712 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + sched 7 2 9,504 5,109 > + TOTAL: 2 9,504 > + > + > +Here you can see the revalidator run took about 263ms, which does not look > +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was > +waiting 87ms for a CPU to be run on. This means the revalidator process could > +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay > +of almost 46ms can be seen, which hints at an overloaded system. > + > +One final example that uses a ``uprobe`` to get some statistics on a > +``bridge_run()`` execution that takes more than 1ms. > + > +.. code-block:: console > + > + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000 > + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) > + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) > + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) > + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) > + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) > + # Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE unknown THREADS> > + 3371102 handler66 [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE HANDLER THREADS> > + 3366258 ovs-vswitchd [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + futex 202 43 403,469 199,312 > + clone3 435 13 174,394 30,731 > + munmap 11 8 115,774 21,861 > + poll 7 5 92,969 38,307 > + unlink 87 2 49,918 35,741 > + mprotect 10 8 47,618 13,201 > + accept 43 10 31,360 6,976 > + mmap 9 8 30,279 5,776 > + write 1 6 27,720 11,774 > + rt_sigprocmask 14 28 12,281 970 > + read 0 6 9,478 2,318 > + recvfrom 45 3 7,024 4,024 > + sendto 44 1 4,684 4,684 > + getrusage 98 5 4,594 1,342 > + close 3 2 2,918 1,627 > + recvmsg 47 1 2,722 2,722 > + TOTAL( - poll): 144 924,233 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + 13 817,605 5,433 524,376 > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + 14 28,646 11,566 > + > + [SOFT IRQ STATISTICS] > + NAME VECT_NR COUNT TOTAL ns MAX ns > + rcu 9 1 2,838 2,838 > + TOTAL: 1 2,838 > + > + 3371110 revalidator74 [SYSCALL STATISTICS] > + ... <REMOVED 7 MORE NEW revalidator THREADS> > + 3366311 urcu3 [SYSCALL STATISTICS] > + ... > + > + > +OVS removed some of the threads and their resource-specific data, but based > +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some > +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some > +``clone3`` syscalls, indicating threads were created. In this example, it was > +due to the deletion of a bridge, which resulted in the recreation of the > +revalidator and handler threads. > + > + > +Use with Openshift > +------------------ > +This section describes how you would use the tool on a node in an OpenShift > +cluster. It assumes you have console access to the node, either directly or > +through a debug container. > + > +A base fedora38 container will be used through podman, as this will allow the > +use of some additional tools and packages needed. > + > +First the containers need to be started: > + > +.. code-block:: console > + > + [core@sno-master ~]$ sudo podman run -it --rm \ > + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ > + --privileged --network=host --pid=host \ > + -v /lib/modules:/lib/modules:ro \ > + -v /sys/kernel/debug:/sys/kernel/debug \ > + -v /proc:/proc \ > + -v /:/mnt/rootdir \ > + quay.io/fedora/fedora:38-x86_64 > + > + [(DEBUG)root@sno-master /]# > + > + > +Next add the ``linux_delay.py`` dependencies: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ > + python3-pytz python3-psutil > + > + > +You need to install the devel, debug and source RPMs for your OVS and kernel > +version: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master home]# rpm -i \ > + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ > + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ > + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm > + > + > +Now the tool can be started. Here the above ``bridge_run()`` example is used: > + > +.. code-block:: console > + > + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run > + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) > + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) > + # Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC) > + TID THREAD <RESOURCE SPECIFIC> > + ---------- ---------------- ---------------------------------------------------------------------------- > + 1246 ovs-vswitchd [SYSCALL STATISTICS] > + NAME NUMBER COUNT TOTAL ns MAX ns > + getdents64 217 2 8,560 8,162 > + openat 257 1 6,951 6,951 > + accept 43 4 6,942 3,763 > + recvfrom 45 1 3,726 3,726 > + recvmsg 47 2 2,880 2,188 > + stat 4 2 1,946 1,384 > + close 3 1 1,393 1,393 > + fstat 5 1 1,324 1,324 > + TOTAL( - poll): 14 33,722 > + > + [THREAD RUN STATISTICS] > + SCHED_CNT TOTAL ns MIN ns MAX ns > + > + [THREAD READY STATISTICS] > + SCHED_CNT TOTAL ns MAX ns > + > + > +.. rubric:: Footnotes > + > +.. [#BCC] https://github.com/iovisor/bcc > +.. [#BCC_EVENT] https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments > +.. [#REVAL_BLOG] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained
On 26 Sep 2023, at 16:57, Aaron Conole wrote: > Eelco Chaudron <echaudro@redhat.com> writes: > >> This patch adds an utility that can be used to determine if >> an issue is related to a lack of Linux kernel resources. >> >> This tool is also featured in a Red Hat developers blog article: >> >> https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame >> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >> >> --- > > Nits below, with those addressed > > Acked-by: Aaron Conole <aconole@redhat.com> Thanks Aaron, I’ve applied the patch with your below nits addressed ;) >> v2: Addressed review comments from Aaron. >> v3: Changed wording in documentation. >> v4: Addressed review comments from Adrian. >> >> utilities/automake.mk | 4 >> utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ >> utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ >> 3 files changed, 2020 insertions(+) >> create mode 100755 utilities/usdt-scripts/kernel_delay.py >> create mode 100644 utilities/usdt-scripts/kernel_delay.rst > > General: > > This code has a mix of " and ' (see for example, used vs fn_name in some > blocks). I think it is preferable to use " everywhere. It is a bit > ambiguous in the python coding style, and I see it used > interchangeably. It would be preferable to keep it consistent here. Thanks for reminding me, I thought I did it for this script already but you proved me wrong... >> diff --git a/utilities/automake.mk b/utilities/automake.mk >> index 37d679f82..9a2114df4 100644 >> --- a/utilities/automake.mk >> +++ b/utilities/automake.mk >> @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib >> usdt_SCRIPTS += \ >> utilities/usdt-scripts/bridge_loop.bt \ >> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >> + utilities/usdt-scripts/kernel_delay.py \ >> + utilities/usdt-scripts/kernel_delay.rst \ >> utilities/usdt-scripts/reval_monitor.py \ >> utilities/usdt-scripts/upcall_cost.py \ >> utilities/usdt-scripts/upcall_monitor.py >> @@ -70,6 +72,8 @@ EXTRA_DIST += \ >> utilities/docker/debian/build-kernel-modules.sh \ >> utilities/usdt-scripts/bridge_loop.bt \ >> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >> + utilities/usdt-scripts/kernel_delay.py \ >> + utilities/usdt-scripts/kernel_delay.rst \ >> utilities/usdt-scripts/reval_monitor.py \ >> utilities/usdt-scripts/upcall_cost.py \ >> utilities/usdt-scripts/upcall_monitor.py >> diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py >> new file mode 100755 >> index 000000000..636e108be >> --- /dev/null >> +++ b/utilities/usdt-scripts/kernel_delay.py >> @@ -0,0 +1,1420 @@ >> +#!/usr/bin/env python3 >> +# >> +# Copyright (c) 2022,2023 Red Hat, Inc. >> +# >> +# Licensed under the Apache License, Version 2.0 (the "License"); >> +# you may not use this file except in compliance with the License. >> +# You may obtain a copy of the License at: >> +# >> +# http://www.apache.org/licenses/LICENSE-2.0 >> +# >> +# Unless required by applicable law or agreed to in writing, software >> +# distributed under the License is distributed on an "AS IS" BASIS, >> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> +# See the License for the specific language governing permissions and >> +# limitations under the License. >> +# >> +# >> +# Script information: >> +# ------------------- >> +# This script allows a developer to quickly identify if the issue at hand >> +# might be related to the kernel running out of resources or if it really is >> +# an Open vSwitch issue. >> +# >> +# For documentation see the kernel_delay.rst file. >> +# >> +# >> +# Dependencies: >> +# ------------- >> +# You need to install the BCC package for your specific platform or build it >> +# yourself using the following instructions: >> +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md >> +# >> +# Python needs the following additional packages installed: >> +# - pytz >> +# - psutil >> +# >> +# You can either install your distribution specific package or use pip: >> +# pip install pytz psutil >> +# >> +import argparse >> +import datetime >> +import os >> +import pytz >> +import psutil >> +import re >> +import sys >> +import time >> + >> +import ctypes as ct >> + >> +try: >> + from bcc import BPF, USDT, USDTException >> + from bcc.syscall import syscalls, syscall_name >> +except ModuleNotFoundError: >> + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") >> + sys.exit(os.EX_OSFILE) >> + >> +from enum import IntEnum >> + >> + >> +# >> +# Actual eBPF source code >> +# >> +EBPF_SOURCE = """ >> +#include <linux/irq.h> >> +#include <linux/sched.h> >> + >> +#define MONITOR_PID <MONITOR_PID> >> + >> +enum { >> +<EVENT_ENUM> >> +}; >> + >> +struct event_t { >> + u64 ts; >> + u32 tid; >> + u32 id; >> + >> + int user_stack_id; >> + int kernel_stack_id; >> + >> + u32 syscall; >> + u64 entry_ts; >> + >> +}; >> + >> +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); >> +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); >> +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); >> +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); >> + >> +BPF_ARRAY(capture_on, u64, 1); >> +static inline bool capture_enabled(u64 pid_tgid) { >> + int key = 0; >> + u64 *ret; >> + >> + if ((pid_tgid >> 32) != MONITOR_PID) >> + return false; >> + >> + ret = capture_on.lookup(&key); >> + return ret && *ret == 1; >> +} >> + >> +static inline bool capture_enabled__() { >> + int key = 0; >> + u64 *ret; >> + >> + ret = capture_on.lookup(&key); >> + return ret && *ret == 1; >> +} >> + >> +static struct event_t *get_event(uint32_t id) { >> + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); >> + >> + if (!event) { >> + dropcnt.increment(0); >> + return NULL; >> + } >> + >> + event->id = id; >> + event->ts = bpf_ktime_get_ns(); >> + event->tid = bpf_get_current_pid_tgid(); >> + >> + return event; >> +} >> + >> +static int start_trigger() { >> + int key = 0; >> + u64 *val = capture_on.lookup(&key); >> + >> + /* If the value is -1 we can't start as we are still processing the >> + * results in userspace. */ >> + if (!val || *val != 0) { >> + trigger_miss.increment(0); >> + return 0; >> + } >> + >> + struct event_t *event = get_event(EVENT_START_TRIGGER); >> + if (event) { >> + events.ringbuf_submit(event, 0); >> + *val = 1; >> + } else { >> + trigger_miss.increment(0); >> + } >> + return 0; >> +} >> + >> +static int stop_trigger() { >> + int key = 0; >> + u64 *val = capture_on.lookup(&key); >> + >> + if (!val || *val != 1) >> + return 0; >> + >> + struct event_t *event = get_event(EVENT_STOP_TRIGGER); >> + >> + if (event) >> + events.ringbuf_submit(event, 0); >> + >> + if (val) >> + *val = -1; >> + >> + return 0; >> +} >> + >> +<START_TRIGGER> >> +<STOP_TRIGGER> >> + >> + >> +/* >> + * For the syscall monitor the following probes get installed. >> + */ >> +struct syscall_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct syscall_data_key_t { >> + u32 pid; >> + u32 tid; >> + u32 syscall; >> +}; >> + >> +BPF_HASH(syscall_start, u64, u64); >> +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); >> + >> +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + u64 t = bpf_ktime_get_ns(); >> + syscall_start.update(&pid_tgid, &t); >> + >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { >> + struct syscall_data_t *val, zero = {}; >> + struct syscall_data_key_t key; >> + >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + key.pid = pid_tgid >> 32; >> + key.tid = (u32)pid_tgid; >> + key.syscall = args->id; >> + >> + u64 *start_ns = syscall_start.lookup(&pid_tgid); >> + >> + if (!start_ns) >> + return 0; >> + >> + val = syscall_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + >> + if (<SYSCALL_TRACE_EVENTS>) { >> + struct event_t *event = get_event(EVENT_SYSCALL); >> + if (event) { >> + event->syscall = args->id; >> + event->entry_ts = *start_ns; >> + if (<STACK_TRACE_ENABLED>) { >> + event->user_stack_id = stack_traces.get_stackid( >> + args, BPF_F_USER_STACK); >> + event->kernel_stack_id = stack_traces.get_stackid( >> + args, 0); >> + } >> + events.ringbuf_submit(event, 0); >> + } >> + } >> + } >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the thread run time, we need the following. >> + */ >> +struct run_time_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 max_ns; >> + u64 min_ns; >> +}; >> + >> +struct pid_tid_key_t { >> + u32 pid; >> + u32 tid; >> +}; >> + >> +BPF_HASH(run_start, u64, u64); >> +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); >> + >> +static inline void thread_start_run(u64 pid_tgid, u64 ktime) >> +{ >> + run_start.update(&pid_tgid, &ktime); >> +} >> + >> +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) >> +{ >> + u64 pid_tgid = (u64) tgid << 32 | pid; >> + u64 *start_ns = run_start.lookup(&pid_tgid); >> + >> + if (!start_ns || *start_ns == 0) >> + return; >> + >> + struct run_time_data_t *val, zero = {}; >> + struct pid_tid_key_t key = { .pid = tgid, >> + .tid = pid }; >> + >> + val = run_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = ktime - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->max_ns == 0 || delta > val->max_ns) >> + val->max_ns = delta; >> + if (val->min_ns == 0 || delta < val->min_ns) >> + val->min_ns = delta; >> + } >> + *start_ns = 0; >> +} >> + >> + >> +/* >> + * For measuring the thread-ready delay, we need the following. >> + */ >> +struct ready_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +BPF_HASH(ready_start, u64, u64); >> +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); >> + >> +static inline int sched_wakeup__(u32 pid, u32 tgid) >> +{ >> + u64 pid_tgid = (u64) tgid << 32 | pid; >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + u64 t = bpf_ktime_get_ns(); >> + ready_start.update(&pid_tgid, &t); >> + return 0; >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_wakeup) >> +{ >> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >> + return sched_wakeup__(t->pid, t->tgid); >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_wakeup_new) >> +{ >> + struct task_struct *t = (struct task_struct *)ctx->args[0]; >> + return sched_wakeup__(t->pid, t->tgid); >> +} >> + >> +RAW_TRACEPOINT_PROBE(sched_switch) >> +{ >> + struct task_struct *prev = (struct task_struct *)ctx->args[1]; >> + struct task_struct *next= (struct task_struct *)ctx->args[2]; >> + u64 ktime = 0; >> + >> + if (!capture_enabled__()) >> + return 0; >> + >> + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) >> + sched_wakeup__(prev->pid, prev->tgid); >> + >> + if (prev->tgid == MONITOR_PID) { >> + ktime = bpf_ktime_get_ns(); >> + thread_stop_run(prev->pid, prev->tgid, ktime); >> + } >> + >> + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; >> + >> + if (next->tgid != MONITOR_PID) >> + return 0; >> + >> + if (ktime == 0) >> + ktime = bpf_ktime_get_ns(); >> + >> + u64 *start_ns = ready_start.lookup(&pid_tgid); >> + >> + if (start_ns && *start_ns != 0) { >> + >> + struct ready_data_t *val, zero = {}; >> + struct pid_tid_key_t key = { .pid = next->tgid, >> + .tid = next->pid }; >> + >> + val = ready_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = ktime - *start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + *start_ns = 0; >> + } >> + >> + thread_start_run(pid_tgid, ktime); >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the hard irq time, we need the following. >> + */ >> +struct hardirq_start_data_t { >> + u64 start_ns; >> + char irq_name[32]; >> +}; >> + >> +struct hardirq_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct hardirq_data_key_t { >> + u32 pid; >> + u32 tid; >> + char irq_name[32]; >> +}; >> + >> +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); >> +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); >> + >> +TRACEPOINT_PROBE(irq, irq_handler_entry) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct hardirq_start_data_t data = {}; >> + >> + data.start_ns = bpf_ktime_get_ns(); >> + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); >> + hardirq_start.update(&pid_tgid, &data); >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(irq, irq_handler_exit) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct hardirq_start_data_t *data; >> + data = hardirq_start.lookup(&pid_tgid); >> + if (!data || data->start_ns == 0) >> + return 0; >> + >> + if (args->ret != IRQ_NONE) { >> + struct hardirq_data_t *val, zero = {}; >> + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, >> + .tid = (u32)pid_tgid }; >> + >> + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), >> + data->irq_name); >> + val = hardirq_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + } >> + >> + data->start_ns = 0; >> + return 0; >> +} >> + >> + >> +/* >> + * For measuring the soft irq time, we need the following. >> + */ >> +struct softirq_start_data_t { >> + u64 start_ns; >> + u32 vec_nr; >> +}; >> + >> +struct softirq_data_t { >> + u64 count; >> + u64 total_ns; >> + u64 worst_ns; >> +}; >> + >> +struct softirq_data_key_t { >> + u32 pid; >> + u32 tid; >> + u32 vec_nr; >> +}; >> + >> +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); >> +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); >> + >> +TRACEPOINT_PROBE(irq, softirq_entry) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct softirq_start_data_t data = {}; >> + >> + data.start_ns = bpf_ktime_get_ns(); >> + data.vec_nr = args->vec; >> + softirq_start.update(&pid_tgid, &data); >> + return 0; >> +} >> + >> +TRACEPOINT_PROBE(irq, softirq_exit) >> +{ >> + u64 pid_tgid = bpf_get_current_pid_tgid(); >> + >> + if (!capture_enabled(pid_tgid)) >> + return 0; >> + >> + struct softirq_start_data_t *data; >> + data = softirq_start.lookup(&pid_tgid); >> + if (!data || data->start_ns == 0) >> + return 0; >> + >> + struct softirq_data_t *val, zero = {}; >> + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, >> + .tid = (u32)pid_tgid, >> + .vec_nr = data->vec_nr}; >> + >> + val = softirq_data.lookup_or_try_init(&key, &zero); >> + if (val) { >> + u64 delta = bpf_ktime_get_ns() - data->start_ns; >> + val->count++; >> + val->total_ns += delta; >> + if (val->worst_ns == 0 || delta > val->worst_ns) >> + val->worst_ns = delta; >> + } >> + >> + data->start_ns = 0; >> + return 0; >> +} >> +""" >> + >> + >> +# >> +# time_ns() >> +# >> +try: >> + from time import time_ns >> +except ImportError: >> + # For compatibility with Python <= v3.6. >> + def time_ns(): >> + now = datetime.datetime.now() >> + return int(now.timestamp() * 1e9) >> + >> + >> +# >> +# Probe class to use for the start/stop triggers >> +# >> +class Probe(object): >> + ''' >> + The goal for this object is to support as many as possible >> + probe/events as supported by BCC. See >> + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >> + ''' >> + def __init__(self, probe, pid=None): >> + self.pid = pid >> + self.text_probe = probe >> + self._parse_text_probe() >> + >> + def __str__(self): >> + if self.probe_type == "usdt": >> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >> + self.usdt_provider, self.usdt_probe) >> + elif self.probe_type == "trace": >> + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, >> + self.trace_system, self.trace_event) >> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >> + self.kprobe_function) >> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >> + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, >> + self.uprobe_function) >> + else: >> + return "[{}] <{}:unknown probe>".format(self.text_probe, >> + self.probe_type) >> + >> + def _raise(self, error): >> + raise ValueError("[{}]; {}".format(self.text_probe, error)) >> + >> + def _verify_kprobe_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_trace_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_uprobe_probe(self): >> + # Nothing to verify for now, just return. >> + return >> + >> + def _verify_usdt_probe(self): >> + if not self.pid: >> + self._raise("USDT probes need a valid PID.") >> + >> + usdt = USDT(pid=self.pid) >> + >> + for probe in usdt.enumerate_probes(): >> + if probe.provider.decode('utf-8') == self.usdt_provider and \ >> + probe.name.decode('utf-8') == self.usdt_probe: >> + return >> + >> + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, >> + self.usdt_probe)) >> + >> + def _parse_text_probe(self): >> + ''' >> + The text probe format is defined as follows: >> + <probe_type>:<probe_specific> >> + >> + Types: >> + USDT: u|usdt:<provider>:<probe> >> + TRACE: t|trace:<system>:<event> >> + KPROBE: k|kprobe:<kernel_function> >> + KRETPROBE: kr|kretprobe:<kernel_function> >> + UPROBE: up|uprobe:<function> >> + URETPROBE: ur|uretprobe:<function> >> + ''' >> + args = self.text_probe.split(":") >> + if len(args) <= 1: >> + self._raise("Can't extract probe type.") >> + >> + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", >> + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: >> + self._raise("Invalid probe type '{}'".format(args[0])) >> + >> + self.probe_type = "kprobe" if args[0] == "k" else args[0] >> + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type >> + self.probe_type = "trace" if args[0] == "t" else self.probe_type >> + self.probe_type = "usdt" if args[0] == "u" else self.probe_type >> + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type >> + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type >> + >> + if self.probe_type == "usdt": >> + if len(args) != 3: >> + self._raise("Invalid number of arguments for USDT") >> + >> + self.usdt_provider = args[1] >> + self.usdt_probe = args[2] >> + self._verify_usdt_probe() >> + >> + elif self.probe_type == "trace": >> + if len(args) != 3: >> + self._raise("Invalid number of arguments for TRACE") >> + >> + self.trace_system = args[1] >> + self.trace_event = args[2] >> + self._verify_trace_probe() >> + >> + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": >> + if len(args) != 2: >> + self._raise("Invalid number of arguments for K(RET)PROBE") >> + self.kprobe_function = args[1] >> + self._verify_kprobe_probe() >> + >> + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": >> + if len(args) != 2: >> + self._raise("Invalid number of arguments for U(RET)PROBE") >> + self.uprobe_function = args[1] >> + self._verify_uprobe_probe() >> + >> + def _get_kprobe_c_code(self, function_name, function_content): >> + # >> + # The kprobe__* do not require a function name, so it's >> + # ignored in the code generation. >> + # >> + return """ >> +int {}__{}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(self.probe_type, self.kprobe_function, function_content) >> + >> + def _get_trace_c_code(self, function_name, function_content): >> + # >> + # The TRACEPOINT_PROBE() do not require a function name, so it's >> + # ignored in the code generation. >> + # >> + return """ >> +TRACEPOINT_PROBE({},{}) {{ >> + {} >> +}} >> +""".format(self.trace_system, self.trace_event, function_content) >> + >> + def _get_uprobe_c_code(self, function_name, function_content): >> + return """ >> +int {}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(function_name, function_content) >> + >> + def _get_usdt_c_code(self, function_name, function_content): >> + return """ >> +int {}(struct pt_regs *ctx) {{ >> + {} >> +}} >> +""".format(function_name, function_content) >> + >> + def get_c_code(self, function_name, function_content): >> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >> + return self._get_kprobe_c_code(function_name, function_content) >> + elif self.probe_type == 'trace': >> + return self._get_trace_c_code(function_name, function_content) >> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >> + return self._get_uprobe_c_code(function_name, function_content) >> + elif self.probe_type == 'usdt': >> + return self._get_usdt_c_code(function_name, function_content) >> + >> + return "" >> + >> + def probe_name(self): >> + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': >> + return "{}".format(self.kprobe_function) >> + elif self.probe_type == 'trace': >> + return "{}:{}".format(self.trace_system, >> + self.trace_event) >> + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': >> + return "{}".format(self.uprobe_function) >> + elif self.probe_type == 'usdt': >> + return "{}:{}".format(self.usdt_provider, >> + self.usdt_probe) >> + >> + return "" >> + >> + >> +# >> +# event_to_dict() >> +# >> +def event_to_dict(event): >> + return dict([(field, getattr(event, field)) >> + for (field, _) in event._fields_ >> + if isinstance(getattr(event, field), (int, bytes))]) >> + >> + >> +# >> +# Event enum >> +# >> +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], >> + start=0) >> + >> + >> +# >> +# process_event() >> +# >> +def process_event(ctx, data, size): >> + global start_trigger_ts >> + global stop_trigger_ts >> + >> + event = bpf['events'].event(data) >> + if event.id == Event.SYSCALL: >> + syscall_events.append({"tid": event.tid, >> + "ts_entry": event.entry_ts, >> + "ts_exit": event.ts, >> + "syscall": event.syscall, >> + "user_stack_id": event.user_stack_id, >> + "kernel_stack_id": event.kernel_stack_id}) >> + elif event.id == Event.START_TRIGGER: >> + # >> + # This event would have started the trigger already, so all we need to >> + # do is record the start timestamp. >> + # >> + start_trigger_ts = event.ts >> + >> + elif event.id == Event.STOP_TRIGGER: >> + # >> + # This event would have stopped the trigger already, so all we need to >> + # do is record the start timestamp. >> + stop_trigger_ts = event.ts >> + >> + >> +# >> +# next_power_of_two() >> +# >> +def next_power_of_two(val): >> + np = 1 >> + while np < val: >> + np *= 2 >> + return np >> + >> + >> +# >> +# unsigned_int() >> +# >> +def unsigned_int(value): >> + try: >> + value = int(value) >> + except ValueError: >> + raise argparse.ArgumentTypeError("must be an integer") >> + >> + if value < 0: >> + raise argparse.ArgumentTypeError("must be positive") >> + return value >> + >> + >> +# >> +# unsigned_nonzero_int() >> +# >> +def unsigned_nonzero_int(value): >> + value = unsigned_int(value) >> + if value == 0: >> + raise argparse.ArgumentTypeError("must be nonzero") >> + return value >> + >> + >> +# >> +# get_thread_name() >> +# >> +def get_thread_name(pid, tid): >> + try: >> + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: >> + return f.readline().strip("\n") >> + except FileNotFoundError: >> + pass >> + >> + return f"<unknown:{pid}/{tid}>" >> + >> + >> +# >> +# get_vec_nr_name() >> +# >> +def get_vec_nr_name(vec_nr): >> + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", >> + "tasklet", "sched", "hrtimer", "rcu"] >> + >> + if vec_nr < 0 or vec_nr > len(known_vec_nr): >> + return f"<unknown:{vec_nr}>" >> + >> + return known_vec_nr[vec_nr] >> + >> + >> +# >> +# start/stop/reset capture >> +# >> +def start_capture(): >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) >> + >> + >> +def stop_capture(force=False): >> + if force: >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) >> + else: >> + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) >> + >> + >> +def capture_running(): >> + return bpf["capture_on"][ct.c_int(0)].value == 1 >> + >> + >> +def reset_capture(): >> + bpf["syscall_start"].clear() >> + bpf["syscall_data"].clear() >> + bpf["run_start"].clear() >> + bpf["run_data"].clear() >> + bpf["ready_start"].clear() >> + bpf["ready_data"].clear() >> + bpf["hardirq_start"].clear() >> + bpf["hardirq_data"].clear() >> + bpf["softirq_start"].clear() >> + bpf["softirq_data"].clear() >> + bpf["stack_traces"].clear() >> + >> + >> +# >> +# Display timestamp >> +# >> +def print_timestamp(msg): >> + ltz = datetime.datetime.now() >> + utc = ltz.astimezone(pytz.utc) >> + time_string = "{} @{} ({} UTC)".format( >> + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) >> + print(time_string) >> + >> + >> +# >> +# process_results() >> +# >> +def process_results(syscall_events=None, trigger_delta=None): >> + if trigger_delta: >> + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". >> + format(trigger_delta)) >> + else: >> + print_timestamp("# Sample dump") >> + >> + # >> + # First get a list of all threads we need to report on. >> + # >> + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() >> + if k.syscall != 0xffffffff} >> + >> + threads_run = {k.tid for k, _ in bpf["run_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() >> + if k.pid != 0xffffffff} >> + >> + threads = sorted(threads_syscall | threads_run | threads_ready | >> + threads_hardirq | threads_softirq, >> + key=lambda x: get_thread_name(options.pid, x)) >> + >> + # >> + # Print header... >> + # >> + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) >> + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) >> + indent = 28 * " " >> + >> + # >> + # Print all events/statistics per threads. >> + # >> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >> + for thread in threads: >> + >> + if thread != threads[0]: >> + print("") >> + >> + # >> + # SYSCALL_STATISTICS >> + # >> + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( >> + thread, get_thread_name(options.pid, thread), >> + "[SYSCALL STATISTICS]", indent, >> + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) >> + >> + total_count = 0 >> + total_ns = 0 >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["syscall_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( >> + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, >> + v.count, v.total_ns, v.worst_ns)) >> + if k.syscall != poll_id: >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:6} {:10} {:16,}".format( >> + indent, "TOTAL( - poll):", "", total_count, total_ns)) >> + >> + # >> + # THREAD RUN STATISTICS >> + # >> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( >> + "", "", "[THREAD RUN STATISTICS]", indent, >> + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) >> + >> + for k, v in filter(lambda t: t[0].tid == thread, >> + bpf["run_data"].items()): >> + >> + print("{}{:10} {:16,} {:16,} {:16,}".format( >> + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) >> + >> + # >> + # THREAD READY STATISTICS >> + # >> + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( >> + "", "", "[THREAD READY STATISTICS]", indent, >> + "SCHED_CNT", "TOTAL ns", "MAX ns")) >> + >> + for k, v in filter(lambda t: t[0].tid == thread, >> + bpf["ready_data"].items()): >> + >> + print("{}{:10} {:16,} {:16,}".format( >> + indent, v.count, v.total_ns, v.worst_ns)) >> + >> + # >> + # HARD IRQ STATISTICS >> + # >> + total_ns = 0 >> + total_count = 0 >> + header_printed = False >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["hardirq_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + if not header_printed: >> + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". >> + format("", "", "[HARD IRQ STATISTICS]", indent, >> + "NAME", "COUNT", "TOTAL ns", "MAX ns")) >> + header_printed = True >> + >> + print("{}{:20.20} {:10} {:16,} {:16,}".format( >> + indent, k.irq_name.decode('utf-8'), >> + v.count, v.total_ns, v.worst_ns)) >> + >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:10} {:16,}".format( >> + indent, "TOTAL:", total_count, total_ns)) >> + >> + # >> + # SOFT IRQ STATISTICS >> + # >> + total_ns = 0 >> + total_count = 0 >> + header_printed = False >> + for k, v in sorted(filter(lambda t: t[0].tid == thread, >> + bpf["softirq_data"].items()), >> + key=lambda kv: -kv[1].total_ns): >> + >> + if not header_printed: >> + print("\n{:10} {:16} {}\n" >> + "{}{:20} {:>7} {:>10} {:>16} {:>16}". >> + format("", "", "[SOFT IRQ STATISTICS]", indent, >> + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) >> + header_printed = True >> + >> + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( >> + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, >> + v.count, v.total_ns, v.worst_ns)) >> + >> + total_count += v.count >> + total_ns += v.total_ns >> + >> + if total_count > 0: >> + print("{}{:20.20} {:7} {:10} {:16,}".format( >> + indent, "TOTAL:", "", total_count, total_ns)) >> + >> + # >> + # Print events >> + # >> + lost_stack_traces = 0 >> + if syscall_events: >> + stack_traces = bpf.get_table("stack_traces") >> + >> + print("\n\n# SYSCALL EVENTS:" >> + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( >> + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", >> + "DELTA (us)", "SYSCALL")) >> + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( >> + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, >> + "-" * 10, "-" * 16)) >> + for event in syscall_events: >> + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( >> + " " * 2, >> + event["ts_entry"], event["ts_exit"], event["tid"], >> + get_thread_name(options.pid, event["tid"]), >> + int((event["ts_exit"] - event["ts_entry"]) / 1000), >> + syscall_name(event["syscall"]).decode('utf-8'))) >> + # >> + # Not sure where to put this, but I'll add some info on stack >> + # traces here... Userspace stack traces are very limited due to >> + # the fact that bcc does not support dwarf backtraces. As OVS >> + # gets compiled without frame pointers we will not see much. >> + # If however, OVS does get built with frame pointers, we should not >> + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle >> + # the debug symbols correctly. Also, note that for kernel >> + # traces you should not use BPF_STACK_TRACE_BUILDID, so two >> + # buffers are needed. >> + # >> + # Some info on manual dwarf walk support: >> + # https://github.com/iovisor/bcc/issues/3515 >> + # https://github.com/iovisor/bcc/pull/4463 >> + # >> + if options.stack_trace_size == 0: >> + continue >> + >> + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: >> + lost_stack_traces += 1 >> + >> + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ >> + if event['kernel_stack_id'] >= 0 else [] >> + user_stack = stack_traces.walk(event['user_stack_id']) \ >> + if event['user_stack_id'] >= 0 else [] >> + >> + for addr in kernel_stack: >> + print("{}{}".format( >> + " " * 10, >> + bpf.ksym(addr, show_module=True, >> + show_offset=True).decode('utf-8', 'replace'))) >> + >> + for addr in user_stack: >> + addr_str = bpf.sym(addr, options.pid, show_module=True, >> + show_offset=True).decode('utf-8', 'replace') >> + >> + if addr_str == "[unknown]": >> + addr_str += " 0x{:x}".format(addr) >> + >> + print("{}{}".format(" " * 10, addr_str)) >> + >> + # >> + # Print any footer messages. >> + # >> + if lost_stack_traces > 0: >> + print("\n#WARNING: We where not able to display {} stack traces!\n" >> + "# Consider increasing the stack trace size using\n" >> + "# the '--stack-trace-size' option.\n" >> + "# Note that this can also happen due to a stack id\n" >> + "# collision.".format(lost_stack_traces)) >> + >> + >> +# >> +# main() >> +# >> +def main(): >> + # >> + # Don't like these globals, but ctx passing does not seem to work with the >> + # existing open_ring_buffer() API :( >> + # >> + global bpf >> + global options >> + global syscall_events >> + global start_trigger_ts >> + global stop_trigger_ts >> + >> + start_trigger_ts = 0 >> + stop_trigger_ts = 0 >> + >> + # >> + # Argument parsing >> + # >> + parser = argparse.ArgumentParser() >> + >> + parser.add_argument("-D", "--debug", >> + help="Enable eBPF debugging", >> + type=int, const=0x3f, default=0, nargs='?') >> + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", >> + help="ovs-vswitch's PID", >> + type=unsigned_int, default=None) >> + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", >> + help="Record syscall events that take longer than " >> + "DURATION_NS. Omit the duration value to record all " >> + "syscall events", >> + type=unsigned_int, const=0, default=None, nargs='?') >> + parser.add_argument("--buffer-page-count", >> + help="Number of BPF ring buffer pages, default 1024", >> + type=unsigned_int, default=1024, metavar="NUMBER") >> + parser.add_argument("--sample-count", >> + help="Number of sample runs, default 1", >> + type=unsigned_nonzero_int, default=1, metavar="RUNS") >> + parser.add_argument("--sample-interval", >> + help="Delay between sample runs, default 0", >> + type=float, default=0, metavar="SECONDS") >> + parser.add_argument("--sample-time", >> + help="Sample time, default 0.5 seconds", >> + type=float, default=0.5, metavar="SECONDS") >> + parser.add_argument("--skip-syscall-poll-events", >> + help="Skip poll() syscalls with --syscall-events", >> + action="store_true") >> + parser.add_argument("--stack-trace-size", >> + help="Number of unique stack traces that can be " >> + "recorded, default 4096. 0 to disable", >> + type=unsigned_int, default=4096) >> + parser.add_argument("--start-trigger", metavar="TRIGGER", >> + help="Start trigger, see documentation for details", >> + type=str, default=None) >> + parser.add_argument("--stop-trigger", metavar="TRIGGER", >> + help="Stop trigger, see documentation for details", >> + type=str, default=None) >> + parser.add_argument("--trigger-delta", metavar="DURATION_NS", >> + help="Only report event when the trigger duration > " >> + "DURATION_NS, default 0 (all events)", >> + type=unsigned_int, const=0, default=0, nargs='?') >> + >> + options = parser.parse_args() >> + >> + # >> + # Find the PID of the ovs-vswitchd daemon if not specified. >> + # >> + if not options.pid: >> + for proc in psutil.process_iter(): >> + if 'ovs-vswitchd' in proc.name(): >> + if options.pid: >> + print("ERROR: Multiple ovs-vswitchd daemons running, " >> + "use the -p option!") >> + sys.exit(os.EX_NOINPUT) >> + >> + options.pid = proc.pid >> + >> + # >> + # Error checking on input parameters. >> + # >> + if not options.pid: >> + print("ERROR: Failed to find ovs-vswitchd's PID!") >> + sys.exit(os.EX_UNAVAILABLE) >> + >> + options.buffer_page_count = next_power_of_two(options.buffer_page_count) >> + >> + # >> + # Make sure we are running as root, or else we can not attach the probes. >> + # >> + if os.geteuid() != 0: >> + print("ERROR: We need to run as root to attached probes!") >> + sys.exit(os.EX_NOPERM) >> + >> + # >> + # Setup any of the start stop triggers >> + # >> + if options.start_trigger is not None: >> + try: >> + start_trigger = Probe(options.start_trigger, pid=options.pid) >> + except ValueError as e: >> + print(f"ERROR: Invalid start trigger {str(e)}") >> + sys.exit(os.EX_CONFIG) >> + else: >> + start_trigger = None >> + >> + if options.stop_trigger is not None: >> + try: >> + stop_trigger = Probe(options.stop_trigger, pid=options.pid) >> + except ValueError as e: >> + print(f"ERROR: Invalid stop trigger {str(e)}") >> + sys.exit(os.EX_CONFIG) >> + else: >> + stop_trigger = None >> + >> + # >> + # Attach probe to running process. >> + # >> + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( >> + [" EVENT_{} = {},".format( >> + event.name, event.value) for event in Event])) >> + source = source.replace("<BUFFER_PAGE_CNT>", >> + str(options.buffer_page_count)) >> + source = source.replace("<MONITOR_PID>", str(options.pid)) >> + >> + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: >> + source = source.replace('<STATE_FIELD>', 'state') >> + else: >> + source = source.replace('<STATE_FIELD>', '__state') >> + >> + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] >> + if options.syscall_events is None: >> + syscall_trace_events = "false" >> + elif options.syscall_events == 0: >> + if not options.skip_syscall_poll_events: >> + syscall_trace_events = "true" >> + else: >> + syscall_trace_events = f"args->id != {poll_id}" >> + else: >> + syscall_trace_events = "delta > {}".format(options.syscall_events) >> + if options.skip_syscall_poll_events: >> + syscall_trace_events += f" && args->id != {poll_id}" >> + >> + source = source.replace("<SYSCALL_TRACE_EVENTS>", >> + syscall_trace_events) >> + >> + source = source.replace("<STACK_TRACE_SIZE>", >> + str(options.stack_trace_size)) >> + >> + source = source.replace("<STACK_TRACE_ENABLED>", "true" >> + if options.stack_trace_size > 0 else "false") >> + >> + # >> + # Handle start/stop probes >> + # >> + if start_trigger: >> + source = source.replace("<START_TRIGGER>", >> + start_trigger.get_c_code( >> + "start_trigger_probe", >> + "return start_trigger();")) >> + else: >> + source = source.replace("<START_TRIGGER>", "") >> + >> + if stop_trigger: >> + source = source.replace("<STOP_TRIGGER>", >> + stop_trigger.get_c_code( >> + "stop_trigger_probe", >> + "return stop_trigger();")) >> + else: >> + source = source.replace("<STOP_TRIGGER>", "") >> + >> + # >> + # Setup usdt or other probes that need handling trough the BFP class. >> + # >> + usdt = USDT(pid=int(options.pid)) >> + try: >> + if start_trigger and start_trigger.probe_type == 'usdt': >> + usdt.enable_probe(probe=start_trigger.probe_name(), >> + fn_name="start_trigger_probe") >> + if stop_trigger and stop_trigger.probe_type == 'usdt': >> + usdt.enable_probe(probe=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe") >> + >> + except USDTException as e: >> + print("ERROR: {}".format( >> + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). >> + replace("--with-dtrace or --enable-dtrace", >> + "--enable-usdt-probes"))) >> + sys.exit(os.EX_OSERR) >> + >> + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) >> + >> + if start_trigger: >> + try: >> + if start_trigger.probe_type == "uprobe": >> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >> + sym=start_trigger.probe_name(), >> + fn_name="start_trigger_probe", >> + pid=options.pid) >> + >> + if start_trigger.probe_type == "uretprobe": >> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >> + sym=start_trigger.probe_name(), >> + fn_name="start_trigger_probe", >> + pid=options.pid) >> + except Exception as e: >> + print("ERROR: Failed attaching uprobe start trigger " >> + f"'{start_trigger.probe_name()}';\n {str(e)}") >> + sys.exit(os.EX_OSERR) >> + >> + if stop_trigger: >> + try: >> + if stop_trigger.probe_type == "uprobe": >> + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", >> + sym=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe", >> + pid=options.pid) >> + >> + if stop_trigger.probe_type == "uretprobe": >> + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", >> + sym=stop_trigger.probe_name(), >> + fn_name="stop_trigger_probe", >> + pid=options.pid) >> + except Exception as e: >> + print("ERROR: Failed attaching uprobe stop trigger" >> + f"'{stop_trigger.probe_name()}';\n {str(e)}") >> + sys.exit(os.EX_OSERR) >> + >> + # >> + # If no triggers are configured use the delay configuration >> + # >> + bpf['events'].open_ring_buffer(process_event) >> + >> + sample_count = 0 >> + while sample_count < options.sample_count: >> + sample_count += 1 >> + syscall_events = [] >> + >> + if not options.start_trigger: >> + print_timestamp("# Start sampling") >> + start_capture() >> + stop_time = -1 if options.stop_trigger else \ >> + time_ns() + options.sample_time * 1000000000 >> + else: >> + # For start triggers the stop time depends on the start trigger >> + # time, or depends on the stop trigger if configured. >> + stop_time = -1 if options.stop_trigger else 0 >> + >> + while True: >> + keyboard_interrupt = False >> + try: >> + last_start_ts = start_trigger_ts >> + last_stop_ts = stop_trigger_ts >> + >> + if stop_time > 0: >> + delay = int((stop_time - time_ns()) / 1000000) >> + if delay <= 0: >> + break >> + else: >> + delay = -1 >> + >> + bpf.ring_buffer_poll(timeout=delay) >> + >> + if stop_time <= 0 and last_start_ts != start_trigger_ts: >> + print_timestamp( >> + "# Start sampling (trigger@{})".format( >> + start_trigger_ts)) >> + >> + if not options.stop_trigger: >> + stop_time = time_ns() + \ >> + options.sample_time * 1000000000 >> + >> + if last_stop_ts != stop_trigger_ts: >> + break >> + >> + except KeyboardInterrupt: >> + keyboard_interrupt = True >> + break >> + >> + if options.stop_trigger and not capture_running(): >> + print_timestamp("# Stop sampling (trigger@{})".format( >> + stop_trigger_ts)) >> + else: >> + print_timestamp("# Stop sampling") >> + >> + if stop_trigger_ts != 0 and start_trigger_ts != 0: >> + trigger_delta = stop_trigger_ts - start_trigger_ts >> + else: >> + trigger_delta = None >> + >> + if not trigger_delta or trigger_delta >= options.trigger_delta: >> + stop_capture(force=True) # Prevent a new trigger to start. >> + process_results(syscall_events=syscall_events, >> + trigger_delta=trigger_delta) >> + elif trigger_delta: >> + sample_count -= 1 >> + print_timestamp("# Sample dump skipped, delta {:,} ns".format( >> + trigger_delta)) >> + >> + reset_capture() >> + stop_capture() >> + >> + if keyboard_interrupt: >> + break >> + >> + if options.sample_interval > 0: >> + time.sleep(options.sample_interval) >> + >> + # >> + # Report lost events. >> + # >> + dropcnt = bpf.get_table("dropcnt") >> + for k in dropcnt.keys(): >> + count = dropcnt.sum(k).value >> + if k.value == 0 and count > 0: >> + print("\n# WARNING: Not all events were captured, {} were " >> + "dropped!\n# Increase the BPF ring buffer size " >> + "with the --buffer-page-count option.".format(count)) >> + >> + if (options.sample_count > 1): > > I think it is more pythonic to omit the parens here. ACK, too much C and Python code in one file. >> + trigger_miss = bpf.get_table("trigger_miss") >> + for k in trigger_miss.keys(): >> + count = trigger_miss.sum(k).value >> + if k.value == 0 and count > 0: >> + print("\n# WARNING: Not all start triggers were successful. " >> + "{} were missed due to\n# slow userspace " >> + "processing!".format(count)) >> + >> + >> +# >> +# Start main() as the default entry point... >> +# >> +if __name__ == '__main__': >> + main() >> diff --git a/utilities/usdt-scripts/kernel_delay.rst b/utilities/usdt-scripts/kernel_delay.rst >> new file mode 100644 >> index 000000000..0ebd30afb >> --- /dev/null >> +++ b/utilities/usdt-scripts/kernel_delay.rst >> @@ -0,0 +1,596 @@ >> +Troubleshooting Open vSwitch: Is the kernel to blame? >> +===================================================== >> +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left >> +wondering if the issue is really OVS-related, or if it's a problem with the >> +kernel being overloaded. Messages in the log like >> +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from >> +experience, these are mostly related to an overloaded Linux Kernel. >> +The kernel_delay.py tool can help you quickly identify if the focus of your >> +investigation should be OVS or the Linux kernel. >> + >> + >> +Introduction >> +------------ >> +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ >> +framework to install eBPF probes. The data the eBPF probes collect will be >> +analyzed and presented to the user by the Python script. Some of the presented >> +data can also be captured by the individual scripts included in the BBC [#BCC]_ >> +framework. >> + >> +kernel_delay.py has two modes of operation: >> + >> +- In **time mode**, the tool runs for a specific time and collects the >> + information. >> +- In **trigger mode**, event collection can be started and/or stopped based on >> + a specific eBPF probe. Currently, the following probes are supported: >> + - USDT probes >> + - Kernel tracepoints >> + - kprobe >> + - kretprobe >> + - uprobe >> + - uretprobe >> + >> + >> +In addition, the option, ``--sample-count``, exists to specify how many >> +iterations you would like to do. When using triggers, you can also ignore >> +samples if they are less than a number of nanoseconds with the >> +``--trigger-delta`` option. The latter might be useful when debugging Linux >> +syscalls which take a long time to complete. More on this later. Finally, you >> +can configure the delay between two sample runs with the ``--sample-interval`` >> +option. >> + >> +Before getting into more details, you can run the tool without any options >> +to see what the output looks like. Notice that it will try to automatically >> +get the process ID of the running ``ovs-vswitchd``. You can overwrite this >> +with the ``--pid`` option. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py >> + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) >> + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) >> + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 27090 ovs-vswitchd [SYSCALL STATISTICS] >> + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> >> + >> + 31741 revalidator122 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + poll 7 5 184,193,176 184,191,520 >> + recvmsg 47 494 125,208,756 310,331 >> + futex 202 8 18,768,758 4,023,039 >> + sendto 44 10 375,861 266,867 >> + sendmsg 46 4 43,294 11,213 >> + write 1 1 5,949 5,949 >> + getrusage 98 1 1,424 1,424 >> + read 0 1 1,292 1,292 >> + TOTAL( - poll): 519 144,405,334 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 6 136,764,071 1,480 115,146,424 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 7 11,334 6,636 >> + >> + [HARD IRQ STATISTICS] >> + NAME COUNT TOTAL ns MAX ns >> + eno8303-rx-1 1 3,586 3,586 >> + TOTAL: 1 3,586 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + net_rx 3 1 17,699 17,699 >> + sched 7 6 13,820 3,226 >> + rcu 9 16 13,586 1,554 >> + timer 1 3 10,259 3,815 >> + TOTAL: 26 55,364 >> + >> + >> +By default, the tool will run for half a second in `time mode`. To extend this >> +you can use the ``--sample-time`` option. >> + >> + >> +What will it report >> +------------------- >> +The above sample output separates the captured data on a per-thread basis. >> +For this, it displays the thread's id (``TID``) and name (``THREAD``), >> +followed by resource-specific data. Which are: >> + >> +- ``SYSCALL STATISTICS`` >> +- ``THREAD RUN STATISTICS`` >> +- ``THREAD READY STATISTICS`` >> +- ``HARD IRQ STATISTICS`` >> +- ``SOFT IRQ STATISTICS`` >> + >> +The following sections will describe in detail what statistics they report. >> + >> + >> +``SYSCALL STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~ >> +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during >> +the measurement interval. This includes the number of times the syscall was >> +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), >> +and the worst-case duration of a single call (``MAX ns``). >> + >> +It also shows the total of all system calls, but it excludes the poll system >> +call, as the purpose of this call is to wait for activity on a set of sockets, >> +and usually, the thread gets swapped out. >> + >> +Note that it only counts calls that started and stopped during the >> +measurement interval! >> + >> + >> +``THREAD RUN STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~~~ >> +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU >> +during the measurement interval. >> + >> +Note that these statistics only count events where the thread started and >> +stopped running on a CPU during the measurement interval. For example, if >> +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. >> +If not, there might be a misconfiguration. >> + >> + >> +``THREAD READY STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> +``THREAD READY STATISTICS`` tell you the time between the thread being ready >> +to run and it actually running on the CPU. >> + >> +Note that these statistics only count events where the thread was getting >> +ready to run and started running during the measurement interval. >> + >> + >> +``HARD IRQ STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~ >> +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard >> +interrupts during the threads run time. >> + >> +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), >> +the total time spent in the interrupt handler (``TOTAL ns``), and the >> +worst-case duration (``MAX ns``). >> + >> + >> +``SOFT IRQ STATISTICS`` >> +~~~~~~~~~~~~~~~~~~~~~~~ >> +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft >> +interrupts during the threads run time. >> + >> +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the >> +number of interrupts (``COUNT``), the total time spent in the interrupt >> +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). >> + >> + >> +The ``--syscall-events`` option >> +------------------------------- >> +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, >> +the tool can also report each individual syscall. This can be a usefull >> +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. >> + >> +All you need to do is add the ``--syscall-events`` option, with or without >> +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter >> +allows you to exclude events that take less than the supplied time. >> + >> +The ``--skip-syscall-poll-events`` option allows you to exclude poll >> +syscalls from the report. >> + >> +Below is an example run, note that the resource-specific data is removed >> +to highlight the syscall events: >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events >> + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) >> + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) >> + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 3359686 ipf_clean2 [SYSCALL STATISTICS] >> + ... >> + 3359635 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 3359697 revalidator12 [SYSCALL STATISTICS] >> + ... >> + 3359698 revalidator13 [SYSCALL STATISTICS] >> + ... >> + 3359699 revalidator14 [SYSCALL STATISTICS] >> + ... >> + 3359700 revalidator15 [SYSCALL STATISTICS] >> + ... >> + >> + # SYSCALL EVENTS: >> + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL >> + ------------------- ------------------- ---------- ---------------- ---------- ---------------- >> + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003936313a63 >> + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003134313a63 >> + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI___lll_lock_wait+0x30 [libc.so.6] >> + ovs_mutex_lock_at+0x18 [ovs-vswitchd] >> + [unknown] 0x696c003936313a63 >> + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode_prepare+0x161 [kernel] >> + syscall_exit_to_user_mode+0x9 [kernel] >> + do_syscall_64+0x68 [kernel] >> + entry_SYSCALL_64_after_hwframe+0x72 [kernel] >> + __GI_accept+0x4d [libc.so.6] >> + pfd_accept+0x3a [ovs-vswitchd] >> + [unknown] 0x7fff19f2bd00 >> + [unknown] 0xe4b8001f0f >> + >> +As you can see above, the output also shows the stackback trace. You can >> +disable this using the ``--stack-trace-size 0`` option. >> + >> +As you can see above, the backtrace does not show a lot of useful information >> +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further >> +analyze system call backtraces, you could use perf. The following perf >> +script can do this for you (refer to the embedded instructions): >> + >> +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py >> + >> + >> +Using triggers >> +-------------- >> +The tool supports start and, or stop triggers. This will allow you to capture >> +statistics triggered by a specific event. The following combinations of >> +stop-and-start triggers can be used. >> + >> +If you only use ``--start-trigger``, the inspection start when the trigger >> +happens and runs until the ``--sample-time`` number of seconds has passed. >> +The example below shows all the supported options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ >> + --sample-count 4 --sample-interval 1 >> + >> + >> +If you only use ``--stop-trigger``, the inspection starts immediately and >> +stops when the trigger happens. The example below shows all the supported >> +options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ >> + --sample-count 4 --sample-interval 1 >> + >> + >> +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the >> +statistics are captured between the two first occurrences of these events. >> +The example below shows all the supported options in this scenario. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ >> + --stop-trigger upr:bridge_run \ >> + --sample-count 4 --sample-interval 1 \ >> + --trigger-delta 50000 >> + >> +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, >> +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. >> +For more details on the supported tracepoints, check out the BCC >> +documentation [#BCC_EVENT]_. >> + >> +The list below shows the supported triggers and their argument format: >> + >> +**USDT probes:** >> + [u|usdt]:{provider}:{probe} >> +**Kernel tracepoint:** >> + [t:trace]:{system}:{event} >> +**kprobe:** >> + [k:kprobe]:{kernel_function} >> +**kretprobe:** >> + [kr:kretprobe]:{kernel_function} >> +**uprobe:** >> + [up:uprobe]:{function} >> +**uretprobe:** >> + [upr:uretprobe]:{function} >> + >> +Here are a couple of trigger examples, more use-case-specific examples can be >> +found in the *Examples* section. >> + >> +.. code-block:: console >> + >> + --start|stop-trigger u:udpif_revalidator:start_dump >> + --start|stop-trigger t:openvswitch:ovs_dp_upcall >> + --start|stop-trigger k:ovs_dp_process_packet >> + --start|stop-trigger kr:ovs_dp_process_packet >> + --start|stop-trigger up:bridge_run >> + --start|stop-trigger upr:bridge_run >> + >> + >> +Examples >> +-------- >> +This section will give some examples of how to use this tool in real-world >> +scenarios. Let's start with the issue where Open vSwitch reports >> +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note >> +that there is a blog available explaining how the revalidator process works >> +in OVS [#REVAL_BLOG]_. >> + >> +First, let me explain this log message. It gets logged if the time delta >> +between two ``poll_block()`` calls is more than 1 second. In other words, >> +the process was spending a lot of time processing stuff that was made >> +available by the return of the ``poll_block()`` function. >> + >> +Do a run with the tool using the existing USDT revalidator probes as a start >> +and stop trigger (Note that the resource-specific data is removed from the none >> +revalidator threads): >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done >> + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) >> + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) >> + # Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1457761 handler24 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + sendmsg 46 6110 123,274,761 41,776 >> + recvmsg 47 136299 99,397,508 49,896 >> + futex 202 51 7,655,832 7,536,776 >> + poll 7 4068 1,202,883 2,907 >> + getrusage 98 2034 586,602 1,398 >> + sendto 44 9 213,682 27,417 >> + TOTAL( - poll): 144503 231,128,385 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 1 1,438 1,438 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 21 59,145 3,769 >> + rcu 9 50 42,917 2,234 >> + TOTAL: 71 102,062 >> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 1457792 revalidator55 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + futex 202 73 572,576,329 19,621,600 >> + recvmsg 47 815 296,697,618 405,338 >> + sendto 44 3 78,302 26,837 >> + sendmsg 46 3 38,712 13,250 >> + write 1 1 5,073 5,073 >> + TOTAL( - poll): 895 869,396,034 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 48 394,350,393 1,729 140,455,796 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 49 23,650 1,559 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 14 26,889 3,041 >> + rcu 9 28 23,024 1,600 >> + TOTAL: 42 49,913 >> + >> + >> +Above you see from the start of the output that the trigger took more than a >> +second (1,187,986,454 ns), which is already know, by looking at the output of >> +the ``ovs-vsctl upcall/show`` command. >> + >> +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it >> +spent almost 870ms handling syscalls, and there were no poll() calls being >> +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, >> +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned >> +that this time does not include the time being on the CPU at the start or stop >> +of an event. What is exactly the case here, because USDT probes were used. >> + >> +From the above data and maybe some ``top`` output, it can be determined that >> +the *revalidator55* thread is taking a lot of CPU time, probably because it >> +has to do a lot of revalidator work by itself. The solution here is to increase >> +the number of revalidator threads, so more work could be done in parallel. >> + >> +Here is another run of the same command in another scenario: >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done >> + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) >> + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) >> + # Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1457733 ovs-vswitchd [SYSCALL STATISTICS] >> + ... >> + 1457792 revalidator55 [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + recvmsg 47 284 193,422,110 46,248,418 >> + sendto 44 2 46,685 23,665 >> + sendmsg 46 2 24,916 12,703 >> + write 1 1 6,534 6,534 >> + TOTAL( - poll): 289 193,500,245 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 2 47,333,558 331,516 47,002,042 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 3 87,000,403 45,999,712 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + sched 7 2 9,504 5,109 >> + TOTAL: 2 9,504 >> + >> + >> +Here you can see the revalidator run took about 263ms, which does not look >> +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was >> +waiting 87ms for a CPU to be run on. This means the revalidator process could >> +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay >> +of almost 46ms can be seen, which hints at an overloaded system. >> + >> +One final example that uses a ``uprobe`` to get some statistics on a >> +``bridge_run()`` execution that takes more than 1ms. >> + >> +.. code-block:: console >> + >> + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000 >> + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) >> + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) >> + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) >> + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) >> + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) >> + # Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE unknown THREADS> >> + 3371102 handler66 [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE HANDLER THREADS> >> + 3366258 ovs-vswitchd [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + futex 202 43 403,469 199,312 >> + clone3 435 13 174,394 30,731 >> + munmap 11 8 115,774 21,861 >> + poll 7 5 92,969 38,307 >> + unlink 87 2 49,918 35,741 >> + mprotect 10 8 47,618 13,201 >> + accept 43 10 31,360 6,976 >> + mmap 9 8 30,279 5,776 >> + write 1 6 27,720 11,774 >> + rt_sigprocmask 14 28 12,281 970 >> + read 0 6 9,478 2,318 >> + recvfrom 45 3 7,024 4,024 >> + sendto 44 1 4,684 4,684 >> + getrusage 98 5 4,594 1,342 >> + close 3 2 2,918 1,627 >> + recvmsg 47 1 2,722 2,722 >> + TOTAL( - poll): 144 924,233 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + 13 817,605 5,433 524,376 >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + 14 28,646 11,566 >> + >> + [SOFT IRQ STATISTICS] >> + NAME VECT_NR COUNT TOTAL ns MAX ns >> + rcu 9 1 2,838 2,838 >> + TOTAL: 1 2,838 >> + >> + 3371110 revalidator74 [SYSCALL STATISTICS] >> + ... <REMOVED 7 MORE NEW revalidator THREADS> >> + 3366311 urcu3 [SYSCALL STATISTICS] >> + ... >> + >> + >> +OVS removed some of the threads and their resource-specific data, but based >> +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some >> +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some >> +``clone3`` syscalls, indicating threads were created. In this example, it was >> +due to the deletion of a bridge, which resulted in the recreation of the >> +revalidator and handler threads. >> + >> + >> +Use with Openshift >> +------------------ >> +This section describes how you would use the tool on a node in an OpenShift >> +cluster. It assumes you have console access to the node, either directly or >> +through a debug container. >> + >> +A base fedora38 container will be used through podman, as this will allow the >> +use of some additional tools and packages needed. >> + >> +First the containers need to be started: >> + >> +.. code-block:: console >> + >> + [core@sno-master ~]$ sudo podman run -it --rm \ >> + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ >> + --privileged --network=host --pid=host \ >> + -v /lib/modules:/lib/modules:ro \ >> + -v /sys/kernel/debug:/sys/kernel/debug \ >> + -v /proc:/proc \ >> + -v /:/mnt/rootdir \ >> + quay.io/fedora/fedora:38-x86_64 >> + >> + [(DEBUG)root@sno-master /]# >> + >> + >> +Next add the ``linux_delay.py`` dependencies: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ >> + python3-pytz python3-psutil >> + >> + >> +You need to install the devel, debug and source RPMs for your OVS and kernel >> +version: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master home]# rpm -i \ >> + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ >> + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ >> + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm >> + >> + >> +Now the tool can be started. Here the above ``bridge_run()`` example is used: >> + >> +.. code-block:: console >> + >> + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run >> + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) >> + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) >> + # Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC) >> + TID THREAD <RESOURCE SPECIFIC> >> + ---------- ---------------- ---------------------------------------------------------------------------- >> + 1246 ovs-vswitchd [SYSCALL STATISTICS] >> + NAME NUMBER COUNT TOTAL ns MAX ns >> + getdents64 217 2 8,560 8,162 >> + openat 257 1 6,951 6,951 >> + accept 43 4 6,942 3,763 >> + recvfrom 45 1 3,726 3,726 >> + recvmsg 47 2 2,880 2,188 >> + stat 4 2 1,946 1,384 >> + close 3 1 1,393 1,393 >> + fstat 5 1 1,324 1,324 >> + TOTAL( - poll): 14 33,722 >> + >> + [THREAD RUN STATISTICS] >> + SCHED_CNT TOTAL ns MIN ns MAX ns >> + >> + [THREAD READY STATISTICS] >> + SCHED_CNT TOTAL ns MAX ns >> + >> + >> +.. rubric:: Footnotes >> + >> +.. [#BCC] https://github.com/iovisor/bcc >> +.. [#BCC_EVENT] https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments >> +.. [#REVAL_BLOG] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained
diff --git a/utilities/automake.mk b/utilities/automake.mk index 37d679f82..9a2114df4 100644 --- a/utilities/automake.mk +++ b/utilities/automake.mk @@ -23,6 +23,8 @@ scripts_DATA += utilities/ovs-lib usdt_SCRIPTS += \ utilities/usdt-scripts/bridge_loop.bt \ utilities/usdt-scripts/dpif_nl_exec_monitor.py \ + utilities/usdt-scripts/kernel_delay.py \ + utilities/usdt-scripts/kernel_delay.rst \ utilities/usdt-scripts/reval_monitor.py \ utilities/usdt-scripts/upcall_cost.py \ utilities/usdt-scripts/upcall_monitor.py @@ -70,6 +72,8 @@ EXTRA_DIST += \ utilities/docker/debian/build-kernel-modules.sh \ utilities/usdt-scripts/bridge_loop.bt \ utilities/usdt-scripts/dpif_nl_exec_monitor.py \ + utilities/usdt-scripts/kernel_delay.py \ + utilities/usdt-scripts/kernel_delay.rst \ utilities/usdt-scripts/reval_monitor.py \ utilities/usdt-scripts/upcall_cost.py \ utilities/usdt-scripts/upcall_monitor.py diff --git a/utilities/usdt-scripts/kernel_delay.py b/utilities/usdt-scripts/kernel_delay.py new file mode 100755 index 000000000..636e108be --- /dev/null +++ b/utilities/usdt-scripts/kernel_delay.py @@ -0,0 +1,1420 @@ +#!/usr/bin/env python3 +# +# Copyright (c) 2022,2023 Red Hat, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at: +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# +# Script information: +# ------------------- +# This script allows a developer to quickly identify if the issue at hand +# might be related to the kernel running out of resources or if it really is +# an Open vSwitch issue. +# +# For documentation see the kernel_delay.rst file. +# +# +# Dependencies: +# ------------- +# You need to install the BCC package for your specific platform or build it +# yourself using the following instructions: +# https://raw.githubusercontent.com/iovisor/bcc/master/INSTALL.md +# +# Python needs the following additional packages installed: +# - pytz +# - psutil +# +# You can either install your distribution specific package or use pip: +# pip install pytz psutil +# +import argparse +import datetime +import os +import pytz +import psutil +import re +import sys +import time + +import ctypes as ct + +try: + from bcc import BPF, USDT, USDTException + from bcc.syscall import syscalls, syscall_name +except ModuleNotFoundError: + print("ERROR: Can't find the BPF Compiler Collection (BCC) tools!") + sys.exit(os.EX_OSFILE) + +from enum import IntEnum + + +# +# Actual eBPF source code +# +EBPF_SOURCE = """ +#include <linux/irq.h> +#include <linux/sched.h> + +#define MONITOR_PID <MONITOR_PID> + +enum { +<EVENT_ENUM> +}; + +struct event_t { + u64 ts; + u32 tid; + u32 id; + + int user_stack_id; + int kernel_stack_id; + + u32 syscall; + u64 entry_ts; + +}; + +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_CNT>); +BPF_STACK_TRACE(stack_traces, <STACK_TRACE_SIZE>); +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); +BPF_TABLE("percpu_array", uint32_t, uint64_t, trigger_miss, 1); + +BPF_ARRAY(capture_on, u64, 1); +static inline bool capture_enabled(u64 pid_tgid) { + int key = 0; + u64 *ret; + + if ((pid_tgid >> 32) != MONITOR_PID) + return false; + + ret = capture_on.lookup(&key); + return ret && *ret == 1; +} + +static inline bool capture_enabled__() { + int key = 0; + u64 *ret; + + ret = capture_on.lookup(&key); + return ret && *ret == 1; +} + +static struct event_t *get_event(uint32_t id) { + struct event_t *event = events.ringbuf_reserve(sizeof(struct event_t)); + + if (!event) { + dropcnt.increment(0); + return NULL; + } + + event->id = id; + event->ts = bpf_ktime_get_ns(); + event->tid = bpf_get_current_pid_tgid(); + + return event; +} + +static int start_trigger() { + int key = 0; + u64 *val = capture_on.lookup(&key); + + /* If the value is -1 we can't start as we are still processing the + * results in userspace. */ + if (!val || *val != 0) { + trigger_miss.increment(0); + return 0; + } + + struct event_t *event = get_event(EVENT_START_TRIGGER); + if (event) { + events.ringbuf_submit(event, 0); + *val = 1; + } else { + trigger_miss.increment(0); + } + return 0; +} + +static int stop_trigger() { + int key = 0; + u64 *val = capture_on.lookup(&key); + + if (!val || *val != 1) + return 0; + + struct event_t *event = get_event(EVENT_STOP_TRIGGER); + + if (event) + events.ringbuf_submit(event, 0); + + if (val) + *val = -1; + + return 0; +} + +<START_TRIGGER> +<STOP_TRIGGER> + + +/* + * For the syscall monitor the following probes get installed. + */ +struct syscall_data_t { + u64 count; + u64 total_ns; + u64 worst_ns; +}; + +struct syscall_data_key_t { + u32 pid; + u32 tid; + u32 syscall; +}; + +BPF_HASH(syscall_start, u64, u64); +BPF_HASH(syscall_data, struct syscall_data_key_t, struct syscall_data_t); + +TRACEPOINT_PROBE(raw_syscalls, sys_enter) { + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + u64 t = bpf_ktime_get_ns(); + syscall_start.update(&pid_tgid, &t); + + return 0; +} + +TRACEPOINT_PROBE(raw_syscalls, sys_exit) { + struct syscall_data_t *val, zero = {}; + struct syscall_data_key_t key; + + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + key.pid = pid_tgid >> 32; + key.tid = (u32)pid_tgid; + key.syscall = args->id; + + u64 *start_ns = syscall_start.lookup(&pid_tgid); + + if (!start_ns) + return 0; + + val = syscall_data.lookup_or_try_init(&key, &zero); + if (val) { + u64 delta = bpf_ktime_get_ns() - *start_ns; + val->count++; + val->total_ns += delta; + if (val->worst_ns == 0 || delta > val->worst_ns) + val->worst_ns = delta; + + if (<SYSCALL_TRACE_EVENTS>) { + struct event_t *event = get_event(EVENT_SYSCALL); + if (event) { + event->syscall = args->id; + event->entry_ts = *start_ns; + if (<STACK_TRACE_ENABLED>) { + event->user_stack_id = stack_traces.get_stackid( + args, BPF_F_USER_STACK); + event->kernel_stack_id = stack_traces.get_stackid( + args, 0); + } + events.ringbuf_submit(event, 0); + } + } + } + return 0; +} + + +/* + * For measuring the thread run time, we need the following. + */ +struct run_time_data_t { + u64 count; + u64 total_ns; + u64 max_ns; + u64 min_ns; +}; + +struct pid_tid_key_t { + u32 pid; + u32 tid; +}; + +BPF_HASH(run_start, u64, u64); +BPF_HASH(run_data, struct pid_tid_key_t, struct run_time_data_t); + +static inline void thread_start_run(u64 pid_tgid, u64 ktime) +{ + run_start.update(&pid_tgid, &ktime); +} + +static inline void thread_stop_run(u32 pid, u32 tgid, u64 ktime) +{ + u64 pid_tgid = (u64) tgid << 32 | pid; + u64 *start_ns = run_start.lookup(&pid_tgid); + + if (!start_ns || *start_ns == 0) + return; + + struct run_time_data_t *val, zero = {}; + struct pid_tid_key_t key = { .pid = tgid, + .tid = pid }; + + val = run_data.lookup_or_try_init(&key, &zero); + if (val) { + u64 delta = ktime - *start_ns; + val->count++; + val->total_ns += delta; + if (val->max_ns == 0 || delta > val->max_ns) + val->max_ns = delta; + if (val->min_ns == 0 || delta < val->min_ns) + val->min_ns = delta; + } + *start_ns = 0; +} + + +/* + * For measuring the thread-ready delay, we need the following. + */ +struct ready_data_t { + u64 count; + u64 total_ns; + u64 worst_ns; +}; + +BPF_HASH(ready_start, u64, u64); +BPF_HASH(ready_data, struct pid_tid_key_t, struct ready_data_t); + +static inline int sched_wakeup__(u32 pid, u32 tgid) +{ + u64 pid_tgid = (u64) tgid << 32 | pid; + + if (!capture_enabled(pid_tgid)) + return 0; + + u64 t = bpf_ktime_get_ns(); + ready_start.update(&pid_tgid, &t); + return 0; +} + +RAW_TRACEPOINT_PROBE(sched_wakeup) +{ + struct task_struct *t = (struct task_struct *)ctx->args[0]; + return sched_wakeup__(t->pid, t->tgid); +} + +RAW_TRACEPOINT_PROBE(sched_wakeup_new) +{ + struct task_struct *t = (struct task_struct *)ctx->args[0]; + return sched_wakeup__(t->pid, t->tgid); +} + +RAW_TRACEPOINT_PROBE(sched_switch) +{ + struct task_struct *prev = (struct task_struct *)ctx->args[1]; + struct task_struct *next= (struct task_struct *)ctx->args[2]; + u64 ktime = 0; + + if (!capture_enabled__()) + return 0; + + if (prev-><STATE_FIELD> == TASK_RUNNING && prev->tgid == MONITOR_PID) + sched_wakeup__(prev->pid, prev->tgid); + + if (prev->tgid == MONITOR_PID) { + ktime = bpf_ktime_get_ns(); + thread_stop_run(prev->pid, prev->tgid, ktime); + } + + u64 pid_tgid = (u64)next->tgid << 32 | next->pid; + + if (next->tgid != MONITOR_PID) + return 0; + + if (ktime == 0) + ktime = bpf_ktime_get_ns(); + + u64 *start_ns = ready_start.lookup(&pid_tgid); + + if (start_ns && *start_ns != 0) { + + struct ready_data_t *val, zero = {}; + struct pid_tid_key_t key = { .pid = next->tgid, + .tid = next->pid }; + + val = ready_data.lookup_or_try_init(&key, &zero); + if (val) { + u64 delta = ktime - *start_ns; + val->count++; + val->total_ns += delta; + if (val->worst_ns == 0 || delta > val->worst_ns) + val->worst_ns = delta; + } + *start_ns = 0; + } + + thread_start_run(pid_tgid, ktime); + return 0; +} + + +/* + * For measuring the hard irq time, we need the following. + */ +struct hardirq_start_data_t { + u64 start_ns; + char irq_name[32]; +}; + +struct hardirq_data_t { + u64 count; + u64 total_ns; + u64 worst_ns; +}; + +struct hardirq_data_key_t { + u32 pid; + u32 tid; + char irq_name[32]; +}; + +BPF_HASH(hardirq_start, u64, struct hardirq_start_data_t); +BPF_HASH(hardirq_data, struct hardirq_data_key_t, struct hardirq_data_t); + +TRACEPOINT_PROBE(irq, irq_handler_entry) +{ + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + struct hardirq_start_data_t data = {}; + + data.start_ns = bpf_ktime_get_ns(); + TP_DATA_LOC_READ_STR(&data.irq_name, name, sizeof(data.irq_name)); + hardirq_start.update(&pid_tgid, &data); + return 0; +} + +TRACEPOINT_PROBE(irq, irq_handler_exit) +{ + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + struct hardirq_start_data_t *data; + data = hardirq_start.lookup(&pid_tgid); + if (!data || data->start_ns == 0) + return 0; + + if (args->ret != IRQ_NONE) { + struct hardirq_data_t *val, zero = {}; + struct hardirq_data_key_t key = { .pid = pid_tgid >> 32, + .tid = (u32)pid_tgid }; + + bpf_probe_read_kernel(&key.irq_name, sizeof(key.irq_name), + data->irq_name); + val = hardirq_data.lookup_or_try_init(&key, &zero); + if (val) { + u64 delta = bpf_ktime_get_ns() - data->start_ns; + val->count++; + val->total_ns += delta; + if (val->worst_ns == 0 || delta > val->worst_ns) + val->worst_ns = delta; + } + } + + data->start_ns = 0; + return 0; +} + + +/* + * For measuring the soft irq time, we need the following. + */ +struct softirq_start_data_t { + u64 start_ns; + u32 vec_nr; +}; + +struct softirq_data_t { + u64 count; + u64 total_ns; + u64 worst_ns; +}; + +struct softirq_data_key_t { + u32 pid; + u32 tid; + u32 vec_nr; +}; + +BPF_HASH(softirq_start, u64, struct softirq_start_data_t); +BPF_HASH(softirq_data, struct softirq_data_key_t, struct softirq_data_t); + +TRACEPOINT_PROBE(irq, softirq_entry) +{ + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + struct softirq_start_data_t data = {}; + + data.start_ns = bpf_ktime_get_ns(); + data.vec_nr = args->vec; + softirq_start.update(&pid_tgid, &data); + return 0; +} + +TRACEPOINT_PROBE(irq, softirq_exit) +{ + u64 pid_tgid = bpf_get_current_pid_tgid(); + + if (!capture_enabled(pid_tgid)) + return 0; + + struct softirq_start_data_t *data; + data = softirq_start.lookup(&pid_tgid); + if (!data || data->start_ns == 0) + return 0; + + struct softirq_data_t *val, zero = {}; + struct softirq_data_key_t key = { .pid = pid_tgid >> 32, + .tid = (u32)pid_tgid, + .vec_nr = data->vec_nr}; + + val = softirq_data.lookup_or_try_init(&key, &zero); + if (val) { + u64 delta = bpf_ktime_get_ns() - data->start_ns; + val->count++; + val->total_ns += delta; + if (val->worst_ns == 0 || delta > val->worst_ns) + val->worst_ns = delta; + } + + data->start_ns = 0; + return 0; +} +""" + + +# +# time_ns() +# +try: + from time import time_ns +except ImportError: + # For compatibility with Python <= v3.6. + def time_ns(): + now = datetime.datetime.now() + return int(now.timestamp() * 1e9) + + +# +# Probe class to use for the start/stop triggers +# +class Probe(object): + ''' + The goal for this object is to support as many as possible + probe/events as supported by BCC. See + https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments + ''' + def __init__(self, probe, pid=None): + self.pid = pid + self.text_probe = probe + self._parse_text_probe() + + def __str__(self): + if self.probe_type == "usdt": + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, + self.usdt_provider, self.usdt_probe) + elif self.probe_type == "trace": + return "[{}]; {}:{}:{}".format(self.text_probe, self.probe_type, + self.trace_system, self.trace_event) + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, + self.kprobe_function) + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": + return "[{}]; {}:{}".format(self.text_probe, self.probe_type, + self.uprobe_function) + else: + return "[{}] <{}:unknown probe>".format(self.text_probe, + self.probe_type) + + def _raise(self, error): + raise ValueError("[{}]; {}".format(self.text_probe, error)) + + def _verify_kprobe_probe(self): + # Nothing to verify for now, just return. + return + + def _verify_trace_probe(self): + # Nothing to verify for now, just return. + return + + def _verify_uprobe_probe(self): + # Nothing to verify for now, just return. + return + + def _verify_usdt_probe(self): + if not self.pid: + self._raise("USDT probes need a valid PID.") + + usdt = USDT(pid=self.pid) + + for probe in usdt.enumerate_probes(): + if probe.provider.decode('utf-8') == self.usdt_provider and \ + probe.name.decode('utf-8') == self.usdt_probe: + return + + self._raise("Can't find UDST probe '{}:{}'".format(self.usdt_provider, + self.usdt_probe)) + + def _parse_text_probe(self): + ''' + The text probe format is defined as follows: + <probe_type>:<probe_specific> + + Types: + USDT: u|usdt:<provider>:<probe> + TRACE: t|trace:<system>:<event> + KPROBE: k|kprobe:<kernel_function> + KRETPROBE: kr|kretprobe:<kernel_function> + UPROBE: up|uprobe:<function> + URETPROBE: ur|uretprobe:<function> + ''' + args = self.text_probe.split(":") + if len(args) <= 1: + self._raise("Can't extract probe type.") + + if args[0] not in ["k", "kprobe", "kr", "kretprobe", "t", "trace", + "u", "usdt", "up", "uprobe", "ur", "uretprobe"]: + self._raise("Invalid probe type '{}'".format(args[0])) + + self.probe_type = "kprobe" if args[0] == "k" else args[0] + self.probe_type = "kretprobe" if args[0] == "kr" else self.probe_type + self.probe_type = "trace" if args[0] == "t" else self.probe_type + self.probe_type = "usdt" if args[0] == "u" else self.probe_type + self.probe_type = "uprobe" if args[0] == "up" else self.probe_type + self.probe_type = "uretprobe" if args[0] == "ur" else self.probe_type + + if self.probe_type == "usdt": + if len(args) != 3: + self._raise("Invalid number of arguments for USDT") + + self.usdt_provider = args[1] + self.usdt_probe = args[2] + self._verify_usdt_probe() + + elif self.probe_type == "trace": + if len(args) != 3: + self._raise("Invalid number of arguments for TRACE") + + self.trace_system = args[1] + self.trace_event = args[2] + self._verify_trace_probe() + + elif self.probe_type == "kprobe" or self.probe_type == "kretprobe": + if len(args) != 2: + self._raise("Invalid number of arguments for K(RET)PROBE") + self.kprobe_function = args[1] + self._verify_kprobe_probe() + + elif self.probe_type == "uprobe" or self.probe_type == "uretprobe": + if len(args) != 2: + self._raise("Invalid number of arguments for U(RET)PROBE") + self.uprobe_function = args[1] + self._verify_uprobe_probe() + + def _get_kprobe_c_code(self, function_name, function_content): + # + # The kprobe__* do not require a function name, so it's + # ignored in the code generation. + # + return """ +int {}__{}(struct pt_regs *ctx) {{ + {} +}} +""".format(self.probe_type, self.kprobe_function, function_content) + + def _get_trace_c_code(self, function_name, function_content): + # + # The TRACEPOINT_PROBE() do not require a function name, so it's + # ignored in the code generation. + # + return """ +TRACEPOINT_PROBE({},{}) {{ + {} +}} +""".format(self.trace_system, self.trace_event, function_content) + + def _get_uprobe_c_code(self, function_name, function_content): + return """ +int {}(struct pt_regs *ctx) {{ + {} +}} +""".format(function_name, function_content) + + def _get_usdt_c_code(self, function_name, function_content): + return """ +int {}(struct pt_regs *ctx) {{ + {} +}} +""".format(function_name, function_content) + + def get_c_code(self, function_name, function_content): + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': + return self._get_kprobe_c_code(function_name, function_content) + elif self.probe_type == 'trace': + return self._get_trace_c_code(function_name, function_content) + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': + return self._get_uprobe_c_code(function_name, function_content) + elif self.probe_type == 'usdt': + return self._get_usdt_c_code(function_name, function_content) + + return "" + + def probe_name(self): + if self.probe_type == 'kprobe' or self.probe_type == 'kretprobe': + return "{}".format(self.kprobe_function) + elif self.probe_type == 'trace': + return "{}:{}".format(self.trace_system, + self.trace_event) + elif self.probe_type == 'uprobe' or self.probe_type == 'uretprobe': + return "{}".format(self.uprobe_function) + elif self.probe_type == 'usdt': + return "{}:{}".format(self.usdt_provider, + self.usdt_probe) + + return "" + + +# +# event_to_dict() +# +def event_to_dict(event): + return dict([(field, getattr(event, field)) + for (field, _) in event._fields_ + if isinstance(getattr(event, field), (int, bytes))]) + + +# +# Event enum +# +Event = IntEnum("Event", ["SYSCALL", "START_TRIGGER", "STOP_TRIGGER"], + start=0) + + +# +# process_event() +# +def process_event(ctx, data, size): + global start_trigger_ts + global stop_trigger_ts + + event = bpf['events'].event(data) + if event.id == Event.SYSCALL: + syscall_events.append({"tid": event.tid, + "ts_entry": event.entry_ts, + "ts_exit": event.ts, + "syscall": event.syscall, + "user_stack_id": event.user_stack_id, + "kernel_stack_id": event.kernel_stack_id}) + elif event.id == Event.START_TRIGGER: + # + # This event would have started the trigger already, so all we need to + # do is record the start timestamp. + # + start_trigger_ts = event.ts + + elif event.id == Event.STOP_TRIGGER: + # + # This event would have stopped the trigger already, so all we need to + # do is record the start timestamp. + stop_trigger_ts = event.ts + + +# +# next_power_of_two() +# +def next_power_of_two(val): + np = 1 + while np < val: + np *= 2 + return np + + +# +# unsigned_int() +# +def unsigned_int(value): + try: + value = int(value) + except ValueError: + raise argparse.ArgumentTypeError("must be an integer") + + if value < 0: + raise argparse.ArgumentTypeError("must be positive") + return value + + +# +# unsigned_nonzero_int() +# +def unsigned_nonzero_int(value): + value = unsigned_int(value) + if value == 0: + raise argparse.ArgumentTypeError("must be nonzero") + return value + + +# +# get_thread_name() +# +def get_thread_name(pid, tid): + try: + with open(f"/proc/{pid}/task/{tid}/comm", encoding="utf8") as f: + return f.readline().strip("\n") + except FileNotFoundError: + pass + + return f"<unknown:{pid}/{tid}>" + + +# +# get_vec_nr_name() +# +def get_vec_nr_name(vec_nr): + known_vec_nr = ["hi", "timer", "net_tx", "net_rx", "block", "irq_poll", + "tasklet", "sched", "hrtimer", "rcu"] + + if vec_nr < 0 or vec_nr > len(known_vec_nr): + return f"<unknown:{vec_nr}>" + + return known_vec_nr[vec_nr] + + +# +# start/stop/reset capture +# +def start_capture(): + bpf["capture_on"][ct.c_int(0)] = ct.c_int(1) + + +def stop_capture(force=False): + if force: + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0xffff) + else: + bpf["capture_on"][ct.c_int(0)] = ct.c_int(0) + + +def capture_running(): + return bpf["capture_on"][ct.c_int(0)].value == 1 + + +def reset_capture(): + bpf["syscall_start"].clear() + bpf["syscall_data"].clear() + bpf["run_start"].clear() + bpf["run_data"].clear() + bpf["ready_start"].clear() + bpf["ready_data"].clear() + bpf["hardirq_start"].clear() + bpf["hardirq_data"].clear() + bpf["softirq_start"].clear() + bpf["softirq_data"].clear() + bpf["stack_traces"].clear() + + +# +# Display timestamp +# +def print_timestamp(msg): + ltz = datetime.datetime.now() + utc = ltz.astimezone(pytz.utc) + time_string = "{} @{} ({} UTC)".format( + msg, ltz.isoformat(), utc.strftime("%H:%M:%S")) + print(time_string) + + +# +# process_results() +# +def process_results(syscall_events=None, trigger_delta=None): + if trigger_delta: + print_timestamp("# Triggered sample dump, stop-start delta {:,} ns". + format(trigger_delta)) + else: + print_timestamp("# Sample dump") + + # + # First get a list of all threads we need to report on. + # + threads_syscall = {k.tid for k, _ in bpf["syscall_data"].items() + if k.syscall != 0xffffffff} + + threads_run = {k.tid for k, _ in bpf["run_data"].items() + if k.pid != 0xffffffff} + + threads_ready = {k.tid for k, _ in bpf["ready_data"].items() + if k.pid != 0xffffffff} + + threads_hardirq = {k.tid for k, _ in bpf["hardirq_data"].items() + if k.pid != 0xffffffff} + + threads_softirq = {k.tid for k, _ in bpf["softirq_data"].items() + if k.pid != 0xffffffff} + + threads = sorted(threads_syscall | threads_run | threads_ready | + threads_hardirq | threads_softirq, + key=lambda x: get_thread_name(options.pid, x)) + + # + # Print header... + # + print("{:10} {:16} {}".format("TID", "THREAD", "<RESOURCE SPECIFIC>")) + print("{:10} {:16} {}".format("-" * 10, "-" * 16, "-" * 76)) + indent = 28 * " " + + # + # Print all events/statistics per threads. + # + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] + for thread in threads: + + if thread != threads[0]: + print("") + + # + # SYSCALL_STATISTICS + # + print("{:10} {:16} {}\n{}{:20} {:>6} {:>10} {:>16} {:>16}".format( + thread, get_thread_name(options.pid, thread), + "[SYSCALL STATISTICS]", indent, + "NAME", "NUMBER", "COUNT", "TOTAL ns", "MAX ns")) + + total_count = 0 + total_ns = 0 + for k, v in sorted(filter(lambda t: t[0].tid == thread, + bpf["syscall_data"].items()), + key=lambda kv: -kv[1].total_ns): + + print("{}{:20.20} {:6} {:10} {:16,} {:16,}".format( + indent, syscall_name(k.syscall).decode('utf-8'), k.syscall, + v.count, v.total_ns, v.worst_ns)) + if k.syscall != poll_id: + total_count += v.count + total_ns += v.total_ns + + if total_count > 0: + print("{}{:20.20} {:6} {:10} {:16,}".format( + indent, "TOTAL( - poll):", "", total_count, total_ns)) + + # + # THREAD RUN STATISTICS + # + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16} {:>16}".format( + "", "", "[THREAD RUN STATISTICS]", indent, + "SCHED_CNT", "TOTAL ns", "MIN ns", "MAX ns")) + + for k, v in filter(lambda t: t[0].tid == thread, + bpf["run_data"].items()): + + print("{}{:10} {:16,} {:16,} {:16,}".format( + indent, v.count, v.total_ns, v.min_ns, v.max_ns)) + + # + # THREAD READY STATISTICS + # + print("\n{:10} {:16} {}\n{}{:10} {:>16} {:>16}".format( + "", "", "[THREAD READY STATISTICS]", indent, + "SCHED_CNT", "TOTAL ns", "MAX ns")) + + for k, v in filter(lambda t: t[0].tid == thread, + bpf["ready_data"].items()): + + print("{}{:10} {:16,} {:16,}".format( + indent, v.count, v.total_ns, v.worst_ns)) + + # + # HARD IRQ STATISTICS + # + total_ns = 0 + total_count = 0 + header_printed = False + for k, v in sorted(filter(lambda t: t[0].tid == thread, + bpf["hardirq_data"].items()), + key=lambda kv: -kv[1].total_ns): + + if not header_printed: + print("\n{:10} {:16} {}\n{}{:20} {:>10} {:>16} {:>16}". + format("", "", "[HARD IRQ STATISTICS]", indent, + "NAME", "COUNT", "TOTAL ns", "MAX ns")) + header_printed = True + + print("{}{:20.20} {:10} {:16,} {:16,}".format( + indent, k.irq_name.decode('utf-8'), + v.count, v.total_ns, v.worst_ns)) + + total_count += v.count + total_ns += v.total_ns + + if total_count > 0: + print("{}{:20.20} {:10} {:16,}".format( + indent, "TOTAL:", total_count, total_ns)) + + # + # SOFT IRQ STATISTICS + # + total_ns = 0 + total_count = 0 + header_printed = False + for k, v in sorted(filter(lambda t: t[0].tid == thread, + bpf["softirq_data"].items()), + key=lambda kv: -kv[1].total_ns): + + if not header_printed: + print("\n{:10} {:16} {}\n" + "{}{:20} {:>7} {:>10} {:>16} {:>16}". + format("", "", "[SOFT IRQ STATISTICS]", indent, + "NAME", "VECT_NR", "COUNT", "TOTAL ns", "MAX ns")) + header_printed = True + + print("{}{:20.20} {:>7} {:10} {:16,} {:16,}".format( + indent, get_vec_nr_name(k.vec_nr), k.vec_nr, + v.count, v.total_ns, v.worst_ns)) + + total_count += v.count + total_ns += v.total_ns + + if total_count > 0: + print("{}{:20.20} {:7} {:10} {:16,}".format( + indent, "TOTAL:", "", total_count, total_ns)) + + # + # Print events + # + lost_stack_traces = 0 + if syscall_events: + stack_traces = bpf.get_table("stack_traces") + + print("\n\n# SYSCALL EVENTS:" + "\n{}{:>19} {:>19} {:>10} {:16} {:>10} {}".format( + 2 * " ", "ENTRY (ns)", "EXIT (ns)", "TID", "COMM", + "DELTA (us)", "SYSCALL")) + print("{}{:19} {:19} {:10} {:16} {:10} {}".format( + 2 * " ", "-" * 19, "-" * 19, "-" * 10, "-" * 16, + "-" * 10, "-" * 16)) + for event in syscall_events: + print("{}{:19} {:19} {:10} {:16} {:10,} {}".format( + " " * 2, + event["ts_entry"], event["ts_exit"], event["tid"], + get_thread_name(options.pid, event["tid"]), + int((event["ts_exit"] - event["ts_entry"]) / 1000), + syscall_name(event["syscall"]).decode('utf-8'))) + # + # Not sure where to put this, but I'll add some info on stack + # traces here... Userspace stack traces are very limited due to + # the fact that bcc does not support dwarf backtraces. As OVS + # gets compiled without frame pointers we will not see much. + # If however, OVS does get built with frame pointers, we should not + # use the BPF_STACK_TRACE_BUILDID as it does not seem to handle + # the debug symbols correctly. Also, note that for kernel + # traces you should not use BPF_STACK_TRACE_BUILDID, so two + # buffers are needed. + # + # Some info on manual dwarf walk support: + # https://github.com/iovisor/bcc/issues/3515 + # https://github.com/iovisor/bcc/pull/4463 + # + if options.stack_trace_size == 0: + continue + + if event['kernel_stack_id'] < 0 or event['user_stack_id'] < 0: + lost_stack_traces += 1 + + kernel_stack = stack_traces.walk(event['kernel_stack_id']) \ + if event['kernel_stack_id'] >= 0 else [] + user_stack = stack_traces.walk(event['user_stack_id']) \ + if event['user_stack_id'] >= 0 else [] + + for addr in kernel_stack: + print("{}{}".format( + " " * 10, + bpf.ksym(addr, show_module=True, + show_offset=True).decode('utf-8', 'replace'))) + + for addr in user_stack: + addr_str = bpf.sym(addr, options.pid, show_module=True, + show_offset=True).decode('utf-8', 'replace') + + if addr_str == "[unknown]": + addr_str += " 0x{:x}".format(addr) + + print("{}{}".format(" " * 10, addr_str)) + + # + # Print any footer messages. + # + if lost_stack_traces > 0: + print("\n#WARNING: We where not able to display {} stack traces!\n" + "# Consider increasing the stack trace size using\n" + "# the '--stack-trace-size' option.\n" + "# Note that this can also happen due to a stack id\n" + "# collision.".format(lost_stack_traces)) + + +# +# main() +# +def main(): + # + # Don't like these globals, but ctx passing does not seem to work with the + # existing open_ring_buffer() API :( + # + global bpf + global options + global syscall_events + global start_trigger_ts + global stop_trigger_ts + + start_trigger_ts = 0 + stop_trigger_ts = 0 + + # + # Argument parsing + # + parser = argparse.ArgumentParser() + + parser.add_argument("-D", "--debug", + help="Enable eBPF debugging", + type=int, const=0x3f, default=0, nargs='?') + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", + help="ovs-vswitch's PID", + type=unsigned_int, default=None) + parser.add_argument("-s", "--syscall-events", metavar="DURATION_NS", + help="Record syscall events that take longer than " + "DURATION_NS. Omit the duration value to record all " + "syscall events", + type=unsigned_int, const=0, default=None, nargs='?') + parser.add_argument("--buffer-page-count", + help="Number of BPF ring buffer pages, default 1024", + type=unsigned_int, default=1024, metavar="NUMBER") + parser.add_argument("--sample-count", + help="Number of sample runs, default 1", + type=unsigned_nonzero_int, default=1, metavar="RUNS") + parser.add_argument("--sample-interval", + help="Delay between sample runs, default 0", + type=float, default=0, metavar="SECONDS") + parser.add_argument("--sample-time", + help="Sample time, default 0.5 seconds", + type=float, default=0.5, metavar="SECONDS") + parser.add_argument("--skip-syscall-poll-events", + help="Skip poll() syscalls with --syscall-events", + action="store_true") + parser.add_argument("--stack-trace-size", + help="Number of unique stack traces that can be " + "recorded, default 4096. 0 to disable", + type=unsigned_int, default=4096) + parser.add_argument("--start-trigger", metavar="TRIGGER", + help="Start trigger, see documentation for details", + type=str, default=None) + parser.add_argument("--stop-trigger", metavar="TRIGGER", + help="Stop trigger, see documentation for details", + type=str, default=None) + parser.add_argument("--trigger-delta", metavar="DURATION_NS", + help="Only report event when the trigger duration > " + "DURATION_NS, default 0 (all events)", + type=unsigned_int, const=0, default=0, nargs='?') + + options = parser.parse_args() + + # + # Find the PID of the ovs-vswitchd daemon if not specified. + # + if not options.pid: + for proc in psutil.process_iter(): + if 'ovs-vswitchd' in proc.name(): + if options.pid: + print("ERROR: Multiple ovs-vswitchd daemons running, " + "use the -p option!") + sys.exit(os.EX_NOINPUT) + + options.pid = proc.pid + + # + # Error checking on input parameters. + # + if not options.pid: + print("ERROR: Failed to find ovs-vswitchd's PID!") + sys.exit(os.EX_UNAVAILABLE) + + options.buffer_page_count = next_power_of_two(options.buffer_page_count) + + # + # Make sure we are running as root, or else we can not attach the probes. + # + if os.geteuid() != 0: + print("ERROR: We need to run as root to attached probes!") + sys.exit(os.EX_NOPERM) + + # + # Setup any of the start stop triggers + # + if options.start_trigger is not None: + try: + start_trigger = Probe(options.start_trigger, pid=options.pid) + except ValueError as e: + print(f"ERROR: Invalid start trigger {str(e)}") + sys.exit(os.EX_CONFIG) + else: + start_trigger = None + + if options.stop_trigger is not None: + try: + stop_trigger = Probe(options.stop_trigger, pid=options.pid) + except ValueError as e: + print(f"ERROR: Invalid stop trigger {str(e)}") + sys.exit(os.EX_CONFIG) + else: + stop_trigger = None + + # + # Attach probe to running process. + # + source = EBPF_SOURCE.replace("<EVENT_ENUM>", "\n".join( + [" EVENT_{} = {},".format( + event.name, event.value) for event in Event])) + source = source.replace("<BUFFER_PAGE_CNT>", + str(options.buffer_page_count)) + source = source.replace("<MONITOR_PID>", str(options.pid)) + + if BPF.kernel_struct_has_field(b'task_struct', b'state') == 1: + source = source.replace('<STATE_FIELD>', 'state') + else: + source = source.replace('<STATE_FIELD>', '__state') + + poll_id = [k for k, v in syscalls.items() if v == b'poll'][0] + if options.syscall_events is None: + syscall_trace_events = "false" + elif options.syscall_events == 0: + if not options.skip_syscall_poll_events: + syscall_trace_events = "true" + else: + syscall_trace_events = f"args->id != {poll_id}" + else: + syscall_trace_events = "delta > {}".format(options.syscall_events) + if options.skip_syscall_poll_events: + syscall_trace_events += f" && args->id != {poll_id}" + + source = source.replace("<SYSCALL_TRACE_EVENTS>", + syscall_trace_events) + + source = source.replace("<STACK_TRACE_SIZE>", + str(options.stack_trace_size)) + + source = source.replace("<STACK_TRACE_ENABLED>", "true" + if options.stack_trace_size > 0 else "false") + + # + # Handle start/stop probes + # + if start_trigger: + source = source.replace("<START_TRIGGER>", + start_trigger.get_c_code( + "start_trigger_probe", + "return start_trigger();")) + else: + source = source.replace("<START_TRIGGER>", "") + + if stop_trigger: + source = source.replace("<STOP_TRIGGER>", + stop_trigger.get_c_code( + "stop_trigger_probe", + "return stop_trigger();")) + else: + source = source.replace("<STOP_TRIGGER>", "") + + # + # Setup usdt or other probes that need handling trough the BFP class. + # + usdt = USDT(pid=int(options.pid)) + try: + if start_trigger and start_trigger.probe_type == 'usdt': + usdt.enable_probe(probe=start_trigger.probe_name(), + fn_name="start_trigger_probe") + if stop_trigger and stop_trigger.probe_type == 'usdt': + usdt.enable_probe(probe=stop_trigger.probe_name(), + fn_name="stop_trigger_probe") + + except USDTException as e: + print("ERROR: {}".format( + (re.sub('^', ' ' * 7, str(e), flags=re.MULTILINE)).strip(). + replace("--with-dtrace or --enable-dtrace", + "--enable-usdt-probes"))) + sys.exit(os.EX_OSERR) + + bpf = BPF(text=source, usdt_contexts=[usdt], debug=options.debug) + + if start_trigger: + try: + if start_trigger.probe_type == "uprobe": + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", + sym=start_trigger.probe_name(), + fn_name="start_trigger_probe", + pid=options.pid) + + if start_trigger.probe_type == "uretprobe": + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", + sym=start_trigger.probe_name(), + fn_name="start_trigger_probe", + pid=options.pid) + except Exception as e: + print("ERROR: Failed attaching uprobe start trigger " + f"'{start_trigger.probe_name()}';\n {str(e)}") + sys.exit(os.EX_OSERR) + + if stop_trigger: + try: + if stop_trigger.probe_type == "uprobe": + bpf.attach_uprobe(name=f"/proc/{options.pid}/exe", + sym=stop_trigger.probe_name(), + fn_name="stop_trigger_probe", + pid=options.pid) + + if stop_trigger.probe_type == "uretprobe": + bpf.attach_uretprobe(name=f"/proc/{options.pid}/exe", + sym=stop_trigger.probe_name(), + fn_name="stop_trigger_probe", + pid=options.pid) + except Exception as e: + print("ERROR: Failed attaching uprobe stop trigger" + f"'{stop_trigger.probe_name()}';\n {str(e)}") + sys.exit(os.EX_OSERR) + + # + # If no triggers are configured use the delay configuration + # + bpf['events'].open_ring_buffer(process_event) + + sample_count = 0 + while sample_count < options.sample_count: + sample_count += 1 + syscall_events = [] + + if not options.start_trigger: + print_timestamp("# Start sampling") + start_capture() + stop_time = -1 if options.stop_trigger else \ + time_ns() + options.sample_time * 1000000000 + else: + # For start triggers the stop time depends on the start trigger + # time, or depends on the stop trigger if configured. + stop_time = -1 if options.stop_trigger else 0 + + while True: + keyboard_interrupt = False + try: + last_start_ts = start_trigger_ts + last_stop_ts = stop_trigger_ts + + if stop_time > 0: + delay = int((stop_time - time_ns()) / 1000000) + if delay <= 0: + break + else: + delay = -1 + + bpf.ring_buffer_poll(timeout=delay) + + if stop_time <= 0 and last_start_ts != start_trigger_ts: + print_timestamp( + "# Start sampling (trigger@{})".format( + start_trigger_ts)) + + if not options.stop_trigger: + stop_time = time_ns() + \ + options.sample_time * 1000000000 + + if last_stop_ts != stop_trigger_ts: + break + + except KeyboardInterrupt: + keyboard_interrupt = True + break + + if options.stop_trigger and not capture_running(): + print_timestamp("# Stop sampling (trigger@{})".format( + stop_trigger_ts)) + else: + print_timestamp("# Stop sampling") + + if stop_trigger_ts != 0 and start_trigger_ts != 0: + trigger_delta = stop_trigger_ts - start_trigger_ts + else: + trigger_delta = None + + if not trigger_delta or trigger_delta >= options.trigger_delta: + stop_capture(force=True) # Prevent a new trigger to start. + process_results(syscall_events=syscall_events, + trigger_delta=trigger_delta) + elif trigger_delta: + sample_count -= 1 + print_timestamp("# Sample dump skipped, delta {:,} ns".format( + trigger_delta)) + + reset_capture() + stop_capture() + + if keyboard_interrupt: + break + + if options.sample_interval > 0: + time.sleep(options.sample_interval) + + # + # Report lost events. + # + dropcnt = bpf.get_table("dropcnt") + for k in dropcnt.keys(): + count = dropcnt.sum(k).value + if k.value == 0 and count > 0: + print("\n# WARNING: Not all events were captured, {} were " + "dropped!\n# Increase the BPF ring buffer size " + "with the --buffer-page-count option.".format(count)) + + if (options.sample_count > 1): + trigger_miss = bpf.get_table("trigger_miss") + for k in trigger_miss.keys(): + count = trigger_miss.sum(k).value + if k.value == 0 and count > 0: + print("\n# WARNING: Not all start triggers were successful. " + "{} were missed due to\n# slow userspace " + "processing!".format(count)) + + +# +# Start main() as the default entry point... +# +if __name__ == '__main__': + main() diff --git a/utilities/usdt-scripts/kernel_delay.rst b/utilities/usdt-scripts/kernel_delay.rst new file mode 100644 index 000000000..0ebd30afb --- /dev/null +++ b/utilities/usdt-scripts/kernel_delay.rst @@ -0,0 +1,596 @@ +Troubleshooting Open vSwitch: Is the kernel to blame? +===================================================== +Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left +wondering if the issue is really OVS-related, or if it's a problem with the +kernel being overloaded. Messages in the log like +``Unreasonably long XXXXms poll interval`` might suggest it's OVS, but from +experience, these are mostly related to an overloaded Linux Kernel. +The kernel_delay.py tool can help you quickly identify if the focus of your +investigation should be OVS or the Linux kernel. + + +Introduction +------------ +``kernel_delay.py`` consists of a Python script that uses the BCC [#BCC]_ +framework to install eBPF probes. The data the eBPF probes collect will be +analyzed and presented to the user by the Python script. Some of the presented +data can also be captured by the individual scripts included in the BBC [#BCC]_ +framework. + +kernel_delay.py has two modes of operation: + +- In **time mode**, the tool runs for a specific time and collects the + information. +- In **trigger mode**, event collection can be started and/or stopped based on + a specific eBPF probe. Currently, the following probes are supported: + - USDT probes + - Kernel tracepoints + - kprobe + - kretprobe + - uprobe + - uretprobe + + +In addition, the option, ``--sample-count``, exists to specify how many +iterations you would like to do. When using triggers, you can also ignore +samples if they are less than a number of nanoseconds with the +``--trigger-delta`` option. The latter might be useful when debugging Linux +syscalls which take a long time to complete. More on this later. Finally, you +can configure the delay between two sample runs with the ``--sample-interval`` +option. + +Before getting into more details, you can run the tool without any options +to see what the output looks like. Notice that it will try to automatically +get the process ID of the running ``ovs-vswitchd``. You can overwrite this +with the ``--pid`` option. + +.. code-block:: console + + $ sudo ./kernel_delay.py + # Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC) + # Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC) + # Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 27090 ovs-vswitchd [SYSCALL STATISTICS] + <EDIT: REMOVED DATA FOR ovs-vswitchd THREAD> + + 31741 revalidator122 [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + poll 7 5 184,193,176 184,191,520 + recvmsg 47 494 125,208,756 310,331 + futex 202 8 18,768,758 4,023,039 + sendto 44 10 375,861 266,867 + sendmsg 46 4 43,294 11,213 + write 1 1 5,949 5,949 + getrusage 98 1 1,424 1,424 + read 0 1 1,292 1,292 + TOTAL( - poll): 519 144,405,334 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + 6 136,764,071 1,480 115,146,424 + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + 7 11,334 6,636 + + [HARD IRQ STATISTICS] + NAME COUNT TOTAL ns MAX ns + eno8303-rx-1 1 3,586 3,586 + TOTAL: 1 3,586 + + [SOFT IRQ STATISTICS] + NAME VECT_NR COUNT TOTAL ns MAX ns + net_rx 3 1 17,699 17,699 + sched 7 6 13,820 3,226 + rcu 9 16 13,586 1,554 + timer 1 3 10,259 3,815 + TOTAL: 26 55,364 + + +By default, the tool will run for half a second in `time mode`. To extend this +you can use the ``--sample-time`` option. + + +What will it report +------------------- +The above sample output separates the captured data on a per-thread basis. +For this, it displays the thread's id (``TID``) and name (``THREAD``), +followed by resource-specific data. Which are: + +- ``SYSCALL STATISTICS`` +- ``THREAD RUN STATISTICS`` +- ``THREAD READY STATISTICS`` +- ``HARD IRQ STATISTICS`` +- ``SOFT IRQ STATISTICS`` + +The following sections will describe in detail what statistics they report. + + +``SYSCALL STATISTICS`` +~~~~~~~~~~~~~~~~~~~~~~ +``SYSCALL STATISTICS`` tell you which Linux system calls got executed during +the measurement interval. This includes the number of times the syscall was +called (``COUNT``), the total time spent in the system calls (``TOTAL ns``), +and the worst-case duration of a single call (``MAX ns``). + +It also shows the total of all system calls, but it excludes the poll system +call, as the purpose of this call is to wait for activity on a set of sockets, +and usually, the thread gets swapped out. + +Note that it only counts calls that started and stopped during the +measurement interval! + + +``THREAD RUN STATISTICS`` +~~~~~~~~~~~~~~~~~~~~~~~~~ +``THREAD RUN STATISTICS`` tell you how long the thread was running on a CPU +during the measurement interval. + +Note that these statistics only count events where the thread started and +stopped running on a CPU during the measurement interval. For example, if +this was a PMD thread, you should see zero ``SCHED_CNT`` and ``TOTAL_ns``. +If not, there might be a misconfiguration. + + +``THREAD READY STATISTICS`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``THREAD READY STATISTICS`` tell you the time between the thread being ready +to run and it actually running on the CPU. + +Note that these statistics only count events where the thread was getting +ready to run and started running during the measurement interval. + + +``HARD IRQ STATISTICS`` +~~~~~~~~~~~~~~~~~~~~~~~ +``HARD IRQ STATISTICS`` tell you how much time was spent servicing hard +interrupts during the threads run time. + +It shows the interrupt name (``NAME``), the number of interrupts (``COUNT``), +the total time spent in the interrupt handler (``TOTAL ns``), and the +worst-case duration (``MAX ns``). + + +``SOFT IRQ STATISTICS`` +~~~~~~~~~~~~~~~~~~~~~~~ +``SOFT IRQ STATISTICS`` tell you how much time was spent servicing soft +interrupts during the threads run time. + +It shows the interrupt name (``NAME``), vector number (``VECT_NR``), the +number of interrupts (``COUNT``), the total time spent in the interrupt +handler (``TOTAL ns``), and the worst-case duration (``MAX ns``). + + +The ``--syscall-events`` option +------------------------------- +In addition to reporting global syscall statistics in ``SYSCALL_STATISTICS``, +the tool can also report each individual syscall. This can be a usefull +second step if the ``SYSCALL_STATISTICS`` show high latency numbers. + +All you need to do is add the ``--syscall-events`` option, with or without +the additional ``DURATION_NS`` parameter. The ``DUTATION_NS`` parameter +allows you to exclude events that take less than the supplied time. + +The ``--skip-syscall-poll-events`` option allows you to exclude poll +syscalls from the report. + +Below is an example run, note that the resource-specific data is removed +to highlight the syscall events: + +.. code-block:: console + + $ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events + # Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC) + # Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC) + # Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 3359686 ipf_clean2 [SYSCALL STATISTICS] + ... + 3359635 ovs-vswitchd [SYSCALL STATISTICS] + ... + 3359697 revalidator12 [SYSCALL STATISTICS] + ... + 3359698 revalidator13 [SYSCALL STATISTICS] + ... + 3359699 revalidator14 [SYSCALL STATISTICS] + ... + 3359700 revalidator15 [SYSCALL STATISTICS] + ... + + # SYSCALL EVENTS: + ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL + ------------------- ------------------- ---------- ---------------- ---------- ---------------- + 2161821694935486 2161821695031201 3359699 revalidator14 95 futex + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode+0x9 [kernel] + do_syscall_64+0x68 [kernel] + entry_SYSCALL_64_after_hwframe+0x72 [kernel] + __GI___lll_lock_wait+0x30 [libc.so.6] + ovs_mutex_lock_at+0x18 [ovs-vswitchd] + [unknown] 0x696c003936313a63 + 2161821695276882 2161821695333687 3359698 revalidator13 56 futex + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode+0x9 [kernel] + do_syscall_64+0x68 [kernel] + entry_SYSCALL_64_after_hwframe+0x72 [kernel] + __GI___lll_lock_wait+0x30 [libc.so.6] + ovs_mutex_lock_at+0x18 [ovs-vswitchd] + [unknown] 0x696c003134313a63 + 2161821695275820 2161821695405733 3359700 revalidator15 129 futex + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode+0x9 [kernel] + do_syscall_64+0x68 [kernel] + entry_SYSCALL_64_after_hwframe+0x72 [kernel] + __GI___lll_lock_wait+0x30 [libc.so.6] + ovs_mutex_lock_at+0x18 [ovs-vswitchd] + [unknown] 0x696c003936313a63 + 2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode_prepare+0x161 [kernel] + syscall_exit_to_user_mode+0x9 [kernel] + do_syscall_64+0x68 [kernel] + entry_SYSCALL_64_after_hwframe+0x72 [kernel] + __GI_accept+0x4d [libc.so.6] + pfd_accept+0x3a [ovs-vswitchd] + [unknown] 0x7fff19f2bd00 + [unknown] 0xe4b8001f0f + +As you can see above, the output also shows the stackback trace. You can +disable this using the ``--stack-trace-size 0`` option. + +As you can see above, the backtrace does not show a lot of useful information +due to the BCC [#BCC]_ toolkit not supporting dwarf decoding. So to further +analyze system call backtraces, you could use perf. The following perf +script can do this for you (refer to the embedded instructions): + +https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py + + +Using triggers +-------------- +The tool supports start and, or stop triggers. This will allow you to capture +statistics triggered by a specific event. The following combinations of +stop-and-start triggers can be used. + +If you only use ``--start-trigger``, the inspection start when the trigger +happens and runs until the ``--sample-time`` number of seconds has passed. +The example below shows all the supported options in this scenario. + +.. code-block:: console + + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \ + --sample-count 4 --sample-interval 1 + + +If you only use ``--stop-trigger``, the inspection starts immediately and +stops when the trigger happens. The example below shows all the supported +options in this scenario. + +.. code-block:: console + + $ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \ + --sample-count 4 --sample-interval 1 + + +If you use both ``--start-trigger`` and ``--stop-trigger`` triggers, the +statistics are captured between the two first occurrences of these events. +The example below shows all the supported options in this scenario. + +.. code-block:: console + + $ sudo ./kernel_delay.py --start-trigger up:bridge_run \ + --stop-trigger upr:bridge_run \ + --sample-count 4 --sample-interval 1 \ + --trigger-delta 50000 + +What triggers are supported? Note that what ``kernel_delay.py`` calls triggers, +BCC [#BCC]_, calls events; these are eBPF tracepoints you can attach to. +For more details on the supported tracepoints, check out the BCC +documentation [#BCC_EVENT]_. + +The list below shows the supported triggers and their argument format: + +**USDT probes:** + [u|usdt]:{provider}:{probe} +**Kernel tracepoint:** + [t:trace]:{system}:{event} +**kprobe:** + [k:kprobe]:{kernel_function} +**kretprobe:** + [kr:kretprobe]:{kernel_function} +**uprobe:** + [up:uprobe]:{function} +**uretprobe:** + [upr:uretprobe]:{function} + +Here are a couple of trigger examples, more use-case-specific examples can be +found in the *Examples* section. + +.. code-block:: console + + --start|stop-trigger u:udpif_revalidator:start_dump + --start|stop-trigger t:openvswitch:ovs_dp_upcall + --start|stop-trigger k:ovs_dp_process_packet + --start|stop-trigger kr:ovs_dp_process_packet + --start|stop-trigger up:bridge_run + --start|stop-trigger upr:bridge_run + + +Examples +-------- +This section will give some examples of how to use this tool in real-world +scenarios. Let's start with the issue where Open vSwitch reports +``Unreasonably long XXXXms poll interval`` on your revalidator threads. Note +that there is a blog available explaining how the revalidator process works +in OVS [#REVAL_BLOG]_. + +First, let me explain this log message. It gets logged if the time delta +between two ``poll_block()`` calls is more than 1 second. In other words, +the process was spending a lot of time processing stuff that was made +available by the return of the ``poll_block()`` function. + +Do a run with the tool using the existing USDT revalidator probes as a start +and stop trigger (Note that the resource-specific data is removed from the none +revalidator threads): + +.. code-block:: console + + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done + # Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC) + # Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC) + # Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 1457761 handler24 [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + sendmsg 46 6110 123,274,761 41,776 + recvmsg 47 136299 99,397,508 49,896 + futex 202 51 7,655,832 7,536,776 + poll 7 4068 1,202,883 2,907 + getrusage 98 2034 586,602 1,398 + sendto 44 9 213,682 27,417 + TOTAL( - poll): 144503 231,128,385 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + 1 1,438 1,438 + + [SOFT IRQ STATISTICS] + NAME VECT_NR COUNT TOTAL ns MAX ns + sched 7 21 59,145 3,769 + rcu 9 50 42,917 2,234 + TOTAL: 71 102,062 + 1457733 ovs-vswitchd [SYSCALL STATISTICS] + ... + 1457792 revalidator55 [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + futex 202 73 572,576,329 19,621,600 + recvmsg 47 815 296,697,618 405,338 + sendto 44 3 78,302 26,837 + sendmsg 46 3 38,712 13,250 + write 1 1 5,073 5,073 + TOTAL( - poll): 895 869,396,034 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + 48 394,350,393 1,729 140,455,796 + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + 49 23,650 1,559 + + [SOFT IRQ STATISTICS] + NAME VECT_NR COUNT TOTAL ns MAX ns + sched 7 14 26,889 3,041 + rcu 9 28 23,024 1,600 + TOTAL: 42 49,913 + + +Above you see from the start of the output that the trigger took more than a +second (1,187,986,454 ns), which is already know, by looking at the output of +the ``ovs-vsctl upcall/show`` command. + +From the *revalidator55*'s ``SYSCALL STATISTICS`` statistics you can see it +spent almost 870ms handling syscalls, and there were no poll() calls being +executed. The ``THREAD RUN STATISTICS`` statistics here are a bit misleading, +as it looks like OVS only spent 394ms on the CPU. But earlier, it was mentioned +that this time does not include the time being on the CPU at the start or stop +of an event. What is exactly the case here, because USDT probes were used. + +From the above data and maybe some ``top`` output, it can be determined that +the *revalidator55* thread is taking a lot of CPU time, probably because it +has to do a lot of revalidator work by itself. The solution here is to increase +the number of revalidator threads, so more work could be done in parallel. + +Here is another run of the same command in another scenario: + +.. code-block:: console + + $ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done + # Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC) + # Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC) + # Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 1457733 ovs-vswitchd [SYSCALL STATISTICS] + ... + 1457792 revalidator55 [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + recvmsg 47 284 193,422,110 46,248,418 + sendto 44 2 46,685 23,665 + sendmsg 46 2 24,916 12,703 + write 1 1 6,534 6,534 + TOTAL( - poll): 289 193,500,245 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + 2 47,333,558 331,516 47,002,042 + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + 3 87,000,403 45,999,712 + + [SOFT IRQ STATISTICS] + NAME VECT_NR COUNT TOTAL ns MAX ns + sched 7 2 9,504 5,109 + TOTAL: 2 9,504 + + +Here you can see the revalidator run took about 263ms, which does not look +odd, however, the ``THREAD READY STATISTICS`` information shows that OVS was +waiting 87ms for a CPU to be run on. This means the revalidator process could +have finished 87ms faster. Looking at the ``MAX ns`` value, a worst-case delay +of almost 46ms can be seen, which hints at an overloaded system. + +One final example that uses a ``uprobe`` to get some statistics on a +``bridge_run()`` execution that takes more than 1ms. + +.. code-block:: console + + $ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000 + # Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC) + # Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC) + # Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC) + # Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC) + # Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC) + # Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS] + ... <REMOVED 7 MORE unknown THREADS> + 3371102 handler66 [SYSCALL STATISTICS] + ... <REMOVED 7 MORE HANDLER THREADS> + 3366258 ovs-vswitchd [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + futex 202 43 403,469 199,312 + clone3 435 13 174,394 30,731 + munmap 11 8 115,774 21,861 + poll 7 5 92,969 38,307 + unlink 87 2 49,918 35,741 + mprotect 10 8 47,618 13,201 + accept 43 10 31,360 6,976 + mmap 9 8 30,279 5,776 + write 1 6 27,720 11,774 + rt_sigprocmask 14 28 12,281 970 + read 0 6 9,478 2,318 + recvfrom 45 3 7,024 4,024 + sendto 44 1 4,684 4,684 + getrusage 98 5 4,594 1,342 + close 3 2 2,918 1,627 + recvmsg 47 1 2,722 2,722 + TOTAL( - poll): 144 924,233 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + 13 817,605 5,433 524,376 + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + 14 28,646 11,566 + + [SOFT IRQ STATISTICS] + NAME VECT_NR COUNT TOTAL ns MAX ns + rcu 9 1 2,838 2,838 + TOTAL: 1 2,838 + + 3371110 revalidator74 [SYSCALL STATISTICS] + ... <REMOVED 7 MORE NEW revalidator THREADS> + 3366311 urcu3 [SYSCALL STATISTICS] + ... + + +OVS removed some of the threads and their resource-specific data, but based +on the ``<unknown:3366258/3371035>`` thread name, you can determine that some +threads no longer exist. In the ``ovs-vswitchd`` thread, you can see some +``clone3`` syscalls, indicating threads were created. In this example, it was +due to the deletion of a bridge, which resulted in the recreation of the +revalidator and handler threads. + + +Use with Openshift +------------------ +This section describes how you would use the tool on a node in an OpenShift +cluster. It assumes you have console access to the node, either directly or +through a debug container. + +A base fedora38 container will be used through podman, as this will allow the +use of some additional tools and packages needed. + +First the containers need to be started: + +.. code-block:: console + + [core@sno-master ~]$ sudo podman run -it --rm \ + -e PS1='[(DEBUG)\u@\h \W]\$ ' \ + --privileged --network=host --pid=host \ + -v /lib/modules:/lib/modules:ro \ + -v /sys/kernel/debug:/sys/kernel/debug \ + -v /proc:/proc \ + -v /:/mnt/rootdir \ + quay.io/fedora/fedora:38-x86_64 + + [(DEBUG)root@sno-master /]# + + +Next add the ``linux_delay.py`` dependencies: + +.. code-block:: console + + [(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \ + python3-pytz python3-psutil + + +You need to install the devel, debug and source RPMs for your OVS and kernel +version: + +.. code-block:: console + + [(DEBUG)root@sno-master home]# rpm -i \ + openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \ + openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \ + kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm + + +Now the tool can be started. Here the above ``bridge_run()`` example is used: + +.. code-block:: console + + [(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run + # Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC) + # Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC) + # Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC) + TID THREAD <RESOURCE SPECIFIC> + ---------- ---------------- ---------------------------------------------------------------------------- + 1246 ovs-vswitchd [SYSCALL STATISTICS] + NAME NUMBER COUNT TOTAL ns MAX ns + getdents64 217 2 8,560 8,162 + openat 257 1 6,951 6,951 + accept 43 4 6,942 3,763 + recvfrom 45 1 3,726 3,726 + recvmsg 47 2 2,880 2,188 + stat 4 2 1,946 1,384 + close 3 1 1,393 1,393 + fstat 5 1 1,324 1,324 + TOTAL( - poll): 14 33,722 + + [THREAD RUN STATISTICS] + SCHED_CNT TOTAL ns MIN ns MAX ns + + [THREAD READY STATISTICS] + SCHED_CNT TOTAL ns MAX ns + + +.. rubric:: Footnotes + +.. [#BCC] https://github.com/iovisor/bcc +.. [#BCC_EVENT] https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments +.. [#REVAL_BLOG] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained
This patch adds an utility that can be used to determine if an issue is related to a lack of Linux kernel resources. This tool is also featured in a Red Hat developers blog article: https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame Signed-off-by: Eelco Chaudron <echaudro@redhat.com> --- v2: Addressed review comments from Aaron. v3: Changed wording in documentation. v4: Addressed review comments from Adrian. utilities/automake.mk | 4 utilities/usdt-scripts/kernel_delay.py | 1420 +++++++++++++++++++++++++++++++ utilities/usdt-scripts/kernel_delay.rst | 596 +++++++++++++ 3 files changed, 2020 insertions(+) create mode 100755 utilities/usdt-scripts/kernel_delay.py create mode 100644 utilities/usdt-scripts/kernel_delay.rst