mbox series

[net-next,00/11] Add drop monitor for offloaded data paths

Message ID 20190707075828.3315-1-idosch@idosch.org
Headers show
Series Add drop monitor for offloaded data paths | expand

Message

Ido Schimmel July 7, 2019, 7:58 a.m. UTC
From: Ido Schimmel <idosch@mellanox.com>

Users have several ways to debug the kernel and understand why a packet
was dropped. For example, using "drop monitor" and "perf". Both
utilities trace kfree_skb(), which is the function called when a packet
is freed as part of a failure. The information provided by these tools
is invaluable when trying to understand the cause of a packet loss.

In recent years, large portions of the kernel data path were offloaded
to capable devices. Today, it is possible to perform L2 and L3
forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
Different TC classifiers and actions are also offloaded to capable
devices, at both ingress and egress.

However, when the data path is offloaded it is not possible to achieve
the same level of introspection as tools such "perf" and "drop monitor"
become irrelevant.

This patchset aims to solve this by allowing users to monitor packets
that the underlying device decided to drop along with relevant metadata
such as the drop reason and ingress port.

The above is achieved by exposing a fundamental capability of devices
capable of data path offloading - packet trapping. While the common use
case for packet trapping is the trapping of packets required for the
correct functioning of the control plane (e.g., STP, BGP packets),
packets can also be trapped due to other reasons such as exceptions
(e.g., TTL error) and drops (e.g., blackhole route).

Given this ability is not specific to a port, but rather to a device, it
is exposed using devlink. Each capable driver is expected to register
its supported packet traps with devlink and report trapped packets to
devlink as they income. devlink will perform accounting of received
packets and bytes and will potentially generate an event to user space
using a new generic netlink multicast group.

While this patchset is concerned with traps corresponding to dropped
packets, the interface itself is generic and can be used to expose traps
corresponding to control packets in the future. The API is vendor
neutral and similar to the API exposed by SAI which is implemented by
several vendors already.

The implementation in this patchset is on top of both mlxsw and
netdevsim so that people could experiment with the interface and provide
useful feedback.

Patches #1-#4 add the devlink-trap infrastructure.

Patches #5-#6 add an example implementation of netdevsim.

Patches #7-#11 add a real world implementation over mlxsw.

Tests for both the core infrastructure (over netdevsim) and mlxsw will
be sent separately as RFC as they are dependent on the acceptance of the
iproute2 changes.

Example
=======

Instantiate netdevsim
---------------------

# echo "10 1" > /sys/bus/netdevsim/new_device
# ip link set dev eth0 up

List supported traps
--------------------

# devlink trap show
netdevsim/netdevsim10:
  name source_mac_is_multicast type drop generic true report false action drop group l2_drops
  name vlan_tag_mismatch type drop generic true report false action drop group l2_drops
  name ingress_vlan_filter type drop generic true report false action drop group l2_drops
  name ingress_spanning_tree_filter type drop generic true report false action drop group l2_drops
  name port_list_is_empty type drop generic true report false action drop group l2_drops
  name port_loopback_filter type drop generic true report false action drop group l2_drops
  name fid_miss type exception generic false report false action trap group l2_drops
  name blackhole_route type drop generic true report false action drop group l3_drops
  name ttl_value_is_too_small type exception generic true report false action trap group l3_drops
  name tail_drop type drop generic true report false action drop group buffer_drops

Enable a trap
-------------

# devlink trap set netdevsim/netdevsim10 trap blackhole_route action trap report true

Query statistics
----------------

# devlink -s trap show netdevsim/netdevsim10 trap blackhole_route
netdevsim/netdevsim10:
  name blackhole_route type drop generic true report true action trap group l3_drops
    stats:
        rx:
          bytes 18744 packets 132

Monitor dropped packets
-----------------------

# devlink -v mon trap-report
[trap-report,report] netdevsim/netdevsim10: name blackhole_route type drop group l3_drops length 142 timestamp Sun Jun 30 20:26:12 2019 835605178 nsec
  input_port:
    netdevsim/netdevsim10/0: type eth netdev eth0

Future plans
============

* Provide more drop reasons as well as more metadata

v1:
* Rename trap names to make them more generic
* Change policer settings in mlxsw

Ido Schimmel (11):
  devlink: Create helper to fill port type information
  devlink: Add packet trap infrastructure
  devlink: Add generic packet traps and groups
  Documentation: Add devlink-trap documentation
  netdevsim: Add devlink-trap support
  Documentation: Add description of netdevsim traps
  mlxsw: core: Add API to set trap action
  mlxsw: reg: Add new trap action
  mlxsw: Add layer 2 discard trap IDs
  mlxsw: Add trap group for layer 2 discards
  mlxsw: spectrum: Add devlink-trap support

 .../networking/devlink-trap-netdevsim.rst     |   20 +
 Documentation/networking/devlink-trap.rst     |  190 +++
 Documentation/networking/index.rst            |    2 +
 drivers/net/ethernet/mellanox/mlxsw/Makefile  |    2 +-
 drivers/net/ethernet/mellanox/mlxsw/core.c    |   64 +
 drivers/net/ethernet/mellanox/mlxsw/core.h    |   12 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h     |   10 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |   17 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   13 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   |  270 ++++
 drivers/net/ethernet/mellanox/mlxsw/trap.h    |    7 +
 drivers/net/netdevsim/dev.c                   |  273 +++-
 drivers/net/netdevsim/netdevsim.h             |    1 +
 include/net/devlink.h                         |  175 +++
 include/uapi/linux/devlink.h                  |   68 +
 net/core/devlink.c                            | 1312 ++++++++++++++++-
 16 files changed, 2409 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/networking/devlink-trap-netdevsim.rst
 create mode 100644 Documentation/networking/devlink-trap.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c

Comments

Ido Schimmel July 7, 2019, 8:15 a.m. UTC | #1
More info that I did not want to put in the cover letter:

Wrote a Wireshark dissector [1] for devlink packets. It dissects various
devlink packets and is especially useful for trap report packets. Both
the metadata about the trapped packet (e.g., trap reason) and also the
trapped Ethernet packet itself are dissected and displayed. Example [6].
Will submit it after the kernel infrastructure is accepted.

Jiri wrote devlink_exporter [2] for Prometheus [3]. This allows one to
export trap statistics as metrics to Prometheus. Eventually, this should
be integrated into node_exporter [4]. We also experimented with
Cloudflare's ebpf_exporter [5], which allows for more fine-grained
statistics (e.g., number of drops per {5-tuple, drop reason}).

I imagine that Wireshark will be mainly used by kernel engineers that
need to pinpoint a specific problem, whereas Prometheus (and similar
time series databases) will be used by network engineers that need to
monitor the network 24x7.

[1] https://github.com/idosch/wireshark/blob/devlink/epan/dissectors/packet-netlink-devlink.c
[2] https://github.com/jpirko/prometheus-devlink-exporter/blob/master/devlink-exporter.py
[3] https://prometheus.io/
[4] https://github.com/prometheus/node_exporter
[5] https://blog.cloudflare.com/introducing-ebpf_exporter/
[6]
Linux netlink (cooked header)
    Link-layer address type: Netlink (824)
    Family: Generic (0x0010)
Linux Generic Netlink protocol
    Netlink message header (type: 0x0013)
        Length: 224
        Family ID: 0x13 (devlink)
        Flags: 0x0000
            .... .... .... ...0 = Request: 0
            .... .... .... ..0. = Multipart message: 0
            .... .... .... .0.. = Ack: 0
            .... .... .... 0... = Echo: 0
            .... .... ...0 .... = Dump inconsistent: 0
            .... .... ..0. .... = Dump filtered: 0
        Sequence: 0
        Port ID: 0
    Command: Trap report (65)
    Family Version: 1
    Reserved
Linux devlink (device netlink) protocol
    Attribute: Bus name: pci
        Len: 8
        Type: 0x0001, Bus name (1)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Bus name (1)
        Bus name: pci
    Attribute: Device name: 0000:01:00.0
        Len: 17
        Type: 0x0002, Device name (2)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Device name (2)
        Device name: 0000:01:00.0
    Attribute: Trap group name: l2_drops
        Len: 13
        Type: 0x0089, Trap group name (137)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Trap group name (137)
        Trap group name: l2_drops
    Attribute: Trap name: source_mac_is_multicast
        Len: 28
        Type: 0x0080, Trap name (128)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Trap name (128)
        Trap name: source_mac_is_multicast
    Attribute: Trap type
        Len: 5
        Type: 0x0083, Trap type (131)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Trap type (131)
        Trap type: Drop (0)
    Attribute: Trap timestamp
        Len: 20
        Type: 0x0086, Trap timestamp (134)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Trap timestamp (134)
        Trap timestamp: Jul  6, 2019 18:16:11.396492223 IDT
    Attribute: Trap input port
        Len: 40
        Type: 0x8087, Nested, Trap input port (135)
            1... .... .... .... = Nested: 1
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Unknown (32903)
        Attribute: Port index: 17
            Len: 8
            Type: 0x0003, Port index (3)
                0... .... .... .... = Nested: 0
                .0.. .... .... .... = Network byte order: 0
                Attribute type: Port index (3)
            Port index: 17
        Attribute: Port type
            Len: 6
            Type: 0x0004, Port type (4)
                0... .... .... .... = Nested: 0
                .0.. .... .... .... = Network byte order: 0
                Attribute type: Port type (4)
            Port type: Ethernet (2)
        Attribute: Net device index: 133
            Len: 8
            Type: 0x0006, Net device index (6)
                0... .... .... .... = Nested: 0
                .0.. .... .... .... = Network byte order: 0
                Attribute type: Net device index (6)
            Port net device index: 133
        Attribute: Net device name: swp3
            Len: 9
            Type: 0x0007, Net device name (7)
                0... .... .... .... = Nested: 0
                .0.. .... .... .... = Network byte order: 0
                Attribute type: Net device name (7)
            Port net device name: swp3
    Attribute: Trap payload
        Len: 64
        Type: 0x0088, Trap payload (136)
            0... .... .... .... = Nested: 0
            .0.. .... .... .... = Network byte order: 0
            Attribute type: Trap payload (136)
        Ethernet II, Src: Woonsang_04:05:06 (01:02:03:04:05:06), Dst: Mellanox_ff:27:d1 (7c:fe:90:ff:27:d1)
            Destination: Mellanox_ff:27:d1 (7c:fe:90:ff:27:d1)
                Address: Mellanox_ff:27:d1 (7c:fe:90:ff:27:d1)
                .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
                .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
            Source: Woonsang_04:05:06 (01:02:03:04:05:06)
                [Expert Info (Warning/Protocol): Source MAC must not be a group address: IEEE 802.3-2002, Section 3.2.3(b)]
                    [Source MAC must not be a group address: IEEE 802.3-2002, Section 3.2.3(b)]
                    [Severity level: Warning]
                    [Group: Protocol]
                Address: Woonsang_04:05:06 (01:02:03:04:05:06)
                .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
                .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
            Type: IPv4 (0x0800)
            Trailer: 000000000000000000000000000000000000000000000000…
        Internet Protocol Version 4, Src: 192.0.2.1, Dst: 192.0.2.2
            0100 .... = Version: 4
            .... 0101 = Header Length: 20 bytes (5)
            Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
                0000 00.. = Differentiated Services Codepoint: Default (0)
                .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
            Total Length: 20
            Identification: 0x0000 (0)
            Flags: 0x0000
                0... .... .... .... = Reserved bit: Not set
                .0.. .... .... .... = Don't fragment: Not set
                ..0. .... .... .... = More fragments: Not set
            ...0 0000 0000 0000 = Fragment offset: 0
            Time to live: 255
            Protocol: IPv6 Hop-by-Hop Option (0)
            Header checksum: 0x37e6 [validation disabled]
            [Header checksum status: Unverified]
            Source: 192.0.2.1
            Destination: 192.0.2.2
David Miller July 7, 2019, 7:45 p.m. UTC | #2
From: Ido Schimmel <idosch@idosch.org>
Date: Sun,  7 Jul 2019 10:58:17 +0300

> Users have several ways to debug the kernel and understand why a packet
> was dropped. For example, using "drop monitor" and "perf". Both
> utilities trace kfree_skb(), which is the function called when a packet
> is freed as part of a failure. The information provided by these tools
> is invaluable when trying to understand the cause of a packet loss.
> 
> In recent years, large portions of the kernel data path were offloaded
> to capable devices. Today, it is possible to perform L2 and L3
> forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> Different TC classifiers and actions are also offloaded to capable
> devices, at both ingress and egress.
> 
> However, when the data path is offloaded it is not possible to achieve
> the same level of introspection as tools such "perf" and "drop monitor"
> become irrelevant.
> 
> This patchset aims to solve this by allowing users to monitor packets
> that the underlying device decided to drop along with relevant metadata
> such as the drop reason and ingress port.

We are now going to have 5 or so ways to capture packets passing through
the system, this is nonsense.

AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
devlink thing.

This is insanity, too many ways to do the same thing and therefore the
worst possible user experience.

Pick _ONE_ method to trap packets and forward normal kfree_skb events,
XDP perf events, and these taps there too.

I mean really, think about it from the average user's perspective.  To
see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
listen on devlink but configure a special tap thing beforehand and then
if someone is using XDP I gotta setup another perf event buffer capture
thing too.

Sorry, this isn't where we are going.
Ido Schimmel July 8, 2019, 1:19 p.m. UTC | #3
On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> From: Ido Schimmel <idosch@idosch.org>
> Date: Sun,  7 Jul 2019 10:58:17 +0300
> 
> > Users have several ways to debug the kernel and understand why a packet
> > was dropped. For example, using "drop monitor" and "perf". Both
> > utilities trace kfree_skb(), which is the function called when a packet
> > is freed as part of a failure. The information provided by these tools
> > is invaluable when trying to understand the cause of a packet loss.
> > 
> > In recent years, large portions of the kernel data path were offloaded
> > to capable devices. Today, it is possible to perform L2 and L3
> > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > Different TC classifiers and actions are also offloaded to capable
> > devices, at both ingress and egress.
> > 
> > However, when the data path is offloaded it is not possible to achieve
> > the same level of introspection as tools such "perf" and "drop monitor"
> > become irrelevant.
> > 
> > This patchset aims to solve this by allowing users to monitor packets
> > that the underlying device decided to drop along with relevant metadata
> > such as the drop reason and ingress port.
> 
> We are now going to have 5 or so ways to capture packets passing through
> the system, this is nonsense.
> 
> AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> devlink thing.
> 
> This is insanity, too many ways to do the same thing and therefore the
> worst possible user experience.
> 
> Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> XDP perf events, and these taps there too.
> 
> I mean really, think about it from the average user's perspective.  To
> see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> listen on devlink but configure a special tap thing beforehand and then
> if someone is using XDP I gotta setup another perf event buffer capture
> thing too.

Let me try to explain again because I probably wasn't clear enough. The
devlink-trap mechanism is not doing the same thing as other solutions.

The packets we are capturing in this patchset are packets that the
kernel (the CPU) never saw up until now - they were silently dropped by
the underlying device performing the packet forwarding instead of the
CPU.

For each such packet we get valuable metadata from the underlying device
such as the drop reason and the ingress port. With time, even more
reasons and metadata could be provided (e.g., egress port, traffic
class). Netlink provides a structured and extensible way to report the
packet along with the metadata to interested users. The tc-sample action
uses a similar concept.

I would like to emphasize that these dropped packets are not injected to
the kernel's receive path and therefore not subject to kfree_skb() and
related infrastructure. There is no need to waste CPU cycles on packets
we already know were dropped (and why). Further, hardware tail/early
drops will not be dropped by the kernel, given its qdiscs are probably
empty.

Regarding the use of devlink, current ASICs can forward packets at
6.4Tb/s. We do not want to overwhelm the CPU with dropped packets and
therefore we give users the ability to control - via devlink - the
trapping of certain packets to the CPU and their reporting to user
space. In the future, devlink-trap can be extended to support the
configuration of the hardware policers of each trap.
Jakub Kicinski July 8, 2019, 10:51 p.m. UTC | #4
On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:
> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> > From: Ido Schimmel <idosch@idosch.org>
> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> >   
> > > Users have several ways to debug the kernel and understand why a packet
> > > was dropped. For example, using "drop monitor" and "perf". Both
> > > utilities trace kfree_skb(), which is the function called when a packet
> > > is freed as part of a failure. The information provided by these tools
> > > is invaluable when trying to understand the cause of a packet loss.
> > > 
> > > In recent years, large portions of the kernel data path were offloaded
> > > to capable devices. Today, it is possible to perform L2 and L3
> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > Different TC classifiers and actions are also offloaded to capable
> > > devices, at both ingress and egress.
> > > 
> > > However, when the data path is offloaded it is not possible to achieve
> > > the same level of introspection as tools such "perf" and "drop monitor"
> > > become irrelevant.
> > > 
> > > This patchset aims to solve this by allowing users to monitor packets
> > > that the underlying device decided to drop along with relevant metadata
> > > such as the drop reason and ingress port.  
> > 
> > We are now going to have 5 or so ways to capture packets passing through
> > the system, this is nonsense.
> > 
> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > devlink thing.
> > 
> > This is insanity, too many ways to do the same thing and therefore the
> > worst possible user experience.
> > 
> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > XDP perf events, and these taps there too.
> > 
> > I mean really, think about it from the average user's perspective.  To
> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > listen on devlink but configure a special tap thing beforehand and then
> > if someone is using XDP I gotta setup another perf event buffer capture
> > thing too.  
> 
> Let me try to explain again because I probably wasn't clear enough. The
> devlink-trap mechanism is not doing the same thing as other solutions.
> 
> The packets we are capturing in this patchset are packets that the
> kernel (the CPU) never saw up until now - they were silently dropped by
> the underlying device performing the packet forwarding instead of the
> CPU.

When you say silently dropped do you mean that mlxsw as of today
doesn't have any counters exposed for those events?

If we wanted to consolidate this into something existing we can either
 (a) add similar traps in the kernel data path;
 (b) make these traps extension of statistics.

My knee jerk reaction to seeing the patches was that it adds a new
place where device statistics are reported. Users who want to know why
things are dropped will not get detailed breakdown from ethtool -S which
for better or worse is the one stop shop for device stats today. 

Having thought about it some more, however, I think that having a
forwarding "exception" object and hanging statistics off of it is a
better design, even if we need to deal with some duplication to get
there.

IOW having an way to "trap all packets which would increment a
statistic" (option (b) above) is probably a bad design.

As for (a) I wonder how many of those events have a corresponding event
in the kernel stack? If we could add corresponding trace points and
just feed those from the device driver, that'd obviously be a holy
grail. Not to mention that requiring trace points to be added to the
core would make Alexei happy:

http://vger.kernel.org/netconf2019_files/netconf2019_slides_ast.pdf#page=3

;)

That's my $.02, not very insightful.

> For each such packet we get valuable metadata from the underlying device
> such as the drop reason and the ingress port. With time, even more
> reasons and metadata could be provided (e.g., egress port, traffic
> class). Netlink provides a structured and extensible way to report the
> packet along with the metadata to interested users. The tc-sample action
> uses a similar concept.
> 
> I would like to emphasize that these dropped packets are not injected to
> the kernel's receive path and therefore not subject to kfree_skb() and
> related infrastructure. There is no need to waste CPU cycles on packets
> we already know were dropped (and why). Further, hardware tail/early
> drops will not be dropped by the kernel, given its qdiscs are probably
> empty.
> 
> Regarding the use of devlink, current ASICs can forward packets at
> 6.4Tb/s. We do not want to overwhelm the CPU with dropped packets and
> therefore we give users the ability to control - via devlink - the
> trapping of certain packets to the CPU and their reporting to user
> space. In the future, devlink-trap can be extended to support the
> configuration of the hardware policers of each trap.
Ido Schimmel July 9, 2019, 12:38 p.m. UTC | #5
On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:
> > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> > > From: Ido Schimmel <idosch@idosch.org>
> > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > >   
> > > > Users have several ways to debug the kernel and understand why a packet
> > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > is freed as part of a failure. The information provided by these tools
> > > > is invaluable when trying to understand the cause of a packet loss.
> > > > 
> > > > In recent years, large portions of the kernel data path were offloaded
> > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > Different TC classifiers and actions are also offloaded to capable
> > > > devices, at both ingress and egress.
> > > > 
> > > > However, when the data path is offloaded it is not possible to achieve
> > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > become irrelevant.
> > > > 
> > > > This patchset aims to solve this by allowing users to monitor packets
> > > > that the underlying device decided to drop along with relevant metadata
> > > > such as the drop reason and ingress port.  
> > > 
> > > We are now going to have 5 or so ways to capture packets passing through
> > > the system, this is nonsense.
> > > 
> > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > devlink thing.
> > > 
> > > This is insanity, too many ways to do the same thing and therefore the
> > > worst possible user experience.
> > > 
> > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > XDP perf events, and these taps there too.
> > > 
> > > I mean really, think about it from the average user's perspective.  To
> > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > listen on devlink but configure a special tap thing beforehand and then
> > > if someone is using XDP I gotta setup another perf event buffer capture
> > > thing too.  
> > 
> > Let me try to explain again because I probably wasn't clear enough. The
> > devlink-trap mechanism is not doing the same thing as other solutions.
> > 
> > The packets we are capturing in this patchset are packets that the
> > kernel (the CPU) never saw up until now - they were silently dropped by
> > the underlying device performing the packet forwarding instead of the
> > CPU.

Jakub,

It seems to me that most of the criticism is about consolidation of
interfaces because you believe I'm doing something you can already do
today, but this is not the case.

Switch ASICs have dedicated traps for specific packets. Usually, these
packets are control packets (e.g., ARP, BGP) which are required for the
correct functioning of the control plane. You can see this in the SAI
interface, which is an abstraction layer over vendors' SDKs:

https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157

We need to be able to configure the hardware policers of these traps and
read their statistics to understand how many packets they dropped. We
currently do not have a way to do any of that and we rely on hardcoded
defaults in the driver which do not fit every use case (from
experience):

https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103

We plan to extend devlink-trap mechanism to cover all these use cases. I
hope you agree that this functionality belongs in devlink given it is a
device-specific configuration and not a netdev-specific one.

That being said, in its current form, this mechanism is focused on traps
that correlate to packets the device decided to drop as this is very
useful for debugging.

Given that the entire configuration is done via devlink and that devlink
stores all the information about these traps, it seems logical to also
report these packets and their metadata to user space as devlink events.

If this is not desirable, we can try to call into drop_monitor from
devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.

IMO, this is less desirable, as instead of having one tool (devlink) to
interact with this mechanism we will need two (devlink & dropwatch).

Below I tried to answer all your questions and refer to all the points
you brought up.

> When you say silently dropped do you mean that mlxsw as of today
> doesn't have any counters exposed for those events?

Some of these packets are counted, but not all of them.

> If we wanted to consolidate this into something existing we can either
>  (a) add similar traps in the kernel data path;
>  (b) make these traps extension of statistics.
> 
> My knee jerk reaction to seeing the patches was that it adds a new
> place where device statistics are reported.

Not at all. This would be a step back. We can already count discards due
to VLAN membership on ingress on a per-port basis. A software maintained
global counter does not buy us anything.

By also getting the dropped packet - coupled with the drop reason and
ingress port - you can understand exactly why and on which VLAN the
packet was dropped. I wrote a Wireshark dissector for these netlink
packets to make our life easier. You can see the details in my comment
to the cover letter:

https://marc.info/?l=linux-netdev&m=156248736710238&w=2

In case you do not care about individual packets, but still want more
fine-grained statistics for your monitoring application, you can use
eBPF. For example, one thing we did is attaching a kprobe to
devlink_trap_report() with an eBPF program that dissects the incoming
skbs and maintains a counter per-{5 tuple, drop reason}. With
ebpf_exporter you can export these statistics to Prometheus on which you
can run queries and visualize the results with Grafana. This is
especially useful for tail and early drops since it allows you to
understand which flows contribute to most of the drops.

> Users who want to know why things are dropped will not get detailed
> breakdown from ethtool -S which for better or worse is the one stop
> shop for device stats today.

I hope I managed to explain why counters are not enough, but I also want
to point out that ethtool statistics are not properly documented and
this hinders their effectiveness. I did my best to document the exposed
traps in order to avoid the same fate:

https://patchwork.ozlabs.org/patch/1128585/

In addition, there are selftests to show how each trap can be triggered
to reduce the ambiguity even further:

https://patchwork.ozlabs.org/patch/1128610/

And a note in the documentation to make sure future functionality is
tested as well:

https://patchwork.ozlabs.org/patch/1128608/

> Having thought about it some more, however, I think that having a
> forwarding "exception" object and hanging statistics off of it is a
> better design, even if we need to deal with some duplication to get
> there.
> 
> IOW having an way to "trap all packets which would increment a
> statistic" (option (b) above) is probably a bad design.
> 
> As for (a) I wonder how many of those events have a corresponding event
> in the kernel stack?

Generic packet drops all have a corresponding kfree_skb() calls in the
kernel, but that does not mean that every packet dropped by the hardware
would also be dropped by the kernel if it were to be injected to its Rx
path. In my reply to Dave I gave buffer drops as an example.

There are also situations in which packets can be dropped due to
device-specific exceptions and these do not have a corresponding drop
reason in the kernel. See example here:

https://patchwork.ozlabs.org/patch/1128587/

> If we could add corresponding trace points and just feed those from
> the device driver, that'd obviously be a holy grail.

Unlike tracepoints, netlink gives you a structured and extensible
interface. For example, in Spectrum-1 we cannot provide the Tx port for
early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
Spectrum-1. You also get a programmatic interface that you can query for
this information:

# devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
netdevsim/netdevsim10:
  name ingress_vlan_filter type drop generic true report false action drop group l2_drops
    metadata:
       input_port

Thanks
Jakub Kicinski July 9, 2019, 10:34 p.m. UTC | #6
On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
> On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
> > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
> > > > From: Ido Schimmel <idosch@idosch.org>
> > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > > >     
> > > > > Users have several ways to debug the kernel and understand why a packet
> > > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > > is freed as part of a failure. The information provided by these tools
> > > > > is invaluable when trying to understand the cause of a packet loss.
> > > > > 
> > > > > In recent years, large portions of the kernel data path were offloaded
> > > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > > Different TC classifiers and actions are also offloaded to capable
> > > > > devices, at both ingress and egress.
> > > > > 
> > > > > However, when the data path is offloaded it is not possible to achieve
> > > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > > become irrelevant.
> > > > > 
> > > > > This patchset aims to solve this by allowing users to monitor packets
> > > > > that the underlying device decided to drop along with relevant metadata
> > > > > such as the drop reason and ingress port.    
> > > > 
> > > > We are now going to have 5 or so ways to capture packets passing through
> > > > the system, this is nonsense.
> > > > 
> > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > > devlink thing.
> > > > 
> > > > This is insanity, too many ways to do the same thing and therefore the
> > > > worst possible user experience.
> > > > 
> > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > > XDP perf events, and these taps there too.
> > > > 
> > > > I mean really, think about it from the average user's perspective.  To
> > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > > listen on devlink but configure a special tap thing beforehand and then
> > > > if someone is using XDP I gotta setup another perf event buffer capture
> > > > thing too.    
> > > 
> > > Let me try to explain again because I probably wasn't clear enough. The
> > > devlink-trap mechanism is not doing the same thing as other solutions.
> > > 
> > > The packets we are capturing in this patchset are packets that the
> > > kernel (the CPU) never saw up until now - they were silently dropped by
> > > the underlying device performing the packet forwarding instead of the
> > > CPU.  
> 
> Jakub,
> 
> It seems to me that most of the criticism is about consolidation of
> interfaces because you believe I'm doing something you can already do
> today, but this is not the case.

To be clear I'm not opposed to the patches, I'm just trying to
facilitate a discussion.

> Switch ASICs have dedicated traps for specific packets. Usually, these
> packets are control packets (e.g., ARP, BGP) which are required for the
> correct functioning of the control plane. You can see this in the SAI
> interface, which is an abstraction layer over vendors' SDKs:
> 
> https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
> 
> We need to be able to configure the hardware policers of these traps and
> read their statistics to understand how many packets they dropped. We
> currently do not have a way to do any of that and we rely on hardcoded
> defaults in the driver which do not fit every use case (from
> experience):
> 
> https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
> 
> We plan to extend devlink-trap mechanism to cover all these use cases. I
> hope you agree that this functionality belongs in devlink given it is a
> device-specific configuration and not a netdev-specific one.

No disagreement on providing knobs for traps.

> That being said, in its current form, this mechanism is focused on traps
> that correlate to packets the device decided to drop as this is very
> useful for debugging.

That'd be mixing two things - trap configuration and tracing exceptions
in one API. That's a little suboptimal but not too terrible, especially
if there is a higher level APIs users can default to.

> Given that the entire configuration is done via devlink and that devlink
> stores all the information about these traps, it seems logical to also
> report these packets and their metadata to user space as devlink events.
> 
> If this is not desirable, we can try to call into drop_monitor from
> devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
> encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
> 
> IMO, this is less desirable, as instead of having one tool (devlink) to
> interact with this mechanism we will need two (devlink & dropwatch).
> 
> Below I tried to answer all your questions and refer to all the points
> you brought up.
> 
> > When you say silently dropped do you mean that mlxsw as of today
> > doesn't have any counters exposed for those events?  
> 
> Some of these packets are counted, but not all of them.
> 
> > If we wanted to consolidate this into something existing we can either
> >  (a) add similar traps in the kernel data path;
> >  (b) make these traps extension of statistics.
> > 
> > My knee jerk reaction to seeing the patches was that it adds a new
> > place where device statistics are reported.  
> 
> Not at all. This would be a step back. We can already count discards due
> to VLAN membership on ingress on a per-port basis. A software maintained
> global counter does not buy us anything.
> 
> By also getting the dropped packet - coupled with the drop reason and
> ingress port - you can understand exactly why and on which VLAN the
> packet was dropped. I wrote a Wireshark dissector for these netlink
> packets to make our life easier. You can see the details in my comment
> to the cover letter:
> 
> https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> 
> In case you do not care about individual packets, but still want more
> fine-grained statistics for your monitoring application, you can use
> eBPF. For example, one thing we did is attaching a kprobe to
> devlink_trap_report() with an eBPF program that dissects the incoming
> skbs and maintains a counter per-{5 tuple, drop reason}. With
> ebpf_exporter you can export these statistics to Prometheus on which you
> can run queries and visualize the results with Grafana. This is
> especially useful for tail and early drops since it allows you to
> understand which flows contribute to most of the drops.

No question that the mechanism is useful.

> > Users who want to know why things are dropped will not get detailed
> > breakdown from ethtool -S which for better or worse is the one stop
> > shop for device stats today.  
> 
> I hope I managed to explain why counters are not enough, but I also want
> to point out that ethtool statistics are not properly documented and
> this hinders their effectiveness. I did my best to document the exposed
> traps in order to avoid the same fate:
> 
> https://patchwork.ozlabs.org/patch/1128585/
> 
> In addition, there are selftests to show how each trap can be triggered
> to reduce the ambiguity even further:
> 
> https://patchwork.ozlabs.org/patch/1128610/
> 
> And a note in the documentation to make sure future functionality is
> tested as well:
> 
> https://patchwork.ozlabs.org/patch/1128608/
> 
> > Having thought about it some more, however, I think that having a
> > forwarding "exception" object and hanging statistics off of it is a
> > better design, even if we need to deal with some duplication to get
> > there.
> > 
> > IOW having an way to "trap all packets which would increment a
> > statistic" (option (b) above) is probably a bad design.
> > 
> > As for (a) I wonder how many of those events have a corresponding event
> > in the kernel stack?  
> 
> Generic packet drops all have a corresponding kfree_skb() calls in the
> kernel, but that does not mean that every packet dropped by the hardware
> would also be dropped by the kernel if it were to be injected to its Rx
> path.

The notion that all SW events get captured by kfree_skb() would not be
correct. We have the kfree_skb(), and xdp_exception(), and drivers can
drop packets if various allocations fail.. the situation is already not
great.

I think that having a single useful place where users can look to see
all traffic exception events would go a long way. Software side as I
mentioned is pretty brutal, IDK how many users are actually willing to
decode stack traces to figure out why their system is dropping
packets :/  

> In my reply to Dave I gave buffer drops as an example.

The example of buffer drops is also probably the case where having the
packet is least useful, but yes, I definitely agree devices need a way
of reporting events that can't happen in SW.

> There are also situations in which packets can be dropped due to
> device-specific exceptions and these do not have a corresponding drop
> reason in the kernel. See example here:
> 
> https://patchwork.ozlabs.org/patch/1128587/
> 
> > If we could add corresponding trace points and just feed those from
> > the device driver, that'd obviously be a holy grail.  
> 
> Unlike tracepoints, netlink gives you a structured and extensible
> interface. For example, in Spectrum-1 we cannot provide the Tx port for
> early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
> we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
> Spectrum-1. You also get a programmatic interface that you can query for
> this information:
> 
> # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
> netdevsim/netdevsim10:
>   name ingress_vlan_filter type drop generic true report false action drop group l2_drops
>     metadata:
>        input_port

Right, you can set or not set skb fields to some extent but its
definitely not as flexible as netlink.
Ido Schimmel July 10, 2019, 11:20 a.m. UTC | #7
On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
> > > > > From: Ido Schimmel <idosch@idosch.org>
> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > > > >     
> > > > > > Users have several ways to debug the kernel and understand why a packet
> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > > > is freed as part of a failure. The information provided by these tools
> > > > > > is invaluable when trying to understand the cause of a packet loss.
> > > > > > 
> > > > > > In recent years, large portions of the kernel data path were offloaded
> > > > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > > > Different TC classifiers and actions are also offloaded to capable
> > > > > > devices, at both ingress and egress.
> > > > > > 
> > > > > > However, when the data path is offloaded it is not possible to achieve
> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > > > become irrelevant.
> > > > > > 
> > > > > > This patchset aims to solve this by allowing users to monitor packets
> > > > > > that the underlying device decided to drop along with relevant metadata
> > > > > > such as the drop reason and ingress port.    
> > > > > 
> > > > > We are now going to have 5 or so ways to capture packets passing through
> > > > > the system, this is nonsense.
> > > > > 
> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > > > devlink thing.
> > > > > 
> > > > > This is insanity, too many ways to do the same thing and therefore the
> > > > > worst possible user experience.
> > > > > 
> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > > > XDP perf events, and these taps there too.
> > > > > 
> > > > > I mean really, think about it from the average user's perspective.  To
> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > > > listen on devlink but configure a special tap thing beforehand and then
> > > > > if someone is using XDP I gotta setup another perf event buffer capture
> > > > > thing too.    
> > > > 
> > > > Let me try to explain again because I probably wasn't clear enough. The
> > > > devlink-trap mechanism is not doing the same thing as other solutions.
> > > > 
> > > > The packets we are capturing in this patchset are packets that the
> > > > kernel (the CPU) never saw up until now - they were silently dropped by
> > > > the underlying device performing the packet forwarding instead of the
> > > > CPU.  
> > 
> > Jakub,
> > 
> > It seems to me that most of the criticism is about consolidation of
> > interfaces because you believe I'm doing something you can already do
> > today, but this is not the case.
> 
> To be clear I'm not opposed to the patches, I'm just trying to
> facilitate a discussion.

Sure, sorry if it came out the wrong way. I appreciate your feedback and
the time you have spent on this subject.

> 
> > Switch ASICs have dedicated traps for specific packets. Usually, these
> > packets are control packets (e.g., ARP, BGP) which are required for the
> > correct functioning of the control plane. You can see this in the SAI
> > interface, which is an abstraction layer over vendors' SDKs:
> > 
> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
> > 
> > We need to be able to configure the hardware policers of these traps and
> > read their statistics to understand how many packets they dropped. We
> > currently do not have a way to do any of that and we rely on hardcoded
> > defaults in the driver which do not fit every use case (from
> > experience):
> > 
> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
> > 
> > We plan to extend devlink-trap mechanism to cover all these use cases. I
> > hope you agree that this functionality belongs in devlink given it is a
> > device-specific configuration and not a netdev-specific one.
> 
> No disagreement on providing knobs for traps.
> 
> > That being said, in its current form, this mechanism is focused on traps
> > that correlate to packets the device decided to drop as this is very
> > useful for debugging.
> 
> That'd be mixing two things - trap configuration and tracing exceptions
> in one API. That's a little suboptimal but not too terrible, especially
> if there is a higher level APIs users can default to.

TBH, initially I was only focused on the drops, but then it occurred to
me that this is a too narrow scope. These traps are only a subset of the
complete list of traps we have and we have similar requirements for both
(statistics, setting policers etc.). Therefore, I decided to design this
interface in a more generic way, so that it could support the different
use cases.

> 
> > Given that the entire configuration is done via devlink and that devlink
> > stores all the information about these traps, it seems logical to also
> > report these packets and their metadata to user space as devlink events.
> > 
> > If this is not desirable, we can try to call into drop_monitor from
> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
> > 
> > IMO, this is less desirable, as instead of having one tool (devlink) to
> > interact with this mechanism we will need two (devlink & dropwatch).
> > 
> > Below I tried to answer all your questions and refer to all the points
> > you brought up.
> > 
> > > When you say silently dropped do you mean that mlxsw as of today
> > > doesn't have any counters exposed for those events?  
> > 
> > Some of these packets are counted, but not all of them.
> > 
> > > If we wanted to consolidate this into something existing we can either
> > >  (a) add similar traps in the kernel data path;
> > >  (b) make these traps extension of statistics.
> > > 
> > > My knee jerk reaction to seeing the patches was that it adds a new
> > > place where device statistics are reported.  
> > 
> > Not at all. This would be a step back. We can already count discards due
> > to VLAN membership on ingress on a per-port basis. A software maintained
> > global counter does not buy us anything.
> > 
> > By also getting the dropped packet - coupled with the drop reason and
> > ingress port - you can understand exactly why and on which VLAN the
> > packet was dropped. I wrote a Wireshark dissector for these netlink
> > packets to make our life easier. You can see the details in my comment
> > to the cover letter:
> > 
> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > In case you do not care about individual packets, but still want more
> > fine-grained statistics for your monitoring application, you can use
> > eBPF. For example, one thing we did is attaching a kprobe to
> > devlink_trap_report() with an eBPF program that dissects the incoming
> > skbs and maintains a counter per-{5 tuple, drop reason}. With
> > ebpf_exporter you can export these statistics to Prometheus on which you
> > can run queries and visualize the results with Grafana. This is
> > especially useful for tail and early drops since it allows you to
> > understand which flows contribute to most of the drops.
> 
> No question that the mechanism is useful.
> 
> > > Users who want to know why things are dropped will not get detailed
> > > breakdown from ethtool -S which for better or worse is the one stop
> > > shop for device stats today.  
> > 
> > I hope I managed to explain why counters are not enough, but I also want
> > to point out that ethtool statistics are not properly documented and
> > this hinders their effectiveness. I did my best to document the exposed
> > traps in order to avoid the same fate:
> > 
> > https://patchwork.ozlabs.org/patch/1128585/
> > 
> > In addition, there are selftests to show how each trap can be triggered
> > to reduce the ambiguity even further:
> > 
> > https://patchwork.ozlabs.org/patch/1128610/
> > 
> > And a note in the documentation to make sure future functionality is
> > tested as well:
> > 
> > https://patchwork.ozlabs.org/patch/1128608/
> > 
> > > Having thought about it some more, however, I think that having a
> > > forwarding "exception" object and hanging statistics off of it is a
> > > better design, even if we need to deal with some duplication to get
> > > there.
> > > 
> > > IOW having an way to "trap all packets which would increment a
> > > statistic" (option (b) above) is probably a bad design.
> > > 
> > > As for (a) I wonder how many of those events have a corresponding event
> > > in the kernel stack?  
> > 
> > Generic packet drops all have a corresponding kfree_skb() calls in the
> > kernel, but that does not mean that every packet dropped by the hardware
> > would also be dropped by the kernel if it were to be injected to its Rx
> > path.
> 
> The notion that all SW events get captured by kfree_skb() would not be
> correct.

I meant that the generic drop reasons I'm exposing with this patchset
all correspond to reasons for which the kernel would drop packets.

> We have the kfree_skb(), and xdp_exception(), and drivers can
> drop packets if various allocations fail.. the situation is already not
> great.
> 
> I think that having a single useful place where users can look to see
> all traffic exception events would go a long way. 

I believe this was Dave's point as well. We have one tool to monitor
kfree_skb() drops and with this patchset we will have another to monitor
HW drops. As I mentioned in my previous reply, I will look into sending
the events via drop_monitor by calling into it from devlink.

I'm not involved with XDP (as you might have noticed), but I assume
drop_monitor could be extended for this use case as well by doing
register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
events using a single netlink channel, potentially split into different
multicast groups to allow user space programs to receive only the events
they care about.

> Software side as I mentioned is pretty brutal, IDK how many users are
> actually willing to decode stack traces to figure out why their system
> is dropping packets :/
> 
> > In my reply to Dave I gave buffer drops as an example.
> 
> The example of buffer drops is also probably the case where having the
> packet is least useful, but yes, I definitely agree devices need a way
> of reporting events that can't happen in SW.
> 
> > There are also situations in which packets can be dropped due to
> > device-specific exceptions and these do not have a corresponding drop
> > reason in the kernel. See example here:
> > 
> > https://patchwork.ozlabs.org/patch/1128587/
> > 
> > > If we could add corresponding trace points and just feed those from
> > > the device driver, that'd obviously be a holy grail.  
> > 
> > Unlike tracepoints, netlink gives you a structured and extensible
> > interface. For example, in Spectrum-1 we cannot provide the Tx port for
> > early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
> > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
> > Spectrum-1. You also get a programmatic interface that you can query for
> > this information:
> > 
> > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
> > netdevsim/netdevsim10:
> >   name ingress_vlan_filter type drop generic true report false action drop group l2_drops
> >     metadata:
> >        input_port
> 
> Right, you can set or not set skb fields to some extent but its
> definitely not as flexible as netlink.
Toke Høiland-Jørgensen July 10, 2019, 11:39 a.m. UTC | #8
Ido Schimmel <idosch@idosch.org> writes:

> On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
>> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
>> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
>> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
>> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
>> > > > > From: Ido Schimmel <idosch@idosch.org>
>> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> > > > >     
>> > > > > > Users have several ways to debug the kernel and understand why a packet
>> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
>> > > > > > utilities trace kfree_skb(), which is the function called when a packet
>> > > > > > is freed as part of a failure. The information provided by these tools
>> > > > > > is invaluable when trying to understand the cause of a packet loss.
>> > > > > > 
>> > > > > > In recent years, large portions of the kernel data path were offloaded
>> > > > > > to capable devices. Today, it is possible to perform L2 and L3
>> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> > > > > > Different TC classifiers and actions are also offloaded to capable
>> > > > > > devices, at both ingress and egress.
>> > > > > > 
>> > > > > > However, when the data path is offloaded it is not possible to achieve
>> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
>> > > > > > become irrelevant.
>> > > > > > 
>> > > > > > This patchset aims to solve this by allowing users to monitor packets
>> > > > > > that the underlying device decided to drop along with relevant metadata
>> > > > > > such as the drop reason and ingress port.    
>> > > > > 
>> > > > > We are now going to have 5 or so ways to capture packets passing through
>> > > > > the system, this is nonsense.
>> > > > > 
>> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> > > > > devlink thing.
>> > > > > 
>> > > > > This is insanity, too many ways to do the same thing and therefore the
>> > > > > worst possible user experience.
>> > > > > 
>> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> > > > > XDP perf events, and these taps there too.
>> > > > > 
>> > > > > I mean really, think about it from the average user's perspective.  To
>> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> > > > > listen on devlink but configure a special tap thing beforehand and then
>> > > > > if someone is using XDP I gotta setup another perf event buffer capture
>> > > > > thing too.    
>> > > > 
>> > > > Let me try to explain again because I probably wasn't clear enough. The
>> > > > devlink-trap mechanism is not doing the same thing as other solutions.
>> > > > 
>> > > > The packets we are capturing in this patchset are packets that the
>> > > > kernel (the CPU) never saw up until now - they were silently dropped by
>> > > > the underlying device performing the packet forwarding instead of the
>> > > > CPU.  
>> > 
>> > Jakub,
>> > 
>> > It seems to me that most of the criticism is about consolidation of
>> > interfaces because you believe I'm doing something you can already do
>> > today, but this is not the case.
>> 
>> To be clear I'm not opposed to the patches, I'm just trying to
>> facilitate a discussion.
>
> Sure, sorry if it came out the wrong way. I appreciate your feedback and
> the time you have spent on this subject.
>
>> 
>> > Switch ASICs have dedicated traps for specific packets. Usually, these
>> > packets are control packets (e.g., ARP, BGP) which are required for the
>> > correct functioning of the control plane. You can see this in the SAI
>> > interface, which is an abstraction layer over vendors' SDKs:
>> > 
>> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
>> > 
>> > We need to be able to configure the hardware policers of these traps and
>> > read their statistics to understand how many packets they dropped. We
>> > currently do not have a way to do any of that and we rely on hardcoded
>> > defaults in the driver which do not fit every use case (from
>> > experience):
>> > 
>> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
>> > 
>> > We plan to extend devlink-trap mechanism to cover all these use cases. I
>> > hope you agree that this functionality belongs in devlink given it is a
>> > device-specific configuration and not a netdev-specific one.
>> 
>> No disagreement on providing knobs for traps.
>> 
>> > That being said, in its current form, this mechanism is focused on traps
>> > that correlate to packets the device decided to drop as this is very
>> > useful for debugging.
>> 
>> That'd be mixing two things - trap configuration and tracing exceptions
>> in one API. That's a little suboptimal but not too terrible, especially
>> if there is a higher level APIs users can default to.
>
> TBH, initially I was only focused on the drops, but then it occurred to
> me that this is a too narrow scope. These traps are only a subset of the
> complete list of traps we have and we have similar requirements for both
> (statistics, setting policers etc.). Therefore, I decided to design this
> interface in a more generic way, so that it could support the different
> use cases.
>
>> 
>> > Given that the entire configuration is done via devlink and that devlink
>> > stores all the information about these traps, it seems logical to also
>> > report these packets and their metadata to user space as devlink events.
>> > 
>> > If this is not desirable, we can try to call into drop_monitor from
>> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
>> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
>> > 
>> > IMO, this is less desirable, as instead of having one tool (devlink) to
>> > interact with this mechanism we will need two (devlink & dropwatch).
>> > 
>> > Below I tried to answer all your questions and refer to all the points
>> > you brought up.
>> > 
>> > > When you say silently dropped do you mean that mlxsw as of today
>> > > doesn't have any counters exposed for those events?  
>> > 
>> > Some of these packets are counted, but not all of them.
>> > 
>> > > If we wanted to consolidate this into something existing we can either
>> > >  (a) add similar traps in the kernel data path;
>> > >  (b) make these traps extension of statistics.
>> > > 
>> > > My knee jerk reaction to seeing the patches was that it adds a new
>> > > place where device statistics are reported.  
>> > 
>> > Not at all. This would be a step back. We can already count discards due
>> > to VLAN membership on ingress on a per-port basis. A software maintained
>> > global counter does not buy us anything.
>> > 
>> > By also getting the dropped packet - coupled with the drop reason and
>> > ingress port - you can understand exactly why and on which VLAN the
>> > packet was dropped. I wrote a Wireshark dissector for these netlink
>> > packets to make our life easier. You can see the details in my comment
>> > to the cover letter:
>> > 
>> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
>> > 
>> > In case you do not care about individual packets, but still want more
>> > fine-grained statistics for your monitoring application, you can use
>> > eBPF. For example, one thing we did is attaching a kprobe to
>> > devlink_trap_report() with an eBPF program that dissects the incoming
>> > skbs and maintains a counter per-{5 tuple, drop reason}. With
>> > ebpf_exporter you can export these statistics to Prometheus on which you
>> > can run queries and visualize the results with Grafana. This is
>> > especially useful for tail and early drops since it allows you to
>> > understand which flows contribute to most of the drops.
>> 
>> No question that the mechanism is useful.
>> 
>> > > Users who want to know why things are dropped will not get detailed
>> > > breakdown from ethtool -S which for better or worse is the one stop
>> > > shop for device stats today.  
>> > 
>> > I hope I managed to explain why counters are not enough, but I also want
>> > to point out that ethtool statistics are not properly documented and
>> > this hinders their effectiveness. I did my best to document the exposed
>> > traps in order to avoid the same fate:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128585/
>> > 
>> > In addition, there are selftests to show how each trap can be triggered
>> > to reduce the ambiguity even further:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128610/
>> > 
>> > And a note in the documentation to make sure future functionality is
>> > tested as well:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128608/
>> > 
>> > > Having thought about it some more, however, I think that having a
>> > > forwarding "exception" object and hanging statistics off of it is a
>> > > better design, even if we need to deal with some duplication to get
>> > > there.
>> > > 
>> > > IOW having an way to "trap all packets which would increment a
>> > > statistic" (option (b) above) is probably a bad design.
>> > > 
>> > > As for (a) I wonder how many of those events have a corresponding event
>> > > in the kernel stack?  
>> > 
>> > Generic packet drops all have a corresponding kfree_skb() calls in the
>> > kernel, but that does not mean that every packet dropped by the hardware
>> > would also be dropped by the kernel if it were to be injected to its Rx
>> > path.
>> 
>> The notion that all SW events get captured by kfree_skb() would not be
>> correct.
>
> I meant that the generic drop reasons I'm exposing with this patchset
> all correspond to reasons for which the kernel would drop packets.
>
>> We have the kfree_skb(), and xdp_exception(), and drivers can
>> drop packets if various allocations fail.. the situation is already not
>> great.
>> 
>> I think that having a single useful place where users can look to see
>> all traffic exception events would go a long way. 
>
> I believe this was Dave's point as well. We have one tool to monitor
> kfree_skb() drops and with this patchset we will have another to monitor
> HW drops. As I mentioned in my previous reply, I will look into sending
> the events via drop_monitor by calling into it from devlink.
>
> I'm not involved with XDP (as you might have noticed), but I assume
> drop_monitor could be extended for this use case as well by doing
> register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
> events using a single netlink channel, potentially split into different
> multicast groups to allow user space programs to receive only the events
> they care about.

Yes, having XDP hook into this would be good! This ties in nicely with
the need for better statistics for XDP (see
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#consistency-for-statistics-with-xdp).

Unfortunately, it's not enough to just hook into the xdp_exception trace
point, as that is generally only triggered on XDP_ABORTED, not if the
XDP program just decides to drop the packet with XDP_DROP. So we may
need another mechanism to hook this if it's going to comprehensive; and
we'd probably want a way to do this that doesn't impose a performance
penalty when the drop monitor is not enabled...

-Toke
Ido Schimmel July 11, 2019, 12:39 p.m. UTC | #9
On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> From: Ido Schimmel <idosch@idosch.org>
> Date: Sun,  7 Jul 2019 10:58:17 +0300
> 
> > Users have several ways to debug the kernel and understand why a packet
> > was dropped. For example, using "drop monitor" and "perf". Both
> > utilities trace kfree_skb(), which is the function called when a packet
> > is freed as part of a failure. The information provided by these tools
> > is invaluable when trying to understand the cause of a packet loss.
> > 
> > In recent years, large portions of the kernel data path were offloaded
> > to capable devices. Today, it is possible to perform L2 and L3
> > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > Different TC classifiers and actions are also offloaded to capable
> > devices, at both ingress and egress.
> > 
> > However, when the data path is offloaded it is not possible to achieve
> > the same level of introspection as tools such "perf" and "drop monitor"
> > become irrelevant.
> > 
> > This patchset aims to solve this by allowing users to monitor packets
> > that the underlying device decided to drop along with relevant metadata
> > such as the drop reason and ingress port.
> 
> We are now going to have 5 or so ways to capture packets passing through
> the system, this is nonsense.
> 
> AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> devlink thing.
> 
> This is insanity, too many ways to do the same thing and therefore the
> worst possible user experience.
> 
> Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> XDP perf events, and these taps there too.
> 
> I mean really, think about it from the average user's perspective.  To
> see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> listen on devlink but configure a special tap thing beforehand and then
> if someone is using XDP I gotta setup another perf event buffer capture
> thing too.

Dave,

Before I start working on v2, I would like to get your feedback on the
high level plan. Also adding Neil who is the maintainer of drop_monitor
(and counterpart DropWatch tool [1]).

IIUC, the problem you point out is that users need to use different
tools to monitor packet drops based on where these drops occur
(SW/HW/XDP).

Therefore, my plan is to extend the existing drop_monitor netlink
channel to also cover HW drops. I will add a new message type and a new
multicast group for HW drops and encode in the message what is currently
encoded in the devlink events.

I would like to emphasize that the configuration of whether these
dropped packets are even sent to the CPU from the device still needs to
reside in devlink given this is the go-to tool for device-specific
configuration. In addition, these drop traps are a small subset of the
entire packet traps devices support and all have similar needs such as
HW policer configuration and statistics.

In the future we might also want to report events that indicate the
formation of possible problems. For example, in case packets are queued
above a certain threshold or for long periods of time. I hope we could
re-use drop_monitor for this as well, thereby making it the go-to
channel for diagnosing current and to-be problems in the data path.

Thanks

[1] https://github.com/nhorman/dropwatch
David Miller July 11, 2019, 7:02 p.m. UTC | #10
From: Ido Schimmel <idosch@idosch.org>
Date: Thu, 11 Jul 2019 15:39:09 +0300

> Before I start working on v2, I would like to get your feedback on the
> high level plan. Also adding Neil who is the maintainer of drop_monitor
> (and counterpart DropWatch tool [1]).

I'll try to get back to this, but right now the merge window is completely
consuming me at the moment so you will have to exercise extreme patience.

Thank you.
Neil Horman July 11, 2019, 11:53 p.m. UTC | #11
On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> > From: Ido Schimmel <idosch@idosch.org>
> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > 
> > > Users have several ways to debug the kernel and understand why a packet
> > > was dropped. For example, using "drop monitor" and "perf". Both
> > > utilities trace kfree_skb(), which is the function called when a packet
> > > is freed as part of a failure. The information provided by these tools
> > > is invaluable when trying to understand the cause of a packet loss.
> > > 
> > > In recent years, large portions of the kernel data path were offloaded
> > > to capable devices. Today, it is possible to perform L2 and L3
> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > Different TC classifiers and actions are also offloaded to capable
> > > devices, at both ingress and egress.
> > > 
> > > However, when the data path is offloaded it is not possible to achieve
> > > the same level of introspection as tools such "perf" and "drop monitor"
> > > become irrelevant.
> > > 
> > > This patchset aims to solve this by allowing users to monitor packets
> > > that the underlying device decided to drop along with relevant metadata
> > > such as the drop reason and ingress port.
> > 
> > We are now going to have 5 or so ways to capture packets passing through
> > the system, this is nonsense.
> > 
> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > devlink thing.
> > 
> > This is insanity, too many ways to do the same thing and therefore the
> > worst possible user experience.
> > 
> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > XDP perf events, and these taps there too.
> > 
> > I mean really, think about it from the average user's perspective.  To
> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > listen on devlink but configure a special tap thing beforehand and then
> > if someone is using XDP I gotta setup another perf event buffer capture
> > thing too.
> 
> Dave,
> 
> Before I start working on v2, I would like to get your feedback on the
> high level plan. Also adding Neil who is the maintainer of drop_monitor
> (and counterpart DropWatch tool [1]).
> 
> IIUC, the problem you point out is that users need to use different
> tools to monitor packet drops based on where these drops occur
> (SW/HW/XDP).
> 
> Therefore, my plan is to extend the existing drop_monitor netlink
> channel to also cover HW drops. I will add a new message type and a new
> multicast group for HW drops and encode in the message what is currently
> encoded in the devlink events.
> 
A few things here:
IIRC we don't announce individual hardware drops, drivers record them in
internal structures, and they are retrieved on demand via ethtool calls, so you
will either need to include some polling (probably not a very performant idea),
or some sort of flagging mechanism to indicate that on the next message sent to
user space you should go retrieve hw stats from a given interface.  I certainly
wouldn't mind seeing this happen, but its more work than just adding a new
netlink message.

Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
hardware (you'll need your hw drop gathering mechanism for that), but for xdp
programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
program returns a DROP result for a packet being processed, the OS will call
kfree_skb on its behalf, and dropwatch wil call that.

> I would like to emphasize that the configuration of whether these
> dropped packets are even sent to the CPU from the device still needs to
> reside in devlink given this is the go-to tool for device-specific
> configuration. In addition, these drop traps are a small subset of the
> entire packet traps devices support and all have similar needs such as
> HW policer configuration and statistics.
> 
> In the future we might also want to report events that indicate the
> formation of possible problems. For example, in case packets are queued
> above a certain threshold or for long periods of time. I hope we could
> re-use drop_monitor for this as well, thereby making it the go-to
> channel for diagnosing current and to-be problems in the data path.
> 
Thats an interesting idea, but dropwatch certainly isn't currently setup for
that kind of messaging.  It may be worth creating a v2 of the netlink protocol
and really thinking out what you want to communicate.

Best
Neil

> Thanks
> 
> [1] https://github.com/nhorman/dropwatch
>
Florian Fainelli July 12, 2019, 3:40 a.m. UTC | #12
On 7/11/2019 4:53 PM, Neil Horman wrote:
>> I would like to emphasize that the configuration of whether these
>> dropped packets are even sent to the CPU from the device still needs to
>> reside in devlink given this is the go-to tool for device-specific
>> configuration. In addition, these drop traps are a small subset of the
>> entire packet traps devices support and all have similar needs such as
>> HW policer configuration and statistics.
>>
>> In the future we might also want to report events that indicate the
>> formation of possible problems. For example, in case packets are queued
>> above a certain threshold or for long periods of time. I hope we could
>> re-use drop_monitor for this as well, thereby making it the go-to
>> channel for diagnosing current and to-be problems in the data path.
>>
> Thats an interesting idea, but dropwatch certainly isn't currently setup for
> that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> and really thinking out what you want to communicate.

Is not what you describe more or less what Ido has been doing here with
this patch series?
Toke Høiland-Jørgensen July 12, 2019, 9:27 a.m. UTC | #13
Neil Horman <nhorman@tuxdriver.com> writes:

> On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
>> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
>> > From: Ido Schimmel <idosch@idosch.org>
>> > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> > 
>> > > Users have several ways to debug the kernel and understand why a packet
>> > > was dropped. For example, using "drop monitor" and "perf". Both
>> > > utilities trace kfree_skb(), which is the function called when a packet
>> > > is freed as part of a failure. The information provided by these tools
>> > > is invaluable when trying to understand the cause of a packet loss.
>> > > 
>> > > In recent years, large portions of the kernel data path were offloaded
>> > > to capable devices. Today, it is possible to perform L2 and L3
>> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> > > Different TC classifiers and actions are also offloaded to capable
>> > > devices, at both ingress and egress.
>> > > 
>> > > However, when the data path is offloaded it is not possible to achieve
>> > > the same level of introspection as tools such "perf" and "drop monitor"
>> > > become irrelevant.
>> > > 
>> > > This patchset aims to solve this by allowing users to monitor packets
>> > > that the underlying device decided to drop along with relevant metadata
>> > > such as the drop reason and ingress port.
>> > 
>> > We are now going to have 5 or so ways to capture packets passing through
>> > the system, this is nonsense.
>> > 
>> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> > devlink thing.
>> > 
>> > This is insanity, too many ways to do the same thing and therefore the
>> > worst possible user experience.
>> > 
>> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> > XDP perf events, and these taps there too.
>> > 
>> > I mean really, think about it from the average user's perspective.  To
>> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> > listen on devlink but configure a special tap thing beforehand and then
>> > if someone is using XDP I gotta setup another perf event buffer capture
>> > thing too.
>> 
>> Dave,
>> 
>> Before I start working on v2, I would like to get your feedback on the
>> high level plan. Also adding Neil who is the maintainer of drop_monitor
>> (and counterpart DropWatch tool [1]).
>> 
>> IIUC, the problem you point out is that users need to use different
>> tools to monitor packet drops based on where these drops occur
>> (SW/HW/XDP).
>> 
>> Therefore, my plan is to extend the existing drop_monitor netlink
>> channel to also cover HW drops. I will add a new message type and a new
>> multicast group for HW drops and encode in the message what is currently
>> encoded in the devlink events.
>> 
> A few things here:
> IIRC we don't announce individual hardware drops, drivers record them in
> internal structures, and they are retrieved on demand via ethtool calls, so you
> will either need to include some polling (probably not a very performant idea),
> or some sort of flagging mechanism to indicate that on the next message sent to
> user space you should go retrieve hw stats from a given interface.  I certainly
> wouldn't mind seeing this happen, but its more work than just adding a new
> netlink message.
>
> Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
> hardware (you'll need your hw drop gathering mechanism for that), but for xdp
> programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
> program returns a DROP result for a packet being processed, the OS will call
> kfree_skb on its behalf, and dropwatch wil call that.

There is no skb by the time an XDP program runs, so this is not true. As
I mentioned upthread, there's a tracepoint that will get called if an
error occurs (or the program returns XDP_ABORTED), but in most cases,
XDP_DROP just means that the packet silently disappears...

-Toke
Neil Horman July 12, 2019, 12:05 p.m. UTC | #14
On Thu, Jul 11, 2019 at 08:40:34PM -0700, Florian Fainelli wrote:
> 
> 
> On 7/11/2019 4:53 PM, Neil Horman wrote:
> >> I would like to emphasize that the configuration of whether these
> >> dropped packets are even sent to the CPU from the device still needs to
> >> reside in devlink given this is the go-to tool for device-specific
> >> configuration. In addition, these drop traps are a small subset of the
> >> entire packet traps devices support and all have similar needs such as
> >> HW policer configuration and statistics.
> >>
> >> In the future we might also want to report events that indicate the
> >> formation of possible problems. For example, in case packets are queued
> >> above a certain threshold or for long periods of time. I hope we could
> >> re-use drop_monitor for this as well, thereby making it the go-to
> >> channel for diagnosing current and to-be problems in the data path.
> >>
> > Thats an interesting idea, but dropwatch certainly isn't currently setup for
> > that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> > and really thinking out what you want to communicate.
> 
> Is not what you describe more or less what Ido has been doing here with
> this patch series?
possibly, I was only CCed on this thread halfway throught the conversation, and
only on the cover letter, I've not had a chance to look at the entire series

Neil

> -- 
> Florian
>
Neil Horman July 12, 2019, 12:18 p.m. UTC | #15
On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> >> > From: Ido Schimmel <idosch@idosch.org>
> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> >> > 
> >> > > Users have several ways to debug the kernel and understand why a packet
> >> > > was dropped. For example, using "drop monitor" and "perf". Both
> >> > > utilities trace kfree_skb(), which is the function called when a packet
> >> > > is freed as part of a failure. The information provided by these tools
> >> > > is invaluable when trying to understand the cause of a packet loss.
> >> > > 
> >> > > In recent years, large portions of the kernel data path were offloaded
> >> > > to capable devices. Today, it is possible to perform L2 and L3
> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> >> > > Different TC classifiers and actions are also offloaded to capable
> >> > > devices, at both ingress and egress.
> >> > > 
> >> > > However, when the data path is offloaded it is not possible to achieve
> >> > > the same level of introspection as tools such "perf" and "drop monitor"
> >> > > become irrelevant.
> >> > > 
> >> > > This patchset aims to solve this by allowing users to monitor packets
> >> > > that the underlying device decided to drop along with relevant metadata
> >> > > such as the drop reason and ingress port.
> >> > 
> >> > We are now going to have 5 or so ways to capture packets passing through
> >> > the system, this is nonsense.
> >> > 
> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> >> > devlink thing.
> >> > 
> >> > This is insanity, too many ways to do the same thing and therefore the
> >> > worst possible user experience.
> >> > 
> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> >> > XDP perf events, and these taps there too.
> >> > 
> >> > I mean really, think about it from the average user's perspective.  To
> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> >> > listen on devlink but configure a special tap thing beforehand and then
> >> > if someone is using XDP I gotta setup another perf event buffer capture
> >> > thing too.
> >> 
> >> Dave,
> >> 
> >> Before I start working on v2, I would like to get your feedback on the
> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
> >> (and counterpart DropWatch tool [1]).
> >> 
> >> IIUC, the problem you point out is that users need to use different
> >> tools to monitor packet drops based on where these drops occur
> >> (SW/HW/XDP).
> >> 
> >> Therefore, my plan is to extend the existing drop_monitor netlink
> >> channel to also cover HW drops. I will add a new message type and a new
> >> multicast group for HW drops and encode in the message what is currently
> >> encoded in the devlink events.
> >> 
> > A few things here:
> > IIRC we don't announce individual hardware drops, drivers record them in
> > internal structures, and they are retrieved on demand via ethtool calls, so you
> > will either need to include some polling (probably not a very performant idea),
> > or some sort of flagging mechanism to indicate that on the next message sent to
> > user space you should go retrieve hw stats from a given interface.  I certainly
> > wouldn't mind seeing this happen, but its more work than just adding a new
> > netlink message.
> >
> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
> > program returns a DROP result for a packet being processed, the OS will call
> > kfree_skb on its behalf, and dropwatch wil call that.
> 
> There is no skb by the time an XDP program runs, so this is not true. As
> I mentioned upthread, there's a tracepoint that will get called if an
> error occurs (or the program returns XDP_ABORTED), but in most cases,
> XDP_DROP just means that the packet silently disappears...
> 
As I noted, thats only true for xdp programs that are offloaded to hardware, I
was only speaking for XDP programs that run on the cpu.  For the former case, we
obviously need some other mechanism to detect drops, but for cpu executed xdp
programs, the OS is responsible for freeing skbs associated with programs the
return XDP_DROP.

Neil

> -Toke
>
Toke Høiland-Jørgensen July 12, 2019, 12:33 p.m. UTC | #16
Neil Horman <nhorman@tuxdriver.com> writes:

> On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
>> Neil Horman <nhorman@tuxdriver.com> writes:
>> 
>> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
>> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
>> >> > From: Ido Schimmel <idosch@idosch.org>
>> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> >> > 
>> >> > > Users have several ways to debug the kernel and understand why a packet
>> >> > > was dropped. For example, using "drop monitor" and "perf". Both
>> >> > > utilities trace kfree_skb(), which is the function called when a packet
>> >> > > is freed as part of a failure. The information provided by these tools
>> >> > > is invaluable when trying to understand the cause of a packet loss.
>> >> > > 
>> >> > > In recent years, large portions of the kernel data path were offloaded
>> >> > > to capable devices. Today, it is possible to perform L2 and L3
>> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> >> > > Different TC classifiers and actions are also offloaded to capable
>> >> > > devices, at both ingress and egress.
>> >> > > 
>> >> > > However, when the data path is offloaded it is not possible to achieve
>> >> > > the same level of introspection as tools such "perf" and "drop monitor"
>> >> > > become irrelevant.
>> >> > > 
>> >> > > This patchset aims to solve this by allowing users to monitor packets
>> >> > > that the underlying device decided to drop along with relevant metadata
>> >> > > such as the drop reason and ingress port.
>> >> > 
>> >> > We are now going to have 5 or so ways to capture packets passing through
>> >> > the system, this is nonsense.
>> >> > 
>> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> >> > devlink thing.
>> >> > 
>> >> > This is insanity, too many ways to do the same thing and therefore the
>> >> > worst possible user experience.
>> >> > 
>> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> >> > XDP perf events, and these taps there too.
>> >> > 
>> >> > I mean really, think about it from the average user's perspective.  To
>> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> >> > listen on devlink but configure a special tap thing beforehand and then
>> >> > if someone is using XDP I gotta setup another perf event buffer capture
>> >> > thing too.
>> >> 
>> >> Dave,
>> >> 
>> >> Before I start working on v2, I would like to get your feedback on the
>> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
>> >> (and counterpart DropWatch tool [1]).
>> >> 
>> >> IIUC, the problem you point out is that users need to use different
>> >> tools to monitor packet drops based on where these drops occur
>> >> (SW/HW/XDP).
>> >> 
>> >> Therefore, my plan is to extend the existing drop_monitor netlink
>> >> channel to also cover HW drops. I will add a new message type and a new
>> >> multicast group for HW drops and encode in the message what is currently
>> >> encoded in the devlink events.
>> >> 
>> > A few things here:
>> > IIRC we don't announce individual hardware drops, drivers record them in
>> > internal structures, and they are retrieved on demand via ethtool calls, so you
>> > will either need to include some polling (probably not a very performant idea),
>> > or some sort of flagging mechanism to indicate that on the next message sent to
>> > user space you should go retrieve hw stats from a given interface.  I certainly
>> > wouldn't mind seeing this happen, but its more work than just adding a new
>> > netlink message.
>> >
>> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
>> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
>> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
>> > program returns a DROP result for a packet being processed, the OS will call
>> > kfree_skb on its behalf, and dropwatch wil call that.
>> 
>> There is no skb by the time an XDP program runs, so this is not true. As
>> I mentioned upthread, there's a tracepoint that will get called if an
>> error occurs (or the program returns XDP_ABORTED), but in most cases,
>> XDP_DROP just means that the packet silently disappears...
>> 
> As I noted, thats only true for xdp programs that are offloaded to hardware, I
> was only speaking for XDP programs that run on the cpu.  For the former case, we
> obviously need some other mechanism to detect drops, but for cpu executed xdp
> programs, the OS is responsible for freeing skbs associated with programs the
> return XDP_DROP.

Ah, I think maybe you're thinking of generic XDP (also referred to as
skb mode)? That is a separate mode; an XDP program loaded in "native
mode" (or "driver mode") runs on the CPU, but before the skb is created;
this is the common case for XDP, and there is no skb and thus no drop
notification in this mode.

There is *also* an offload mode for XDP programs, but that is only
supported by netronome cards thus far, so not as commonly used...

-Toke
Ido Schimmel July 12, 2019, 1:52 p.m. UTC | #17
On Thu, Jul 11, 2019 at 07:53:54PM -0400, Neil Horman wrote:
> A few things here:
> IIRC we don't announce individual hardware drops, drivers record them in
> internal structures, and they are retrieved on demand via ethtool calls, so you
> will either need to include some polling (probably not a very performant idea),
> or some sort of flagging mechanism to indicate that on the next message sent to
> user space you should go retrieve hw stats from a given interface.  I certainly
> wouldn't mind seeing this happen, but its more work than just adding a new
> netlink message.

Neil,

The idea of this series is to pass the dropped packets themselves to
user space along with metadata, such as the drop reason and the ingress
port. In the future more metadata could be added thanks to the
extensible nature of netlink.

In v1 these packets were notified to user space as devlink events
and my plan for v2 is to send them as drop_monitor events, given it's an
existing generic netlink channel used to monitor SW drops. This will
allow users to listen on one netlink channel to diagnose potential
problems in either SW or HW (and hopefully XDP in the future).

Please note that the packets I'm talking about are packets users
currently do not see. They are dropped - potentially silently - by the
underlying device, thereby making it hard to debug whatever issues you
might be experiencing in your network.

The control path that determines if these packets are even sent to the
CPU from the HW needs to remain in devlink for the reasons I outlined in
my previous reply. However, the monitoring of these drops will be over
drop_monitor. This is similar to what we are currently doing with
tc-sample, where the control path that triggers the sampling and
determines the sampling rate and truncation is done over rtnetlink (tc),
but the sampled packets are notified over the generic netlink psample
channel.

To make it more real, you can check the example of the dissected devlink
message that notifies the drop of a packet due to a multicast source
MAC: https://marc.info/?l=linux-netdev&m=156248736710238&w=2

I will obviously have to create another Wireshark dissector for
drop_monitor in order to get the same information.

> Thats an interesting idea, but dropwatch certainly isn't currently setup for
> that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> and really thinking out what you want to communicate.

I don't think we need a v2 of the netlink protocol. My current plan is
to extend the existing protocol with: New message type (e.g.,
NET_DM_CMD_HW_ALERT), new multicast group and a set of attributes to
encode the information that is currently encoded in the example message
I pasted above.

Thanks
Neil Horman July 13, 2019, 12:40 a.m. UTC | #18
On Fri, Jul 12, 2019 at 02:33:29PM +0200, Toke Høiland-Jørgensen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
> >> Neil Horman <nhorman@tuxdriver.com> writes:
> >> 
> >> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
> >> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> >> >> > From: Ido Schimmel <idosch@idosch.org>
> >> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> >> >> > 
> >> >> > > Users have several ways to debug the kernel and understand why a packet
> >> >> > > was dropped. For example, using "drop monitor" and "perf". Both
> >> >> > > utilities trace kfree_skb(), which is the function called when a packet
> >> >> > > is freed as part of a failure. The information provided by these tools
> >> >> > > is invaluable when trying to understand the cause of a packet loss.
> >> >> > > 
> >> >> > > In recent years, large portions of the kernel data path were offloaded
> >> >> > > to capable devices. Today, it is possible to perform L2 and L3
> >> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> >> >> > > Different TC classifiers and actions are also offloaded to capable
> >> >> > > devices, at both ingress and egress.
> >> >> > > 
> >> >> > > However, when the data path is offloaded it is not possible to achieve
> >> >> > > the same level of introspection as tools such "perf" and "drop monitor"
> >> >> > > become irrelevant.
> >> >> > > 
> >> >> > > This patchset aims to solve this by allowing users to monitor packets
> >> >> > > that the underlying device decided to drop along with relevant metadata
> >> >> > > such as the drop reason and ingress port.
> >> >> > 
> >> >> > We are now going to have 5 or so ways to capture packets passing through
> >> >> > the system, this is nonsense.
> >> >> > 
> >> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> >> >> > devlink thing.
> >> >> > 
> >> >> > This is insanity, too many ways to do the same thing and therefore the
> >> >> > worst possible user experience.
> >> >> > 
> >> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> >> >> > XDP perf events, and these taps there too.
> >> >> > 
> >> >> > I mean really, think about it from the average user's perspective.  To
> >> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> >> >> > listen on devlink but configure a special tap thing beforehand and then
> >> >> > if someone is using XDP I gotta setup another perf event buffer capture
> >> >> > thing too.
> >> >> 
> >> >> Dave,
> >> >> 
> >> >> Before I start working on v2, I would like to get your feedback on the
> >> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
> >> >> (and counterpart DropWatch tool [1]).
> >> >> 
> >> >> IIUC, the problem you point out is that users need to use different
> >> >> tools to monitor packet drops based on where these drops occur
> >> >> (SW/HW/XDP).
> >> >> 
> >> >> Therefore, my plan is to extend the existing drop_monitor netlink
> >> >> channel to also cover HW drops. I will add a new message type and a new
> >> >> multicast group for HW drops and encode in the message what is currently
> >> >> encoded in the devlink events.
> >> >> 
> >> > A few things here:
> >> > IIRC we don't announce individual hardware drops, drivers record them in
> >> > internal structures, and they are retrieved on demand via ethtool calls, so you
> >> > will either need to include some polling (probably not a very performant idea),
> >> > or some sort of flagging mechanism to indicate that on the next message sent to
> >> > user space you should go retrieve hw stats from a given interface.  I certainly
> >> > wouldn't mind seeing this happen, but its more work than just adding a new
> >> > netlink message.
> >> >
> >> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
> >> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
> >> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
> >> > program returns a DROP result for a packet being processed, the OS will call
> >> > kfree_skb on its behalf, and dropwatch wil call that.
> >> 
> >> There is no skb by the time an XDP program runs, so this is not true. As
> >> I mentioned upthread, there's a tracepoint that will get called if an
> >> error occurs (or the program returns XDP_ABORTED), but in most cases,
> >> XDP_DROP just means that the packet silently disappears...
> >> 
> > As I noted, thats only true for xdp programs that are offloaded to hardware, I
> > was only speaking for XDP programs that run on the cpu.  For the former case, we
> > obviously need some other mechanism to detect drops, but for cpu executed xdp
> > programs, the OS is responsible for freeing skbs associated with programs the
> > return XDP_DROP.
> 
> Ah, I think maybe you're thinking of generic XDP (also referred to as
> skb mode)? That is a separate mode; an XDP program loaded in "native
Yes, was I not clear about that?
Neil

> mode" (or "driver mode") runs on the CPU, but before the skb is created;
> this is the common case for XDP, and there is no skb and thus no drop
> notification in this mode.
> 
> There is *also* an offload mode for XDP programs, but that is only
> supported by netronome cards thus far, so not as commonly used...
> 
> -Toke
>
Toke Høiland-Jørgensen July 13, 2019, 8:07 a.m. UTC | #19
>Neil Horman <nhorman@tuxdriver.com> writes:

> On Fri, Jul 12, 2019 at 02:33:29PM +0200, Toke Høiland-Jørgensen wrote:
>> Neil Horman <nhorman@tuxdriver.com> writes:
>> 
>> > On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
>> >> Neil Horman <nhorman@tuxdriver.com> writes:
>> >> 
>> >> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
>> >> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
>> >> >> > From: Ido Schimmel <idosch@idosch.org>
>> >> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> >> >> > 
>> >> >> > > Users have several ways to debug the kernel and understand why a packet
>> >> >> > > was dropped. For example, using "drop monitor" and "perf". Both
>> >> >> > > utilities trace kfree_skb(), which is the function called when a packet
>> >> >> > > is freed as part of a failure. The information provided by these tools
>> >> >> > > is invaluable when trying to understand the cause of a packet loss.
>> >> >> > > 
>> >> >> > > In recent years, large portions of the kernel data path were offloaded
>> >> >> > > to capable devices. Today, it is possible to perform L2 and L3
>> >> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> >> >> > > Different TC classifiers and actions are also offloaded to capable
>> >> >> > > devices, at both ingress and egress.
>> >> >> > > 
>> >> >> > > However, when the data path is offloaded it is not possible to achieve
>> >> >> > > the same level of introspection as tools such "perf" and "drop monitor"
>> >> >> > > become irrelevant.
>> >> >> > > 
>> >> >> > > This patchset aims to solve this by allowing users to monitor packets
>> >> >> > > that the underlying device decided to drop along with relevant metadata
>> >> >> > > such as the drop reason and ingress port.
>> >> >> > 
>> >> >> > We are now going to have 5 or so ways to capture packets passing through
>> >> >> > the system, this is nonsense.
>> >> >> > 
>> >> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> >> >> > devlink thing.
>> >> >> > 
>> >> >> > This is insanity, too many ways to do the same thing and therefore the
>> >> >> > worst possible user experience.
>> >> >> > 
>> >> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> >> >> > XDP perf events, and these taps there too.
>> >> >> > 
>> >> >> > I mean really, think about it from the average user's perspective.  To
>> >> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> >> >> > listen on devlink but configure a special tap thing beforehand and then
>> >> >> > if someone is using XDP I gotta setup another perf event buffer capture
>> >> >> > thing too.
>> >> >> 
>> >> >> Dave,
>> >> >> 
>> >> >> Before I start working on v2, I would like to get your feedback on the
>> >> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
>> >> >> (and counterpart DropWatch tool [1]).
>> >> >> 
>> >> >> IIUC, the problem you point out is that users need to use different
>> >> >> tools to monitor packet drops based on where these drops occur
>> >> >> (SW/HW/XDP).
>> >> >> 
>> >> >> Therefore, my plan is to extend the existing drop_monitor netlink
>> >> >> channel to also cover HW drops. I will add a new message type and a new
>> >> >> multicast group for HW drops and encode in the message what is currently
>> >> >> encoded in the devlink events.
>> >> >> 
>> >> > A few things here:
>> >> > IIRC we don't announce individual hardware drops, drivers record them in
>> >> > internal structures, and they are retrieved on demand via ethtool calls, so you
>> >> > will either need to include some polling (probably not a very performant idea),
>> >> > or some sort of flagging mechanism to indicate that on the next message sent to
>> >> > user space you should go retrieve hw stats from a given interface.  I certainly
>> >> > wouldn't mind seeing this happen, but its more work than just adding a new
>> >> > netlink message.
>> >> >
>> >> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
>> >> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
>> >> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
>> >> > program returns a DROP result for a packet being processed, the OS will call
>> >> > kfree_skb on its behalf, and dropwatch wil call that.
>> >> 
>> >> There is no skb by the time an XDP program runs, so this is not true. As
>> >> I mentioned upthread, there's a tracepoint that will get called if an
>> >> error occurs (or the program returns XDP_ABORTED), but in most cases,
>> >> XDP_DROP just means that the packet silently disappears...
>> >> 
>> > As I noted, thats only true for xdp programs that are offloaded to hardware, I
>> > was only speaking for XDP programs that run on the cpu.  For the former case, we
>> > obviously need some other mechanism to detect drops, but for cpu executed xdp
>> > programs, the OS is responsible for freeing skbs associated with programs the
>> > return XDP_DROP.
>> 
>> Ah, I think maybe you're thinking of generic XDP (also referred to as
>> skb mode)? That is a separate mode; an XDP program loaded in "native
> Yes, was I not clear about that?

No, not really. "Generic XDP" is not the same as "XDP"; the generic mode
is more of a debug mode (as far as I'm concerned at least). So in the
common case, it is absolutely not the case that the kernel will end up
calling kfree_skb after an XDP_DROP; so I got somewhat thrown off by
your insistence that it would... :)

-Toke
David Miller July 14, 2019, 2:38 a.m. UTC | #20
From: Ido Schimmel <idosch@idosch.org>
Date: Thu, 11 Jul 2019 15:39:09 +0300

> Before I start working on v2, I would like to get your feedback on the
> high level plan. Also adding Neil who is the maintainer of drop_monitor
> (and counterpart DropWatch tool [1]).
> 
> IIUC, the problem you point out is that users need to use different
> tools to monitor packet drops based on where these drops occur
> (SW/HW/XDP).
> 
> Therefore, my plan is to extend the existing drop_monitor netlink
> channel to also cover HW drops. I will add a new message type and a new
> multicast group for HW drops and encode in the message what is currently
> encoded in the devlink events.

Consolidating the reporting to merge with the drop monitor facilities
is definitely a step in the right direction.
Neil Horman July 14, 2019, 11:29 a.m. UTC | #21
On Fri, Jul 12, 2019 at 04:52:30PM +0300, Ido Schimmel wrote:
> On Thu, Jul 11, 2019 at 07:53:54PM -0400, Neil Horman wrote:
> > A few things here:
> > IIRC we don't announce individual hardware drops, drivers record them in
> > internal structures, and they are retrieved on demand via ethtool calls, so you
> > will either need to include some polling (probably not a very performant idea),
> > or some sort of flagging mechanism to indicate that on the next message sent to
> > user space you should go retrieve hw stats from a given interface.  I certainly
> > wouldn't mind seeing this happen, but its more work than just adding a new
> > netlink message.
> 
> Neil,
> 
> The idea of this series is to pass the dropped packets themselves to
> user space along with metadata, such as the drop reason and the ingress
> port. In the future more metadata could be added thanks to the
> extensible nature of netlink.
> 
I had experimented with this idea previously.  Specifically I had investigated
the possibility of creating a dummy net_device that received only dropped
packets so that utilities like tcpdump could monitor the interface for said
packets along with the metadata that described where they dropped.

The concern I had was, as Dave mentioned, that you would wind up with either a
head of line blocking issue, or simply lots of lost "dropped" packets due to
queue overflow on receive, which kind of defeated the purpose of drop monitor.

That said, I like the idea, and if we can find a way around the fact that we
could potentially receive way more dropped packets than we could bounce back to
userspace, it would be a major improvement.

> In v1 these packets were notified to user space as devlink events
> and my plan for v2 is to send them as drop_monitor events, given it's an
> existing generic netlink channel used to monitor SW drops. This will
> allow users to listen on one netlink channel to diagnose potential
> problems in either SW or HW (and hopefully XDP in the future).
> 
Yeah, I'm supportive of that.

> Please note that the packets I'm talking about are packets users
> currently do not see. They are dropped - potentially silently - by the
> underlying device, thereby making it hard to debug whatever issues you
> might be experiencing in your network.
> 
Right I get that, you want the ability to register a listener of sorts to
monitor drops in hardware and report that back to user space as an drop even
with a location that (instead of being a kernel address, is a 'special location'
representing a hardware instance.  Makes sense.  Having that be a location +
counter tuple would make sense, but I don't think we can pass the skb itself (as
you mention above), without seeing significant loss.

> The control path that determines if these packets are even sent to the
> CPU from the HW needs to remain in devlink for the reasons I outlined in
> my previous reply. However, the monitoring of these drops will be over
> drop_monitor. This is similar to what we are currently doing with
> tc-sample, where the control path that triggers the sampling and
> determines the sampling rate and truncation is done over rtnetlink (tc),
> but the sampled packets are notified over the generic netlink psample
> channel.
> 
> To make it more real, you can check the example of the dissected devlink
> message that notifies the drop of a packet due to a multicast source
> MAC: https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> 
> I will obviously have to create another Wireshark dissector for
> drop_monitor in order to get the same information.
> 
yes, Of course.
> > Thats an interesting idea, but dropwatch certainly isn't currently setup for
> > that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> > and really thinking out what you want to communicate.
> 
> I don't think we need a v2 of the netlink protocol. My current plan is
> to extend the existing protocol with: New message type (e.g.,
> NET_DM_CMD_HW_ALERT), new multicast group and a set of attributes to
> encode the information that is currently encoded in the example message
> I pasted above.
> 
Ok, that makes sense.  I think we already do some very rudimentary version of
that (see trace_napi_poll_hit).  Here we check the device we receive frames on
to see if its rx_dropped count has increased, and if it has, store that as a
drop count in the NULL location.  Thats obviously insufficient, but I wonder if
its worth looking at using the dm_hw_stat_delta to encode and record those event
for sending with your new message type.

> Thanks
>
Ido Schimmel July 14, 2019, 12:43 p.m. UTC | #22
On Sun, Jul 14, 2019 at 07:29:04AM -0400, Neil Horman wrote:
> On Fri, Jul 12, 2019 at 04:52:30PM +0300, Ido Schimmel wrote:
> > On Thu, Jul 11, 2019 at 07:53:54PM -0400, Neil Horman wrote:
> > > A few things here:
> > > IIRC we don't announce individual hardware drops, drivers record them in
> > > internal structures, and they are retrieved on demand via ethtool calls, so you
> > > will either need to include some polling (probably not a very performant idea),
> > > or some sort of flagging mechanism to indicate that on the next message sent to
> > > user space you should go retrieve hw stats from a given interface.  I certainly
> > > wouldn't mind seeing this happen, but its more work than just adding a new
> > > netlink message.
> > 
> > Neil,
> > 
> > The idea of this series is to pass the dropped packets themselves to
> > user space along with metadata, such as the drop reason and the ingress
> > port. In the future more metadata could be added thanks to the
> > extensible nature of netlink.
> > 
> I had experimented with this idea previously.  Specifically I had investigated
> the possibility of creating a dummy net_device that received only dropped
> packets so that utilities like tcpdump could monitor the interface for said
> packets along with the metadata that described where they dropped.
> 
> The concern I had was, as Dave mentioned, that you would wind up with either a
> head of line blocking issue, or simply lots of lost "dropped" packets due to
> queue overflow on receive, which kind of defeated the purpose of drop monitor.
> 
> That said, I like the idea, and if we can find a way around the fact that we
> could potentially receive way more dropped packets than we could bounce back to
> userspace, it would be a major improvement.

We don't necessarily need the entire packet, so we could add an option
to truncate them and get only the headers. tc-sample does this (see man
tc-sample).

Also, packets that are dropped for the same reason and belong to the
same flow are not necessarily interesting. We could add an eBPF filter
on the netlink socket which will only allow "unique" packets to be
enqueued. Where "unique" is defined as {drop reason, 5-tuple}. In the
case of SW drops "drop reason" is instruction pointer (call site of
kfree_skb()) and in the case of HW drops it's the drop reason provided
by the device. The duplicate copies will be counted in the eBPF map.

> 
> > In v1 these packets were notified to user space as devlink events
> > and my plan for v2 is to send them as drop_monitor events, given it's an
> > existing generic netlink channel used to monitor SW drops. This will
> > allow users to listen on one netlink channel to diagnose potential
> > problems in either SW or HW (and hopefully XDP in the future).
> > 
> Yeah, I'm supportive of that.
> 
> > Please note that the packets I'm talking about are packets users
> > currently do not see. They are dropped - potentially silently - by the
> > underlying device, thereby making it hard to debug whatever issues you
> > might be experiencing in your network.
> > 
> Right I get that, you want the ability to register a listener of sorts to
> monitor drops in hardware and report that back to user space as an drop even
> with a location that (instead of being a kernel address, is a 'special location'
> representing a hardware instance.  Makes sense.  Having that be a location +
> counter tuple would make sense, but I don't think we can pass the skb itself (as
> you mention above), without seeing significant loss.

Getting the skb itself will be an option users will need to enable in
drop_monitor. If it is not enabled, then default behavior is maintained
and you only get notifications that contain the location + counter tuple
you mentioned.

> 
> > The control path that determines if these packets are even sent to the
> > CPU from the HW needs to remain in devlink for the reasons I outlined in
> > my previous reply. However, the monitoring of these drops will be over
> > drop_monitor. This is similar to what we are currently doing with
> > tc-sample, where the control path that triggers the sampling and
> > determines the sampling rate and truncation is done over rtnetlink (tc),
> > but the sampled packets are notified over the generic netlink psample
> > channel.
> > 
> > To make it more real, you can check the example of the dissected devlink
> > message that notifies the drop of a packet due to a multicast source
> > MAC: https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > I will obviously have to create another Wireshark dissector for
> > drop_monitor in order to get the same information.
> > 
> yes, Of course.
> > > Thats an interesting idea, but dropwatch certainly isn't currently setup for
> > > that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> > > and really thinking out what you want to communicate.
> > 
> > I don't think we need a v2 of the netlink protocol. My current plan is
> > to extend the existing protocol with: New message type (e.g.,
> > NET_DM_CMD_HW_ALERT), new multicast group and a set of attributes to
> > encode the information that is currently encoded in the example message
> > I pasted above.
> > 
> Ok, that makes sense.  I think we already do some very rudimentary version of
> that (see trace_napi_poll_hit).  Here we check the device we receive frames on
> to see if its rx_dropped count has increased, and if it has, store that as a
> drop count in the NULL location.  Thats obviously insufficient, but I wonder if
> its worth looking at using the dm_hw_stat_delta to encode and record those event
> for sending with your new message type.

Will check.

Thanks, Neil.