mbox series

[net-next,v2,0/3] Arrow SpeedChips XRS700x DSA Driver

Message ID 20201125193740.36825-1-george.mccollister@gmail.com
Headers show
Series Arrow SpeedChips XRS700x DSA Driver | expand

Message

George McCollister Nov. 25, 2020, 7:37 p.m. UTC
This series adds a DSA driver for the Arrow SpeedChips XRS 7000 series
of HSR/PRP gigabit switch chips.

The chips use Flexibilis IP.
More information can be found here:
 https://www.flexibilis.com/products/speedchips-xrs7000/

The switches have up to three RGMII ports and one MII port and are
managed via mdio or i2c. They use a one byte trailing tag to identify
the switch port when in managed mode so I've added a tag driver which
implements this.

This series contains minimal DSA functionality which may be built upon
in future patches. The ultimate goal is to add HSR and PRP
(IEC 62439-3 Clause 5 & 4) offloading with integration into net/hsr.

Changes since v1:
 * Use central TX reallocation in tag driver. (Andrew Lunn)
 * Code style fixes. (Andrew Lunn, Vladimir Oltean)
 * Code simplifications. (Andrew Lunn, Vladimir Oltean)
 * Verify detected switch matches compatible. (Andrew Lunn)
 * Add inbound policy to allow BPDUs. (Andrew Lunn)
 * Move files into their own subdir. (Vladimir Oltean)
 * Automate regmap field allocation. (Vladimir Oltean)
 * Move setting link speed to .mac_link_up. (Vladimir Oltean)
 * Use different compatible strings for e/f variants.

George McCollister (3):
  dsa: add support for Arrow XRS700x tag trailer
  net: dsa: add Arrow SpeedChips XRS700x driver
  dt-bindings: net: dsa: add bindings for xrs700x switches

 .../devicetree/bindings/net/dsa/arrow,xrs700x.yaml |  74 +++
 drivers/net/dsa/Kconfig                            |   2 +
 drivers/net/dsa/Makefile                           |   1 +
 drivers/net/dsa/xrs700x/Kconfig                    |  26 +
 drivers/net/dsa/xrs700x/Makefile                   |   4 +
 drivers/net/dsa/xrs700x/xrs700x.c                  | 585 +++++++++++++++++++++
 drivers/net/dsa/xrs700x/xrs700x.h                  |  38 ++
 drivers/net/dsa/xrs700x/xrs700x_i2c.c              | 150 ++++++
 drivers/net/dsa/xrs700x/xrs700x_mdio.c             | 162 ++++++
 drivers/net/dsa/xrs700x/xrs700x_reg.h              | 205 ++++++++
 include/net/dsa.h                                  |   2 +
 net/dsa/Kconfig                                    |   6 +
 net/dsa/Makefile                                   |   1 +
 net/dsa/tag_xrs700x.c                              |  67 +++
 14 files changed, 1323 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/arrow,xrs700x.yaml
 create mode 100644 drivers/net/dsa/xrs700x/Kconfig
 create mode 100644 drivers/net/dsa/xrs700x/Makefile
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x.h
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_i2c.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_mdio.c
 create mode 100644 drivers/net/dsa/xrs700x/xrs700x_reg.h
 create mode 100644 net/dsa/tag_xrs700x.c

Comments

Andrew Lunn Nov. 25, 2020, 8:34 p.m. UTC | #1
> +static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev,
> +				   struct packet_type *pt)
> +{
> +	int source_port;
> +	u8 *trailer;
> +
> +	if (skb_linearize(skb))
> +		return NULL;

Something for Vladimir:

Could this linearise be moved into the core, depending on the
tail_tag?

> +	if (pskb_trim_rcsum(skb, skb->len - 1))
> +		return NULL;

And the overhead is also in dsa_devlink_ops, so maybe this can be
moved as well?

      Andrew
Andrew Lunn Nov. 25, 2020, 8:34 p.m. UTC | #2
On Wed, Nov 25, 2020 at 01:37:38PM -0600, George McCollister wrote:
> Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
> is modeled on tag_trailer.c which works in a similar way.
> 
> Signed-off-by: George McCollister <george.mccollister@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew
Florian Fainelli Nov. 26, 2020, 1:31 a.m. UTC | #3
On 11/25/2020 11:37 AM, George McCollister wrote:
> Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
> is modeled on tag_trailer.c which works in a similar way.
> 
> Signed-off-by: George McCollister <george.mccollister@gmail.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Jakub Kicinski Nov. 26, 2020, 1:42 a.m. UTC | #4
On Wed, 25 Nov 2020 13:37:39 -0600 George McCollister wrote:
> Add a driver with initial support for the Arrow SpeedChips XRS7000
> series of gigabit Ethernet switch chips which are typically used in
> critical networking applications.
> 
> The switches have up to three RGMII ports and one RMII port.
> Management to the switches can be performed over i2c or mdio.
> 
> Support for advanced features such as PTP and
> HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
> may be added at a later date.
> 
> Signed-off-by: George McCollister <george.mccollister@gmail.com>

You need to add symbol exports otherwise this won't build with
allmodconfig:

ERROR: modpost: "xrs7004f_info"
[drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
"xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
ERROR: modpost: "xrs7003f_info"
[drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
"xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
ERROR: modpost: "xrs7004f_info"
[drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
"xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!
ERROR: modpost: "xrs7003f_info"
[drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
"xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!

> +	{XRS_RX_UNDERSIZE_L, "rx_undersize"},
> +	{XRS_RX_FRAGMENTS_L, "rx_fragments"},
> +	{XRS_RX_OVERSIZE_L, "rx_oversize"},
> +	{XRS_RX_JABBER_L, "rx_jabber"},
> +	{XRS_RX_ERR_L, "rx_err"},
> +	{XRS_RX_CRC_L, "rx_crc"},

As Vladimir already mentioned to you the statistics which have
corresponding entries in struct rtnl_link_stats64 should be reported
the standard way. The infra for DSA may not be in place yet, so best 
if you just drop those for now.
George McCollister Nov. 26, 2020, 2:25 a.m. UTC | #5
On Wed, Nov 25, 2020 at 7:42 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 25 Nov 2020 13:37:39 -0600 George McCollister wrote:
> > Add a driver with initial support for the Arrow SpeedChips XRS7000
> > series of gigabit Ethernet switch chips which are typically used in
> > critical networking applications.
> >
> > The switches have up to three RGMII ports and one RMII port.
> > Management to the switches can be performed over i2c or mdio.
> >
> > Support for advanced features such as PTP and
> > HSR/PRP (IEC 62439-3 Clause 5 & 4) is not included in this patch and
> > may be added at a later date.
> >
> > Signed-off-by: George McCollister <george.mccollister@gmail.com>
>
> You need to add symbol exports otherwise this won't build with
> allmodconfig:
>
> ERROR: modpost: "xrs7004f_info"
> [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
> "xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
> ERROR: modpost: "xrs7003f_info"
> [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined! ERROR: modpost:
> "xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_mdio.ko] undefined!
> ERROR: modpost: "xrs7004f_info"
> [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
> "xrs7004e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!
> ERROR: modpost: "xrs7003f_info"
> [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined! ERROR: modpost:
> "xrs7003e_info" [drivers/net/dsa/xrs700x/xrs700x_i2c.ko] undefined!

I was wondering if I possibly needed to do that but wasn't getting any
errors the way I was building.

>
> > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > +     {XRS_RX_ERR_L, "rx_err"},
> > +     {XRS_RX_CRC_L, "rx_crc"},
>
> As Vladimir already mentioned to you the statistics which have
> corresponding entries in struct rtnl_link_stats64 should be reported
> the standard way. The infra for DSA may not be in place yet, so best
> if you just drop those for now.

Okay, that clears it up a bit. Just drop these 6? I'll read through
that thread again and try to make sense of it.

Thanks
Vladimir Oltean Nov. 26, 2020, 1:24 p.m. UTC | #6
On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > +     {XRS_RX_ERR_L, "rx_err"},
> > > +     {XRS_RX_CRC_L, "rx_crc"},
> >
> > As Vladimir already mentioned to you the statistics which have
> > corresponding entries in struct rtnl_link_stats64 should be reported
> > the standard way. The infra for DSA may not be in place yet, so best
> > if you just drop those for now.
> 
> Okay, that clears it up a bit. Just drop these 6? I'll read through
> that thread again and try to make sense of it.

I feel that I should ask. Do you want me to look into exposing RMON
interface counters through rtnetlink (I've never done anything like that
before either, but there's a beginning for everything), or are you going
to?
Vladimir Oltean Nov. 26, 2020, 1:50 p.m. UTC | #7
On Wed, Nov 25, 2020 at 09:34:29PM +0100, Andrew Lunn wrote:
> > +static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev,
> > +				   struct packet_type *pt)
> > +{
> > +	int source_port;
> > +	u8 *trailer;
> > +
> > +	if (skb_linearize(skb))
> > +		return NULL;
>
> Something for Vladimir:
>
> Could this linearise be moved into the core, depending on the
> tail_tag?

Honestly I believe that the skb_linearize is not needed at all. It is
copy-pasted from tag_trailer.c, a driver that has not exercised at
runtime by anybody for a long time now. The pskb_trim_rcsum function
used for removing the tail tag should do the right thing even with
fragmented packets.

> > +	if (pskb_trim_rcsum(skb, skb->len - 1))
> > +		return NULL;
>
> And the overhead is also in dsa_devlink_ops, so maybe this can be
> moved as well?

Sorry, I don't understand this comment.
Andrew Lunn Nov. 26, 2020, 2:01 p.m. UTC | #8
On Thu, Nov 26, 2020 at 03:50:04PM +0200, Vladimir Oltean wrote:
> On Wed, Nov 25, 2020 at 09:34:29PM +0100, Andrew Lunn wrote:
> > > +static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev,
> > > +				   struct packet_type *pt)
> > > +{
> > > +	int source_port;
> > > +	u8 *trailer;
> > > +
> > > +	if (skb_linearize(skb))
> > > +		return NULL;
> >
> > Something for Vladimir:
> >
> > Could this linearise be moved into the core, depending on the
> > tail_tag?
> 
> Honestly I believe that the skb_linearize is not needed at all.

Humm

I'm assuming this is here in case the frame is in fragments, and the
trailer could be spread over two fragments? If so, you cannot access
the trailer using straight pointers. Linearize should copy it into one
buffer.

For the normal case of a 1500 byte frame, i doubt we have hardware
which uses multiple scatter/gather buffers. But for jumbo frames?

> > > +	if (pskb_trim_rcsum(skb, skb->len - 1))
> > > +		return NULL;
> >
> > And the overhead is also in dsa_devlink_ops, so maybe this can be
> > moved as well?
> 
> Sorry, I don't understand this comment.

I'm meaning, could that also be moved into the core? We seem to have
the needed information to do it in the core.

    Andrew
Vladimir Oltean Nov. 26, 2020, 5:56 p.m. UTC | #9
On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > >
> > > As Vladimir already mentioned to you the statistics which have
> > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > the standard way. The infra for DSA may not be in place yet, so best
> > > if you just drop those for now.
> >
> > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > that thread again and try to make sense of it.
>
> I feel that I should ask. Do you want me to look into exposing RMON
> interface counters through rtnetlink (I've never done anything like that
> before either, but there's a beginning for everything), or are you going
> to?

So I started to add .ndo_get_stats64 based on the hardware counters, but
I already hit the first roadblock, as described by the wise words of
Documentation/networking/statistics.rst:

| The `.ndo_get_stats64` callback can not sleep because of accesses
| via `/proc/net/dev`. If driver may sleep when retrieving the statistics
| from the device it should do so periodically asynchronously and only return
| a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
| allows setting the frequency of refreshing statistics, if needed.


Unfortunately, I feel this is almost unacceptable for a DSA driver that
more often than not needs to retrieve these counters from a slow and
bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
option, because the only periodic interval that would not put absurdly
high pressure on the limited SPI bandwidth would be a readout interval
that gives you very old counters.

What exactly is it that incurs the atomic context? I cannot seem to
figure out from this stack trace:

[  869.692526] ------------[ cut here ]------------
[  869.697174] WARNING: CPU: 0 PID: 444 at kernel/rcu/tree_plugin.h:297 rcu_note_context_switch+0x54/0x438
[  869.706598] Modules linked in:
[  869.709662] CPU: 0 PID: 444 Comm: cat Not tainted 5.10.0-rc5-next-20201126-00006-g0598c9bbacc1-dirty #1452
[  869.724764] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[  869.730790] pc : rcu_note_context_switch+0x54/0x438
[  869.735681] lr : rcu_note_context_switch+0x44/0x438
[  869.740570] sp : ffff80001039b420
[  869.743889] x29: ffff80001039b420 x28: ffff0f7a046c8e80
[  869.749220] x27: ffffcc70e3fad000 x26: ffff80001039b9d4
[  869.754550] x25: 0000000000000000 x24: ffffcc70e27ae82c
[  869.759879] x23: ffffcc70e3f90000 x22: 0000000000000000
[  869.765208] x21: ffff0f7a02177140 x20: ffff0f7a02177140
[  869.770537] x19: ffff0f7a7b9d3bc0 x18: 00000000ffffffff
[  869.775865] x17: 0000000000000000 x16: 0000000000000000
[  869.781193] x15: 0000000000000004 x14: 0000000000000000
[  869.786523] x13: 0000000000000000 x12: 0000000000000000
[  869.791852] x11: 0000000000000000 x10: 0000000000000000
[  869.797181] x9 : ffff0f7a022d0800 x8 : 0000000000000004
[  869.802510] x7 : 0000000000000004 x6 : ffff80001039b410
[  869.807838] x5 : 0000000000000001 x4 : 0000000000000001
[  869.813168] x3 : 503c00c9a4c6a300 x2 : 0000000000000000
[  869.818496] x1 : ffffcc70e3f90b98 x0 : 0000000000000001
[  869.823826] Call trace:
[  869.826276]  rcu_note_context_switch+0x54/0x438
[  869.830819]  __schedule+0xc0/0x708
[  869.834228]  schedule+0x4c/0x108
[  869.837462]  schedule_timeout+0x1a8/0x320
[  869.841480]  wait_for_completion+0x9c/0x148
[  869.845675]  dspi_transfer_one_message+0x158/0x550
[  869.850480]  __spi_pump_messages+0x208/0x818
[  869.854760]  __spi_sync+0x2a4/0x2e0
[  869.858257]  spi_sync+0x34/0x58
[  869.861404]  spi_sync_transfer+0x94/0xb8
[  869.865337]  sja1105_xfer.isra.1+0x250/0x2e0
[  869.869618]  sja1105_xfer_buf+0x4c/0x60
[  869.873462]  sja1105_port_status_get+0x68/0x8f0
[  869.878004]  sja1105_port_get_stats64+0x58/0x100
[  869.882633]  dsa_slave_get_stats64+0x3c/0x58
[  869.886916]  dev_get_stats+0xc0/0xd8
[  869.890500]  dev_seq_printf_stats+0x44/0x118
[  869.894780]  dev_seq_show+0x30/0x60
[  869.898276]  seq_read_iter+0x330/0x450
[  869.902032]  seq_read+0xf8/0x148
[  869.905268]  proc_reg_read+0xd4/0x110
[  869.908939]  vfs_read+0xac/0x1c8
[  869.912172]  ksys_read+0x74/0xf8
[  869.915405]  __arm64_sys_read+0x24/0x30
[  869.919251]  el0_svc_common.constprop.3+0x80/0x1b0
[  869.924055]  do_el0_svc+0x34/0xa0
[  869.927378]  el0_sync_handler+0x138/0x198
[  869.931397]  el0_sync+0x140/0x180
[  869.934718] ---[ end trace fd45b387ae2c6970 ]---
George McCollister Nov. 26, 2020, 7:07 p.m. UTC | #10
On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > >
> > > > As Vladimir already mentioned to you the statistics which have
> > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > if you just drop those for now.
> > >
> > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > that thread again and try to make sense of it.
> >
> > I feel that I should ask. Do you want me to look into exposing RMON
> > interface counters through rtnetlink (I've never done anything like that
> > before either, but there's a beginning for everything), or are you going
> > to?
>
> So I started to add .ndo_get_stats64 based on the hardware counters, but
> I already hit the first roadblock, as described by the wise words of
> Documentation/networking/statistics.rst:
>
> | The `.ndo_get_stats64` callback can not sleep because of accesses
> | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> | from the device it should do so periodically asynchronously and only return
> | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> | allows setting the frequency of refreshing statistics, if needed.
>
>
> Unfortunately, I feel this is almost unacceptable for a DSA driver that
> more often than not needs to retrieve these counters from a slow and
> bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> option, because the only periodic interval that would not put absurdly
> high pressure on the limited SPI bandwidth would be a readout interval
> that gives you very old counters.

Indeed it seems ndo_get_stats64() usually gets data over something
like a local or PCIe bus or from software. I had a brief look to see
if I could find another driver that was getting the stats over a slow
bus and didn't notice anything. If you haven't already you might do a
quick grep and see if anything pops out to you.

>
> What exactly is it that incurs the atomic context? I cannot seem to
> figure out from this stack trace:

I think something in fs/seq_file.c is taking an rcu lock. I suppose it
doesn't really matter though since the documentation says we can't
sleep.

>
> [  869.692526] ------------[ cut here ]------------
> [  869.697174] WARNING: CPU: 0 PID: 444 at kernel/rcu/tree_plugin.h:297 rcu_note_context_switch+0x54/0x438
> [  869.706598] Modules linked in:
> [  869.709662] CPU: 0 PID: 444 Comm: cat Not tainted 5.10.0-rc5-next-20201126-00006-g0598c9bbacc1-dirty #1452
> [  869.724764] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
> [  869.730790] pc : rcu_note_context_switch+0x54/0x438
> [  869.735681] lr : rcu_note_context_switch+0x44/0x438
> [  869.740570] sp : ffff80001039b420
> [  869.743889] x29: ffff80001039b420 x28: ffff0f7a046c8e80
> [  869.749220] x27: ffffcc70e3fad000 x26: ffff80001039b9d4
> [  869.754550] x25: 0000000000000000 x24: ffffcc70e27ae82c
> [  869.759879] x23: ffffcc70e3f90000 x22: 0000000000000000
> [  869.765208] x21: ffff0f7a02177140 x20: ffff0f7a02177140
> [  869.770537] x19: ffff0f7a7b9d3bc0 x18: 00000000ffffffff
> [  869.775865] x17: 0000000000000000 x16: 0000000000000000
> [  869.781193] x15: 0000000000000004 x14: 0000000000000000
> [  869.786523] x13: 0000000000000000 x12: 0000000000000000
> [  869.791852] x11: 0000000000000000 x10: 0000000000000000
> [  869.797181] x9 : ffff0f7a022d0800 x8 : 0000000000000004
> [  869.802510] x7 : 0000000000000004 x6 : ffff80001039b410
> [  869.807838] x5 : 0000000000000001 x4 : 0000000000000001
> [  869.813168] x3 : 503c00c9a4c6a300 x2 : 0000000000000000
> [  869.818496] x1 : ffffcc70e3f90b98 x0 : 0000000000000001
> [  869.823826] Call trace:
> [  869.826276]  rcu_note_context_switch+0x54/0x438
> [  869.830819]  __schedule+0xc0/0x708
> [  869.834228]  schedule+0x4c/0x108
> [  869.837462]  schedule_timeout+0x1a8/0x320
> [  869.841480]  wait_for_completion+0x9c/0x148
> [  869.845675]  dspi_transfer_one_message+0x158/0x550
> [  869.850480]  __spi_pump_messages+0x208/0x818
> [  869.854760]  __spi_sync+0x2a4/0x2e0
> [  869.858257]  spi_sync+0x34/0x58
> [  869.861404]  spi_sync_transfer+0x94/0xb8
> [  869.865337]  sja1105_xfer.isra.1+0x250/0x2e0
> [  869.869618]  sja1105_xfer_buf+0x4c/0x60
> [  869.873462]  sja1105_port_status_get+0x68/0x8f0
> [  869.878004]  sja1105_port_get_stats64+0x58/0x100
> [  869.882633]  dsa_slave_get_stats64+0x3c/0x58
> [  869.886916]  dev_get_stats+0xc0/0xd8
> [  869.890500]  dev_seq_printf_stats+0x44/0x118
> [  869.894780]  dev_seq_show+0x30/0x60
> [  869.898276]  seq_read_iter+0x330/0x450
> [  869.902032]  seq_read+0xf8/0x148
> [  869.905268]  proc_reg_read+0xd4/0x110
> [  869.908939]  vfs_read+0xac/0x1c8
> [  869.912172]  ksys_read+0x74/0xf8
> [  869.915405]  __arm64_sys_read+0x24/0x30
> [  869.919251]  el0_svc_common.constprop.3+0x80/0x1b0
> [  869.924055]  do_el0_svc+0x34/0xa0
> [  869.927378]  el0_sync_handler+0x138/0x198
> [  869.931397]  el0_sync+0x140/0x180
> [  869.934718] ---[ end trace fd45b387ae2c6970 ]---

It does seem to me that this is something that needs to be sorted out
at the subsystem level and that this driver has been "caught in the
crossfire". Any guidance on how we could proceed with this driver and
revisit this when we have answers to these questions at the subsystem
level would be appreciated if substantial time will be required to
work this out.

Thanks for taking the initiative to look into this.

-George
Vladimir Oltean Nov. 26, 2020, 10:05 p.m. UTC | #11
On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:
> > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > >
> > > > > As Vladimir already mentioned to you the statistics which have
> > > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > > if you just drop those for now.
> > > >
> > > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > > that thread again and try to make sense of it.
> > >
> > > I feel that I should ask. Do you want me to look into exposing RMON
> > > interface counters through rtnetlink (I've never done anything like that
> > > before either, but there's a beginning for everything), or are you going
> > > to?
> >
> > So I started to add .ndo_get_stats64 based on the hardware counters, but
> > I already hit the first roadblock, as described by the wise words of
> > Documentation/networking/statistics.rst:
> >
> > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> > | from the device it should do so periodically asynchronously and only return
> > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> > | allows setting the frequency of refreshing statistics, if needed.
> >
> >
> > Unfortunately, I feel this is almost unacceptable for a DSA driver that
> > more often than not needs to retrieve these counters from a slow and
> > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > option, because the only periodic interval that would not put absurdly
> > high pressure on the limited SPI bandwidth would be a readout interval
> > that gives you very old counters.
>
> Indeed it seems ndo_get_stats64() usually gets data over something
> like a local or PCIe bus or from software. I had a brief look to see
> if I could find another driver that was getting the stats over a slow
> bus and didn't notice anything. If you haven't already you might do a
> quick grep and see if anything pops out to you.
>
> >
> > What exactly is it that incurs the atomic context? I cannot seem to
> > figure out from this stack trace:
>
> I think something in fs/seq_file.c is taking an rcu lock.

Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
locally from dev_seq_start in net/core/net-procfs.c. The reason is that
/proc/net/dev iterates through all interfaces from the current netns,
and it is precisely that that creates atomic context. You used to need
to hold the rwlock_t dev_base_lock, but now you can also "get away" with
the RCU read-side lock. Either way, both are atomic context, so it
doesn't help.

commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Nov 4 05:43:23 2009 -0800

    net: Introduce for_each_netdev_rcu() iterator

    Adds RCU management to the list of netdevices.

    Convert some for_each_netdev() users to RCU version, if
    it can avoid read_lock-ing dev_base_lock

    Ie:
            read_lock(&dev_base_loack);
            for_each_netdev(net, dev)
                    some_action();
            read_unlock(&dev_base_lock);

    becomes :

            rcu_read_lock();
            for_each_netdev_rcu(net, dev)
                    some_action();
            rcu_read_unlock();


    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

So... yeah. As long as this kernel interface exists, it needs to run in
atomic context, by construction. Great.

> I suppose it doesn't really matter though since the documentation says
> we can't sleep.

You're talking, I suppose, about these words of wisdom in
Documentation/filesystems/seq_file.rst?

| However, the seq_file code (by design) will not sleep between the calls
| to start() and stop(), so holding a lock during that time is a
| reasonable thing to do. The seq_file code will also avoid taking any
| other locks while the iterator is active.

It _doesn't_ say that you can't sleep between start() and stop(), right?
It just says that if you want to keep the seq_file iterator atomic, the
seq_file code is not sabotaging you by sleeping. But you still could
sleep if you wanted to.

Back to the statistics counters.

How accurate do the counters in /proc/net/dev need to be? What programs
consume those? Could they be more out of date than the ones retrieved
through rtnetlink?

I'm thinking that maybe we could introduce another ndo, something like
.ndo_get_stats64_blocking, that could be called from all places except
from net/core/net-procfs.c. That one could still call the non-blocking
variant. Then, depending on the answer to the question "how inaccurate
could we reasonably leave /proc/net/dev", we could:
- just return zeroes there
- return the counters cached from the last blocking call

> It does seem to me that this is something that needs to be sorted out
> at the subsystem level and that this driver has been "caught in the
> crossfire". Any guidance on how we could proceed with this driver and
> revisit this when we have answers to these questions at the subsystem
> level would be appreciated if substantial time will be required to
> work this out.

Now seriously, who isn't caught in the crossfire here? Let's do some
brainstorming and it will be quick and painless.
Jakub Kicinski Nov. 27, 2020, 6:35 p.m. UTC | #12
On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com> wrote:  
> > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:  
> > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister wrote:  
> > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},  
> > > > > >
> > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > corresponding entries in struct rtnl_link_stats64 should be reported
> > > > > > the standard way. The infra for DSA may not be in place yet, so best
> > > > > > if you just drop those for now.  
> > > > >
> > > > > Okay, that clears it up a bit. Just drop these 6? I'll read through
> > > > > that thread again and try to make sense of it.  
> > > >
> > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > interface counters through rtnetlink (I've never done anything like that
> > > > before either, but there's a beginning for everything), or are you going
> > > > to?  
> > >
> > > So I started to add .ndo_get_stats64 based on the hardware counters, but
> > > I already hit the first roadblock, as described by the wise words of
> > > Documentation/networking/statistics.rst:
> > >
> > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > | via `/proc/net/dev`. If driver may sleep when retrieving the statistics
> > > | from the device it should do so periodically asynchronously and only return
> > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
> > > | allows setting the frequency of refreshing statistics, if needed.

I should have probably also mentioned here that unlike most NDOs
.ndo_get_stats64 is called without rtnl lock held at all.

> > > Unfortunately, I feel this is almost unacceptable for a DSA driver that
> > > more often than not needs to retrieve these counters from a slow and
> > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > option, because the only periodic interval that would not put absurdly
> > > high pressure on the limited SPI bandwidth would be a readout interval
> > > that gives you very old counters.  

What's a high interval? It's not uncommon to refresh the stats once a
second even in high performance NICs.

> > Indeed it seems ndo_get_stats64() usually gets data over something
> > like a local or PCIe bus or from software. I had a brief look to see
> > if I could find another driver that was getting the stats over a slow
> > bus and didn't notice anything. If you haven't already you might do a
> > quick grep and see if anything pops out to you.
> >  
> > >
> > > What exactly is it that incurs the atomic context? I cannot seem to
> > > figure out from this stack trace:  
> >
> > I think something in fs/seq_file.c is taking an rcu lock.  
> 
> Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> /proc/net/dev iterates through all interfaces from the current netns,
> and it is precisely that that creates atomic context. You used to need
> to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> the RCU read-side lock. Either way, both are atomic context, so it
> doesn't help.
> 
> commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Nov 4 05:43:23 2009 -0800
> 
>     net: Introduce for_each_netdev_rcu() iterator
> 
>     Adds RCU management to the list of netdevices.
> 
>     Convert some for_each_netdev() users to RCU version, if
>     it can avoid read_lock-ing dev_base_lock
> 
>     Ie:
>             read_lock(&dev_base_loack);
>             for_each_netdev(net, dev)
>                     some_action();
>             read_unlock(&dev_base_lock);
> 
>     becomes :
> 
>             rcu_read_lock();
>             for_each_netdev_rcu(net, dev)
>                     some_action();
>             rcu_read_unlock();
> 
> 
>     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> So... yeah. As long as this kernel interface exists, it needs to run in
> atomic context, by construction. Great.
>
> > I suppose it doesn't really matter though since the documentation says
> > we can't sleep.  
> 
> You're talking, I suppose, about these words of wisdom in
> Documentation/filesystems/seq_file.rst?
> 
> | However, the seq_file code (by design) will not sleep between the calls
> | to start() and stop(), so holding a lock during that time is a
> | reasonable thing to do. The seq_file code will also avoid taking any
> | other locks while the iterator is active.
> 
> It _doesn't_ say that you can't sleep between start() and stop(), right?
> It just says that if you want to keep the seq_file iterator atomic, the
> seq_file code is not sabotaging you by sleeping. But you still could
> sleep if you wanted to.
> 
> Back to the statistics counters.
> 
> How accurate do the counters in /proc/net/dev need to be? What programs
> consume those? Could they be more out of date than the ones retrieved
> through rtnetlink?

ifconfig does for sure.

> I'm thinking that maybe we could introduce another ndo, something like
> .ndo_get_stats64_blocking, that could be called from all places except
> from net/core/net-procfs.c. That one could still call the non-blocking
> variant. Then, depending on the answer to the question "how inaccurate
> could we reasonably leave /proc/net/dev", we could:
> - just return zeroes there
> - return the counters cached from the last blocking call

I'd rather not introduce divergent behavior like that.

Is the periodic refresh really that awful? We're mostly talking error
counters here so every second or every few seconds should be perfectly
fine.

> > It does seem to me that this is something that needs to be sorted out
> > at the subsystem level and that this driver has been "caught in the
> > crossfire". Any guidance on how we could proceed with this driver and
> > revisit this when we have answers to these questions at the subsystem
> > level would be appreciated if substantial time will be required to
> > work this out.  
> 
> Now seriously, who isn't caught in the crossfire here? Let's do some
> brainstorming and it will be quick and painless.
Vladimir Oltean Nov. 27, 2020, 7:50 p.m. UTC | #13
On Fri, Nov 27, 2020 at 12:47:41PM -0600, George McCollister wrote:
> On Fri, Nov 27, 2020, 12:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> > > On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > > > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com>
> > wrote:
> > > > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister
> > wrote:
> > > > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > > > > >
> > > > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > > > corresponding entries in struct rtnl_link_stats64 should be
> > reported
> > > > > > > > the standard way. The infra for DSA may not be in place yet,
> > so best
> > > > > > > > if you just drop those for now.
> > > > > > >
> > > > > > > Okay, that clears it up a bit. Just drop these 6? I'll read
> > through
> > > > > > > that thread again and try to make sense of it.
> > > > > >
> > > > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > > > interface counters through rtnetlink (I've never done anything
> > like that
> > > > > > before either, but there's a beginning for everything), or are you
> > going
> > > > > > to?
> > > > >
> > > > > So I started to add .ndo_get_stats64 based on the hardware counters,
> > but
> > > > > I already hit the first roadblock, as described by the wise words of
> > > > > Documentation/networking/statistics.rst:
> > > > >
> > > > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > > > | via `/proc/net/dev`. If driver may sleep when retrieving the
> > statistics
> > > > > | from the device it should do so periodically asynchronously and
> > only return
> > > > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt
> > coalescing interface
> > > > > | allows setting the frequency of refreshing statistics, if needed.
> >
> > I should have probably also mentioned here that unlike most NDOs
> > .ndo_get_stats64 is called without rtnl lock held at all.
> >
> > > > > Unfortunately, I feel this is almost unacceptable for a DSA driver
> > that
> > > > > more often than not needs to retrieve these counters from a slow and
> > > > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > > > option, because the only periodic interval that would not put
> > absurdly
> > > > > high pressure on the limited SPI bandwidth would be a readout
> > interval
> > > > > that gives you very old counters.
> >
> > What's a high interval? It's not uncommon to refresh the stats once a
> > second even in high performance NICs.
> >
> > > > Indeed it seems ndo_get_stats64() usually gets data over something
> > > > like a local or PCIe bus or from software. I had a brief look to see
> > > > if I could find another driver that was getting the stats over a slow
> > > > bus and didn't notice anything. If you haven't already you might do a
> > > > quick grep and see if anything pops out to you.
> > > >
> > > > >
> > > > > What exactly is it that incurs the atomic context? I cannot seem to
> > > > > figure out from this stack trace:
> > > >
> > > > I think something in fs/seq_file.c is taking an rcu lock.
> > >
> > > Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> > > locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> > > /proc/net/dev iterates through all interfaces from the current netns,
> > > and it is precisely that that creates atomic context. You used to need
> > > to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> > > the RCU read-side lock. Either way, both are atomic context, so it
> > > doesn't help.
> > >
> > > commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> > > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > > Date:   Wed Nov 4 05:43:23 2009 -0800
> > >
> > >     net: Introduce for_each_netdev_rcu() iterator
> > >
> > >     Adds RCU management to the list of netdevices.
> > >
> > >     Convert some for_each_netdev() users to RCU version, if
> > >     it can avoid read_lock-ing dev_base_lock
> > >
> > >     Ie:
> > >             read_lock(&dev_base_loack);
> > >             for_each_netdev(net, dev)
> > >                     some_action();
> > >             read_unlock(&dev_base_lock);
> > >
> > >     becomes :
> > >
> > >             rcu_read_lock();
> > >             for_each_netdev_rcu(net, dev)
> > >                     some_action();
> > >             rcu_read_unlock();
> > >
> > >
> > >     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > So... yeah. As long as this kernel interface exists, it needs to run in
> > > atomic context, by construction. Great.
> > >
> > > > I suppose it doesn't really matter though since the documentation says
> > > > we can't sleep.
> > >
> > > You're talking, I suppose, about these words of wisdom in
> > > Documentation/filesystems/seq_file.rst?
> > >
> > > | However, the seq_file code (by design) will not sleep between the calls
> > > | to start() and stop(), so holding a lock during that time is a
> > > | reasonable thing to do. The seq_file code will also avoid taking any
> > > | other locks while the iterator is active.
> > >
> > > It _doesn't_ say that you can't sleep between start() and stop(), right?
> > > It just says that if you want to keep the seq_file iterator atomic, the
> > > seq_file code is not sabotaging you by sleeping. But you still could
> > > sleep if you wanted to.
> > >
> > > Back to the statistics counters.
> > >
> > > How accurate do the counters in /proc/net/dev need to be? What programs
> > > consume those? Could they be more out of date than the ones retrieved
> > > through rtnetlink?
> >
> > ifconfig does for sure.
> >
> > > I'm thinking that maybe we could introduce another ndo, something like
> > > .ndo_get_stats64_blocking, that could be called from all places except
> > > from net/core/net-procfs.c. That one could still call the non-blocking
> > > variant. Then, depending on the answer to the question "how inaccurate
> > > could we reasonably leave /proc/net/dev", we could:
> > > - just return zeroes there
> > > - return the counters cached from the last blocking call
> >
> > I'd rather not introduce divergent behavior like that.
> >
> > Is the periodic refresh really that awful? We're mostly talking error
> > counters here so every second or every few seconds should be perfectly
> > fine.
> >
> 
> That sounds pretty reasonable to me. Worst case is about a 100kbp/s
> management bus and with the amount of data we're talking about that should
> be no problem so long as there's a way to raise the interval or disable it
> entirely.

100 Kbps = 12.5KB/s.
sja1105 has 93 64-bit counters, and during every counter refresh cycle I
would need to get some counters from the beginning of that range, some
from the middle and some from the end. With all the back-and-forth
between the sja1105 driver and the SPI controller driver, and the
protocol overhead associated with creating a "SPI read" message, it is
all in all more efficient to just issue a burst read operation for all
the counters, even ones that I'm not going to use. So let's go with
that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
and that's just for the raw I/O, that thing which keeps the SPI mutex
locked. You know what else I could do during that time? Anything else!
Like for example perform PTP timestamp reconstruction, which has a hard
deadline at 135 ms after the packet was received, and would appreciate
if the SPI mutex was not locked for 59 ms every second.
And all of that, for what benefit? Honestly any periodic I/O over the
management interface is too much I/O, unless there is any strong reason
to have it.

Also, even the simple idea of providing out-of-date counters to user
space running in syscall context has me scratching my head. I can only
think of all the drivers in selftests that are checking statistics
counters before, then they send a packet, then they check the counters
after. What do those need to do, put a sleep to make sure the counters
were updated?
Andrew Lunn Nov. 27, 2020, 8:47 p.m. UTC | #14
> Is the periodic refresh really that awful? We're mostly talking error
> counters here so every second or every few seconds should be perfectly
> fine.

Humm, i would prefer error counts to be more correct than anything
else. When debugging issues, you generally don't care how many packets
worked. It is how many failed you are interesting, and how that number
of failures increases.

So long as these counters are still in ethtool -S, i guess it does not
matter. That i do trust to be accurate, and probably consistent across
the counters it returns.

	Andrew
George McCollister Nov. 27, 2020, 8:58 p.m. UTC | #15
On Fri, Nov 27, 2020 at 1:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Fri, Nov 27, 2020 at 12:47:41PM -0600, George McCollister wrote:
> > On Fri, Nov 27, 2020, 12:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > > On Fri, 27 Nov 2020 00:05:00 +0200 Vladimir Oltean wrote:
> > > > On Thu, Nov 26, 2020 at 01:07:12PM -0600, George McCollister wrote:
> > > > > On Thu, Nov 26, 2020 at 11:56 AM Vladimir Oltean <olteanv@gmail.com>
> > > wrote:
> > > > > > On Thu, Nov 26, 2020 at 03:24:18PM +0200, Vladimir Oltean wrote:
> > > > > > > On Wed, Nov 25, 2020 at 08:25:11PM -0600, George McCollister
> > > wrote:
> > > > > > > > > > +     {XRS_RX_UNDERSIZE_L, "rx_undersize"},
> > > > > > > > > > +     {XRS_RX_FRAGMENTS_L, "rx_fragments"},
> > > > > > > > > > +     {XRS_RX_OVERSIZE_L, "rx_oversize"},
> > > > > > > > > > +     {XRS_RX_JABBER_L, "rx_jabber"},
> > > > > > > > > > +     {XRS_RX_ERR_L, "rx_err"},
> > > > > > > > > > +     {XRS_RX_CRC_L, "rx_crc"},
> > > > > > > > >
> > > > > > > > > As Vladimir already mentioned to you the statistics which have
> > > > > > > > > corresponding entries in struct rtnl_link_stats64 should be
> > > reported
> > > > > > > > > the standard way. The infra for DSA may not be in place yet,
> > > so best
> > > > > > > > > if you just drop those for now.
> > > > > > > >
> > > > > > > > Okay, that clears it up a bit. Just drop these 6? I'll read
> > > through
> > > > > > > > that thread again and try to make sense of it.
> > > > > > >
> > > > > > > I feel that I should ask. Do you want me to look into exposing RMON
> > > > > > > interface counters through rtnetlink (I've never done anything
> > > like that
> > > > > > > before either, but there's a beginning for everything), or are you
> > > going
> > > > > > > to?
> > > > > >
> > > > > > So I started to add .ndo_get_stats64 based on the hardware counters,
> > > but
> > > > > > I already hit the first roadblock, as described by the wise words of
> > > > > > Documentation/networking/statistics.rst:
> > > > > >
> > > > > > | The `.ndo_get_stats64` callback can not sleep because of accesses
> > > > > > | via `/proc/net/dev`. If driver may sleep when retrieving the
> > > statistics
> > > > > > | from the device it should do so periodically asynchronously and
> > > only return
> > > > > > | a recent copy from `.ndo_get_stats64`. Ethtool interrupt
> > > coalescing interface
> > > > > > | allows setting the frequency of refreshing statistics, if needed.
> > >
> > > I should have probably also mentioned here that unlike most NDOs
> > > .ndo_get_stats64 is called without rtnl lock held at all.
> > >
> > > > > > Unfortunately, I feel this is almost unacceptable for a DSA driver
> > > that
> > > > > > more often than not needs to retrieve these counters from a slow and
> > > > > > bottlenecked bus (SPI, I2C, MDIO etc). Periodic readouts are not an
> > > > > > option, because the only periodic interval that would not put
> > > absurdly
> > > > > > high pressure on the limited SPI bandwidth would be a readout
> > > interval
> > > > > > that gives you very old counters.
> > >
> > > What's a high interval? It's not uncommon to refresh the stats once a
> > > second even in high performance NICs.
> > >
> > > > > Indeed it seems ndo_get_stats64() usually gets data over something
> > > > > like a local or PCIe bus or from software. I had a brief look to see
> > > > > if I could find another driver that was getting the stats over a slow
> > > > > bus and didn't notice anything. If you haven't already you might do a
> > > > > quick grep and see if anything pops out to you.
> > > > >
> > > > > >
> > > > > > What exactly is it that incurs the atomic context? I cannot seem to
> > > > > > figure out from this stack trace:
> > > > >
> > > > > I think something in fs/seq_file.c is taking an rcu lock.
> > > >
> > > > Not quite. It _is_ the RCU read-side lock that's taken, but it's taken
> > > > locally from dev_seq_start in net/core/net-procfs.c. The reason is that
> > > > /proc/net/dev iterates through all interfaces from the current netns,
> > > > and it is precisely that that creates atomic context. You used to need
> > > > to hold the rwlock_t dev_base_lock, but now you can also "get away" with
> > > > the RCU read-side lock. Either way, both are atomic context, so it
> > > > doesn't help.
> > > >
> > > > commit c6d14c84566d6b70ad9dc1618db0dec87cca9300
> > > > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > > > Date:   Wed Nov 4 05:43:23 2009 -0800
> > > >
> > > >     net: Introduce for_each_netdev_rcu() iterator
> > > >
> > > >     Adds RCU management to the list of netdevices.
> > > >
> > > >     Convert some for_each_netdev() users to RCU version, if
> > > >     it can avoid read_lock-ing dev_base_lock
> > > >
> > > >     Ie:
> > > >             read_lock(&dev_base_loack);
> > > >             for_each_netdev(net, dev)
> > > >                     some_action();
> > > >             read_unlock(&dev_base_lock);
> > > >
> > > >     becomes :
> > > >
> > > >             rcu_read_lock();
> > > >             for_each_netdev_rcu(net, dev)
> > > >                     some_action();
> > > >             rcu_read_unlock();
> > > >
> > > >
> > > >     Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >
> > > > So... yeah. As long as this kernel interface exists, it needs to run in
> > > > atomic context, by construction. Great.
> > > >
> > > > > I suppose it doesn't really matter though since the documentation says
> > > > > we can't sleep.
> > > >
> > > > You're talking, I suppose, about these words of wisdom in
> > > > Documentation/filesystems/seq_file.rst?
> > > >
> > > > | However, the seq_file code (by design) will not sleep between the calls
> > > > | to start() and stop(), so holding a lock during that time is a
> > > > | reasonable thing to do. The seq_file code will also avoid taking any
> > > > | other locks while the iterator is active.
> > > >
> > > > It _doesn't_ say that you can't sleep between start() and stop(), right?
> > > > It just says that if you want to keep the seq_file iterator atomic, the
> > > > seq_file code is not sabotaging you by sleeping. But you still could
> > > > sleep if you wanted to.
> > > >
> > > > Back to the statistics counters.
> > > >
> > > > How accurate do the counters in /proc/net/dev need to be? What programs
> > > > consume those? Could they be more out of date than the ones retrieved
> > > > through rtnetlink?
> > >
> > > ifconfig does for sure.
> > >
> > > > I'm thinking that maybe we could introduce another ndo, something like
> > > > .ndo_get_stats64_blocking, that could be called from all places except
> > > > from net/core/net-procfs.c. That one could still call the non-blocking
> > > > variant. Then, depending on the answer to the question "how inaccurate
> > > > could we reasonably leave /proc/net/dev", we could:
> > > > - just return zeroes there
> > > > - return the counters cached from the last blocking call
> > >
> > > I'd rather not introduce divergent behavior like that.
> > >
> > > Is the periodic refresh really that awful? We're mostly talking error
> > > counters here so every second or every few seconds should be perfectly
> > > fine.
> > >
> >
> > That sounds pretty reasonable to me. Worst case is about a 100kbp/s
> > management bus and with the amount of data we're talking about that should
> > be no problem so long as there's a way to raise the interval or disable it
> > entirely.
>
> 100 Kbps = 12.5KB/s.
> sja1105 has 93 64-bit counters, and during every counter refresh cycle I

Yeah, that's quite big. The xrs700x counters are only 16 bit. They
need to be polled on an interval anyway or they will roll.

> would need to get some counters from the beginning of that range, some
> from the middle and some from the end. With all the back-and-forth
> between the sja1105 driver and the SPI controller driver, and the
> protocol overhead associated with creating a "SPI read" message, it is
> all in all more efficient to just issue a burst read operation for all
> the counters, even ones that I'm not going to use. So let's go with
> that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> and that's just for the raw I/O, that thing which keeps the SPI mutex
> locked. You know what else I could do during that time? Anything else!
> Like for example perform PTP timestamp reconstruction, which has a hard
> deadline at 135 ms after the packet was received, and would appreciate
> if the SPI mutex was not locked for 59 ms every second.

Indeed, however if you need to acquire this data at all it's going to
burden the system at that time so unless you're able to stretch out
the reads over a length of time whether or not you're polling every
second or once a day may not matter if you're never able to miss a
deadline.

> And all of that, for what benefit? Honestly any periodic I/O over the
> management interface is too much I/O, unless there is any strong reason
> to have it.

True enough.

>
> Also, even the simple idea of providing out-of-date counters to user
> space running in syscall context has me scratching my head. I can only
> think of all the drivers in selftests that are checking statistics
> counters before, then they send a packet, then they check the counters
> after. What do those need to do, put a sleep to make sure the counters
> were updated?

Yeah, I don't know enough about how these are used. I'll defer to you.

I apologize if I'm slow at responding. Yesterday and today are
holidays in the US.
Jakub Kicinski Nov. 27, 2020, 9:13 p.m. UTC | #16
On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:
> > Is the periodic refresh really that awful? We're mostly talking error
> > counters here so every second or every few seconds should be perfectly
> > fine.  
> 
> Humm, i would prefer error counts to be more correct than anything
> else. When debugging issues, you generally don't care how many packets
> worked. It is how many failed you are interesting, and how that number
> of failures increases.

Right, but not sure I'd use the word "correct". Perhaps "immediately up
to date"?

High speed NICs usually go through a layer of firmware before they
access the stats, IOW even if we always synchronously ask for the stats
in the kernel - in practice a lot of NICs (most?) will return some form
of cached information.

> So long as these counters are still in ethtool -S, i guess it does not
> matter. That i do trust to be accurate, and probably consistent across
> the counters it returns.

Not in the NIC designs I'm familiar with.

But anyway - this only matters in some strict testing harness, right?
Normal users will look at a stats after they noticed issues (so minutes
/ hours later) or at the very best they'll look at a graph, which will
hardly require <1sec accuracy to when error occurred.
Vladimir Oltean Nov. 27, 2020, 9:23 p.m. UTC | #17
On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote:
> On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:
> > > Is the periodic refresh really that awful? We're mostly talking error
> > > counters here so every second or every few seconds should be perfectly
> > > fine.
> >
> > Humm, i would prefer error counts to be more correct than anything
> > else. When debugging issues, you generally don't care how many packets
> > worked. It is how many failed you are interesting, and how that number
> > of failures increases.
>
> Right, but not sure I'd use the word "correct". Perhaps "immediately up
> to date"?
>
> High speed NICs usually go through a layer of firmware before they
> access the stats, IOW even if we always synchronously ask for the stats
> in the kernel - in practice a lot of NICs (most?) will return some form
> of cached information.
>
> > So long as these counters are still in ethtool -S, i guess it does not
> > matter. That i do trust to be accurate, and probably consistent across
> > the counters it returns.
>
> Not in the NIC designs I'm familiar with.
>
> But anyway - this only matters in some strict testing harness, right?
> Normal users will look at a stats after they noticed issues (so minutes
> / hours later) or at the very best they'll look at a graph, which will
> hardly require <1sec accuracy to when error occurred.

Either way, can we conclude that ndo_get_stats64 is not a replacement
for ethtool -S, since the latter is blocking and, if implemented correctly,
can return the counters at the time of the call (therefore making sure
that anything that happened before the syscall has been accounted into
the retrieved values), and the former isn't?

The whole discussion started because you said we shouldn't expose some
statistics counters in ethtool as long as they have a standardized
equivalent. Well, I think we still should.
Andrew Lunn Nov. 27, 2020, 9:32 p.m. UTC | #18
> > So long as these counters are still in ethtool -S, i guess it does not
> > matter. That i do trust to be accurate, and probably consistent across
> > the counters it returns.
> 
> Not in the NIC designs I'm familiar with.

Many NICs have a way to take a hardware snapshot of all counters. You
can then read them out as fast or slow as you want, since you read the
snapshot, not the live counters. As a result you can compare counters
against each other.

> But anyway - this only matters in some strict testing harness, right?
> Normal users will look at a stats after they noticed issues (so minutes
> / hours later) or at the very best they'll look at a graph, which will
> hardly require <1sec accuracy to when error occurred.

As Vladimir has pointed out, polling once per second over an i2c bus
is expensive. And there is an obvious linear cost with the number of
ports on these switches. And we need to keep latency down so that PTP
is accurate. Do we really want to be polling, for something which is
very unlikely to be used? I think we should probably take another look
at the locking and see if it can be modified to allow block, so we can
avoid this wasteful polling.

	Andrew
Andrew Lunn Nov. 27, 2020, 9:36 p.m. UTC | #19
> Either way, can we conclude that ndo_get_stats64 is not a replacement
> for ethtool -S, since the latter is blocking and, if implemented correctly,
> can return the counters at the time of the call (therefore making sure
> that anything that happened before the syscall has been accounted into
> the retrieved values), and the former isn't?

ethtool -S is the best source of consistent, up to date statistics we
have. It seems silly not to include everything the hardware offers
there.

	Andrew
Jakub Kicinski Nov. 27, 2020, 9:37 p.m. UTC | #20
Replying to George's email 'cause I didn't get Vladimir's email from
the ML.

On Fri, 27 Nov 2020 14:58:29 -0600 George McCollister wrote:
> > 100 Kbps = 12.5KB/s.
> > sja1105 has 93 64-bit counters, and during every counter refresh cycle I  

Are these 93 for one port? That sounds like a lot.. There're usually 
~10 stats (per port) that are relevant to the standard netdev stats.

> Yeah, that's quite big. The xrs700x counters are only 16 bit. They
> need to be polled on an interval anyway or they will roll.

Yup! That's pretty common.

> > would need to get some counters from the beginning of that range, some
> > from the middle and some from the end. With all the back-and-forth
> > between the sja1105 driver and the SPI controller driver, and the
> > protocol overhead associated with creating a "SPI read" message, it is
> > all in all more efficient to just issue a burst read operation for all
> > the counters, even ones that I'm not going to use. So let's go with
> > that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> > per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> > and that's just for the raw I/O, that thing which keeps the SPI mutex
> > locked. You know what else I could do during that time? Anything else!
> > Like for example perform PTP timestamp reconstruction, which has a hard
> > deadline at 135 ms after the packet was received, and would appreciate
> > if the SPI mutex was not locked for 59 ms every second.  
> 
> Indeed, however if you need to acquire this data at all it's going to
> burden the system at that time so unless you're able to stretch out
> the reads over a length of time whether or not you're polling every
> second or once a day may not matter if you're never able to miss a
> deadline.

Exactly, either way you gotta prepare for users polling those stats.
A design where stats are read synchronously and user (an unprivileged
user, BTW) has the ability to disturb the operation of the system
sounds really flaky.

> > And all of that, for what benefit? Honestly any periodic I/O over the
> > management interface is too much I/O, unless there is any strong reason
> > to have it.  
> 
> True enough.
> 
> > Also, even the simple idea of providing out-of-date counters to user
> > space running in syscall context has me scratching my head. I can only
> > think of all the drivers in selftests that are checking statistics
> > counters before, then they send a packet, then they check the counters
> > after. What do those need to do, put a sleep to make sure the counters
> > were updated?  

Frankly life sounds simpler on the embedded networking planet than it is
on the one I'm living on ;) High speed systems are often eventually
consistent. Either because stats are gathered from HW periodically by
the FW, or RCU grace period has to expire, or workqueue has to run,
etc. etc. I know it's annoying for writing tests but it's manageable. 

If there is a better alternative I'm all ears but having /proc and
ifconfig return zeros for error counts while ip link doesn't will lead
to too much confusion IMO. While delayed update of stats is a fact of
life for _years_ now (hence it was backed into the ethtool -C API).
Jakub Kicinski Nov. 27, 2020, 10:03 p.m. UTC | #21
On Fri, 27 Nov 2020 23:23:42 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote:
> > On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:  
> > > > Is the periodic refresh really that awful? We're mostly talking error
> > > > counters here so every second or every few seconds should be perfectly
> > > > fine.  
> > >
> > > Humm, i would prefer error counts to be more correct than anything
> > > else. When debugging issues, you generally don't care how many packets
> > > worked. It is how many failed you are interesting, and how that number
> > > of failures increases.  
> >
> > Right, but not sure I'd use the word "correct". Perhaps "immediately up
> > to date"?
> >
> > High speed NICs usually go through a layer of firmware before they
> > access the stats, IOW even if we always synchronously ask for the stats
> > in the kernel - in practice a lot of NICs (most?) will return some form
> > of cached information.
> >  
> > > So long as these counters are still in ethtool -S, i guess it does not
> > > matter. That i do trust to be accurate, and probably consistent across
> > > the counters it returns.  
> >
> > Not in the NIC designs I'm familiar with.
> >
> > But anyway - this only matters in some strict testing harness, right?
> > Normal users will look at a stats after they noticed issues (so minutes
> > / hours later) or at the very best they'll look at a graph, which will
> > hardly require <1sec accuracy to when error occurred.  
> 
> Either way, can we conclude that ndo_get_stats64 is not a replacement
> for ethtool -S, since the latter is blocking and, if implemented correctly,
> can return the counters at the time of the call (therefore making sure
> that anything that happened before the syscall has been accounted into
> the retrieved values), and the former isn't?

ethtool -S stats are not 100% up to date. Not on Netronome, Intel,
Broadcom or Mellanox NICs AFAIK.

> The whole discussion started because you said we shouldn't expose some
> statistics counters in ethtool as long as they have a standardized
> equivalent. Well, I think we still should.

Users must have access to stats via standard Linux interfaces with well
defined semantics. We cannot continue to live in the world where user
has to guess driver specific name for ethtool -S to find out the number
of CRC errors...

I know it may not matter to a driver developer, and it didn't matter
much to me when I was one, because in my drivers they always had the
same name. But trying to monitor a fleet of hardware from multiple
vendors is very painful with the status quo, we must do better.
We can't have users scrape through what is basically a debug interface
to get to vital information.

I'd really love to find a way out of the procfs issue, but I'm not sure
if there is one.
Jakub Kicinski Nov. 27, 2020, 10:14 p.m. UTC | #22
On Fri, 27 Nov 2020 22:32:44 +0100 Andrew Lunn wrote:
> > > So long as these counters are still in ethtool -S, i guess it does not
> > > matter. That i do trust to be accurate, and probably consistent across
> > > the counters it returns.  
> > 
> > Not in the NIC designs I'm familiar with.  
> 
> Many NICs have a way to take a hardware snapshot of all counters.
> You can then read them out as fast or slow as you want, since you
> read the snapshot, not the live counters. As a result you can compare
> counters against each other.

Curious, does Marvell HW do it?
 
> > But anyway - this only matters in some strict testing harness,
> > right? Normal users will look at a stats after they noticed issues
> > (so minutes / hours later) or at the very best they'll look at a
> > graph, which will hardly require <1sec accuracy to when error
> > occurred.  
> 
> As Vladimir has pointed out, polling once per second over an i2c bus
> is expensive. And there is an obvious linear cost with the number of
> ports on these switches. And we need to keep latency down so that PTP
> is accurate. Do we really want to be polling, for something which is
> very unlikely to be used?

IDK I find it very questionable if the system design doesn't take into
account that statistics are retrieved every n seconds. We can perhaps
scale the default period with the speed of the bus?

> I think we should probably take another look at the locking and see
> if it can be modified to allow block, so we can avoid this wasteful
> polling.

It'd be great. Worst case scenario we can have very, very rare polling
+ a synchronous callback? But we shouldn't leave /proc be completely
incorrect.

Converting /proc to be blocking may be a little risky, there may be
legacy daemons, and other software which people have running which reads
/proc just to get a list of interfaces or something silly, which will
suddenly start causing latencies to the entire stack.

But perhaps we can try and find out. I'm not completely opposed.
Vladimir Oltean Nov. 27, 2020, 10:42 p.m. UTC | #23
On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> Replying to George's email 'cause I didn't get Vladimir's email from
> the ML.
>
> On Fri, 27 Nov 2020 14:58:29 -0600 George McCollister wrote:
> > > 100 Kbps = 12.5KB/s.
> > > sja1105 has 93 64-bit counters, and during every counter refresh cycle I
>
> Are these 93 for one port? That sounds like a lot.. There're usually
> ~10 stats (per port) that are relevant to the standard netdev stats.

I would like to report the rx_errors as an aggregate of a few other
discrete counters that are far in between. Hence the idea of reading the
entire region delegated to port counters using a single SPI transaction.

> > Yeah, that's quite big. The xrs700x counters are only 16 bit. They
> > need to be polled on an interval anyway or they will roll.
>
> Yup! That's pretty common.
>
> > > would need to get some counters from the beginning of that range, some
> > > from the middle and some from the end. With all the back-and-forth
> > > between the sja1105 driver and the SPI controller driver, and the
> > > protocol overhead associated with creating a "SPI read" message, it is
> > > all in all more efficient to just issue a burst read operation for all
> > > the counters, even ones that I'm not going to use. So let's go with
> > > that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> > > per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> > > and that's just for the raw I/O, that thing which keeps the SPI mutex
> > > locked. You know what else I could do during that time? Anything else!
> > > Like for example perform PTP timestamp reconstruction, which has a hard
> > > deadline at 135 ms after the packet was received, and would appreciate
> > > if the SPI mutex was not locked for 59 ms every second.
> >
> > Indeed, however if you need to acquire this data at all it's going to
> > burden the system at that time so unless you're able to stretch out
> > the reads over a length of time whether or not you're polling every
> > second or once a day may not matter if you're never able to miss a
> > deadline.
>
> Exactly, either way you gotta prepare for users polling those stats.
> A design where stats are read synchronously and user (an unprivileged
> user, BTW) has the ability to disturb the operation of the system
> sounds really flaky.

So I was probably overreacting with the previous estimate, since
- nobody in their right mind is going to use sja1105 at a SPI clock of
  100 KHz, but at least 100 times that value.
- other buses, like I2C/MDIO, don't really have the notion of variable
  buffer sizes and burst transactions. So the wait time has a smaller
  upper bound there, because of the implicit "cond_resched" given by the
  need to read word by word. In retrospect it would be wiser to not make
  use of burst reads for getting stats over SPI either.

But to your point. The worst case is worse than periodic readouts, and
the system needs to be resilient against them. Well, some things are
finicky no matter what you do and how well you protect, like phc2sys.
So for those, you need to be more radical, and you need to do cross
timestamping as close to the actual I/O as possible:
https://lwn.net/Articles/798467/
But nonetheless, point taken. As long as the system withstands the worst
case, then the impact it has on latency is not catastrophic, although it
is still there. But who would not like a system which achieves the same
thing by doing less? I feel that not sorting this out now will deter
many DSA driver writers from using ndo_get_stats64, if they don't have
otherwise a reason to schedule a periodic stats workqueue anyway.

> > > And all of that, for what benefit? Honestly any periodic I/O over the
> > > management interface is too much I/O, unless there is any strong reason
> > > to have it.
> >
> > True enough.
> >
> > > Also, even the simple idea of providing out-of-date counters to user
> > > space running in syscall context has me scratching my head. I can only
> > > think of all the drivers in selftests that are checking statistics
> > > counters before, then they send a packet, then they check the counters
> > > after. What do those need to do, put a sleep to make sure the counters
> > > were updated?
>
> Frankly life sounds simpler on the embedded networking planet than it is
> on the one I'm living on ;) High speed systems are often eventually
> consistent. Either because stats are gathered from HW periodically by
> the FW, or RCU grace period has to expire, or workqueue has to run,
> etc. etc. I know it's annoying for writing tests but it's manageable.

Yes, but all of these are problems which I do not have, so I don't see
why I would want to suffer from others' challenges.

> If there is a better alternative I'm all ears but having /proc and
> ifconfig return zeros for error counts while ip link doesn't will lead
> to too much confusion IMO. While delayed update of stats is a fact of
> life for _years_ now (hence it was backed into the ethtool -C API).

I don't know the net/core folder well enough to say anything definitive
about it. To me nothing looked obviously possible.
Although in my mind of somebody who doesn't understand how other pieces
of code seem to get away by exposing the same object through two
different types of locking (like br_vlan_get_pvid and
br_vlan_get_pvid_rcu), maybe this could be an approach to try to obtain
the RTNL in net-procfs.c instead of RCU. But this is just, you know,
speculating things I don't understand.
Andrew Lunn Nov. 27, 2020, 10:46 p.m. UTC | #24
On Fri, Nov 27, 2020 at 02:14:02PM -0800, Jakub Kicinski wrote:
> On Fri, 27 Nov 2020 22:32:44 +0100 Andrew Lunn wrote:
> > > > So long as these counters are still in ethtool -S, i guess it does not
> > > > matter. That i do trust to be accurate, and probably consistent across
> > > > the counters it returns.  
> > > 
> > > Not in the NIC designs I'm familiar with.  
> > 
> > Many NICs have a way to take a hardware snapshot of all counters.
> > You can then read them out as fast or slow as you want, since you
> > read the snapshot, not the live counters. As a result you can compare
> > counters against each other.
> 
> Curious, does Marvell HW do it?

Yes. Every generation of Marvell SOHO switch has this.

> IDK I find it very questionable if the system design doesn't take into
> account that statistics are retrieved every n seconds. We can perhaps
> scale the default period with the speed of the bus?

You don't actually have much choice. I2C is defined to run at
100Kbps. There is a fast mode which runs at 400Kbps. How do you design
around that? MDIO is around 2.5Mbps, but has 50% overhead from the
preamble. SPI can be up to 50Mbps, but there is no standard set of
speeds. None of the data sheets i've seen ever talk about recommended
scheduling polices for transactions over these busses. But testing has
shown, if you want good PTP, you need to keep them loaded as lightly
as possible. If you don't have PTP it becomes less important.

   Andrew
Vladimir Oltean Nov. 27, 2020, 11:21 p.m. UTC | #25
On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> High speed systems are often eventually consistent. Either because
> stats are gathered from HW periodically by the FW, or RCU grace period
> has to expire, or workqueue has to run, etc. etc. I know it's annoying
> for writing tests but it's manageable.

Out of curiosity, what does a test writer need to do to get out of the
"eventual consistency" conundrum in a portable way and answer the
question "has my packet not been received by the interface or has the
counter just not updated"?
Andrew Lunn Nov. 27, 2020, 11:30 p.m. UTC | #26
> If there is a better alternative I'm all ears but having /proc and
> ifconfig return zeros for error counts while ip link doesn't will lead
> to too much confusion IMO. While delayed update of stats is a fact of
> life for _years_ now (hence it was backed into the ethtool -C API).

How about dev_seq_start() issues a netdev notifier chain event, asking
devices which care to update their cached rtnl_link_stats64 counters.
They can decide if their cache is too old, and do a blocking read for
new values.

Once the notifier has completed, dev_seq_start() can then
rcu_read_lock() and do the actual collection of stats from the drivers
non-blocking.

	Andrew
Vladimir Oltean Nov. 27, 2020, 11:39 p.m. UTC | #27
On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > If there is a better alternative I'm all ears but having /proc and
> > ifconfig return zeros for error counts while ip link doesn't will lead
> > to too much confusion IMO. While delayed update of stats is a fact of
> > life for _years_ now (hence it was backed into the ethtool -C API).
> 
> How about dev_seq_start() issues a netdev notifier chain event, asking
> devices which care to update their cached rtnl_link_stats64 counters.
> They can decide if their cache is too old, and do a blocking read for
> new values.
> 
> Once the notifier has completed, dev_seq_start() can then
> rcu_read_lock() and do the actual collection of stats from the drivers
> non-blocking.

That sounds smart. I can try to prototype that and see how well it
works, or do you want to?
Jakub Kicinski Nov. 27, 2020, 11:51 p.m. UTC | #28
On Sat, 28 Nov 2020 01:21:40 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> > High speed systems are often eventually consistent. Either because
> > stats are gathered from HW periodically by the FW, or RCU grace period
> > has to expire, or workqueue has to run, etc. etc. I know it's annoying
> > for writing tests but it's manageable.  
> 
> Out of curiosity, what does a test writer need to do to get out of the
> "eventual consistency" conundrum in a portable way and answer the
> question "has my packet not been received by the interface or has the
> counter just not updated"?

Retry for a reasonable amount of time for the system state to converge.
E.g. bpftool_prog_list_wait() in selftests/bpf/test_offloads.py, or
check_tables() in selftests/drivers/net/udp_tunnel_nic.sh.

Sadly I was too lazy to open source the driver tests we built at
Netronome.
Jakub Kicinski Nov. 27, 2020, 11:56 p.m. UTC | #29
What is it with my email today, didn't get this one again.

On Sat, 28 Nov 2020 01:39:16 +0200 Vladimir Oltean wrote:
> On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > If there is a better alternative I'm all ears but having /proc and
> > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > to too much confusion IMO. While delayed update of stats is a fact of
> > > life for _years_ now (hence it was backed into the ethtool -C API).  
> > 
> > How about dev_seq_start() issues a netdev notifier chain event, asking
> > devices which care to update their cached rtnl_link_stats64 counters.
> > They can decide if their cache is too old, and do a blocking read for
> > new values.

Just to avoid breaking the suggestion that seqfiles don't sleep after
.start? Hm. I thought BPF slept (or did cond_reshed()) in the middle of
seq iteration. We should double check that seqfiles really can't sleep.

> > Once the notifier has completed, dev_seq_start() can then
> > rcu_read_lock() and do the actual collection of stats from the drivers
> > non-blocking.
Vladimir Oltean Nov. 28, 2020, 12:02 a.m. UTC | #30
On Sat, Nov 28, 2020 at 01:39:16AM +0200, Vladimir Oltean wrote:
> On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > If there is a better alternative I'm all ears but having /proc and
> > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > to too much confusion IMO. While delayed update of stats is a fact of
> > > life for _years_ now (hence it was backed into the ethtool -C API).
> >
> > How about dev_seq_start() issues a netdev notifier chain event, asking
> > devices which care to update their cached rtnl_link_stats64 counters.
> > They can decide if their cache is too old, and do a blocking read for
> > new values.
> >
> > Once the notifier has completed, dev_seq_start() can then
> > rcu_read_lock() and do the actual collection of stats from the drivers
> > non-blocking.
>
> That sounds smart. I can try to prototype that and see how well it
> works, or do you want to?

The situation is like this:

static int call_netdevice_notifiers_info(unsigned long val,
					 struct netdev_notifier_info *info);

expects a non-NULL info->dev argument.

To get a net device you need to call:

#define for_each_netdev(net, d)		\
		list_for_each_entry(d, &(net)->dev_base_head, dev_list)

which has the following protection rules:

/*
 * The @dev_base_head list is protected by @dev_base_lock and the rtnl
 * semaphore.
 *
 * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
 *
 * Writers must hold the rtnl semaphore while they loop through the
 * dev_base_head list, and hold dev_base_lock for writing when they do the
 * actual updates.  This allows pure readers to access the list even
 * while a writer is preparing to update it.
 *
 * To put it another way, dev_base_lock is held for writing only to
 * protect against pure readers; the rtnl semaphore provides the
 * protection against other writers.
 *
 * See, for example usages, register_netdevice() and
 * unregister_netdevice(), which must be called with the rtnl
 * semaphore held.
 */

This means, as far as I understand, 2 things:
1. call_netdevice_notifiers_info doesn't help, since our problem is the
   same
2. I think that holding the RTNL should also be a valid way to iterate
   through the net devices in the current netns, and doing just that
   could be the simplest way out. It certainly worked when I tried it.
   But those could also be famous last words...
Andrew Lunn Nov. 28, 2020, 12:39 a.m. UTC | #31
> This means, as far as I understand, 2 things:
> 1. call_netdevice_notifiers_info doesn't help, since our problem is the
>    same
> 2. I think that holding the RTNL should also be a valid way to iterate
>    through the net devices in the current netns, and doing just that
>    could be the simplest way out. It certainly worked when I tried it.
>    But those could also be famous last words...

DSA makes the assumption it can block in a notifier change event.  For
example, NETDEV_CHANGEUPPER calls dsa_slave_changeupper() which calls
into the driver. We have not seen any warnings about sleeping while
atomic. So maybe see how NETDEV_CHANGEUPPER is issued.

	Andrew
Vladimir Oltean Nov. 28, 2020, 1:45 a.m. UTC | #32
On Fri, Nov 27, 2020 at 03:56:49PM -0800, Jakub Kicinski wrote:
> What is it with my email today, didn't get this one again.
>
> On Sat, 28 Nov 2020 01:39:16 +0200 Vladimir Oltean wrote:
> > On Sat, Nov 28, 2020 at 12:30:48AM +0100, Andrew Lunn wrote:
> > > > If there is a better alternative I'm all ears but having /proc and
> > > > ifconfig return zeros for error counts while ip link doesn't will lead
> > > > to too much confusion IMO. While delayed update of stats is a fact of
> > > > life for _years_ now (hence it was backed into the ethtool -C API).
> > >
> > > How about dev_seq_start() issues a netdev notifier chain event, asking
> > > devices which care to update their cached rtnl_link_stats64 counters.
> > > They can decide if their cache is too old, and do a blocking read for
> > > new values.
>
> Just to avoid breaking the suggestion that seqfiles don't sleep after
> .start? Hm. I thought BPF slept (or did cond_reshed()) in the middle of
> seq iteration. We should double check that seqfiles really can't sleep.

I don't think that seqfiles must not sleep after start(), at least
that's my interpretation and confirmed by some tests. I added a might_sleep()
in quite a few places and there is no issue now that we don't take the
RCU read-side lock any longer. Have you seen my previous reply to
George:

| > I suppose it doesn't really matter though since the documentation says
| > we can't sleep.
|
| You're talking, I suppose, about these words of wisdom in
| Documentation/filesystems/seq_file.rst?
|
| | However, the seq_file code (by design) will not sleep between the calls
| | to start() and stop(), so holding a lock during that time is a
| | reasonable thing to do. The seq_file code will also avoid taking any
| | other locks while the iterator is active.
|
| It _doesn't_ say that you can't sleep between start() and stop(), right?
| It just says that if you want to keep the seq_file iterator atomic, the
| seq_file code is not sabotaging you by sleeping. But you still could
| sleep if you wanted to.
Jakub Kicinski Nov. 28, 2020, 2:15 a.m. UTC | #33
On Sat, 28 Nov 2020 03:41:06 +0200 Vladimir Oltean wrote:
> Jakub, I would like to hear more from you. I would still like to try
> this patch out. You clearly have a lot more background with the code.

Well, I've seen people run into the problem of this NDO not being able
to sleep, but I don't have much background or knowledge of what impact
the locking will have on real systems.

We will need to bring this up with Eric (probably best after the turkey
weekend is over).

In the meantime if you feel like it you may want to add some tracing /
printing to check which processes are accessing /proc/net/dev on your
platforms of interest, see if there is anything surprising.

> You said in an earlier reply that you should have also documented that
> ndo_get_stats64 is one of the few NDOs that does not take the RTNL. Is
> there a particular reason for that being so, and a reason why it can't
> change?

I just meant that as a way of documenting the status quo. I'm not aware
of any other place reading stats under RCU (which doesn't mean it
doesn't exist :)).

That said it is a little tempting to add a new per-netdev mutex here,
instead of congesting RTNL lock further, since today no correct driver
should depend on the RTNL lock.
George McCollister Nov. 30, 2020, 4:52 p.m. UTC | #34
On Fri, Nov 27, 2020 at 8:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 28 Nov 2020 03:41:06 +0200 Vladimir Oltean wrote:
> > Jakub, I would like to hear more from you. I would still like to try
> > this patch out. You clearly have a lot more background with the code.
>
> Well, I've seen people run into the problem of this NDO not being able
> to sleep, but I don't have much background or knowledge of what impact
> the locking will have on real systems.
>
> We will need to bring this up with Eric (probably best after the turkey
> weekend is over).
>
> In the meantime if you feel like it you may want to add some tracing /
> printing to check which processes are accessing /proc/net/dev on your
> platforms of interest, see if there is anything surprising.
>
> > You said in an earlier reply that you should have also documented that
> > ndo_get_stats64 is one of the few NDOs that does not take the RTNL. Is
> > there a particular reason for that being so, and a reason why it can't
> > change?
>
> I just meant that as a way of documenting the status quo. I'm not aware
> of any other place reading stats under RCU (which doesn't mean it
> doesn't exist :)).
>
> That said it is a little tempting to add a new per-netdev mutex here,
> instead of congesting RTNL lock further, since today no correct driver
> should depend on the RTNL lock.

Another possible option could be replacing for_each_netdev_rcu with
for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
currently used nowhere else in the kernel). Has anyone considered
using sleepable RCUs or thought of a reason they wouldn't work or
wouldn't be desirable? For more info search for SRCU in
Documentation/RCU/RTFP.txt
Vladimir Oltean Nov. 30, 2020, 11:50 p.m. UTC | #35
On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> Another possible option could be replacing for_each_netdev_rcu with
> for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> currently used nowhere else in the kernel). Has anyone considered
> using sleepable RCUs or thought of a reason they wouldn't work or
> wouldn't be desirable? For more info search for SRCU in
> Documentation/RCU/RTFP.txt

Want to take the lead?
George McCollister Nov. 30, 2020, 11:58 p.m. UTC | #36
On Mon, Nov 30, 2020 at 5:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> > Another possible option could be replacing for_each_netdev_rcu with
> > for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> > currently used nowhere else in the kernel). Has anyone considered
> > using sleepable RCUs or thought of a reason they wouldn't work or
> > wouldn't be desirable? For more info search for SRCU in
> > Documentation/RCU/RTFP.txt
>
> Want to take the lead?

I certainly could take a stab at it. It would be nice to get a
"doesn't sound like a terrible idea at first glance" from Jakub (and
anyone else) before starting on it. Maybe someone has brought this up
before and was shot down for $reason. If so that would be nice to
know. Of course it's possible I could also find some reason it won't
work after investigating/implementing/testing.
Vladimir Oltean Dec. 1, 2020, 12:19 a.m. UTC | #37
On Mon, Nov 30, 2020 at 05:58:17PM -0600, George McCollister wrote:
> On Mon, Nov 30, 2020 at 5:50 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 10:52:35AM -0600, George McCollister wrote:
> > > Another possible option could be replacing for_each_netdev_rcu with
> > > for_each_netdev_srcu and using list_for_each_entry_srcu (though it's
> > > currently used nowhere else in the kernel). Has anyone considered
> > > using sleepable RCUs or thought of a reason they wouldn't work or
> > > wouldn't be desirable? For more info search for SRCU in
> > > Documentation/RCU/RTFP.txt
> >
> > Want to take the lead?
>
> I certainly could take a stab at it. It would be nice to get a
> "doesn't sound like a terrible idea at first glance" from Jakub (and
> anyone else) before starting on it. Maybe someone has brought this up
> before and was shot down for $reason. If so that would be nice to
> know. Of course it's possible I could also find some reason it won't
> work after investigating/implementing/testing.

Well, for one thing, I don't think the benefit justifies the effort, but
then again, it's not my effort... I think there are simpler ways to get
sleepable context for the callers of dev_get_stats, I have some patches
that I'm working on. Not sure if you saw the discussion here.
https://lore.kernel.org/netdev/20201129205817.hti2l4hm2fbp2iwy@skbuf/
Let that not stop you, though.
Vladimir Oltean Dec. 2, 2020, 12:28 a.m. UTC | #38
Hi Jakub,

On Fri, Nov 27, 2020 at 10:36:42PM +0100, Andrew Lunn wrote:
> > Either way, can we conclude that ndo_get_stats64 is not a replacement
> > for ethtool -S, since the latter is blocking and, if implemented correctly,
> > can return the counters at the time of the call (therefore making sure
> > that anything that happened before the syscall has been accounted into
> > the retrieved values), and the former isn't?
>
> ethtool -S is the best source of consistent, up to date statistics we
> have. It seems silly not to include everything the hardware offers
> there.

To add to this, it would seem odd to me if we took the decision to not
expose MAC-level counters any longer in ethtool. Say the MAC has a counter
named rx_dropped. If we are only exposing this counter in ndo_get_stats64,
then we could hit the scenario where this counter keeps incrementing,
but it is the network stack who increments it, and not the MAC.

dev_get_stats() currently does:
	storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
	storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
	storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);

thereby clobbering the MAC-provided counter. We would not know if it is
a MAC-level drop or not.
Jakub Kicinski Dec. 2, 2020, 12:54 a.m. UTC | #39
On Wed, 2 Dec 2020 02:28:51 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 10:36:42PM +0100, Andrew Lunn wrote:
> > > Either way, can we conclude that ndo_get_stats64 is not a replacement
> > > for ethtool -S, since the latter is blocking and, if implemented correctly,
> > > can return the counters at the time of the call (therefore making sure
> > > that anything that happened before the syscall has been accounted into
> > > the retrieved values), and the former isn't?  
> >
> > ethtool -S is the best source of consistent, up to date statistics we
> > have. It seems silly not to include everything the hardware offers
> > there.  
> 
> To add to this, it would seem odd to me if we took the decision to not
> expose MAC-level counters any longer in ethtool. Say the MAC has a counter
> named rx_dropped. If we are only exposing this counter in ndo_get_stats64,
> then we could hit the scenario where this counter keeps incrementing,
> but it is the network stack who increments it, and not the MAC.
> 
> dev_get_stats() currently does:
> 	storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
> 	storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
> 	storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);
> 
> thereby clobbering the MAC-provided counter. We would not know if it is
> a MAC-level drop or not.

Fine granularity HW stats are fine, but the aggregate must be reported
in standard stats first.

The correct stat for MAC drops (AFAIU) is rx_missed.

This should act as a generic "device had to drop valid packets"
indication and ethtool -S should serve for manual debugging to find 
out which stage of pipeline / reason caused the drop.