diff mbox series

[v2,net-next] docs: networking: timestamping: add section for stacked PHC devices

Message ID 20200709201733.71874-1-olteanv@gmail.com
State Accepted
Delegated to: David Miller
Headers show
Series [v2,net-next] docs: networking: timestamping: add section for stacked PHC devices | expand

Commit Message

Vladimir Oltean July 9, 2020, 8:17 p.m. UTC
The concept of timestamping DSA switches / Ethernet PHYs is becoming
more and more popular, however the Linux kernel timestamping code has
evolved quite organically and there's layers upon layers of new and old
code that need to work together for things to behave as expected.

Add this chapter to explain what the overall goals are.

Loosely based upon this email discussion plus some more info:
https://lkml.org/lkml/2020/7/6/481

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
---
Changes in v2:
Applied Richard's and Sergey's change suggestions.

 Documentation/networking/timestamping.rst | 165 ++++++++++++++++++++++
 1 file changed, 165 insertions(+)

Comments

Jakub Kicinski July 16, 2020, 12:53 a.m. UTC | #1
On Thu,  9 Jul 2020 23:17:33 +0300 Vladimir Oltean wrote:
> The concept of timestamping DSA switches / Ethernet PHYs is becoming
> more and more popular, however the Linux kernel timestamping code has
> evolved quite organically and there's layers upon layers of new and old
> code that need to work together for things to behave as expected.
> 
> Add this chapter to explain what the overall goals are.
> 
> Loosely based upon this email discussion plus some more info:
> https://lkml.org/lkml/2020/7/6/481
> 
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> Reviewed-by: Richard Cochran <richardcochran@gmail.com>

Applied, thanks!
diff mbox series

Patch

diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index 1adead6a4527..03f7beade470 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -589,3 +589,168 @@  Time stamps for outgoing packets are to be generated as follows:
   this would occur at a later time in the processing pipeline than other
   software time stamping and therefore could lead to unexpected deltas
   between time stamps.
+
+3.2 Special considerations for stacked PTP Hardware Clocks
+----------------------------------------------------------
+
+There are situations when there may be more than one PHC (PTP Hardware Clock)
+in the data path of a packet. The kernel has no explicit mechanism to allow the
+user to select which PHC to use for timestamping Ethernet frames. Instead, the
+assumption is that the outermost PHC is always the most preferable, and that
+kernel drivers collaborate towards achieving that goal. Currently there are 3
+cases of stacked PHCs, detailed below:
+
+3.2.1 DSA (Distributed Switch Architecture) switches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are Ethernet switches which have one of their ports connected to an
+(otherwise completely unaware) host Ethernet interface, and perform the role of
+a port multiplier with optional forwarding acceleration features.  Each DSA
+switch port is visible to the user as a standalone (virtual) network interface,
+and its network I/O is performed, under the hood, indirectly through the host
+interface (redirecting to the host port on TX, and intercepting frames on RX).
+
+When a DSA switch is attached to a host port, PTP synchronization has to
+suffer, since the switch's variable queuing delay introduces a path delay
+jitter between the host port and its PTP partner. For this reason, some DSA
+switches include a timestamping clock of their own, and have the ability to
+perform network timestamping on their own MAC, such that path delays only
+measure wire and PHY propagation latencies. Timestamping DSA switches are
+supported in Linux and expose the same ABI as any other network interface (save
+for the fact that the DSA interfaces are in fact virtual in terms of network
+I/O, they do have their own PHC).  It is typical, but not mandatory, for all
+interfaces of a DSA switch to share the same PHC.
+
+By design, PTP timestamping with a DSA switch does not need any special
+handling in the driver for the host port it is attached to.  However, when the
+host port also supports PTP timestamping, DSA will take care of intercepting
+the ``.ndo_do_ioctl`` calls towards the host port, and block attempts to enable
+hardware timestamping on it. This is because the SO_TIMESTAMPING API does not
+allow the delivery of multiple hardware timestamps for the same packet, so
+anybody else except for the DSA switch port must be prevented from doing so.
+
+In code, DSA provides for most of the infrastructure for timestamping already,
+in generic code: a BPF classifier (``ptp_classify_raw``) is used to identify
+PTP event messages (any other packets, including PTP general messages, are not
+timestamped), and provides two hooks to drivers:
+
+- ``.port_txtstamp()``: The driver is passed a clone of the timestampable skb
+  to be transmitted, before actually transmitting it. Typically, a switch will
+  have a PTP TX timestamp register (or sometimes a FIFO) where the timestamp
+  becomes available. There may be an IRQ that is raised upon this timestamp's
+  availability, or the driver might have to poll after invoking
+  ``dev_queue_xmit()`` towards the host interface. Either way, in the
+  ``.port_txtstamp()`` method, the driver only needs to save the clone for
+  later use (when the timestamp becomes available). Each skb is annotated with
+  a pointer to its clone, in ``DSA_SKB_CB(skb)->clone``, to ease the driver's
+  job of keeping track of which clone belongs to which skb.
+
+- ``.port_rxtstamp()``: The original (and only) timestampable skb is provided
+  to the driver, for it to annotate it with a timestamp, if that is immediately
+  available, or defer to later. On reception, timestamps might either be
+  available in-band (through metadata in the DSA header, or attached in other
+  ways to the packet), or out-of-band (through another RX timestamping FIFO).
+  Deferral on RX is typically necessary when retrieving the timestamp needs a
+  sleepable context. In that case, it is the responsibility of the DSA driver
+  to call ``netif_rx_ni()`` on the freshly timestamped skb.
+
+3.2.2 Ethernet PHYs
+^^^^^^^^^^^^^^^^^^^
+
+These are devices that typically fulfill a Layer 1 role in the network stack,
+hence they do not have a representation in terms of a network interface as DSA
+switches do. However, PHYs may be able to detect and timestamp PTP packets, for
+performance reasons: timestamps taken as close as possible to the wire have the
+potential to yield a more stable and precise synchronization.
+
+A PHY driver that supports PTP timestamping must create a ``struct
+mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
+of this pointer will be checked by the networking stack.
+
+Since PHYs do not have network interface representations, the timestamping and
+ethtool ioctl operations for them need to be mediated by their respective MAC
+driver.  Therefore, as opposed to DSA switches, modifications need to be done
+to each individual MAC driver for PHY timestamping support. This entails:
+
+- Checking, in ``.ndo_do_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
+  is true or not. If it is, then the MAC driver should not process this request
+  but instead pass it on to the PHY using ``phy_mii_ioctl()``.
+
+- On RX, special intervention may or may not be needed, depending on the
+  function used to deliver skb's up the network stack. In the case of plain
+  ``netif_rx()`` and similar, MAC drivers must check whether
+  ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
+  call ``netif_rx()`` at all.  If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
+  enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
+  will be called now, to determine, using logic very similar to DSA, whether
+  deferral for RX timestamping is necessary.  Again like DSA, it becomes the
+  responsibility of the PHY driver to send the packet up the stack when the
+  timestamp is available.
+
+  For other skb receive functions, such as ``napi_gro_receive`` and
+  ``netif_receive_skb``, the stack automatically checks whether
+  ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
+  the driver.
+
+- On TX, again, special intervention might or might not be needed.  The
+  function that calls the ``mii_ts->txtstamp()`` hook is named
+  ``skb_clone_tx_timestamp()``. This function can either be called directly
+  (case in which explicit MAC driver support is indeed needed), but the
+  function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
+  drivers already perform for software timestamping purposes. Therefore, if a
+  MAC supports software timestamping, it does not need to do anything further
+  at this stage.
+
+3.2.3 MII bus snooping devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These perform the same role as timestamping Ethernet PHYs, save for the fact
+that they are discrete devices and can therefore be used in conjunction with
+any PHY even if it doesn't support timestamping. In Linux, they are
+discoverable and attachable to a ``struct phy_device`` through Device Tree, and
+for the rest, they use the same mii_ts infrastructure as those. See
+Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
+
+3.2.4 Other caveats for MAC drivers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Stacked PHCs, especially DSA (but not only) - since that doesn't require any
+modification to MAC drivers, so it is more difficult to ensure correctness of
+all possible code paths - is that they uncover bugs which were impossible to
+trigger before the existence of stacked PTP clocks.  One example has to do with
+this line of code, already presented earlier::
+
+      skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
+driver or a MII bus snooping device driver, should set this flag.
+But a MAC driver that is unaware of PHC stacking might get tripped up by
+somebody other than itself setting this flag, and deliver a duplicate
+timestamp.
+For example, a typical driver design for TX timestamping might be to split the
+transmission part into 2 portions:
+
+1. "TX": checks whether PTP timestamping has been previously enabled through
+   the ``.ndo_do_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
+   current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
+   SKBTX_HW_TSTAMP``"). If this is true, it sets the
+   "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
+   described above, in the case of a stacked PHC system, this condition should
+   never trigger, as this MAC is certainly not the outermost PHC. But this is
+   not where the typical issue is.  Transmission proceeds with this packet.
+
+2. "TX confirmation": Transmission has finished. The driver checks whether it
+   is necessary to collect any TX timestamp for it. Here is where the typical
+   issues are: the MAC driver takes a shortcut and only checks whether
+   "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
+   PHC system, this is incorrect because this MAC driver is not the only entity
+   in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
+   place.
+
+The correct solution for this problem is for MAC drivers to have a compound
+check in their "TX confirmation" portion, not only for
+"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
+"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
+that PTP timestamping is not enabled for anything other than the outermost PHC,
+this enhanced check will avoid delivering a duplicated TX timestamp to user
+space.