diff mbox series

[v3,net-next,18/24] net: dsa: sja1105: Add support for traffic through standalone ports

Message ID 20190413012822.30931-19-olteanv@gmail.com
State Changes Requested
Delegated to: David Miller
Headers show
Series NXP SJA1105 DSA driver | expand

Commit Message

Vladimir Oltean April 13, 2019, 1:28 a.m. UTC
In order to support this, we are creating a make-shift switch tag out of
a VLAN trunk configured on the CPU port. Termination of normal traffic
on switch ports only works when not under a vlan_filtering bridge.
Termination of management (PTP, BPDU) traffic works under all
circumstances because it uses a different tagging mechanism
(incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.

There are two types of traffic: regular and link-local.
The link-local traffic received on the CPU port is trapped from the
switch's regular forwarding decisions because it matched one of the two
DMAC filters for management traffic.
On transmission, the switch requires special massaging for these
link-local frames. Due to a weird implementation of the switching IP, by
default it drops link-local frames that originate on the CPU port. It
needs to be told where to forward them to, through an SPI command
("management route") that is valid for only a single frame.
So when we're sending link-local traffic, we need to clone skb's from
DSA and send them in our custom xmit worker that also performs SPI access.

For that purpose, the DSA xmit handler and the xmit worker communicate
through a per-port "skb ring" software structure, with a producer and a
consumer index. At the moment this structure is rather fragile
(ping-flooding to a link-local DMAC would cause most of the frames to
get dropped). I would like to move the management traffic on a separate
netdev queue that I can stop when the skb ring got full and hardware is
busy processing, so that we are not forced to drop traffic.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:
Made management traffic be receivable on the DSA netdevices even when
switch tagging is disabled, as well as regular traffic be receivable on
the master netdevice in the same scenario. Both are accomplished using
the sja1105_filter() function and some small touch-ups in the .rcv
callback.

Changes in v2:
Fixed a mistake with a missing ntohs(), which was not caught because
the previously used EtherType (0xDADA) was endian-agnostic.

 drivers/net/dsa/sja1105/sja1105.h      |   8 ++
 drivers/net/dsa/sja1105/sja1105_main.c | 125 ++++++++++++++++++++-
 include/linux/dsa/sja1105.h            |  52 +++++++++
 include/net/dsa.h                      |   1 +
 net/dsa/Kconfig                        |   3 +
 net/dsa/Makefile                       |   1 +
 net/dsa/dsa.c                          |   6 +
 net/dsa/dsa_priv.h                     |   3 +
 net/dsa/tag_sja1105.c                  | 148 +++++++++++++++++++++++++
 9 files changed, 345 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/dsa/sja1105.h
 create mode 100644 net/dsa/tag_sja1105.c

Comments

Andrew Lunn April 13, 2019, 4:37 p.m. UTC | #1
On Sat, Apr 13, 2019 at 04:28:16AM +0300, Vladimir Oltean wrote:
> In order to support this, we are creating a make-shift switch tag out of
> a VLAN trunk configured on the CPU port. Termination of normal traffic
> on switch ports only works when not under a vlan_filtering bridge.
> Termination of management (PTP, BPDU) traffic works under all
> circumstances because it uses a different tagging mechanism
> (incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
> code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.
> 
> There are two types of traffic: regular and link-local.
> The link-local traffic received on the CPU port is trapped from the
> switch's regular forwarding decisions because it matched one of the two
> DMAC filters for management traffic.
> On transmission, the switch requires special massaging for these
> link-local frames. Due to a weird implementation of the switching IP, by
> default it drops link-local frames that originate on the CPU port. It
> needs to be told where to forward them to, through an SPI command
> ("management route") that is valid for only a single frame.
> So when we're sending link-local traffic, we need to clone skb's from
> DSA and send them in our custom xmit worker that also performs SPI access.
> 
> For that purpose, the DSA xmit handler and the xmit worker communicate
> through a per-port "skb ring" software structure, with a producer and a
> consumer index. At the moment this structure is rather fragile
> (ping-flooding to a link-local DMAC would cause most of the frames to
> get dropped). I would like to move the management traffic on a separate
> netdev queue that I can stop when the skb ring got full and hardware is
> busy processing, so that we are not forced to drop traffic.
> 
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> Changes in v3:
> Made management traffic be receivable on the DSA netdevices even when
> switch tagging is disabled, as well as regular traffic be receivable on
> the master netdevice in the same scenario. Both are accomplished using
> the sja1105_filter() function and some small touch-ups in the .rcv
> callback.

It seems like you made major changes to this. When you do that, you
should drop any reviewed-by tags you have. They are no longer valid
because of the major changes.

>  /* This callback needs to be present */
> @@ -1141,7 +1158,11 @@ static int sja1105_vlan_filtering(struct dsa_switch *ds, int port, bool enabled)
>  	if (rc)
>  		dev_err(ds->dev, "Failed to change VLAN Ethertype\n");
>  
> -	return rc;
> +	/* Switch port identification based on 802.1Q is only passable

possible, not passable.

> +	 * if we are not under a vlan_filtering bridge. So make sure
> +	 * the two configurations are mutually exclusive.
> +	 */
> +	return sja1105_setup_8021q_tagging(ds, !enabled);
>  }
>  
>  static void sja1105_vlan_add(struct dsa_switch *ds, int port,
> @@ -1233,9 +1254,107 @@ static int sja1105_setup(struct dsa_switch *ds)
>  	 */
>  	ds->vlan_filtering_is_global = true;
>  
> +	/* The DSA/switchdev model brings up switch ports in standalone mode by
> +	 * default, and that means vlan_filtering is 0 since they're not under
> +	 * a bridge, so it's safe to set up switch tagging at this time.
> +	 */
> +	return sja1105_setup_8021q_tagging(ds, true);
> +}
> +
> +#include "../../../net/dsa/dsa_priv.h"

No. Don't use relative includes like this.

What do you need from the header? Maybe move it into
include/linux/net/dsa.h

> +/* Deferred work is unfortunately necessary because setting up the management
> + * route cannot be done from atomit context (SPI transfer takes a sleepable
> + * lock on the bus)
> + */
> +static void sja1105_xmit_work_handler(struct work_struct *work)
> +{
> +	struct sja1105_port *sp = container_of(work, struct sja1105_port,
> +						xmit_work);
> +	struct sja1105_private *priv = sp->dp->ds->priv;
> +	struct net_device *slave = sp->dp->slave;
> +	struct net_device *master = dsa_slave_to_master(slave);
> +	int port = (uintptr_t)(sp - priv->ports);
> +	struct sk_buff *skb;
> +	int i, rc;
> +
> +	while ((i = sja1105_skb_ring_get(&sp->xmit_ring, &skb)) >= 0) {
> +		struct sja1105_mgmt_entry mgmt_route = { 0 };
> +		struct ethhdr *hdr;
> +		int timeout = 10;
> +		int skb_len;
> +
> +		skb_len = skb->len;
> +		hdr = eth_hdr(skb);
> +
> +		mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> +		mgmt_route.destports = BIT(port);
> +		mgmt_route.enfport = 1;
> +		mgmt_route.tsreg = 0;
> +		mgmt_route.takets = false;
> +
> +		rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> +						  port, &mgmt_route, true);
> +		if (rc < 0) {
> +			kfree_skb(skb);
> +			slave->stats.tx_dropped++;
> +			continue;
> +		}
> +
> +		/* Transfer skb to the host port. */
> +		skb->dev = master;
> +		dev_queue_xmit(skb);
> +
> +		/* Wait until the switch has processed the frame */
> +		do {
> +			rc = sja1105_dynamic_config_read(priv, BLK_IDX_MGMT_ROUTE,
> +							 port, &mgmt_route);
> +			if (rc < 0) {
> +				slave->stats.tx_errors++;
> +				dev_err(priv->ds->dev,
> +					"xmit: failed to poll for mgmt route\n");
> +				continue;
> +			}
> +
> +			/* UM10944: The ENFPORT flag of the respective entry is
> +			 * cleared when a match is found. The host can use this
> +			 * flag as an acknowledgment.
> +			 */
> +			cpu_relax();
> +		} while (mgmt_route.enfport && --timeout);
> +
> +		if (!timeout) {
> +			dev_err(priv->ds->dev, "xmit timed out\n");
> +			slave->stats.tx_errors++;
> +			continue;
> +		}
> +
> +		slave->stats.tx_packets++;
> +		slave->stats.tx_bytes += skb_len;
> +	}
> +}
> +
> +static int sja1105_port_enable(struct dsa_switch *ds, int port,
> +			       struct phy_device *phydev)
> +{
> +	struct sja1105_private *priv = ds->priv;
> +	struct sja1105_port *sp = &priv->ports[port];
> +
> +	sp->dp = &ds->ports[port];
> +	INIT_WORK(&sp->xmit_work, sja1105_xmit_work_handler);
>  	return 0;
>  }

I think i'm missing something here. You have a per port queue of link
local frames which need special handling. And you have a per-port work
queue. To send such a frame, you need to write some register, send the
frame, and then wait until the mgmt_route.enfport is reset.

Why are you doing this per port? How do you stop two ports/work queues
running at the same time? It seems like one queue, with one work queue
would be a better structure.

Also, please move all this code into the tagger. Just add exports for 
sja1105_dynamic_config_write() and sja1105_dynamic_config_read().

> +static void sja1105_port_disable(struct dsa_switch *ds, int port)
> +{
> +	struct sja1105_private *priv = ds->priv;
> +	struct sja1105_port *sp = &priv->ports[port];
> +	struct sk_buff *skb;
> +
> +	cancel_work_sync(&sp->xmit_work);
> +	while (sja1105_skb_ring_get(&sp->xmit_ring, &skb) >= 0)
> +		kfree_skb(skb);
> +}
> +
> diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
> new file mode 100644
> index 000000000000..5c76a06c9093
> --- /dev/null
> +++ b/net/dsa/tag_sja1105.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> + */
> +#include <linux/etherdevice.h>
> +#include <linux/if_vlan.h>
> +#include <linux/dsa/sja1105.h>
> +#include "../../drivers/net/dsa/sja1105/sja1105.h"

Again, no, don't do this.

> +
> +#include "dsa_priv.h"
> +
> +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> +{
> +	const struct ethhdr *hdr = eth_hdr(skb);
> +	u64 dmac = ether_addr_to_u64(hdr->h_dest);
> +
> +	if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> +		    SJA1105_LINKLOCAL_FILTER_A)
> +		return true;
> +	if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> +		    SJA1105_LINKLOCAL_FILTER_B)
> +		return true;
> +	return false;
> +}
> +
> +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> +{
> +	if (sja1105_is_link_local(skb))
> +		return true;
> +	if (!dev->dsa_ptr->vlan_filtering)
> +		return true;
> +	return false;
> +}

Please add a comment here about what frames cannot be handled by the
tagger. However, i'm not too happy about this design...

> +
> +static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
> +				    struct net_device *netdev)
> +{
> +	struct dsa_port *dp = dsa_slave_to_port(netdev);
> +	struct dsa_switch *ds = dp->ds;
> +	struct sja1105_private *priv = ds->priv;
> +	struct sja1105_port *sp = &priv->ports[dp->index];
> +	struct sk_buff *clone;
> +
> +	if (likely(!sja1105_is_link_local(skb))) {
> +		/* Normal traffic path. */
> +		u16 tx_vid = dsa_tagging_tx_vid(ds, dp->index);
> +		u8 pcp = skb->priority;
> +
> +		/* If we are under a vlan_filtering bridge, IP termination on
> +		 * switch ports based on 802.1Q tags is simply too brittle to
> +		 * be passable. So just defer to the dsa_slave_notag_xmit
> +		 * implementation.
> +		 */
> +		if (dp->vlan_filtering)
> +			return skb;
> +
> +		return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> +				     ((pcp << VLAN_PRIO_SHIFT) | tx_vid));

Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.

> +	}
> +
> +	/* Code path for transmitting management traffic. This does not rely
> +	 * upon switch tagging, but instead SPI-installed management routes.
> +	 */
> +	clone = skb_clone(skb, GFP_ATOMIC);
> +	if (!clone) {
> +		dev_err(ds->dev, "xmit: failed to clone skb\n");
> +		return NULL;
> +	}
> +
> +	if (sja1105_skb_ring_add(&sp->xmit_ring, clone) < 0) {
> +		dev_err(ds->dev, "xmit: skb ring full\n");
> +		kfree_skb(clone);
> +		return NULL;
> +	}
> +
> +	if (sp->xmit_ring.count == SJA1105_SKB_RING_SIZE)
> +		/* TODO setup a dedicated netdev queue for management traffic
> +		 * so that we can selectively apply backpressure and not be
> +		 * required to stop the entire traffic when the software skb
> +		 * ring is full. This requires hooking the ndo_select_queue
> +		 * from DSA and matching on mac_fltres.
> +		 */
> +		dev_err(ds->dev, "xmit: reached maximum skb ring size\n");

This should be rate limited.

     Andrew

> +
> +	schedule_work(&sp->xmit_work);
> +	/* Let DSA free its reference to the skb and we will free
> +	 * the clone in the deferred worker
> +	 */
> +	return NULL;
> +}
> +
> +static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
> +				   struct net_device *netdev,
> +				   struct packet_type *pt)
> +{
> +	unsigned int source_port, switch_id;
> +	struct ethhdr *hdr = eth_hdr(skb);
> +	struct sk_buff *nskb;
> +	u16 tpid, vid, tci;
> +	bool is_tagged;
> +
> +	nskb = dsa_8021q_rcv(skb, netdev, pt, &tpid, &tci);
> +	is_tagged = (nskb && tpid == ETH_P_EDSA);
> +
> +	skb->priority = (tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
> +	vid = tci & VLAN_VID_MASK;
> +
> +	skb->offload_fwd_mark = 1;
> +
> +	if (likely(!sja1105_is_link_local(skb))) {
> +		/* Normal traffic path. */
> +		source_port = dsa_tagging_rx_source_port(vid);
> +		switch_id = dsa_tagging_rx_switch_id(vid);
> +	} else {
> +		/* Management traffic path. Switch embeds the switch ID and
> +		 * port ID into bytes of the destination MAC, courtesy of
> +		 * the incl_srcpt options.
> +		 */
> +		source_port = hdr->h_dest[3];
> +		switch_id = hdr->h_dest[4];
> +		/* Clear the DMAC bytes that were mangled by the switch */
> +		hdr->h_dest[3] = 0;
> +		hdr->h_dest[4] = 0;
> +	}
> +
> +	skb->dev = dsa_master_find_slave(netdev, switch_id, source_port);
> +	if (!skb->dev) {
> +		netdev_warn(netdev, "Couldn't decode source port\n");
> +		return NULL;
> +	}
> +
> +	/* Delete/overwrite fake VLAN header, DSA expects to not find
> +	 * it there, see dsa_switch_rcv: skb_push(skb, ETH_HLEN).
> +	 */
> +	if (is_tagged)
> +		memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - VLAN_HLEN,
> +			ETH_HLEN - VLAN_HLEN);
> +
> +	return skb;
> +}
> +
> +const struct dsa_device_ops sja1105_netdev_ops = {
> +	.xmit = sja1105_xmit,
> +	.rcv = sja1105_rcv,
> +	.filter = sja1105_filter,
> +	.overhead = VLAN_HLEN,
> +};
> +
> -- 
> 2.17.1
>
Vladimir Oltean April 13, 2019, 9:27 p.m. UTC | #2
On Sat, 13 Apr 2019 at 19:38, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Sat, Apr 13, 2019 at 04:28:16AM +0300, Vladimir Oltean wrote:
> > In order to support this, we are creating a make-shift switch tag out of
> > a VLAN trunk configured on the CPU port. Termination of normal traffic
> > on switch ports only works when not under a vlan_filtering bridge.
> > Termination of management (PTP, BPDU) traffic works under all
> > circumstances because it uses a different tagging mechanism
> > (incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
> > code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.
> >
> > There are two types of traffic: regular and link-local.
> > The link-local traffic received on the CPU port is trapped from the
> > switch's regular forwarding decisions because it matched one of the two
> > DMAC filters for management traffic.
> > On transmission, the switch requires special massaging for these
> > link-local frames. Due to a weird implementation of the switching IP, by
> > default it drops link-local frames that originate on the CPU port. It
> > needs to be told where to forward them to, through an SPI command
> > ("management route") that is valid for only a single frame.
> > So when we're sending link-local traffic, we need to clone skb's from
> > DSA and send them in our custom xmit worker that also performs SPI access.
> >
> > For that purpose, the DSA xmit handler and the xmit worker communicate
> > through a per-port "skb ring" software structure, with a producer and a
> > consumer index. At the moment this structure is rather fragile
> > (ping-flooding to a link-local DMAC would cause most of the frames to
> > get dropped). I would like to move the management traffic on a separate
> > netdev queue that I can stop when the skb ring got full and hardware is
> > busy processing, so that we are not forced to drop traffic.
> >
> > Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> > Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> > ---
> > Changes in v3:
> > Made management traffic be receivable on the DSA netdevices even when
> > switch tagging is disabled, as well as regular traffic be receivable on
> > the master netdevice in the same scenario. Both are accomplished using
> > the sja1105_filter() function and some small touch-ups in the .rcv
> > callback.
>
> It seems like you made major changes to this. When you do that, you
> should drop any reviewed-by tags you have. They are no longer valid
> because of the major changes.
>

Ok, noted.

> >  /* This callback needs to be present */
> > @@ -1141,7 +1158,11 @@ static int sja1105_vlan_filtering(struct dsa_switch *ds, int port, bool enabled)
> >       if (rc)
> >               dev_err(ds->dev, "Failed to change VLAN Ethertype\n");
> >
> > -     return rc;
> > +     /* Switch port identification based on 802.1Q is only passable
>
> possible, not passable.
>

Passable (satisfactory, decent, acceptable) is what I wanted to say.
Tagging using VLANs is possible even when the bridge wants to use
them, but it's smarter not to go there. But I get your point, maybe
I'll rephrase.

> > +      * if we are not under a vlan_filtering bridge. So make sure
> > +      * the two configurations are mutually exclusive.
> > +      */
> > +     return sja1105_setup_8021q_tagging(ds, !enabled);
> >  }
> >
> >  static void sja1105_vlan_add(struct dsa_switch *ds, int port,
> > @@ -1233,9 +1254,107 @@ static int sja1105_setup(struct dsa_switch *ds)
> >        */
> >       ds->vlan_filtering_is_global = true;
> >
> > +     /* The DSA/switchdev model brings up switch ports in standalone mode by
> > +      * default, and that means vlan_filtering is 0 since they're not under
> > +      * a bridge, so it's safe to set up switch tagging at this time.
> > +      */
> > +     return sja1105_setup_8021q_tagging(ds, true);
> > +}
> > +
> > +#include "../../../net/dsa/dsa_priv.h"
>
> No. Don't use relative includes like this.
>
> What do you need from the header? Maybe move it into
> include/linux/net/dsa.h
>

dsa_slave_to_master()

> > +/* Deferred work is unfortunately necessary because setting up the management
> > + * route cannot be done from atomit context (SPI transfer takes a sleepable
> > + * lock on the bus)
> > + */
> > +static void sja1105_xmit_work_handler(struct work_struct *work)
> > +{
> > +     struct sja1105_port *sp = container_of(work, struct sja1105_port,
> > +                                             xmit_work);
> > +     struct sja1105_private *priv = sp->dp->ds->priv;
> > +     struct net_device *slave = sp->dp->slave;
> > +     struct net_device *master = dsa_slave_to_master(slave);
> > +     int port = (uintptr_t)(sp - priv->ports);
> > +     struct sk_buff *skb;
> > +     int i, rc;
> > +
> > +     while ((i = sja1105_skb_ring_get(&sp->xmit_ring, &skb)) >= 0) {
> > +             struct sja1105_mgmt_entry mgmt_route = { 0 };
> > +             struct ethhdr *hdr;
> > +             int timeout = 10;
> > +             int skb_len;
> > +
> > +             skb_len = skb->len;
> > +             hdr = eth_hdr(skb);
> > +
> > +             mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> > +             mgmt_route.destports = BIT(port);
> > +             mgmt_route.enfport = 1;
> > +             mgmt_route.tsreg = 0;
> > +             mgmt_route.takets = false;
> > +
> > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > +                                               port, &mgmt_route, true);
> > +             if (rc < 0) {
> > +                     kfree_skb(skb);
> > +                     slave->stats.tx_dropped++;
> > +                     continue;
> > +             }
> > +
> > +             /* Transfer skb to the host port. */
> > +             skb->dev = master;
> > +             dev_queue_xmit(skb);
> > +
> > +             /* Wait until the switch has processed the frame */
> > +             do {
> > +                     rc = sja1105_dynamic_config_read(priv, BLK_IDX_MGMT_ROUTE,
> > +                                                      port, &mgmt_route);
> > +                     if (rc < 0) {
> > +                             slave->stats.tx_errors++;
> > +                             dev_err(priv->ds->dev,
> > +                                     "xmit: failed to poll for mgmt route\n");
> > +                             continue;
> > +                     }
> > +
> > +                     /* UM10944: The ENFPORT flag of the respective entry is
> > +                      * cleared when a match is found. The host can use this
> > +                      * flag as an acknowledgment.
> > +                      */
> > +                     cpu_relax();
> > +             } while (mgmt_route.enfport && --timeout);
> > +
> > +             if (!timeout) {
> > +                     dev_err(priv->ds->dev, "xmit timed out\n");
> > +                     slave->stats.tx_errors++;
> > +                     continue;
> > +             }
> > +
> > +             slave->stats.tx_packets++;
> > +             slave->stats.tx_bytes += skb_len;
> > +     }
> > +}
> > +
> > +static int sja1105_port_enable(struct dsa_switch *ds, int port,
> > +                            struct phy_device *phydev)
> > +{
> > +     struct sja1105_private *priv = ds->priv;
> > +     struct sja1105_port *sp = &priv->ports[port];
> > +
> > +     sp->dp = &ds->ports[port];
> > +     INIT_WORK(&sp->xmit_work, sja1105_xmit_work_handler);
> >       return 0;
> >  }
>
> I think i'm missing something here. You have a per port queue of link
> local frames which need special handling. And you have a per-port work
> queue. To send such a frame, you need to write some register, send the
> frame, and then wait until the mgmt_route.enfport is reset.
>
> Why are you doing this per port? How do you stop two ports/work queues
> running at the same time? It seems like one queue, with one work queue
> would be a better structure.
>

See the "port" parameter to this call here:

        rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
                          *port*, &mgmt_route, true);

The switch IP aptly allocates 4 slots for management routes. And it's
a 5-port switch where 1 port is the management port. I think the
structure is fine.

> Also, please move all this code into the tagger. Just add exports for
> sja1105_dynamic_config_write() and sja1105_dynamic_config_read().
>

Well, you see, the tagger code is part of the dsa_core object. If I
export function symbols from the driver, those still won't be there if
I compile the driver as a module. On the other hand, the way I'm doing
it, I think the schedule_work() gives me a pretty good separation.

> > +static void sja1105_port_disable(struct dsa_switch *ds, int port)
> > +{
> > +     struct sja1105_private *priv = ds->priv;
> > +     struct sja1105_port *sp = &priv->ports[port];
> > +     struct sk_buff *skb;
> > +
> > +     cancel_work_sync(&sp->xmit_work);
> > +     while (sja1105_skb_ring_get(&sp->xmit_ring, &skb) >= 0)
> > +             kfree_skb(skb);
> > +}
> > +
> > diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
> > new file mode 100644
> > index 000000000000..5c76a06c9093
> > --- /dev/null
> > +++ b/net/dsa/tag_sja1105.c
> > @@ -0,0 +1,148 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> > + */
> > +#include <linux/etherdevice.h>
> > +#include <linux/if_vlan.h>
> > +#include <linux/dsa/sja1105.h>
> > +#include "../../drivers/net/dsa/sja1105/sja1105.h"
>
> Again, no, don't do this.
>

This separation between driver and tagger is fairly arbitrary.
I need access to the driver's private structure, in order to get a
hold of the private shadow of the dsa_port. Moving the driver private
structure to include/linux/dsa/ would pull in quite a number of
dependencies. Maybe I could provide declarations for the most of them,
but anyway the private structure wouldn't be so private any longer,
would it?
Otherwise put, would you prefer a dp->priv similar to the already
existing ds->priv? struct sja1105_port is much more lightweight to
keep in include/linux/dsa/.

> > +
> > +#include "dsa_priv.h"
> > +
> > +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> > +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> > +{
> > +     const struct ethhdr *hdr = eth_hdr(skb);
> > +     u64 dmac = ether_addr_to_u64(hdr->h_dest);
> > +
> > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> > +                 SJA1105_LINKLOCAL_FILTER_A)
> > +             return true;
> > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> > +                 SJA1105_LINKLOCAL_FILTER_B)
> > +             return true;
> > +     return false;
> > +}
> > +
> > +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> > +{
> > +     if (sja1105_is_link_local(skb))
> > +             return true;
> > +     if (!dev->dsa_ptr->vlan_filtering)
> > +             return true;
> > +     return false;
> > +}
>
> Please add a comment here about what frames cannot be handled by the
> tagger. However, i'm not too happy about this design...
>

Ok, let's put this another way.
A switch is primarily a device used to offload the forwarding of
traffic based on L2 rules. Additionally there may be some management
traffic for stuff like STP that needs to be terminated on the host
port of the switch. For that, the hardware's job is to filter and tag
management frames on their way to the host port, and the software's
job is to process the source port and switch id information in a
meaningful way.
Now both this particular switch hardware, and DSA, are taking the
above definitions to extremes.
The switch says: "that's all you want to see? ok, so that's all I'm
going to give you". So its native (hardware) tagging protocol is to
trap link-local traffic and overwrite two bytes of its destination MAC
with the switch ID and the source port. No more, no less. It is an
incomplete solution, but it does the job for practical use cases.
Now DSA says: "I want these to be fully capable net devices, I want
the user to not even realize what's going on under the hood". I don't
think that terminating iperf traffic through switch ports is a
realistic usage scenario. So in a way discussions about performance
and optimizations on DSA hotpath are slightly pointless IMO.
Now what my driver says is that it offers a bit of both. It speaks the
hardware's tagging protocol so it is capable of management traffic,
but it also speaks the DSA paradigm, so in a way pushes the hardware
to work in a mode it was never intended to, by repurposing VLANs when
the user doesn't request them. So on one hand there is some overlap
between the hardware tagging protocol and the VLAN one (in standalone
mode and in VLAN-unaware bridged mode, management traffic *could* use
VLAN tagging but it doesn't rely on it), and on the other hand the
reunion of the two tagging protocols is decent, but still doesn't
cover the entire spectrum (when put under a VLAN-aware bridge, you
lose the ability to decode general traffic). So you'd better not rely
on VLANs to decode the management traffic, because you won't be able
to always rely on that, and that is a shame since a bridge with both
vlan_filtering 1 and stp_state 1 is a real usage scenario, and the
hardware is capable of that combination.
But all of that is secondary. Let's forget about VLAN tagging for a
second and concentrate on the tagging of management traffic. The
limiting factor here is the software architecture of DSA, because in
order for me to decode that in the driver/tagger, I'd have to drop
everything else coming on the master net device (I explained in 13/24
why). I believe that DSA being all-or-nothing about switch tagging is
turning a blind eye to the devices that don't go overboard with
features, and give you what's needed in a real-world design but not
much else.
What would you improve about this design (assuming you're talking
about the filter function)?

Thanks,
-Vladimir





> > +
> > +static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
> > +                                 struct net_device *netdev)
> > +{
> > +     struct dsa_port *dp = dsa_slave_to_port(netdev);
> > +     struct dsa_switch *ds = dp->ds;
> > +     struct sja1105_private *priv = ds->priv;
> > +     struct sja1105_port *sp = &priv->ports[dp->index];
> > +     struct sk_buff *clone;
> > +
> > +     if (likely(!sja1105_is_link_local(skb))) {
> > +             /* Normal traffic path. */
> > +             u16 tx_vid = dsa_tagging_tx_vid(ds, dp->index);
> > +             u8 pcp = skb->priority;
> > +
> > +             /* If we are under a vlan_filtering bridge, IP termination on
> > +              * switch ports based on 802.1Q tags is simply too brittle to
> > +              * be passable. So just defer to the dsa_slave_notag_xmit
> > +              * implementation.
> > +              */
> > +             if (dp->vlan_filtering)
> > +                     return skb;
> > +
> > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
>
> Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
>
> > +     }
> > +
> > +     /* Code path for transmitting management traffic. This does not rely
> > +      * upon switch tagging, but instead SPI-installed management routes.
> > +      */
> > +     clone = skb_clone(skb, GFP_ATOMIC);
> > +     if (!clone) {
> > +             dev_err(ds->dev, "xmit: failed to clone skb\n");
> > +             return NULL;
> > +     }
> > +
> > +     if (sja1105_skb_ring_add(&sp->xmit_ring, clone) < 0) {
> > +             dev_err(ds->dev, "xmit: skb ring full\n");
> > +             kfree_skb(clone);
> > +             return NULL;
> > +     }
> > +
> > +     if (sp->xmit_ring.count == SJA1105_SKB_RING_SIZE)
> > +             /* TODO setup a dedicated netdev queue for management traffic
> > +              * so that we can selectively apply backpressure and not be
> > +              * required to stop the entire traffic when the software skb
> > +              * ring is full. This requires hooking the ndo_select_queue
> > +              * from DSA and matching on mac_fltres.
> > +              */
> > +             dev_err(ds->dev, "xmit: reached maximum skb ring size\n");
>
> This should be rate limited.
>
>      Andrew
>
> > +
> > +     schedule_work(&sp->xmit_work);
> > +     /* Let DSA free its reference to the skb and we will free
> > +      * the clone in the deferred worker
> > +      */
> > +     return NULL;
> > +}
> > +
> > +static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
> > +                                struct net_device *netdev,
> > +                                struct packet_type *pt)
> > +{
> > +     unsigned int source_port, switch_id;
> > +     struct ethhdr *hdr = eth_hdr(skb);
> > +     struct sk_buff *nskb;
> > +     u16 tpid, vid, tci;
> > +     bool is_tagged;
> > +
> > +     nskb = dsa_8021q_rcv(skb, netdev, pt, &tpid, &tci);
> > +     is_tagged = (nskb && tpid == ETH_P_EDSA);
> > +
> > +     skb->priority = (tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
> > +     vid = tci & VLAN_VID_MASK;
> > +
> > +     skb->offload_fwd_mark = 1;
> > +
> > +     if (likely(!sja1105_is_link_local(skb))) {
> > +             /* Normal traffic path. */
> > +             source_port = dsa_tagging_rx_source_port(vid);
> > +             switch_id = dsa_tagging_rx_switch_id(vid);
> > +     } else {
> > +             /* Management traffic path. Switch embeds the switch ID and
> > +              * port ID into bytes of the destination MAC, courtesy of
> > +              * the incl_srcpt options.
> > +              */
> > +             source_port = hdr->h_dest[3];
> > +             switch_id = hdr->h_dest[4];
> > +             /* Clear the DMAC bytes that were mangled by the switch */
> > +             hdr->h_dest[3] = 0;
> > +             hdr->h_dest[4] = 0;
> > +     }
> > +
> > +     skb->dev = dsa_master_find_slave(netdev, switch_id, source_port);
> > +     if (!skb->dev) {
> > +             netdev_warn(netdev, "Couldn't decode source port\n");
> > +             return NULL;
> > +     }
> > +
> > +     /* Delete/overwrite fake VLAN header, DSA expects to not find
> > +      * it there, see dsa_switch_rcv: skb_push(skb, ETH_HLEN).
> > +      */
> > +     if (is_tagged)
> > +             memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - VLAN_HLEN,
> > +                     ETH_HLEN - VLAN_HLEN);
> > +
> > +     return skb;
> > +}
> > +
> > +const struct dsa_device_ops sja1105_netdev_ops = {
> > +     .xmit = sja1105_xmit,
> > +     .rcv = sja1105_rcv,
> > +     .filter = sja1105_filter,
> > +     .overhead = VLAN_HLEN,
> > +};
> > +
> > --
> > 2.17.1
> >
Vladimir Oltean April 13, 2019, 10:08 p.m. UTC | #3
On Sun, 14 Apr 2019 at 00:27, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Sat, 13 Apr 2019 at 19:38, Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Sat, Apr 13, 2019 at 04:28:16AM +0300, Vladimir Oltean wrote:
> > > In order to support this, we are creating a make-shift switch tag out of
> > > a VLAN trunk configured on the CPU port. Termination of normal traffic
> > > on switch ports only works when not under a vlan_filtering bridge.
> > > Termination of management (PTP, BPDU) traffic works under all
> > > circumstances because it uses a different tagging mechanism
> > > (incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
> > > code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.
> > >
> > > There are two types of traffic: regular and link-local.
> > > The link-local traffic received on the CPU port is trapped from the
> > > switch's regular forwarding decisions because it matched one of the two
> > > DMAC filters for management traffic.
> > > On transmission, the switch requires special massaging for these
> > > link-local frames. Due to a weird implementation of the switching IP, by
> > > default it drops link-local frames that originate on the CPU port. It
> > > needs to be told where to forward them to, through an SPI command
> > > ("management route") that is valid for only a single frame.
> > > So when we're sending link-local traffic, we need to clone skb's from
> > > DSA and send them in our custom xmit worker that also performs SPI access.
> > >
> > > For that purpose, the DSA xmit handler and the xmit worker communicate
> > > through a per-port "skb ring" software structure, with a producer and a
> > > consumer index. At the moment this structure is rather fragile
> > > (ping-flooding to a link-local DMAC would cause most of the frames to
> > > get dropped). I would like to move the management traffic on a separate
> > > netdev queue that I can stop when the skb ring got full and hardware is
> > > busy processing, so that we are not forced to drop traffic.
> > >
> > > Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> > > Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> > > ---
> > > Changes in v3:
> > > Made management traffic be receivable on the DSA netdevices even when
> > > switch tagging is disabled, as well as regular traffic be receivable on
> > > the master netdevice in the same scenario. Both are accomplished using
> > > the sja1105_filter() function and some small touch-ups in the .rcv
> > > callback.
> >
> > It seems like you made major changes to this. When you do that, you
> > should drop any reviewed-by tags you have. They are no longer valid
> > because of the major changes.
> >
>
> Ok, noted.
>
> > >  /* This callback needs to be present */
> > > @@ -1141,7 +1158,11 @@ static int sja1105_vlan_filtering(struct dsa_switch *ds, int port, bool enabled)
> > >       if (rc)
> > >               dev_err(ds->dev, "Failed to change VLAN Ethertype\n");
> > >
> > > -     return rc;
> > > +     /* Switch port identification based on 802.1Q is only passable
> >
> > possible, not passable.
> >
>
> Passable (satisfactory, decent, acceptable) is what I wanted to say.
> Tagging using VLANs is possible even when the bridge wants to use
> them, but it's smarter not to go there. But I get your point, maybe
> I'll rephrase.
>
> > > +      * if we are not under a vlan_filtering bridge. So make sure
> > > +      * the two configurations are mutually exclusive.
> > > +      */
> > > +     return sja1105_setup_8021q_tagging(ds, !enabled);
> > >  }
> > >
> > >  static void sja1105_vlan_add(struct dsa_switch *ds, int port,
> > > @@ -1233,9 +1254,107 @@ static int sja1105_setup(struct dsa_switch *ds)
> > >        */
> > >       ds->vlan_filtering_is_global = true;
> > >
> > > +     /* The DSA/switchdev model brings up switch ports in standalone mode by
> > > +      * default, and that means vlan_filtering is 0 since they're not under
> > > +      * a bridge, so it's safe to set up switch tagging at this time.
> > > +      */
> > > +     return sja1105_setup_8021q_tagging(ds, true);
> > > +}
> > > +
> > > +#include "../../../net/dsa/dsa_priv.h"
> >
> > No. Don't use relative includes like this.
> >
> > What do you need from the header? Maybe move it into
> > include/linux/net/dsa.h
> >
>
> dsa_slave_to_master()
>
> > > +/* Deferred work is unfortunately necessary because setting up the management
> > > + * route cannot be done from atomit context (SPI transfer takes a sleepable
> > > + * lock on the bus)
> > > + */
> > > +static void sja1105_xmit_work_handler(struct work_struct *work)
> > > +{
> > > +     struct sja1105_port *sp = container_of(work, struct sja1105_port,
> > > +                                             xmit_work);
> > > +     struct sja1105_private *priv = sp->dp->ds->priv;
> > > +     struct net_device *slave = sp->dp->slave;
> > > +     struct net_device *master = dsa_slave_to_master(slave);
> > > +     int port = (uintptr_t)(sp - priv->ports);
> > > +     struct sk_buff *skb;
> > > +     int i, rc;
> > > +
> > > +     while ((i = sja1105_skb_ring_get(&sp->xmit_ring, &skb)) >= 0) {
> > > +             struct sja1105_mgmt_entry mgmt_route = { 0 };
> > > +             struct ethhdr *hdr;
> > > +             int timeout = 10;
> > > +             int skb_len;
> > > +
> > > +             skb_len = skb->len;
> > > +             hdr = eth_hdr(skb);
> > > +
> > > +             mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> > > +             mgmt_route.destports = BIT(port);
> > > +             mgmt_route.enfport = 1;
> > > +             mgmt_route.tsreg = 0;
> > > +             mgmt_route.takets = false;
> > > +
> > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > +                                               port, &mgmt_route, true);
> > > +             if (rc < 0) {
> > > +                     kfree_skb(skb);
> > > +                     slave->stats.tx_dropped++;
> > > +                     continue;
> > > +             }
> > > +
> > > +             /* Transfer skb to the host port. */
> > > +             skb->dev = master;
> > > +             dev_queue_xmit(skb);
> > > +
> > > +             /* Wait until the switch has processed the frame */
> > > +             do {
> > > +                     rc = sja1105_dynamic_config_read(priv, BLK_IDX_MGMT_ROUTE,
> > > +                                                      port, &mgmt_route);
> > > +                     if (rc < 0) {
> > > +                             slave->stats.tx_errors++;
> > > +                             dev_err(priv->ds->dev,
> > > +                                     "xmit: failed to poll for mgmt route\n");
> > > +                             continue;
> > > +                     }
> > > +
> > > +                     /* UM10944: The ENFPORT flag of the respective entry is
> > > +                      * cleared when a match is found. The host can use this
> > > +                      * flag as an acknowledgment.
> > > +                      */
> > > +                     cpu_relax();
> > > +             } while (mgmt_route.enfport && --timeout);
> > > +
> > > +             if (!timeout) {
> > > +                     dev_err(priv->ds->dev, "xmit timed out\n");
> > > +                     slave->stats.tx_errors++;
> > > +                     continue;
> > > +             }
> > > +
> > > +             slave->stats.tx_packets++;
> > > +             slave->stats.tx_bytes += skb_len;
> > > +     }
> > > +}
> > > +
> > > +static int sja1105_port_enable(struct dsa_switch *ds, int port,
> > > +                            struct phy_device *phydev)
> > > +{
> > > +     struct sja1105_private *priv = ds->priv;
> > > +     struct sja1105_port *sp = &priv->ports[port];
> > > +
> > > +     sp->dp = &ds->ports[port];
> > > +     INIT_WORK(&sp->xmit_work, sja1105_xmit_work_handler);
> > >       return 0;
> > >  }
> >
> > I think i'm missing something here. You have a per port queue of link
> > local frames which need special handling. And you have a per-port work
> > queue. To send such a frame, you need to write some register, send the
> > frame, and then wait until the mgmt_route.enfport is reset.
> >
> > Why are you doing this per port? How do you stop two ports/work queues
> > running at the same time? It seems like one queue, with one work queue
> > would be a better structure.
> >
>
> See the "port" parameter to this call here:
>
>         rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
>                           *port*, &mgmt_route, true);
>
> The switch IP aptly allocates 4 slots for management routes. And it's
> a 5-port switch where 1 port is the management port. I think the
> structure is fine.
>

"How do stop two work queues": if you're talking about contention on
the hardware management route, I responded to that.
If you're talking about netif_stop_queue(), I think I'm going to avoid
that altogether by not having a finite sized ring structure. While
studying the Marvell 88e6060 driver I found out that sk_buff_head
exists. Now I'm part of the elite club of wheel reinventors with my
struct sja1105_skb_ring :)


> > Also, please move all this code into the tagger. Just add exports for
> > sja1105_dynamic_config_write() and sja1105_dynamic_config_read().
> >
>
> Well, you see, the tagger code is part of the dsa_core object. If I
> export function symbols from the driver, those still won't be there if
> I compile the driver as a module. On the other hand, the way I'm doing
> it, I think the schedule_work() gives me a pretty good separation.
>
> > > +static void sja1105_port_disable(struct dsa_switch *ds, int port)
> > > +{
> > > +     struct sja1105_private *priv = ds->priv;
> > > +     struct sja1105_port *sp = &priv->ports[port];
> > > +     struct sk_buff *skb;
> > > +
> > > +     cancel_work_sync(&sp->xmit_work);
> > > +     while (sja1105_skb_ring_get(&sp->xmit_ring, &skb) >= 0)
> > > +             kfree_skb(skb);
> > > +}
> > > +
> > > diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
> > > new file mode 100644
> > > index 000000000000..5c76a06c9093
> > > --- /dev/null
> > > +++ b/net/dsa/tag_sja1105.c
> > > @@ -0,0 +1,148 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> > > + */
> > > +#include <linux/etherdevice.h>
> > > +#include <linux/if_vlan.h>
> > > +#include <linux/dsa/sja1105.h>
> > > +#include "../../drivers/net/dsa/sja1105/sja1105.h"
> >
> > Again, no, don't do this.
> >
>
> This separation between driver and tagger is fairly arbitrary.
> I need access to the driver's private structure, in order to get a
> hold of the private shadow of the dsa_port. Moving the driver private
> structure to include/linux/dsa/ would pull in quite a number of
> dependencies. Maybe I could provide declarations for the most of them,
> but anyway the private structure wouldn't be so private any longer,
> would it?
> Otherwise put, would you prefer a dp->priv similar to the already
> existing ds->priv? struct sja1105_port is much more lightweight to
> keep in include/linux/dsa/.
>
> > > +
> > > +#include "dsa_priv.h"
> > > +
> > > +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> > > +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> > > +{
> > > +     const struct ethhdr *hdr = eth_hdr(skb);
> > > +     u64 dmac = ether_addr_to_u64(hdr->h_dest);
> > > +
> > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> > > +                 SJA1105_LINKLOCAL_FILTER_A)
> > > +             return true;
> > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> > > +                 SJA1105_LINKLOCAL_FILTER_B)
> > > +             return true;
> > > +     return false;
> > > +}
> > > +
> > > +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> > > +{
> > > +     if (sja1105_is_link_local(skb))
> > > +             return true;
> > > +     if (!dev->dsa_ptr->vlan_filtering)
> > > +             return true;
> > > +     return false;
> > > +}
> >
> > Please add a comment here about what frames cannot be handled by the
> > tagger. However, i'm not too happy about this design...
> >
>
> Ok, let's put this another way.
> A switch is primarily a device used to offload the forwarding of
> traffic based on L2 rules. Additionally there may be some management
> traffic for stuff like STP that needs to be terminated on the host
> port of the switch. For that, the hardware's job is to filter and tag
> management frames on their way to the host port, and the software's
> job is to process the source port and switch id information in a
> meaningful way.
> Now both this particular switch hardware, and DSA, are taking the
> above definitions to extremes.
> The switch says: "that's all you want to see? ok, so that's all I'm
> going to give you". So its native (hardware) tagging protocol is to
> trap link-local traffic and overwrite two bytes of its destination MAC
> with the switch ID and the source port. No more, no less. It is an
> incomplete solution, but it does the job for practical use cases.
> Now DSA says: "I want these to be fully capable net devices, I want
> the user to not even realize what's going on under the hood". I don't
> think that terminating iperf traffic through switch ports is a
> realistic usage scenario. So in a way discussions about performance
> and optimizations on DSA hotpath are slightly pointless IMO.
> Now what my driver says is that it offers a bit of both. It speaks the
> hardware's tagging protocol so it is capable of management traffic,
> but it also speaks the DSA paradigm, so in a way pushes the hardware
> to work in a mode it was never intended to, by repurposing VLANs when
> the user doesn't request them. So on one hand there is some overlap
> between the hardware tagging protocol and the VLAN one (in standalone
> mode and in VLAN-unaware bridged mode, management traffic *could* use
> VLAN tagging but it doesn't rely on it), and on the other hand the
> reunion of the two tagging protocols is decent, but still doesn't
> cover the entire spectrum (when put under a VLAN-aware bridge, you
> lose the ability to decode general traffic). So you'd better not rely
> on VLANs to decode the management traffic, because you won't be able
> to always rely on that, and that is a shame since a bridge with both
> vlan_filtering 1 and stp_state 1 is a real usage scenario, and the
> hardware is capable of that combination.
> But all of that is secondary. Let's forget about VLAN tagging for a
> second and concentrate on the tagging of management traffic. The
> limiting factor here is the software architecture of DSA, because in
> order for me to decode that in the driver/tagger, I'd have to drop
> everything else coming on the master net device (I explained in 13/24
> why). I believe that DSA being all-or-nothing about switch tagging is
> turning a blind eye to the devices that don't go overboard with
> features, and give you what's needed in a real-world design but not
> much else.
> What would you improve about this design (assuming you're talking
> about the filter function)?
>
> Thanks,
> -Vladimir
>
>
>
>
>
> > > +
> > > +static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
> > > +                                 struct net_device *netdev)
> > > +{
> > > +     struct dsa_port *dp = dsa_slave_to_port(netdev);
> > > +     struct dsa_switch *ds = dp->ds;
> > > +     struct sja1105_private *priv = ds->priv;
> > > +     struct sja1105_port *sp = &priv->ports[dp->index];
> > > +     struct sk_buff *clone;
> > > +
> > > +     if (likely(!sja1105_is_link_local(skb))) {
> > > +             /* Normal traffic path. */
> > > +             u16 tx_vid = dsa_tagging_tx_vid(ds, dp->index);
> > > +             u8 pcp = skb->priority;
> > > +
> > > +             /* If we are under a vlan_filtering bridge, IP termination on
> > > +              * switch ports based on 802.1Q tags is simply too brittle to
> > > +              * be passable. So just defer to the dsa_slave_notag_xmit
> > > +              * implementation.
> > > +              */
> > > +             if (dp->vlan_filtering)
> > > +                     return skb;
> > > +
> > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> >
> > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> >
> > > +     }
> > > +
> > > +     /* Code path for transmitting management traffic. This does not rely
> > > +      * upon switch tagging, but instead SPI-installed management routes.
> > > +      */
> > > +     clone = skb_clone(skb, GFP_ATOMIC);
> > > +     if (!clone) {
> > > +             dev_err(ds->dev, "xmit: failed to clone skb\n");
> > > +             return NULL;
> > > +     }
> > > +
> > > +     if (sja1105_skb_ring_add(&sp->xmit_ring, clone) < 0) {
> > > +             dev_err(ds->dev, "xmit: skb ring full\n");
> > > +             kfree_skb(clone);
> > > +             return NULL;
> > > +     }
> > > +
> > > +     if (sp->xmit_ring.count == SJA1105_SKB_RING_SIZE)
> > > +             /* TODO setup a dedicated netdev queue for management traffic
> > > +              * so that we can selectively apply backpressure and not be
> > > +              * required to stop the entire traffic when the software skb
> > > +              * ring is full. This requires hooking the ndo_select_queue
> > > +              * from DSA and matching on mac_fltres.
> > > +              */
> > > +             dev_err(ds->dev, "xmit: reached maximum skb ring size\n");
> >
> > This should be rate limited.
> >
> >      Andrew
> >
> > > +
> > > +     schedule_work(&sp->xmit_work);
> > > +     /* Let DSA free its reference to the skb and we will free
> > > +      * the clone in the deferred worker
> > > +      */
> > > +     return NULL;
> > > +}
> > > +
> > > +static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
> > > +                                struct net_device *netdev,
> > > +                                struct packet_type *pt)
> > > +{
> > > +     unsigned int source_port, switch_id;
> > > +     struct ethhdr *hdr = eth_hdr(skb);
> > > +     struct sk_buff *nskb;
> > > +     u16 tpid, vid, tci;
> > > +     bool is_tagged;
> > > +
> > > +     nskb = dsa_8021q_rcv(skb, netdev, pt, &tpid, &tci);
> > > +     is_tagged = (nskb && tpid == ETH_P_EDSA);
> > > +
> > > +     skb->priority = (tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
> > > +     vid = tci & VLAN_VID_MASK;
> > > +
> > > +     skb->offload_fwd_mark = 1;
> > > +
> > > +     if (likely(!sja1105_is_link_local(skb))) {
> > > +             /* Normal traffic path. */
> > > +             source_port = dsa_tagging_rx_source_port(vid);
> > > +             switch_id = dsa_tagging_rx_switch_id(vid);
> > > +     } else {
> > > +             /* Management traffic path. Switch embeds the switch ID and
> > > +              * port ID into bytes of the destination MAC, courtesy of
> > > +              * the incl_srcpt options.
> > > +              */
> > > +             source_port = hdr->h_dest[3];
> > > +             switch_id = hdr->h_dest[4];
> > > +             /* Clear the DMAC bytes that were mangled by the switch */
> > > +             hdr->h_dest[3] = 0;
> > > +             hdr->h_dest[4] = 0;
> > > +     }
> > > +
> > > +     skb->dev = dsa_master_find_slave(netdev, switch_id, source_port);
> > > +     if (!skb->dev) {
> > > +             netdev_warn(netdev, "Couldn't decode source port\n");
> > > +             return NULL;
> > > +     }
> > > +
> > > +     /* Delete/overwrite fake VLAN header, DSA expects to not find
> > > +      * it there, see dsa_switch_rcv: skb_push(skb, ETH_HLEN).
> > > +      */
> > > +     if (is_tagged)
> > > +             memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - VLAN_HLEN,
> > > +                     ETH_HLEN - VLAN_HLEN);
> > > +
> > > +     return skb;
> > > +}
> > > +
> > > +const struct dsa_device_ops sja1105_netdev_ops = {
> > > +     .xmit = sja1105_xmit,
> > > +     .rcv = sja1105_rcv,
> > > +     .filter = sja1105_filter,
> > > +     .overhead = VLAN_HLEN,
> > > +};
> > > +
> > > --
> > > 2.17.1
> > >
Vladimir Oltean April 13, 2019, 10:26 p.m. UTC | #4
On Sun, 14 Apr 2019 at 01:08, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Sun, 14 Apr 2019 at 00:27, Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Sat, 13 Apr 2019 at 19:38, Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > On Sat, Apr 13, 2019 at 04:28:16AM +0300, Vladimir Oltean wrote:
> > > > In order to support this, we are creating a make-shift switch tag out of
> > > > a VLAN trunk configured on the CPU port. Termination of normal traffic
> > > > on switch ports only works when not under a vlan_filtering bridge.
> > > > Termination of management (PTP, BPDU) traffic works under all
> > > > circumstances because it uses a different tagging mechanism
> > > > (incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
> > > > code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.
> > > >
> > > > There are two types of traffic: regular and link-local.
> > > > The link-local traffic received on the CPU port is trapped from the
> > > > switch's regular forwarding decisions because it matched one of the two
> > > > DMAC filters for management traffic.
> > > > On transmission, the switch requires special massaging for these
> > > > link-local frames. Due to a weird implementation of the switching IP, by
> > > > default it drops link-local frames that originate on the CPU port. It
> > > > needs to be told where to forward them to, through an SPI command
> > > > ("management route") that is valid for only a single frame.
> > > > So when we're sending link-local traffic, we need to clone skb's from
> > > > DSA and send them in our custom xmit worker that also performs SPI access.
> > > >
> > > > For that purpose, the DSA xmit handler and the xmit worker communicate
> > > > through a per-port "skb ring" software structure, with a producer and a
> > > > consumer index. At the moment this structure is rather fragile
> > > > (ping-flooding to a link-local DMAC would cause most of the frames to
> > > > get dropped). I would like to move the management traffic on a separate
> > > > netdev queue that I can stop when the skb ring got full and hardware is
> > > > busy processing, so that we are not forced to drop traffic.
> > > >
> > > > Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> > > > Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> > > > ---
> > > > Changes in v3:
> > > > Made management traffic be receivable on the DSA netdevices even when
> > > > switch tagging is disabled, as well as regular traffic be receivable on
> > > > the master netdevice in the same scenario. Both are accomplished using
> > > > the sja1105_filter() function and some small touch-ups in the .rcv
> > > > callback.
> > >
> > > It seems like you made major changes to this. When you do that, you
> > > should drop any reviewed-by tags you have. They are no longer valid
> > > because of the major changes.
> > >
> >
> > Ok, noted.
> >
> > > >  /* This callback needs to be present */
> > > > @@ -1141,7 +1158,11 @@ static int sja1105_vlan_filtering(struct dsa_switch *ds, int port, bool enabled)
> > > >       if (rc)
> > > >               dev_err(ds->dev, "Failed to change VLAN Ethertype\n");
> > > >
> > > > -     return rc;
> > > > +     /* Switch port identification based on 802.1Q is only passable
> > >
> > > possible, not passable.
> > >
> >
> > Passable (satisfactory, decent, acceptable) is what I wanted to say.
> > Tagging using VLANs is possible even when the bridge wants to use
> > them, but it's smarter not to go there. But I get your point, maybe
> > I'll rephrase.
> >
> > > > +      * if we are not under a vlan_filtering bridge. So make sure
> > > > +      * the two configurations are mutually exclusive.
> > > > +      */
> > > > +     return sja1105_setup_8021q_tagging(ds, !enabled);
> > > >  }
> > > >
> > > >  static void sja1105_vlan_add(struct dsa_switch *ds, int port,
> > > > @@ -1233,9 +1254,107 @@ static int sja1105_setup(struct dsa_switch *ds)
> > > >        */
> > > >       ds->vlan_filtering_is_global = true;
> > > >
> > > > +     /* The DSA/switchdev model brings up switch ports in standalone mode by
> > > > +      * default, and that means vlan_filtering is 0 since they're not under
> > > > +      * a bridge, so it's safe to set up switch tagging at this time.
> > > > +      */
> > > > +     return sja1105_setup_8021q_tagging(ds, true);
> > > > +}
> > > > +
> > > > +#include "../../../net/dsa/dsa_priv.h"
> > >
> > > No. Don't use relative includes like this.
> > >
> > > What do you need from the header? Maybe move it into
> > > include/linux/net/dsa.h
> > >
> >
> > dsa_slave_to_master()
> >
> > > > +/* Deferred work is unfortunately necessary because setting up the management
> > > > + * route cannot be done from atomit context (SPI transfer takes a sleepable
> > > > + * lock on the bus)
> > > > + */
> > > > +static void sja1105_xmit_work_handler(struct work_struct *work)
> > > > +{
> > > > +     struct sja1105_port *sp = container_of(work, struct sja1105_port,
> > > > +                                             xmit_work);
> > > > +     struct sja1105_private *priv = sp->dp->ds->priv;
> > > > +     struct net_device *slave = sp->dp->slave;
> > > > +     struct net_device *master = dsa_slave_to_master(slave);
> > > > +     int port = (uintptr_t)(sp - priv->ports);
> > > > +     struct sk_buff *skb;
> > > > +     int i, rc;
> > > > +
> > > > +     while ((i = sja1105_skb_ring_get(&sp->xmit_ring, &skb)) >= 0) {
> > > > +             struct sja1105_mgmt_entry mgmt_route = { 0 };
> > > > +             struct ethhdr *hdr;
> > > > +             int timeout = 10;
> > > > +             int skb_len;
> > > > +
> > > > +             skb_len = skb->len;
> > > > +             hdr = eth_hdr(skb);
> > > > +
> > > > +             mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> > > > +             mgmt_route.destports = BIT(port);
> > > > +             mgmt_route.enfport = 1;
> > > > +             mgmt_route.tsreg = 0;
> > > > +             mgmt_route.takets = false;
> > > > +
> > > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > > +                                               port, &mgmt_route, true);
> > > > +             if (rc < 0) {
> > > > +                     kfree_skb(skb);
> > > > +                     slave->stats.tx_dropped++;
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > +             /* Transfer skb to the host port. */
> > > > +             skb->dev = master;
> > > > +             dev_queue_xmit(skb);
> > > > +
> > > > +             /* Wait until the switch has processed the frame */
> > > > +             do {
> > > > +                     rc = sja1105_dynamic_config_read(priv, BLK_IDX_MGMT_ROUTE,
> > > > +                                                      port, &mgmt_route);
> > > > +                     if (rc < 0) {
> > > > +                             slave->stats.tx_errors++;
> > > > +                             dev_err(priv->ds->dev,
> > > > +                                     "xmit: failed to poll for mgmt route\n");
> > > > +                             continue;
> > > > +                     }
> > > > +
> > > > +                     /* UM10944: The ENFPORT flag of the respective entry is
> > > > +                      * cleared when a match is found. The host can use this
> > > > +                      * flag as an acknowledgment.
> > > > +                      */
> > > > +                     cpu_relax();
> > > > +             } while (mgmt_route.enfport && --timeout);
> > > > +
> > > > +             if (!timeout) {
> > > > +                     dev_err(priv->ds->dev, "xmit timed out\n");
> > > > +                     slave->stats.tx_errors++;
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > +             slave->stats.tx_packets++;
> > > > +             slave->stats.tx_bytes += skb_len;
> > > > +     }
> > > > +}
> > > > +
> > > > +static int sja1105_port_enable(struct dsa_switch *ds, int port,
> > > > +                            struct phy_device *phydev)
> > > > +{
> > > > +     struct sja1105_private *priv = ds->priv;
> > > > +     struct sja1105_port *sp = &priv->ports[port];
> > > > +
> > > > +     sp->dp = &ds->ports[port];
> > > > +     INIT_WORK(&sp->xmit_work, sja1105_xmit_work_handler);
> > > >       return 0;
> > > >  }
> > >
> > > I think i'm missing something here. You have a per port queue of link
> > > local frames which need special handling. And you have a per-port work
> > > queue. To send such a frame, you need to write some register, send the
> > > frame, and then wait until the mgmt_route.enfport is reset.
> > >
> > > Why are you doing this per port? How do you stop two ports/work queues
> > > running at the same time? It seems like one queue, with one work queue
> > > would be a better structure.
> > >
> >
> > See the "port" parameter to this call here:
> >
> >         rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> >                           *port*, &mgmt_route, true);
> >
> > The switch IP aptly allocates 4 slots for management routes. And it's
> > a 5-port switch where 1 port is the management port. I think the
> > structure is fine.
> >
>
> "How do stop two work queues": if you're talking about contention on
> the hardware management route, I responded to that.
> If you're talking about netif_stop_queue(), I think I'm going to avoid
> that altogether by not having a finite sized ring structure. While
> studying the Marvell 88e6060 driver I found out that sk_buff_head
> exists. Now I'm part of the elite club of wheel reinventors with my
> struct sja1105_skb_ring :)
>
>
> > > Also, please move all this code into the tagger. Just add exports for
> > > sja1105_dynamic_config_write() and sja1105_dynamic_config_read().
> > >
> >
> > Well, you see, the tagger code is part of the dsa_core object. If I
> > export function symbols from the driver, those still won't be there if
> > I compile the driver as a module. On the other hand, the way I'm doing
> > it, I think the schedule_work() gives me a pretty good separation.
> >
> > > > +static void sja1105_port_disable(struct dsa_switch *ds, int port)
> > > > +{
> > > > +     struct sja1105_private *priv = ds->priv;
> > > > +     struct sja1105_port *sp = &priv->ports[port];
> > > > +     struct sk_buff *skb;
> > > > +
> > > > +     cancel_work_sync(&sp->xmit_work);
> > > > +     while (sja1105_skb_ring_get(&sp->xmit_ring, &skb) >= 0)
> > > > +             kfree_skb(skb);
> > > > +}
> > > > +
> > > > diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
> > > > new file mode 100644
> > > > index 000000000000..5c76a06c9093
> > > > --- /dev/null
> > > > +++ b/net/dsa/tag_sja1105.c
> > > > @@ -0,0 +1,148 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
> > > > + */
> > > > +#include <linux/etherdevice.h>
> > > > +#include <linux/if_vlan.h>
> > > > +#include <linux/dsa/sja1105.h>
> > > > +#include "../../drivers/net/dsa/sja1105/sja1105.h"
> > >
> > > Again, no, don't do this.
> > >
> >
> > This separation between driver and tagger is fairly arbitrary.
> > I need access to the driver's private structure, in order to get a
> > hold of the private shadow of the dsa_port. Moving the driver private
> > structure to include/linux/dsa/ would pull in quite a number of
> > dependencies. Maybe I could provide declarations for the most of them,
> > but anyway the private structure wouldn't be so private any longer,
> > would it?
> > Otherwise put, would you prefer a dp->priv similar to the already
> > existing ds->priv? struct sja1105_port is much more lightweight to
> > keep in include/linux/dsa/.
> >
> > > > +
> > > > +#include "dsa_priv.h"
> > > > +
> > > > +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> > > > +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> > > > +{
> > > > +     const struct ethhdr *hdr = eth_hdr(skb);
> > > > +     u64 dmac = ether_addr_to_u64(hdr->h_dest);
> > > > +
> > > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> > > > +                 SJA1105_LINKLOCAL_FILTER_A)
> > > > +             return true;
> > > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> > > > +                 SJA1105_LINKLOCAL_FILTER_B)
> > > > +             return true;
> > > > +     return false;
> > > > +}
> > > > +
> > > > +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> > > > +{
> > > > +     if (sja1105_is_link_local(skb))
> > > > +             return true;
> > > > +     if (!dev->dsa_ptr->vlan_filtering)
> > > > +             return true;
> > > > +     return false;
> > > > +}
> > >
> > > Please add a comment here about what frames cannot be handled by the
> > > tagger. However, i'm not too happy about this design...
> > >
> >
> > Ok, let's put this another way.
> > A switch is primarily a device used to offload the forwarding of
> > traffic based on L2 rules. Additionally there may be some management
> > traffic for stuff like STP that needs to be terminated on the host
> > port of the switch. For that, the hardware's job is to filter and tag
> > management frames on their way to the host port, and the software's
> > job is to process the source port and switch id information in a
> > meaningful way.
> > Now both this particular switch hardware, and DSA, are taking the
> > above definitions to extremes.
> > The switch says: "that's all you want to see? ok, so that's all I'm
> > going to give you". So its native (hardware) tagging protocol is to
> > trap link-local traffic and overwrite two bytes of its destination MAC
> > with the switch ID and the source port. No more, no less. It is an
> > incomplete solution, but it does the job for practical use cases.
> > Now DSA says: "I want these to be fully capable net devices, I want
> > the user to not even realize what's going on under the hood". I don't
> > think that terminating iperf traffic through switch ports is a
> > realistic usage scenario. So in a way discussions about performance
> > and optimizations on DSA hotpath are slightly pointless IMO.
> > Now what my driver says is that it offers a bit of both. It speaks the
> > hardware's tagging protocol so it is capable of management traffic,
> > but it also speaks the DSA paradigm, so in a way pushes the hardware
> > to work in a mode it was never intended to, by repurposing VLANs when
> > the user doesn't request them. So on one hand there is some overlap
> > between the hardware tagging protocol and the VLAN one (in standalone
> > mode and in VLAN-unaware bridged mode, management traffic *could* use
> > VLAN tagging but it doesn't rely on it), and on the other hand the
> > reunion of the two tagging protocols is decent, but still doesn't
> > cover the entire spectrum (when put under a VLAN-aware bridge, you
> > lose the ability to decode general traffic). So you'd better not rely
> > on VLANs to decode the management traffic, because you won't be able
> > to always rely on that, and that is a shame since a bridge with both
> > vlan_filtering 1 and stp_state 1 is a real usage scenario, and the
> > hardware is capable of that combination.
> > But all of that is secondary. Let's forget about VLAN tagging for a
> > second and concentrate on the tagging of management traffic. The
> > limiting factor here is the software architecture of DSA, because in
> > order for me to decode that in the driver/tagger, I'd have to drop
> > everything else coming on the master net device (I explained in 13/24
> > why). I believe that DSA being all-or-nothing about switch tagging is
> > turning a blind eye to the devices that don't go overboard with
> > features, and give you what's needed in a real-world design but not
> > much else.
> > What would you improve about this design (assuming you're talking
> > about the filter function)?
> >
> > Thanks,
> > -Vladimir
> >
> >
> >
> >
> >
> > > > +
> > > > +static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
> > > > +                                 struct net_device *netdev)
> > > > +{
> > > > +     struct dsa_port *dp = dsa_slave_to_port(netdev);
> > > > +     struct dsa_switch *ds = dp->ds;
> > > > +     struct sja1105_private *priv = ds->priv;
> > > > +     struct sja1105_port *sp = &priv->ports[dp->index];
> > > > +     struct sk_buff *clone;
> > > > +
> > > > +     if (likely(!sja1105_is_link_local(skb))) {
> > > > +             /* Normal traffic path. */
> > > > +             u16 tx_vid = dsa_tagging_tx_vid(ds, dp->index);
> > > > +             u8 pcp = skb->priority;
> > > > +
> > > > +             /* If we are under a vlan_filtering bridge, IP termination on
> > > > +              * switch ports based on 802.1Q tags is simply too brittle to
> > > > +              * be passable. So just defer to the dsa_slave_notag_xmit
> > > > +              * implementation.
> > > > +              */
> > > > +             if (dp->vlan_filtering)
> > > > +                     return skb;
> > > > +
> > > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> > >
> > > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> > >

I'm receiving contradictory advice on this. Why should I define a new
ethertype, and if I do, what scope should the definition have (local
to the driver and the tagger, local to DSA, UAPI)?

> > > > +     }
> > > > +
> > > > +     /* Code path for transmitting management traffic. This does not rely
> > > > +      * upon switch tagging, but instead SPI-installed management routes.
> > > > +      */
> > > > +     clone = skb_clone(skb, GFP_ATOMIC);
> > > > +     if (!clone) {
> > > > +             dev_err(ds->dev, "xmit: failed to clone skb\n");
> > > > +             return NULL;
> > > > +     }
> > > > +
> > > > +     if (sja1105_skb_ring_add(&sp->xmit_ring, clone) < 0) {
> > > > +             dev_err(ds->dev, "xmit: skb ring full\n");
> > > > +             kfree_skb(clone);
> > > > +             return NULL;
> > > > +     }
> > > > +
> > > > +     if (sp->xmit_ring.count == SJA1105_SKB_RING_SIZE)
> > > > +             /* TODO setup a dedicated netdev queue for management traffic
> > > > +              * so that we can selectively apply backpressure and not be
> > > > +              * required to stop the entire traffic when the software skb
> > > > +              * ring is full. This requires hooking the ndo_select_queue
> > > > +              * from DSA and matching on mac_fltres.
> > > > +              */
> > > > +             dev_err(ds->dev, "xmit: reached maximum skb ring size\n");
> > >
> > > This should be rate limited.
> > >

Again, with sk_buff_head it'll probably completely go away.

> > >      Andrew
> > >
> > > > +
> > > > +     schedule_work(&sp->xmit_work);
> > > > +     /* Let DSA free its reference to the skb and we will free
> > > > +      * the clone in the deferred worker
> > > > +      */
> > > > +     return NULL;
> > > > +}
> > > > +
> > > > +static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
> > > > +                                struct net_device *netdev,
> > > > +                                struct packet_type *pt)
> > > > +{
> > > > +     unsigned int source_port, switch_id;
> > > > +     struct ethhdr *hdr = eth_hdr(skb);
> > > > +     struct sk_buff *nskb;
> > > > +     u16 tpid, vid, tci;
> > > > +     bool is_tagged;
> > > > +
> > > > +     nskb = dsa_8021q_rcv(skb, netdev, pt, &tpid, &tci);
> > > > +     is_tagged = (nskb && tpid == ETH_P_EDSA);
> > > > +
> > > > +     skb->priority = (tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
> > > > +     vid = tci & VLAN_VID_MASK;
> > > > +
> > > > +     skb->offload_fwd_mark = 1;
> > > > +
> > > > +     if (likely(!sja1105_is_link_local(skb))) {
> > > > +             /* Normal traffic path. */
> > > > +             source_port = dsa_tagging_rx_source_port(vid);
> > > > +             switch_id = dsa_tagging_rx_switch_id(vid);
> > > > +     } else {
> > > > +             /* Management traffic path. Switch embeds the switch ID and
> > > > +              * port ID into bytes of the destination MAC, courtesy of
> > > > +              * the incl_srcpt options.
> > > > +              */
> > > > +             source_port = hdr->h_dest[3];
> > > > +             switch_id = hdr->h_dest[4];
> > > > +             /* Clear the DMAC bytes that were mangled by the switch */
> > > > +             hdr->h_dest[3] = 0;
> > > > +             hdr->h_dest[4] = 0;
> > > > +     }
> > > > +
> > > > +     skb->dev = dsa_master_find_slave(netdev, switch_id, source_port);
> > > > +     if (!skb->dev) {
> > > > +             netdev_warn(netdev, "Couldn't decode source port\n");
> > > > +             return NULL;
> > > > +     }
> > > > +
> > > > +     /* Delete/overwrite fake VLAN header, DSA expects to not find
> > > > +      * it there, see dsa_switch_rcv: skb_push(skb, ETH_HLEN).
> > > > +      */
> > > > +     if (is_tagged)
> > > > +             memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - VLAN_HLEN,
> > > > +                     ETH_HLEN - VLAN_HLEN);
> > > > +
> > > > +     return skb;
> > > > +}
> > > > +
> > > > +const struct dsa_device_ops sja1105_netdev_ops = {
> > > > +     .xmit = sja1105_xmit,
> > > > +     .rcv = sja1105_rcv,
> > > > +     .filter = sja1105_filter,
> > > > +     .overhead = VLAN_HLEN,
> > > > +};
> > > > +
> > > > --
> > > > 2.17.1
> > > >
Andrew Lunn April 14, 2019, 4:05 p.m. UTC | #5
On Sun, Apr 14, 2019 at 12:27:50AM +0300, Vladimir Oltean wrote:
> > > +             mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> > > +             mgmt_route.destports = BIT(port);
> > > +             mgmt_route.enfport = 1;
> > > +             mgmt_route.tsreg = 0;
> > > +             mgmt_route.takets = false;
> > > +
> > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > +                                               port, &mgmt_route, true);

>         rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
>                           *port*, &mgmt_route, true);
> 
> The switch IP aptly allocates 4 slots for management routes. And it's
> a 5-port switch where 1 port is the management port. I think the
> structure is fine.

So does the hardware look over all the slots and find the first one
which has a matching mgmt_route.macaddr destination MAC address? You
wait for the enfport to be cleared. I assume a slot with enfport
cleared is not active and won't match?

So we need to consider if there is a race condition where we have
multiple slots with the same destination MAC address, but different
destination ports? Say the bridge sends out BPDU to all ports of a
bridge in quick succession.

These work queues run in any order, and can sleep. Can we get into a
situation where we get the two slots setup, and then the frames sent
in reverse order? The match then happens backwards, and the frames get
sent out the wrong port?

Or say the two slots are setup, the two frames are sent in order, but
the stack decided to drop the first frame because buffers are
full. Can the second frame make it to the switch and match on the
first slot and go out the wrong port?

> > Also, please move all this code into the tagger. Just add exports for
> > sja1105_dynamic_config_write() and sja1105_dynamic_config_read().
> >
> 
> Well, you see, the tagger code is part of the dsa_core object. If I
> export function symbols from the driver, those still won't be there if
> I compile the driver as a module. On the other hand, the way I'm doing
> it, I think the schedule_work() gives me a pretty good separation.

That is solvable via Kconfig, don't allow it to be built as a module.

Also, DSA has been very successful, we keep getting more switches from
different vendors, and hence more taggers. So at some point, we should
turn the taggers into modules. I'm not saying that should happen now,
but when it does happen, this driver can then become a module.

The real reason is, tagger as all about handling frames, where as
drivers are all about configuring the switch. The majority of this
code is about frames, so it belongs in the tagger.
 
> > > +#include <linux/etherdevice.h>
> > > +#include <linux/if_vlan.h>
> > > +#include <linux/dsa/sja1105.h>
> > > +#include "../../drivers/net/dsa/sja1105/sja1105.h"
> >
> > Again, no, don't do this.
> >
> 
> This separation between driver and tagger is fairly arbitrary.
> I need access to the driver's private structure, in order to get a
> hold of the private shadow of the dsa_port. Moving the driver private
> structure to include/linux/dsa/ would pull in quite a number of
> dependencies. Maybe I could provide declarations for the most of them,
> but anyway the private structure wouldn't be so private any longer,
> would it?
> Otherwise put, would you prefer a dp->priv similar to the already
> existing ds->priv? struct sja1105_port is much more lightweight to
> keep in include/linux/dsa/.

Linux simply does not make use of relative paths going between
directories like this. That is the key point here. Whatever you need
to share between the tagger and the driver has to be put into
include/linux/dsa/. 

Assuming we are just exporting something like
sja1105_dynamic_config_write() and _read()


> > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > +                                               port, &mgmt_route, true);

priv can be replaced with ds, which the tagger has. port is
known. BLK_IDX_MGMT_ROUTE is implicit, and all that the tagger needs
to pass for mgmt_route is the destination MAC address, which it has.

The tagger does need somewhere to keep the queue of frames to be sent
and its workqueue. I would probably add a void *tagger_priv to
dsa_switch, and two new methods to dsa_device_ops, .probe and
.release, to allow it to create and destroy what it needs in
tagger_priv.

> > > +#include "dsa_priv.h"
> > > +
> > > +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> > > +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> > > +{
> > > +     const struct ethhdr *hdr = eth_hdr(skb);
> > > +     u64 dmac = ether_addr_to_u64(hdr->h_dest);
> > > +
> > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> > > +                 SJA1105_LINKLOCAL_FILTER_A)
> > > +             return true;
> > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> > > +                 SJA1105_LINKLOCAL_FILTER_B)
> > > +             return true;
> > > +     return false;
> > > +}
> > > +
> > > +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> > > +{
> > > +     if (sja1105_is_link_local(skb))
> > > +             return true;
> > > +     if (!dev->dsa_ptr->vlan_filtering)
> > > +             return true;
> > > +     return false;
> > > +}
> >
> > Please add a comment here about what frames cannot be handled by the
> > tagger. However, i'm not too happy about this design...
> >

> What would you improve about this design (assuming you're talking
> about the filter function)?

I want to understand what frames get passed via the master device, and
how ultimately they get to where they should be going.

Once i understand what sort of frames they are and what is
generating/consuming them, maybe we can find a better solution which
preserves the DSA concepts.

To me, it looks like they are not management frames, at least not BPDU
or PTP, since they are link local. If VLAN filtering is off, the VLAN
tag tells us which port they came in, so we can strip the tag and pass
them to the correct slave.

So it looks like real user frames with a VLAN tag are getting passed
to the master device. So i then assume you put vlan interfaces on top
of the master device, and your application then uses the vlan
interfaces? Your application does not care where the frames came from,
it is using the switch as a dumb switch. The DSA slaves are unused?

Could we enforce that a VLAN can only be assigned to a single port?
The tagger could then pass the tagged frame to the correct slave? Is
that too restrictive for your use case? Do you need the same VLAN on
multiple ports.

	 Andrew
Andrew Lunn April 14, 2019, 4:17 p.m. UTC | #6
> > > > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> > > >
> > > > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> > > >
> 
> I'm receiving contradictory advice on this. Why should I define a new
> ethertype, and if I do, what scope should the definition have (local
> to the driver and the tagger, local to DSA, UAPI)?

ETH_P_EDSA has a well defined meaning. It is a true global EtherType,
and means a Marvell EtherType DSA header follows.

You are polluting this meaning of ETH_P_EDSA. Would you put ETH_P_IP
or ETH_P_8021AD here?

   Andrew
Vladimir Oltean April 14, 2019, 6:42 p.m. UTC | #7
On Sun, 14 Apr 2019 at 19:05, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Sun, Apr 14, 2019 at 12:27:50AM +0300, Vladimir Oltean wrote:
> > > > +             mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
> > > > +             mgmt_route.destports = BIT(port);
> > > > +             mgmt_route.enfport = 1;
> > > > +             mgmt_route.tsreg = 0;
> > > > +             mgmt_route.takets = false;
> > > > +
> > > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > > +                                               port, &mgmt_route, true);
>
> >         rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> >                           *port*, &mgmt_route, true);
> >
> > The switch IP aptly allocates 4 slots for management routes. And it's
> > a 5-port switch where 1 port is the management port. I think the
> > structure is fine.
>
> So does the hardware look over all the slots and find the first one
> which has a matching mgmt_route.macaddr destination MAC address? You
> wait for the enfport to be cleared. I assume a slot with enfport
> cleared is not active and won't match?
>
> So we need to consider if there is a race condition where we have
> multiple slots with the same destination MAC address, but different
> destination ports? Say the bridge sends out BPDU to all ports of a
> bridge in quick succession.
>
> These work queues run in any order, and can sleep. Can we get into a
> situation where we get the two slots setup, and then the frames sent
> in reverse order? The match then happens backwards, and the frames get
> sent out the wrong port?
>

Yes, it looks like the hardware isn't doing me any favors on this one.
From UM10944: "If the host provides several management route entries
with identical values for the MACADDR, the one at the lowest index is
used first."
So the 4 hardware management slots serve no purpose unless I'm willing
to lock the sja1105_xmit_work_handler with a mutex. And even then
there's no reason to use separate slots since the workers are
serialized anyway. Weird.

> Or say the two slots are setup, the two frames are sent in order, but
> the stack decided to drop the first frame because buffers are
> full. Can the second frame make it to the switch and match on the
> first slot and go out the wrong port?
>

Yes if waiting for enfport times out there's some cleanup work I'm
currently not doing.

> > > Also, please move all this code into the tagger. Just add exports for
> > > sja1105_dynamic_config_write() and sja1105_dynamic_config_read().
> > >
> >
> > Well, you see, the tagger code is part of the dsa_core object. If I
> > export function symbols from the driver, those still won't be there if
> > I compile the driver as a module. On the other hand, the way I'm doing
> > it, I think the schedule_work() gives me a pretty good separation.
>
> That is solvable via Kconfig, don't allow it to be built as a module.
>
> Also, DSA has been very successful, we keep getting more switches from
> different vendors, and hence more taggers. So at some point, we should
> turn the taggers into modules. I'm not saying that should happen now,
> but when it does happen, this driver can then become a module.
>
> The real reason is, tagger as all about handling frames, where as
> drivers are all about configuring the switch. The majority of this
> code is about frames, so it belongs in the tagger.
>

The xmit worker needs to configure the switch to be able to handle
frames. Why doesn't that belong in the driver?
Not allowing the driver to be built as module is hardly any cleaner
than a schedule_work().
I only need dsa_slave_to_master for the delayed enqueue. And if DSA
supported a delayed enqueue method natively I wouldn't need it at all.

> > > > +#include <linux/etherdevice.h>
> > > > +#include <linux/if_vlan.h>
> > > > +#include <linux/dsa/sja1105.h>
> > > > +#include "../../drivers/net/dsa/sja1105/sja1105.h"
> > >
> > > Again, no, don't do this.
> > >
> >
> > This separation between driver and tagger is fairly arbitrary.
> > I need access to the driver's private structure, in order to get a
> > hold of the private shadow of the dsa_port. Moving the driver private
> > structure to include/linux/dsa/ would pull in quite a number of
> > dependencies. Maybe I could provide declarations for the most of them,
> > but anyway the private structure wouldn't be so private any longer,
> > would it?
> > Otherwise put, would you prefer a dp->priv similar to the already
> > existing ds->priv? struct sja1105_port is much more lightweight to
> > keep in include/linux/dsa/.
>
> Linux simply does not make use of relative paths going between
> directories like this. That is the key point here. Whatever you need
> to share between the tagger and the driver has to be put into
> include/linux/dsa/.
>
> Assuming we are just exporting something like
> sja1105_dynamic_config_write() and _read()
>
>
> > > > +             rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
> > > > +                                               port, &mgmt_route, true);
>
> priv can be replaced with ds, which the tagger has. port is
> known. BLK_IDX_MGMT_ROUTE is implicit, and all that the tagger needs
> to pass for mgmt_route is the destination MAC address, which it has.
>
> The tagger does need somewhere to keep the queue of frames to be sent
> and its workqueue. I would probably add a void *tagger_priv to
> dsa_switch, and two new methods to dsa_device_ops, .probe and
> .release, to allow it to create and destroy what it needs in
> tagger_priv.
>

I need to think about this.

> > > > +#include "dsa_priv.h"
> > > > +
> > > > +/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
> > > > +static inline bool sja1105_is_link_local(const struct sk_buff *skb)
> > > > +{
> > > > +     const struct ethhdr *hdr = eth_hdr(skb);
> > > > +     u64 dmac = ether_addr_to_u64(hdr->h_dest);
> > > > +
> > > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
> > > > +                 SJA1105_LINKLOCAL_FILTER_A)
> > > > +             return true;
> > > > +     if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
> > > > +                 SJA1105_LINKLOCAL_FILTER_B)
> > > > +             return true;
> > > > +     return false;
> > > > +}
> > > > +
> > > > +static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
> > > > +{
> > > > +     if (sja1105_is_link_local(skb))
> > > > +             return true;
> > > > +     if (!dev->dsa_ptr->vlan_filtering)
> > > > +             return true;
> > > > +     return false;
> > > > +}
> > >
> > > Please add a comment here about what frames cannot be handled by the
> > > tagger. However, i'm not too happy about this design...
> > >
>
> > What would you improve about this design (assuming you're talking
> > about the filter function)?
>
> I want to understand what frames get passed via the master device, and
> how ultimately they get to where they should be going.
>
> Once i understand what sort of frames they are and what is
> generating/consuming them, maybe we can find a better solution which
> preserves the DSA concepts.
>
> To me, it looks like they are not management frames, at least not BPDU
> or PTP, since they are link local. If VLAN filtering is off, the VLAN
> tag tells us which port they came in, so we can strip the tag and pass
> them to the correct slave.
>
> So it looks like real user frames with a VLAN tag are getting passed
> to the master device. So i then assume you put vlan interfaces on top
> of the master device, and your application then uses the vlan
> interfaces? Your application does not care where the frames came from,
> it is using the switch as a dumb switch. The DSA slaves are unused?
>

Whatever the application may be, the DSA solution to switches that
can't decode all incoming traffic is to drop the rest. In this case it
means that the host port is no longer a valid destination for the L2
switching process.

> Could we enforce that a VLAN can only be assigned to a single port?
> The tagger could then pass the tagged frame to the correct slave? Is
> that too restrictive for your use case? Do you need the same VLAN on
> multiple ports.
>
>          Andrew

No we can't enforce that. The commit message of 07/24 has a pretty
lengthy explanation why.

Thanks,
-Vladimir
Vladimir Oltean April 14, 2019, 6:53 p.m. UTC | #8
On Sun, 14 Apr 2019 at 19:18, Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > > > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > > > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> > > > >
> > > > > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> > > > >
> >
> > I'm receiving contradictory advice on this. Why should I define a new
> > ethertype, and if I do, what scope should the definition have (local
> > to the driver and the tagger, local to DSA, UAPI)?
>
> ETH_P_EDSA has a well defined meaning. It is a true global EtherType,
> and means a Marvell EtherType DSA header follows.
>
> You are polluting this meaning of ETH_P_EDSA. Would you put ETH_P_IP
> or ETH_P_8021AD here?
>
>    Andrew

You are putting an equality sign here between things that are not quite equal.
The MEDSA EtherType is used for the same purpose as what I'm using it for.
The only situation when I can receive ETH_P_EDSA frames is if somebody
designed a system with a cascaded SJA1105 and a MV88E6xx. I think
that's unlikely but I might be wrong.
Don't get me wrong, I could use literally any EtherType and that's
exactly why I'm reluctant to define a new one.
The only thing is that if I pick an EtherType smaller than 1500
(LLC/SNAP) like ETH_P_XDSA (or even zero works), then I get the
hardware incrementing the n_sizeerr counter for each received tagged
frame (it doesn't drop it, though).

-Vladimir
Andrew Lunn April 14, 2019, 7:06 p.m. UTC | #9
> Yes, it looks like the hardware isn't doing me any favors on this one.
> >From UM10944: "If the host provides several management route entries
> with identical values for the MACADDR, the one at the lowest index is
> used first."
> So the 4 hardware management slots serve no purpose unless I'm willing
> to lock the sja1105_xmit_work_handler with a mutex. And even then
> there's no reason to use separate slots since the workers are
> serialized anyway. Weird.

So you can simply this down to just using one queue. Maybe even one
queue for all instances of this switch. You can then keep it all as
static private structures in the tag driver, maybe not even needing
ds->tag_priv.

> > Also, DSA has been very successful, we keep getting more switches from
> > different vendors, and hence more taggers. So at some point, we should
> > turn the taggers into modules. I'm not saying that should happen now,
> > but when it does happen, this driver can then become a module.

FYI: I started on this already. There might be patches today to allow
the tag drivers to be kernel modules.

    Andrew
Andrew Lunn April 14, 2019, 7:13 p.m. UTC | #10
On Sun, Apr 14, 2019 at 09:53:42PM +0300, Vladimir Oltean wrote:
> On Sun, 14 Apr 2019 at 19:18, Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > > > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > > > > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> > > > > >
> > > > > > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> > > > > >
> > >
> > > I'm receiving contradictory advice on this. Why should I define a new
> > > ethertype, and if I do, what scope should the definition have (local
> > > to the driver and the tagger, local to DSA, UAPI)?
> >
> > ETH_P_EDSA has a well defined meaning. It is a true global EtherType,
> > and means a Marvell EtherType DSA header follows.
> >
> > You are polluting this meaning of ETH_P_EDSA. Would you put ETH_P_IP
> > or ETH_P_8021AD here?
> >
> >    Andrew
> 
> You are putting an equality sign here between things that are not quite equal.
> The MEDSA EtherType is used for the same purpose as what I'm using it for.

I don't think it is. tcpdump will match on this EtherType and decode
what follows as an EDSA header, just in the same way it matches a
ETH_P_IP and decodes what comes next as an IP packet. I also have
wireshark patches, which i never submitted, which do the same.

Please run tcpdump on the master interface with your test system and
see what it does.

    Andrew
Vladimir Oltean April 14, 2019, 10:30 p.m. UTC | #11
On Sun, 14 Apr 2019 at 22:13, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Sun, Apr 14, 2019 at 09:53:42PM +0300, Vladimir Oltean wrote:
> > On Sun, 14 Apr 2019 at 19:18, Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > > > > > > +             return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
> > > > > > > > +                                  ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
> > > > > > >
> > > > > > > Please don't reuse ETH_P_EDSA. Define an ETH_P_SJA1105.
> > > > > > >
> > > >
> > > > I'm receiving contradictory advice on this. Why should I define a new
> > > > ethertype, and if I do, what scope should the definition have (local
> > > > to the driver and the tagger, local to DSA, UAPI)?
> > >
> > > ETH_P_EDSA has a well defined meaning. It is a true global EtherType,
> > > and means a Marvell EtherType DSA header follows.
> > >
> > > You are polluting this meaning of ETH_P_EDSA. Would you put ETH_P_IP
> > > or ETH_P_8021AD here?
> > >
> > >    Andrew
> >
> > You are putting an equality sign here between things that are not quite equal.
> > The MEDSA EtherType is used for the same purpose as what I'm using it for.
>
> I don't think it is. tcpdump will match on this EtherType and decode
> what follows as an EDSA header, just in the same way it matches a
> ETH_P_IP and decodes what comes next as an IP packet. I also have
> wireshark patches, which i never submitted, which do the same.
>
> Please run tcpdump on the master interface with your test system and
> see what it does.
>
>     Andrew

It fails to decode the frames, obviously. But so does any other EtherType.
Florian was hinting
(https://lwn.net/ml/netdev/b52f4cdf-edcf-0757-1d6e-d4a831ec7943@gmail.com/)
at the recent pull requests submitted to tcpdump and libpcap that make
it possible to decode based on the string in
/sys/class/net/${master}/dsa/tagging. I admit I haven't actually
tested or studied those closely yet (there are more important things
to focus on ATM), but since my driver returns "8021q" in sysfs and
yours returns "edsa", I would presume tcpdump could use that
information. Anyway, since you obviously know more on this topic than
I do, please make me understand what are the real problems in spoofing
the Ethertype as a Marvell one.

Thanks,
-Vladimir
Andrew Lunn April 15, 2019, 3:07 a.m. UTC | #12
> It fails to decode the frames, obviously. But so does any other EtherType.

> Florian was hinting
> (https://lwn.net/ml/netdev/b52f4cdf-edcf-0757-1d6e-d4a831ec7943@gmail.com/)
> at the recent pull requests submitted to tcpdump and libpcap that make
> it possible to decode based on the string in
> /sys/class/net/${master}/dsa/tagging. I admit I haven't actually
> tested or studied those closely yet (there are more important things
> to focus on ATM), but since my driver returns "8021q" in sysfs and
> yours returns "edsa", I would presume tcpdump could use that
> information.

No it does not. It is a valid EtherType, that is what is used to
trigger the decoding, it takes no notice of what is in
/sys/class/net/${master}/dsa/tagging, nor the extra meta-data added to
the pcap file. There is no need. you can identify it is a Marvell EDSA
header from the EtherType.

In fact, this tcpdump code for decoding EDSA pre-dated Florians
patches by a few years.

You only need the code which Florian added when you cannot identify
the header directly from the packet. And that is true for most of the
tagging protocols. But EDSA you can.

> Anyway, since you obviously know more on this topic than I do,
> please make me understand what are the real problems in spoofing the
> Ethertype as a Marvell one.

Despite there being an EDSA EtherType in the frame, what follows is
not an ESDA header. It is like having the IPv4 EtherType but what
following is not an IP header. Broken.

    Andrew
Florian Fainelli April 17, 2019, 12:09 a.m. UTC | #13
On 14/04/2019 20:07, Andrew Lunn wrote:
>> It fails to decode the frames, obviously. But so does any other EtherType.
> 
>> Florian was hinting
>> (https://lwn.net/ml/netdev/b52f4cdf-edcf-0757-1d6e-d4a831ec7943@gmail.com/)
>> at the recent pull requests submitted to tcpdump and libpcap that make
>> it possible to decode based on the string in
>> /sys/class/net/${master}/dsa/tagging. I admit I haven't actually
>> tested or studied those closely yet (there are more important things
>> to focus on ATM), but since my driver returns "8021q" in sysfs and
>> yours returns "edsa", I would presume tcpdump could use that
>> information.
> 
> No it does not. It is a valid EtherType, that is what is used to
> trigger the decoding, it takes no notice of what is in
> /sys/class/net/${master}/dsa/tagging, nor the extra meta-data added to
> the pcap file. There is no need. you can identify it is a Marvell EDSA
> header from the EtherType.
> 
> In fact, this tcpdump code for decoding EDSA pre-dated Florians
> patches by a few years.
> 
> You only need the code which Florian added when you cannot identify
> the header directly from the packet. And that is true for most of the
> tagging protocols. But EDSA you can.

Correct.

> 
>> Anyway, since you obviously know more on this topic than I do,
>> please make me understand what are the real problems in spoofing the
>> Ethertype as a Marvell one.
> 
> Despite there being an EDSA EtherType in the frame, what follows is
> not an ESDA header. It is like having the IPv4 EtherType but what
> following is not an IP header. Broken.
I suppose this is a valid point, but in that case any EtherType would do 
and technically using ETH_P_EDSA is just an one of the many possible 
choices for configuring the Marvell EDSA EtherType, you just need to 
pick one that is not going to trick the switching into thinking this is 
invalid LLC/SNAP.

With Vivien's latest tcpdump changes, I don't think we need to rely on 
ETH_P_EDSA to be present anyway since the kernel tells us (when available).
Florian Fainelli April 17, 2019, 12:16 a.m. UTC | #14
On 13/04/2019 14:27, Vladimir Oltean wrote:
> 
> Ok, let's put this another way.
> A switch is primarily a device used to offload the forwarding of
> traffic based on L2 rules. Additionally there may be some management
> traffic for stuff like STP that needs to be terminated on the host
> port of the switch. For that, the hardware's job is to filter and tag
> management frames on their way to the host port, and the software's
> job is to process the source port and switch id information in a
> meaningful way.
> Now both this particular switch hardware, and DSA, are taking the
> above definitions to extremes.
> The switch says: "that's all you want to see? ok, so that's all I'm
> going to give you". So its native (hardware) tagging protocol is to
> trap link-local traffic and overwrite two bytes of its destination MAC
> with the switch ID and the source port. No more, no less. It is an
> incomplete solution, but it does the job for practical use cases.

Indeed.

> Now DSA says: "I want these to be fully capable net devices, I want
> the user to not even realize what's going on under the hood". I don't
> think that terminating iperf traffic through switch ports is a
> realistic usage scenario. So in a way discussions about performance
> and optimizations on DSA hotpath are slightly pointless IMO.

Actually it is on Broadcom devices that I directly or indirectly helped 
to support with bcm_sf2/b53 we have 2Gb/sec capable management ports and 
we run iperf directly on the host CPUs. Some ports remain standalone 
(e.g.: WAN) and the others can be bridged together (LAN + WLAN).

> Now what my driver says is that it offers a bit of both. It speaks the
> hardware's tagging protocol so it is capable of management traffic,
> but it also speaks the DSA paradigm, so in a way pushes the hardware
> to work in a mode it was never intended to, by repurposing VLANs when
> the user doesn't request them. So on one hand there is some overlap
> between the hardware tagging protocol and the VLAN one (in standalone
> mode and in VLAN-unaware bridged mode, management traffic *could* use
> VLAN tagging but it doesn't rely on it), and on the other hand the
> reunion of the two tagging protocols is decent, but still doesn't
> cover the entire spectrum (when put under a VLAN-aware bridge, you
> lose the ability to decode general traffic). So you'd better not rely
> on VLANs to decode the management traffic, because you won't be able
> to always rely on that, and that is a shame since a bridge with both
> vlan_filtering 1 and stp_state 1 is a real usage scenario, and the
> hardware is capable of that combination.
> But all of that is secondary. Let's forget about VLAN tagging for a
> second and concentrate on the tagging of management traffic. The
> limiting factor here is the software architecture of DSA, because in
> order for me to decode that in the driver/tagger, I'd have to drop
> everything else coming on the master net device (I explained in 13/24
> why). I believe that DSA being all-or-nothing about switch tagging is
> turning a blind eye to the devices that don't go overboard with
> features, and give you what's needed in a real-world design but not
> much else.

I would word it differently and say that up until now, whatever DSA 
assumed about switches was something that was supportable and with the 
sja1105 we are faced with an interesting of limits on both designs. I 
don't think DSA is unreasonable in assuming that management frame is 
always tagged with a proprietary switch protocol; because that's what 
has happened across a wide variety of vendors. The NXP SJA1105 is not 
unreasonable but it does present some challenges.

> What would you improve about this design (assuming you're talking
> about the filter function)?

Would assigning different MAC addresses to each standalone port help in 
any way such that you could leverage filtering in HW based on MAC DA?
diff mbox series

Patch

diff --git a/drivers/net/dsa/sja1105/sja1105.h b/drivers/net/dsa/sja1105/sja1105.h
index 80b20bdd8f9c..b7e745c0bb3a 100644
--- a/drivers/net/dsa/sja1105/sja1105.h
+++ b/drivers/net/dsa/sja1105/sja1105.h
@@ -5,6 +5,7 @@ 
 #ifndef _SJA1105_H
 #define _SJA1105_H
 
+#include <linux/dsa/sja1105.h>
 #include <net/dsa.h>
 #include "sja1105_static_config.h"
 
@@ -19,6 +20,12 @@ 
 #define SJA1105_NUM_TC			8
 #define SJA1105ET_FDB_BIN_SIZE		4
 
+struct sja1105_port {
+	struct dsa_port *dp;
+	struct work_struct xmit_work;
+	struct sja1105_skb_ring xmit_ring;
+};
+
 /* Keeps the different addresses between E/T and P/Q/R/S */
 struct sja1105_regs {
 	u64 device_id;
@@ -63,6 +70,7 @@  struct sja1105_private {
 	struct gpio_desc *reset_gpio;
 	struct spi_device *spidev;
 	struct dsa_switch *ds;
+	struct sja1105_port ports[SJA1105_NUM_PORTS];
 };
 
 #include "sja1105_dynamic_config.h"
diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index 9d28436fe466..018044f358fd 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -1112,10 +1112,27 @@  static int sja1105_vlan_apply(struct sja1105_private *priv, int port, u16 vid,
 	return 0;
 }
 
+static int sja1105_setup_8021q_tagging(struct dsa_switch *ds, bool enabled)
+{
+	int rc, i;
+
+	for (i = 0; i < SJA1105_NUM_PORTS; i++) {
+		rc = dsa_port_setup_8021q_tagging(ds, i, enabled);
+		if (rc < 0) {
+			dev_err(ds->dev, "Failed to setup VLAN tagging for port %d: %d\n",
+				i, rc);
+			return rc;
+		}
+	}
+	dev_info(ds->dev, "%s switch tagging\n",
+		 enabled ? "Enabled" : "Disabled");
+	return 0;
+}
+
 static enum dsa_tag_protocol
 sja1105_get_tag_protocol(struct dsa_switch *ds, int port)
 {
-	return DSA_TAG_PROTO_NONE;
+	return DSA_TAG_PROTO_SJA1105;
 }
 
 /* This callback needs to be present */
@@ -1141,7 +1158,11 @@  static int sja1105_vlan_filtering(struct dsa_switch *ds, int port, bool enabled)
 	if (rc)
 		dev_err(ds->dev, "Failed to change VLAN Ethertype\n");
 
-	return rc;
+	/* Switch port identification based on 802.1Q is only passable
+	 * if we are not under a vlan_filtering bridge. So make sure
+	 * the two configurations are mutually exclusive.
+	 */
+	return sja1105_setup_8021q_tagging(ds, !enabled);
 }
 
 static void sja1105_vlan_add(struct dsa_switch *ds, int port,
@@ -1233,9 +1254,107 @@  static int sja1105_setup(struct dsa_switch *ds)
 	 */
 	ds->vlan_filtering_is_global = true;
 
+	/* The DSA/switchdev model brings up switch ports in standalone mode by
+	 * default, and that means vlan_filtering is 0 since they're not under
+	 * a bridge, so it's safe to set up switch tagging at this time.
+	 */
+	return sja1105_setup_8021q_tagging(ds, true);
+}
+
+#include "../../../net/dsa/dsa_priv.h"
+/* Deferred work is unfortunately necessary because setting up the management
+ * route cannot be done from atomit context (SPI transfer takes a sleepable
+ * lock on the bus)
+ */
+static void sja1105_xmit_work_handler(struct work_struct *work)
+{
+	struct sja1105_port *sp = container_of(work, struct sja1105_port,
+						xmit_work);
+	struct sja1105_private *priv = sp->dp->ds->priv;
+	struct net_device *slave = sp->dp->slave;
+	struct net_device *master = dsa_slave_to_master(slave);
+	int port = (uintptr_t)(sp - priv->ports);
+	struct sk_buff *skb;
+	int i, rc;
+
+	while ((i = sja1105_skb_ring_get(&sp->xmit_ring, &skb)) >= 0) {
+		struct sja1105_mgmt_entry mgmt_route = { 0 };
+		struct ethhdr *hdr;
+		int timeout = 10;
+		int skb_len;
+
+		skb_len = skb->len;
+		hdr = eth_hdr(skb);
+
+		mgmt_route.macaddr = ether_addr_to_u64(hdr->h_dest);
+		mgmt_route.destports = BIT(port);
+		mgmt_route.enfport = 1;
+		mgmt_route.tsreg = 0;
+		mgmt_route.takets = false;
+
+		rc = sja1105_dynamic_config_write(priv, BLK_IDX_MGMT_ROUTE,
+						  port, &mgmt_route, true);
+		if (rc < 0) {
+			kfree_skb(skb);
+			slave->stats.tx_dropped++;
+			continue;
+		}
+
+		/* Transfer skb to the host port. */
+		skb->dev = master;
+		dev_queue_xmit(skb);
+
+		/* Wait until the switch has processed the frame */
+		do {
+			rc = sja1105_dynamic_config_read(priv, BLK_IDX_MGMT_ROUTE,
+							 port, &mgmt_route);
+			if (rc < 0) {
+				slave->stats.tx_errors++;
+				dev_err(priv->ds->dev,
+					"xmit: failed to poll for mgmt route\n");
+				continue;
+			}
+
+			/* UM10944: The ENFPORT flag of the respective entry is
+			 * cleared when a match is found. The host can use this
+			 * flag as an acknowledgment.
+			 */
+			cpu_relax();
+		} while (mgmt_route.enfport && --timeout);
+
+		if (!timeout) {
+			dev_err(priv->ds->dev, "xmit timed out\n");
+			slave->stats.tx_errors++;
+			continue;
+		}
+
+		slave->stats.tx_packets++;
+		slave->stats.tx_bytes += skb_len;
+	}
+}
+
+static int sja1105_port_enable(struct dsa_switch *ds, int port,
+			       struct phy_device *phydev)
+{
+	struct sja1105_private *priv = ds->priv;
+	struct sja1105_port *sp = &priv->ports[port];
+
+	sp->dp = &ds->ports[port];
+	INIT_WORK(&sp->xmit_work, sja1105_xmit_work_handler);
 	return 0;
 }
 
+static void sja1105_port_disable(struct dsa_switch *ds, int port)
+{
+	struct sja1105_private *priv = ds->priv;
+	struct sja1105_port *sp = &priv->ports[port];
+	struct sk_buff *skb;
+
+	cancel_work_sync(&sp->xmit_work);
+	while (sja1105_skb_ring_get(&sp->xmit_ring, &skb) >= 0)
+		kfree_skb(skb);
+}
+
 static const struct dsa_switch_ops sja1105_switch_ops = {
 	.get_tag_protocol	= sja1105_get_tag_protocol,
 	.setup			= sja1105_setup,
@@ -1255,6 +1374,8 @@  static const struct dsa_switch_ops sja1105_switch_ops = {
 	.port_mdb_prepare	= sja1105_mdb_prepare,
 	.port_mdb_add		= sja1105_mdb_add,
 	.port_mdb_del		= sja1105_mdb_del,
+	.port_enable		= sja1105_port_enable,
+	.port_disable		= sja1105_port_disable,
 };
 
 static int sja1105_check_device_id(struct sja1105_private *priv)
diff --git a/include/linux/dsa/sja1105.h b/include/linux/dsa/sja1105.h
new file mode 100644
index 000000000000..dd40ffa475cc
--- /dev/null
+++ b/include/linux/dsa/sja1105.h
@@ -0,0 +1,52 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
+ */
+
+/* Included by drivers/net/dsa/sja1105/sja1105.h and net/dsa/tag_sja1105.c */
+
+#ifndef _NET_DSA_SJA1105_H
+#define _NET_DSA_SJA1105_H
+
+#include <linux/skbuff.h>
+#include <net/dsa.h>
+
+#define SJA1105_SKB_RING_SIZE	20
+
+struct sja1105_skb_ring {
+	struct sk_buff *skb[SJA1105_SKB_RING_SIZE];
+	int count;
+	int pi; /* Producer index */
+	int ci; /* Consumer index */
+};
+
+static inline int sja1105_skb_ring_add(struct sja1105_skb_ring *ring,
+				       struct sk_buff *skb)
+{
+	int index;
+
+	if (ring->count == SJA1105_SKB_RING_SIZE)
+		return -1;
+
+	index = ring->pi;
+	ring->skb[index] = skb;
+	ring->pi = (index + 1) % SJA1105_SKB_RING_SIZE;
+	ring->count++;
+	return index;
+}
+
+static inline int sja1105_skb_ring_get(struct sja1105_skb_ring *ring,
+				       struct sk_buff **skb)
+{
+	int index;
+
+	if (ring->count == 0)
+		return -1;
+
+	index = ring->ci;
+	*skb = ring->skb[index];
+	ring->ci = (index + 1) % SJA1105_SKB_RING_SIZE;
+	ring->count--;
+	return index;
+}
+
+#endif /* _NET_DSA_SJA1105_H */
diff --git a/include/net/dsa.h b/include/net/dsa.h
index e46c107507d8..9c7e29c4f22a 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -41,6 +41,7 @@  enum dsa_tag_protocol {
 	DSA_TAG_PROTO_KSZ9893,
 	DSA_TAG_PROTO_LAN9303,
 	DSA_TAG_PROTO_MTK,
+	DSA_TAG_PROTO_SJA1105,
 	DSA_TAG_PROTO_QCA,
 	DSA_TAG_PROTO_TRAILER,
 	DSA_TAG_LAST,		/* MUST BE LAST */
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index b2fc07de8bcb..18deab52890f 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -65,6 +65,9 @@  config NET_DSA_TAG_LAN9303
 config NET_DSA_TAG_MTK
 	bool
 
+config NET_DSA_TAG_SJA1105
+	bool
+
 config NET_DSA_TAG_TRAILER
 	bool
 
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index d7fc3253d497..8c294cdb895a 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -15,4 +15,5 @@  dsa_core-$(CONFIG_NET_DSA_TAG_KSZ) += tag_ksz.o
 dsa_core-$(CONFIG_NET_DSA_TAG_LAN9303) += tag_lan9303.o
 dsa_core-$(CONFIG_NET_DSA_TAG_MTK) += tag_mtk.o
 dsa_core-$(CONFIG_NET_DSA_TAG_QCA) += tag_qca.o
+dsa_core-$(CONFIG_NET_DSA_TAG_SJA1105) += tag_sja1105.o
 dsa_core-$(CONFIG_NET_DSA_TAG_TRAILER) += tag_trailer.o
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 36de4f2a3366..7e2542d756e0 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -65,6 +65,9 @@  const struct dsa_device_ops *dsa_device_ops[DSA_TAG_LAST] = {
 #ifdef CONFIG_NET_DSA_TAG_MTK
 	[DSA_TAG_PROTO_MTK] = &mtk_netdev_ops,
 #endif
+#ifdef CONFIG_NET_DSA_TAG_SJA1105
+	[DSA_TAG_PROTO_SJA1105] = &sja1105_netdev_ops,
+#endif
 #ifdef CONFIG_NET_DSA_TAG_QCA
 	[DSA_TAG_PROTO_QCA] = &qca_netdev_ops,
 #endif
@@ -102,6 +105,9 @@  const char *dsa_tag_protocol_to_str(const struct dsa_device_ops *ops)
 #ifdef CONFIG_NET_DSA_TAG_MTK
 		[DSA_TAG_PROTO_MTK] = "mtk",
 #endif
+#ifdef CONFIG_NET_DSA_TAG_SJA1105
+		[DSA_TAG_PROTO_SJA1105] = "8021q",
+#endif
 #ifdef CONFIG_NET_DSA_TAG_QCA
 		[DSA_TAG_PROTO_QCA] = "qca",
 #endif
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index cc5ec3759952..d58a749c8b5b 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -236,6 +236,9 @@  extern const struct dsa_device_ops lan9303_netdev_ops;
 /* tag_mtk.c */
 extern const struct dsa_device_ops mtk_netdev_ops;
 
+/* tag_sja1105.c */
+extern const struct dsa_device_ops sja1105_netdev_ops;
+
 /* tag_qca.c */
 extern const struct dsa_device_ops qca_netdev_ops;
 
diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
new file mode 100644
index 000000000000..5c76a06c9093
--- /dev/null
+++ b/net/dsa/tag_sja1105.c
@@ -0,0 +1,148 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
+ */
+#include <linux/etherdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/dsa/sja1105.h>
+#include "../../drivers/net/dsa/sja1105/sja1105.h"
+
+#include "dsa_priv.h"
+
+/* Similar to is_link_local_ether_addr(hdr->h_dest) but also covers PTP */
+static inline bool sja1105_is_link_local(const struct sk_buff *skb)
+{
+	const struct ethhdr *hdr = eth_hdr(skb);
+	u64 dmac = ether_addr_to_u64(hdr->h_dest);
+
+	if ((dmac & SJA1105_LINKLOCAL_FILTER_A_MASK) ==
+		    SJA1105_LINKLOCAL_FILTER_A)
+		return true;
+	if ((dmac & SJA1105_LINKLOCAL_FILTER_B_MASK) ==
+		    SJA1105_LINKLOCAL_FILTER_B)
+		return true;
+	return false;
+}
+
+static bool sja1105_filter(const struct sk_buff *skb, struct net_device *dev)
+{
+	if (sja1105_is_link_local(skb))
+		return true;
+	if (!dev->dsa_ptr->vlan_filtering)
+		return true;
+	return false;
+}
+
+static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
+				    struct net_device *netdev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(netdev);
+	struct dsa_switch *ds = dp->ds;
+	struct sja1105_private *priv = ds->priv;
+	struct sja1105_port *sp = &priv->ports[dp->index];
+	struct sk_buff *clone;
+
+	if (likely(!sja1105_is_link_local(skb))) {
+		/* Normal traffic path. */
+		u16 tx_vid = dsa_tagging_tx_vid(ds, dp->index);
+		u8 pcp = skb->priority;
+
+		/* If we are under a vlan_filtering bridge, IP termination on
+		 * switch ports based on 802.1Q tags is simply too brittle to
+		 * be passable. So just defer to the dsa_slave_notag_xmit
+		 * implementation.
+		 */
+		if (dp->vlan_filtering)
+			return skb;
+
+		return dsa_8021q_xmit(skb, netdev, ETH_P_EDSA,
+				     ((pcp << VLAN_PRIO_SHIFT) | tx_vid));
+	}
+
+	/* Code path for transmitting management traffic. This does not rely
+	 * upon switch tagging, but instead SPI-installed management routes.
+	 */
+	clone = skb_clone(skb, GFP_ATOMIC);
+	if (!clone) {
+		dev_err(ds->dev, "xmit: failed to clone skb\n");
+		return NULL;
+	}
+
+	if (sja1105_skb_ring_add(&sp->xmit_ring, clone) < 0) {
+		dev_err(ds->dev, "xmit: skb ring full\n");
+		kfree_skb(clone);
+		return NULL;
+	}
+
+	if (sp->xmit_ring.count == SJA1105_SKB_RING_SIZE)
+		/* TODO setup a dedicated netdev queue for management traffic
+		 * so that we can selectively apply backpressure and not be
+		 * required to stop the entire traffic when the software skb
+		 * ring is full. This requires hooking the ndo_select_queue
+		 * from DSA and matching on mac_fltres.
+		 */
+		dev_err(ds->dev, "xmit: reached maximum skb ring size\n");
+
+	schedule_work(&sp->xmit_work);
+	/* Let DSA free its reference to the skb and we will free
+	 * the clone in the deferred worker
+	 */
+	return NULL;
+}
+
+static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
+				   struct net_device *netdev,
+				   struct packet_type *pt)
+{
+	unsigned int source_port, switch_id;
+	struct ethhdr *hdr = eth_hdr(skb);
+	struct sk_buff *nskb;
+	u16 tpid, vid, tci;
+	bool is_tagged;
+
+	nskb = dsa_8021q_rcv(skb, netdev, pt, &tpid, &tci);
+	is_tagged = (nskb && tpid == ETH_P_EDSA);
+
+	skb->priority = (tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
+	vid = tci & VLAN_VID_MASK;
+
+	skb->offload_fwd_mark = 1;
+
+	if (likely(!sja1105_is_link_local(skb))) {
+		/* Normal traffic path. */
+		source_port = dsa_tagging_rx_source_port(vid);
+		switch_id = dsa_tagging_rx_switch_id(vid);
+	} else {
+		/* Management traffic path. Switch embeds the switch ID and
+		 * port ID into bytes of the destination MAC, courtesy of
+		 * the incl_srcpt options.
+		 */
+		source_port = hdr->h_dest[3];
+		switch_id = hdr->h_dest[4];
+		/* Clear the DMAC bytes that were mangled by the switch */
+		hdr->h_dest[3] = 0;
+		hdr->h_dest[4] = 0;
+	}
+
+	skb->dev = dsa_master_find_slave(netdev, switch_id, source_port);
+	if (!skb->dev) {
+		netdev_warn(netdev, "Couldn't decode source port\n");
+		return NULL;
+	}
+
+	/* Delete/overwrite fake VLAN header, DSA expects to not find
+	 * it there, see dsa_switch_rcv: skb_push(skb, ETH_HLEN).
+	 */
+	if (is_tagged)
+		memmove(skb->data - ETH_HLEN, skb->data - ETH_HLEN - VLAN_HLEN,
+			ETH_HLEN - VLAN_HLEN);
+
+	return skb;
+}
+
+const struct dsa_device_ops sja1105_netdev_ops = {
+	.xmit = sja1105_xmit,
+	.rcv = sja1105_rcv,
+	.filter = sja1105_filter,
+	.overhead = VLAN_HLEN,
+};
+