diff mbox series

[net] net: dsa: reference count the host mdb addresses

Message ID 20201015212711.724678-1-vladimir.oltean@nxp.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [net] net: dsa: reference count the host mdb addresses | expand

Commit Message

Vladimir Oltean Oct. 15, 2020, 9:27 p.m. UTC
Currently any DSA switch that implements the multicast ops (properly,
that is) gets these errors after just sitting for a while, with at least
2 ports bridged:

[  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)

The reason has to do with this piece of code:

	netdev_for_each_lower_dev(dev, lower_dev, iter)
		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);

called from:

br_multicast_group_expired
-> br_multicast_host_leave
   -> br_mdb_notify
      -> br_mdb_switchdev_host

Basically, that code is correct. It tells each switchdev port that the
host can leave that multicast group. But in the case of DSA, all user
ports are connected to the host through the same pipe. So, because DSA
translates a host MDB to a normal MDB on the CPU port, this means that
when all user ports leave a multicast group, DSA tries to remove it N
times from the CPU port.

We should be reference-counting these addresses.

Fixes: 5f4dbc50ce4d ("net: dsa: slave: Handle switchdev host mdb add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h |  9 +++++
 net/dsa/dsa2.c    |  2 ++
 net/dsa/slave.c   | 92 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 94 insertions(+), 9 deletions(-)

Comments

Jakub Kicinski Oct. 19, 2020, 11:55 p.m. UTC | #1
On Fri, 16 Oct 2020 00:27:11 +0300 Vladimir Oltean wrote:
> Currently any DSA switch that implements the multicast ops (properly,
> that is) gets these errors after just sitting for a while, with at least
> 2 ports bridged:
> 
> [  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)
> 
> The reason has to do with this piece of code:
> 
> 	netdev_for_each_lower_dev(dev, lower_dev, iter)
> 		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);

We need a review on this one, anyone?
Andrew Lunn Oct. 20, 2020, 12:06 a.m. UTC | #2
On Fri, Oct 16, 2020 at 12:27:11AM +0300, Vladimir Oltean wrote:
> Currently any DSA switch that implements the multicast ops (properly,
> that is) gets these errors after just sitting for a while, with at least
> 2 ports bridged:
> 
> [  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)
> 
> The reason has to do with this piece of code:
> 
> 	netdev_for_each_lower_dev(dev, lower_dev, iter)
> 		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);
> 
> called from:
> 
> br_multicast_group_expired
> -> br_multicast_host_leave
>    -> br_mdb_notify
>       -> br_mdb_switchdev_host
> 
> Basically, that code is correct. It tells each switchdev port that the
> host can leave that multicast group. But in the case of DSA, all user
> ports are connected to the host through the same pipe. So, because DSA
> translates a host MDB to a normal MDB on the CPU port, this means that
> when all user ports leave a multicast group, DSA tries to remove it N
> times from the CPU port.

Hi Vladimir

I agree with the analysis. This is how i designed it!

As far as i remember, none of the switches at the time would report an
error when asked to delete an MDB which did not exist. They also would
not give an error when adding an MDB which already exists.

So i decided to keep it KISS and not bother with reference counting.

> +static int dsa_host_mdb_add(struct dsa_port *dp,
> +			    const struct switchdev_obj_port_mdb *mdb,
> +			    struct switchdev_trans *trans)
> +{
> +	struct dsa_port *cpu_dp = dp->cpu_dp;
> +	struct dsa_switch *ds = dp->ds;
> +	struct dsa_host_addr *a;
> +	int err;
> +
> +	/* Only the commit phase is refcounted, which means that for the
> +	 * second, third, etc port which is member of this host address,
> +	 * we'll call the prepare phase but never commit.
> +	 */
> +	if (switchdev_trans_ph_prepare(trans))
> +		return dsa_port_mdb_add(cpu_dp, mdb, trans);
> +
> +	a = dsa_host_mdb_find(ds, mdb);
> +	if (a) {
> +		refcount_inc(&a->refcount);
> +		return 0;
> +	}
> +
> +	a = kzalloc(sizeof(*a), GFP_KERNEL);
> +	if (!a)
> +		return -ENOMEM;
> +

The other part of the argument is that DSA is stateless, in that there
is no dynamic memory allocation. Drivers are also stateless in terms
of dynamically allocating memory.

So, what value do this code add? Why do we actually need reference
counting?

	Andrew
Andrew Lunn Oct. 20, 2020, 12:07 a.m. UTC | #3
On Mon, Oct 19, 2020 at 04:55:14PM -0700, Jakub Kicinski wrote:
> On Fri, 16 Oct 2020 00:27:11 +0300 Vladimir Oltean wrote:
> > Currently any DSA switch that implements the multicast ops (properly,
> > that is) gets these errors after just sitting for a while, with at least
> > 2 ports bridged:
> > 
> > [  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)
> > 
> > The reason has to do with this piece of code:
> > 
> > 	netdev_for_each_lower_dev(dev, lower_dev, iter)
> > 		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);
> 
> We need a review on this one, anyone?

Hi Jakub

Thanks for the reminder. It has been on my TODO list since i got back
from vacation.

	Andrew
Jakub Kicinski Oct. 20, 2020, 12:13 a.m. UTC | #4
On Tue, 20 Oct 2020 02:07:46 +0200 Andrew Lunn wrote:
> On Mon, Oct 19, 2020 at 04:55:14PM -0700, Jakub Kicinski wrote:
> > On Fri, 16 Oct 2020 00:27:11 +0300 Vladimir Oltean wrote:  
> > > Currently any DSA switch that implements the multicast ops (properly,
> > > that is) gets these errors after just sitting for a while, with at least
> > > 2 ports bridged:
> > > 
> > > [  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)
> > > 
> > > The reason has to do with this piece of code:
> > > 
> > > 	netdev_for_each_lower_dev(dev, lower_dev, iter)
> > > 		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);  
> > 
> > We need a review on this one, anyone?  
> 
> Hi Jakub
> 
> Thanks for the reminder. It has been on my TODO list since i got back
> from vacation.

Good to have you back! :)
Vladimir Oltean Oct. 20, 2020, 12:28 a.m. UTC | #5
On Tue, Oct 20, 2020 at 02:06:32AM +0200, Andrew Lunn wrote:
> I agree with the analysis. This is how i designed it!
[...]
> So i decided to keep it KISS and not bother with reference counting.

Well, I can't argue with that then...

> The other part of the argument is that DSA is stateless, in that there
> is no dynamic memory allocation. Drivers are also stateless in terms
> of dynamically allocating memory.

And is that some sort of design goal? I think you'll run out of things
to do very quickly if you set out to never call kmalloc().

> So, what value do this code add? Why do we actually need reference
> counting?

Well, for one thing, if you have multiple bridge interfaces spanning the
ports of a single switch ASIC (and this is not uncommon with port count > 4)
then you'll have the timer expiry from one bridge clear the host MDB
entries of the other. Weird. And in general, you can't just "add first,
delete first" with this type of things. I actually got some pushback
from Vivien an year ago on a topic very similar to this: in the other
place where DSA is "lazy" and does not implement refcounting, which is
VLANs on the CPU port, at least the VLAN is not deleted. That would be
more correct, at least, than performing 6 additions, then 1 single
deletion which would invalidate the entry for all the other ports.

Also, I am taking the opportunity to add the refcounting infrastructure
for host FDB entries (this goes back to the patch series about DSA RX
filtering). At the moment, host addresses are added from a single type
of source (aka, a switchdev HOST_MDB object). But with unicast, there
could be more than one sources that a unicast address could be
added from:
- the MAC addresses of the ports. These addresses should be installed as
  host FDB entries
- the MAC addresses of the upper interfaces. Similar argument here.
- the local (master) FDB of the bridge. My plan of record is to offload
  this using a new switchdev HOST_FDB object, similar to what you've
  done for HOST_MDB.
In this case, having reference counting is pretty much something to
have, when an address could come from more than one place, we wouldn't
want to break anything.
Also, consider what happens when you start having the ability to install
the MAC address of the switch ports as a host FDB entry. DSA configures
all interfaces to have the same MAC address, which is inherited from the
master's MAC address. But you can also change the MAC address of a
switch interface. What do you do with the old one? Do you delete it? Do
you keep it? If you don't delete it, you might run out of FDB space
after enough MAC address changes. If you delete it, you might break
traffic for the other switch ports that are still using it...
Florian Fainelli Oct. 20, 2020, 2:11 a.m. UTC | #6
On 10/15/2020 2:27 PM, Vladimir Oltean wrote:
> Currently any DSA switch that implements the multicast ops (properly,
> that is) gets these errors after just sitting for a while, with at least
> 2 ports bridged:
> 
> [  286.013814] mscc_felix 0000:00:00.5 swp3: failed (err=-2) to del object (id=3)
> 
> The reason has to do with this piece of code:
> 
> 	netdev_for_each_lower_dev(dev, lower_dev, iter)
> 		br_mdb_switchdev_host_port(dev, lower_dev, mp, type);
> 
> called from:
> 
> br_multicast_group_expired
> -> br_multicast_host_leave
>     -> br_mdb_notify
>        -> br_mdb_switchdev_host
> 
> Basically, that code is correct. It tells each switchdev port that the
> host can leave that multicast group. But in the case of DSA, all user
> ports are connected to the host through the same pipe. So, because DSA
> translates a host MDB to a normal MDB on the CPU port, this means that
> when all user ports leave a multicast group, DSA tries to remove it N
> times from the CPU port.
> 
> We should be reference-counting these addresses.
> 
> Fixes: 5f4dbc50ce4d ("net: dsa: slave: Handle switchdev host mdb add/del")
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

This looks good to me, just one possible problem below:

[snip]

> +
> +	a = kzalloc(sizeof(*a), GFP_KERNEL);
> +	if (!a)
> +		return -ENOMEM;

I believe this needs to be GFP_ATOMIC if we are to follow 
net/bridge/br_mdb.c::br_mdb_notify which does its netlink messages 
allocations using GFP_ATOMIC. On a side note, it woul dbe very helpful 
if we could annotate all bridge functions accordingly with their context 
(BH, process, etc.).
diff mbox series

Patch

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 35429a140dfa..bad3877761b9 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -251,6 +251,13 @@  struct dsa_link {
 	struct list_head list;
 };
 
+struct dsa_host_addr {
+	unsigned char addr[ETH_ALEN];
+	u16 vid;
+	refcount_t refcount;
+	struct list_head list;
+};
+
 struct dsa_switch {
 	bool setup;
 
@@ -333,6 +340,8 @@  struct dsa_switch {
 	 */
 	bool			mtu_enforcement_ingress;
 
+	struct list_head	host_mdb;
+
 	size_t num_ports;
 };
 
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 183003e45762..52b3ef34a2cb 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -413,6 +413,8 @@  static int dsa_switch_setup(struct dsa_switch *ds)
 	if (ds->setup)
 		return 0;
 
+	INIT_LIST_HEAD(&ds->host_mdb);
+
 	/* Initialize ds->phys_mii_mask before registering the slave MDIO bus
 	 * driver and before ops->setup() has run, since the switch drivers and
 	 * the slave MDIO bus driver rely on these values for probing PHY
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 3bc5ca40c9fb..1bf3c406e2f8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -376,6 +376,87 @@  static int dsa_slave_vlan_add(struct net_device *dev,
 	return 0;
 }
 
+static struct dsa_host_addr *
+dsa_host_mdb_find(struct dsa_switch *ds,
+		  const struct switchdev_obj_port_mdb *mdb)
+{
+	struct dsa_host_addr *a;
+
+	list_for_each_entry(a, &ds->host_mdb, list)
+		if (ether_addr_equal(a->addr, mdb->addr) && a->vid == mdb->vid)
+			return a;
+
+	return NULL;
+}
+
+/* DSA can directly translate this to a normal MDB add, but on the CPU port.
+ * But because multiple user ports can join the same multicast group and the
+ * bridge will emit a notification for each port, we need to add/delete the
+ * entry towards the host only once, so we reference count it.
+ */
+static int dsa_host_mdb_add(struct dsa_port *dp,
+			    const struct switchdev_obj_port_mdb *mdb,
+			    struct switchdev_trans *trans)
+{
+	struct dsa_port *cpu_dp = dp->cpu_dp;
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_host_addr *a;
+	int err;
+
+	/* Only the commit phase is refcounted, which means that for the
+	 * second, third, etc port which is member of this host address,
+	 * we'll call the prepare phase but never commit.
+	 */
+	if (switchdev_trans_ph_prepare(trans))
+		return dsa_port_mdb_add(cpu_dp, mdb, trans);
+
+	a = dsa_host_mdb_find(ds, mdb);
+	if (a) {
+		refcount_inc(&a->refcount);
+		return 0;
+	}
+
+	a = kzalloc(sizeof(*a), GFP_KERNEL);
+	if (!a)
+		return -ENOMEM;
+
+	err = dsa_port_mdb_add(cpu_dp, mdb, trans);
+	if (err)
+		return err;
+
+	ether_addr_copy(a->addr, mdb->addr);
+	a->vid = mdb->vid;
+	refcount_set(&a->refcount, 1);
+	list_add_tail(&a->list, &ds->host_mdb);
+
+	return 0;
+}
+
+static int dsa_host_mdb_del(struct dsa_port *dp,
+			    const struct switchdev_obj_port_mdb *mdb)
+{
+	struct dsa_port *cpu_dp = dp->cpu_dp;
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_host_addr *a;
+	int err;
+
+	a = dsa_host_mdb_find(ds, mdb);
+	if (!a)
+		return -ENOENT;
+
+	if (!refcount_dec_and_test(&a->refcount))
+		return 0;
+
+	err = dsa_port_mdb_del(cpu_dp, mdb);
+	if (err)
+		return err;
+
+	list_del(&a->list);
+	kfree(a);
+
+	return 0;
+}
+
 static int dsa_slave_port_obj_add(struct net_device *dev,
 				  const struct switchdev_obj *obj,
 				  struct switchdev_trans *trans,
@@ -396,11 +477,7 @@  static int dsa_slave_port_obj_add(struct net_device *dev,
 		err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
 	case SWITCHDEV_OBJ_ID_HOST_MDB:
-		/* DSA can directly translate this to a normal MDB add,
-		 * but on the CPU port.
-		 */
-		err = dsa_port_mdb_add(dp->cpu_dp, SWITCHDEV_OBJ_PORT_MDB(obj),
-				       trans);
+		err = dsa_host_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
 	case SWITCHDEV_OBJ_ID_PORT_VLAN:
 		err = dsa_slave_vlan_add(dev, obj, trans);
@@ -455,10 +532,7 @@  static int dsa_slave_port_obj_del(struct net_device *dev,
 		err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
 	case SWITCHDEV_OBJ_ID_HOST_MDB:
-		/* DSA can directly translate this to a normal MDB add,
-		 * but on the CPU port.
-		 */
-		err = dsa_port_mdb_del(dp->cpu_dp, SWITCHDEV_OBJ_PORT_MDB(obj));
+		err = dsa_host_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
 	case SWITCHDEV_OBJ_ID_PORT_VLAN:
 		err = dsa_slave_vlan_del(dev, obj);