diff mbox

[net-next,2/2] sunvnet: vnet_start_xmit() must hold a refcnt on port.

Message ID 20141001185622.GI17706@oracle.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan Oct. 1, 2014, 6:56 p.m. UTC
A vnet_port_remove could be triggered as a result of an ldm-unbind operation
by the peer, or a module unload, or other changes to the inter-vnet-link
configuration.  When this is concurrent with vnet_start_xmit(),
the following sequence could occur

  thread 1                                    thread 2
vnet_start_xmit
 -> tx_port_find
    spin_lock_irqsave(&vp->lock..)
    ret = __tx_port_find(..)
    spin_lock_irqrestore(&vp->lock..)
                                           vio_remove -> ..
                                               ->vnet_port_remove
                                           spin_lock_irqsave(&vp->lock..)
                                           cleanup
                                           spin_lock_irqrestore(&vp->lock..)
                                           kfree(port)
 /* attempt to use ret will bomb */

This patch avoids the problem by holding/releasing a refcnt on
the vnet_port.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>

---
 drivers/net/ethernet/sun/sunvnet.c | 61 +++++++++++++++++++++++++++++++++-----
 drivers/net/ethernet/sun/sunvnet.h |  2 ++
 2 files changed, 55 insertions(+), 8 deletions(-)

Comments

Eric Dumazet Oct. 1, 2014, 7:05 p.m. UTC | #1
On Wed, 2014-10-01 at 14:56 -0400, Sowmini Varadhan wrote:
> A vnet_port_remove could be triggered as a result of an ldm-unbind operation
> by the peer, or a module unload, or other changes to the inter-vnet-link
> configuration.  When this is concurrent with vnet_start_xmit(),
> the following sequence could occur
> 
>   thread 1                                    thread 2
> vnet_start_xmit
>  -> tx_port_find
>     spin_lock_irqsave(&vp->lock..)
>     ret = __tx_port_find(..)
>     spin_lock_irqrestore(&vp->lock..)
>                                            vio_remove -> ..
>                                                ->vnet_port_remove
>                                            spin_lock_irqsave(&vp->lock..)
>                                            cleanup
>                                            spin_lock_irqrestore(&vp->lock..)
>                                            kfree(port)
>  /* attempt to use ret will bomb */
> 
> This patch avoids the problem by holding/releasing a refcnt on
> the vnet_port.

Hmpff... This calls for rcu protection here !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David L Stevens Oct. 1, 2014, 7:06 p.m. UTC | #2
On 10/01/2014 02:56 PM, Sowmini Varadhan wrote:

> -	list_for_each_entry(port, &vp->port_list, list) {
> -		if (!port->switch_port)
> -			continue;
> -		if (!port_is_up(port))
> -			continue;
> +	port = vp->switch_port;
> +	if (port != NULL && port_is_up(port)) {
> +		vnet_hold(port);
>  		return port;
>  	}
>  	return NULL;

This "vp->switch_port" addition doesn't appear to be related to the port refcnt
change, and doesn't allow for multiple switch ports.

						+-DLS


> @@ -1581,10 +1621,12 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
>  	port->switch_port = switch_port;
>  
>  	spin_lock_irqsave(&vp->lock, flags);
> -	if (switch_port)
> +	if (switch_port) {
> +		vp->switch_port = port;
>  		list_add(&port->list, &vp->port_list);

...and here.

> -	else
> +	} else {
>  		list_add_tail(&port->list, &vp->port_list);
> +	}
>  	hlist_add_head(&port->hash, &vp->port_hash[vnet_hashfn(port->raddr)]);
>  	spin_unlock_irqrestore(&vp->lock, flags);
>  
> @@ -1631,11 +1673,14 @@ static int vnet_port_remove(struct vio_dev *vdev)
>  		 */
>  		spin_lock_irqsave(&vp->lock, flags);
>  		port->flags |= VNET_PORT_DEAD;
> +		if (port->switch_port)
> +			vp->switch_port = NULL;
>  		list_del(&port->list);
>  		hlist_del(&port->hash);

..and here.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Oct. 1, 2014, 7:23 p.m. UTC | #3
On (10/01/14 15:06), David L Stevens wrote:
> 
> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
> change, and doesn't allow for multiple switch ports.

The switch_port is the connection to Dom0. Do you envision us having more than
one switch_port? How?

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David L Stevens Oct. 1, 2014, 7:31 p.m. UTC | #4
On 10/01/2014 03:23 PM, Sowmini Varadhan wrote:
> On (10/01/14 15:06), David L Stevens wrote:
>>
>> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
>> change, and doesn't allow for multiple switch ports.
> 
> The switch_port is the connection to Dom0. Do you envision us having more than
> one switch_port? How?

While Dom0 might only create one port with the "switch" flag, the flag just means
"I can reach anybody" and is not inherently unique. I don't think an attached
VM should assume there is always only one; it prevents multipath load balancing
kinds of things in the future.

Also, there is the broader point that this sort of change should be a separate patch.
It isn't required for fixing the dangling reference -- it is an independent change.

							+-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Oct. 1, 2014, 7:44 p.m. UTC | #5
On (10/01/14 12:05), Eric Dumazet wrote:
> 
> Hmpff... This calls for rcu protection here !
> 

I did consider that, but given that the lists containing the ports are
accessed in multiple contexts, some of which can sleep, and given that
the vnet port is similar in spirit to the net_device, I followed the
net_device model of dev_put etc.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 1, 2014, 7:52 p.m. UTC | #6
From: David L Stevens <david.stevens@oracle.com>
Date: Wed, 01 Oct 2014 15:31:49 -0400

> 
> 
> On 10/01/2014 03:23 PM, Sowmini Varadhan wrote:
>> On (10/01/14 15:06), David L Stevens wrote:
>>>
>>> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
>>> change, and doesn't allow for multiple switch ports.
>> 
>> The switch_port is the connection to Dom0. Do you envision us having more than
>> one switch_port? How?
> 
> While Dom0 might only create one port with the "switch" flag, the flag just means
> "I can reach anybody" and is not inherently unique. I don't think an attached
> VM should assume there is always only one; it prevents multipath load balancing
> kinds of things in the future.
> 
> Also, there is the broader point that this sort of change should be a separate patch.
> It isn't required for fixing the dangling reference -- it is an independent change.

Multiple switch ports are absolutely allowed by the protocol spec and can
provide the suggested facilities David mentioned, don't prevent them from
being used.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 1, 2014, 8:15 p.m. UTC | #7
On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
> On (10/01/14 12:05), Eric Dumazet wrote:
> > 
> > Hmpff... This calls for rcu protection here !
> > 
> 
> I did consider that, but given that the lists containing the ports are
> accessed in multiple contexts, some of which can sleep, and given that
> the vnet port is similar in spirit to the net_device, I followed the
> net_device model of dev_put etc.

dev_put() uses per cpu variables, an order of magnitude faster than a
atomic put/get :(



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Oct. 1, 2014, 8:19 p.m. UTC | #8
On 10/01/2014 04:15 PM, Eric Dumazet wrote:
> On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
>> On (10/01/14 12:05), Eric Dumazet wrote:
>>>
>>> Hmpff... This calls for rcu protection here !
>>>
>>
>> I did consider that, but given that the lists containing the ports are
>> accessed in multiple contexts, some of which can sleep, and given that
>> the vnet port is similar in spirit to the net_device, I followed the
>> net_device model of dev_put etc.
>
> dev_put() uses per cpu variables, an order of magnitude faster than a
> atomic put/get :(

I could make this a per-cpu variable, it would not be too hard to change
vnet_put/hold etc. That's not a problem.

--Sowmini


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 1, 2014, 8:24 p.m. UTC | #9
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 01 Oct 2014 16:19:41 -0400

> On 10/01/2014 04:15 PM, Eric Dumazet wrote:
>> On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
>>> On (10/01/14 12:05), Eric Dumazet wrote:
>>>>
>>>> Hmpff... This calls for rcu protection here !
>>>>
>>>
>>> I did consider that, but given that the lists containing the ports are
>>> accessed in multiple contexts, some of which can sleep, and given that
>>> the vnet port is similar in spirit to the net_device, I followed the
>>> net_device model of dev_put etc.
>>
>> dev_put() uses per cpu variables, an order of magnitude faster than a
>> atomic put/get :(
> 
> I could make this a per-cpu variable, it would not be too hard to
> change
> vnet_put/hold etc. That's not a problem.

If you went with NAPI you could use RCU.

All of the negative and complicated aspects of your changes strictly
have to do with not going with that implementation route.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Oct. 2, 2014, 3:23 a.m. UTC | #10
On Oct 1, 2014, at 12:31 PM, David L Stevens <david.stevens@oracle.com> wrote:

> 
> 
> On 10/01/2014 03:23 PM, Sowmini Varadhan wrote:
>> On (10/01/14 15:06), David L Stevens wrote:
>>> 
>>> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
>>> change, and doesn't allow for multiple switch ports.
>> 
>> The switch_port is the connection to Dom0. Do you envision us having more than
>> one switch_port? How?
> 
> While Dom0 might only create one port with the "switch" flag, the flag just means
> "I can reach anybody" and is not inherently unique. I don't think an attached
> VM should assume there is always only one; it prevents multipath load balancing
> kinds of things in the future.

At the moment our architecture defines only one switch-port. We certainly toyed
with the idea of multiple paths(actually LDCs) mainly for the performance but
we were able to achieve our goal with one LDC so that was not introduced.
Our original design included the idea of multiple LDCs in a port, but not 
multiple ports for performance purpose. We also explored adding paths to
different virtual switches, mainly intended for path failover when one of them
failed due to service domain crash/panic.  We have such feature in our virtual
disk, there we realized it is very proprietary implementation and cost of maintaining
and improving those features is high. We abandoned these proprietary solutions
but rely on the rich features that already exists in the Guest OS operating systems
with multiple network devices.  In summary, currently we do not have any plans
to multiple paths.

Even if we implement multiple paths, we expect it is the choice of the Guest on how
it utilizes those paths as it need to have the knowledge of these multiple paths
and use them as designed.  That is, the driver implementation has to change.
For now, I do not see any issues with this driver assuming one switch-port. If we
ever support multiple paths, I would expect this driver to change, at that time
the code can implement appropriate method of load balancing or failover.

-Raghuram
> 
> Also, there is the broader point that this sort of change should be a separate patch.
> It isn't required for fixing the dangling reference -- it is an independent change.
> 
> 							+-DLS
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Oct. 2, 2014, 3:36 a.m. UTC | #11
On Oct 1, 2014, at 12:52 PM, David Miller <davem@davemloft.net> wrote:

> From: David L Stevens <david.stevens@oracle.com>
> Date: Wed, 01 Oct 2014 15:31:49 -0400
> 
>> 
>> 
>> On 10/01/2014 03:23 PM, Sowmini Varadhan wrote:
>>> On (10/01/14 15:06), David L Stevens wrote:
>>>> 
>>>> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
>>>> change, and doesn't allow for multiple switch ports.
>>> 
>>> The switch_port is the connection to Dom0. Do you envision us having more than
>>> one switch_port? How?
>> 
>> While Dom0 might only create one port with the "switch" flag, the flag just means
>> "I can reach anybody" and is not inherently unique. I don't think an attached
>> VM should assume there is always only one; it prevents multipath load balancing
>> kinds of things in the future.
>> 
>> Also, there is the broader point that this sort of change should be a separate patch.
>> It isn't required for fixing the dangling reference -- it is an independent change.
> 
> Multiple switch ports are absolutely allowed by the protocol spec and can
> provide the suggested facilities David mentioned, don't prevent them from
> being used.

In reality, introducing multiple switch-ports will need Guest driver change
as well. The existing sunvnet driver will not automatically utilize all switch-ports
and requires changes. When we add the full support to use multiple switch-ports,
I am sure we can change the current optimization for switch-port lookup with
a different method probably with another optimized method than what it is today.

-Raghuram--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index e2aacf5..dca4397 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -47,6 +47,42 @@  MODULE_VERSION(DRV_MODULE_VERSION);
 
 static int __vnet_tx_trigger(struct vnet_port *port, u32 start);
 
+static inline int vnet_refcnt_read(const struct vnet_port *port)
+{
+	return atomic_read(&port->refcnt);
+}
+
+static inline void vnet_hold(struct vnet_port *port)
+{
+	atomic_inc(&port->refcnt);
+	BUG_ON(vnet_refcnt_read(port) == 0);
+}
+
+static void vnet_put(struct vnet_port *port)
+{
+	atomic_dec(&port->refcnt);
+}
+
+static void vnet_wait_allrefs(struct vnet_port *port)
+{
+	unsigned long warning_time;
+	int refcnt = vnet_refcnt_read(port);
+
+	warning_time = jiffies;
+	while (refcnt != 0) {
+		msleep(250);
+		refcnt = vnet_refcnt_read(port);
+		if (time_after(jiffies, warning_time + 10 * HZ)) {
+			pr_emerg("vnet_wait_allrefs: waiting for port to "
+				 "%x:%x:%x:%x:%x:%x. Usage count = %d\n",
+				port->raddr[0], port->raddr[1],
+				port->raddr[2], port->raddr[3],
+				port->raddr[4], port->raddr[5], refcnt);
+			warning_time = jiffies;
+		}
+	}
+}
+
 /* Ordered from largest major to lowest */
 static struct vio_version vnet_versions[] = {
 	{ .major = 1, .minor = 6 },
@@ -773,14 +809,14 @@  struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
 	hlist_for_each_entry(port, hp, hash) {
 		if (!port_is_up(port))
 			continue;
-		if (ether_addr_equal(port->raddr, skb->data))
+		if (ether_addr_equal(port->raddr, skb->data)) {
+			vnet_hold(port);
 			return port;
+		}
 	}
-	list_for_each_entry(port, &vp->port_list, list) {
-		if (!port->switch_port)
-			continue;
-		if (!port_is_up(port))
-			continue;
+	port = vp->switch_port;
+	if (port != NULL && port_is_up(port)) {
+		vnet_hold(port);
 		return port;
 	}
 	return NULL;
@@ -974,6 +1010,7 @@  static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			dev->stats.tx_errors++;
 		}
 		spin_unlock_irqrestore(&port->vio.lock, flags);
+		vnet_put(port);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1071,6 +1108,7 @@  ldc_start_done:
 	vnet_free_skbs(freeskbs);
 
 	(void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
+	vnet_put(port);
 
 	return NETDEV_TX_OK;
 
@@ -1087,6 +1125,8 @@  out_dropped:
 	else
 		del_timer(&port->clean_timer);
 	dev->stats.tx_dropped++;
+	if (port)
+		vnet_put(port);
 	return NETDEV_TX_OK;
 }
 
@@ -1581,10 +1621,12 @@  static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 	port->switch_port = switch_port;
 
 	spin_lock_irqsave(&vp->lock, flags);
-	if (switch_port)
+	if (switch_port) {
+		vp->switch_port = port;
 		list_add(&port->list, &vp->port_list);
-	else
+	} else {
 		list_add_tail(&port->list, &vp->port_list);
+	}
 	hlist_add_head(&port->hash, &vp->port_hash[vnet_hashfn(port->raddr)]);
 	spin_unlock_irqrestore(&vp->lock, flags);
 
@@ -1631,11 +1673,14 @@  static int vnet_port_remove(struct vio_dev *vdev)
 		 */
 		spin_lock_irqsave(&vp->lock, flags);
 		port->flags |= VNET_PORT_DEAD;
+		if (port->switch_port)
+			vp->switch_port = NULL;
 		list_del(&port->list);
 		hlist_del(&port->hash);
 		spin_unlock_irqrestore(&vp->lock, flags);
 
 		vnet_workq_disable(port);
+		vnet_wait_allrefs(port);
 
 		vnet_port_free_tx_bufs(port);
 		vio_ldc_free(&port->vio);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index 1182ec6..6f62ea5 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -53,6 +53,7 @@  struct vnet_port {
 	u32			stop_rx_idx;
 	bool			stop_rx;
 	bool			start_cons;
+	atomic_t		refcnt;
 
 	struct timer_list	clean_timer;
 
@@ -104,6 +105,7 @@  struct vnet {
 	u64			local_mac;
 
 	struct tasklet_struct	vnet_tx_wakeup;
+	struct vnet_port	*switch_port;
 };
 
 #endif /* _SUNVNET_H */