diff mbox

[net,V1,1/3] net/mlx4_en: Don't use irq_affinity_notifier to track changes in IRQ affinity map

Message ID 1404032097-5132-2-git-send-email-amirv@mellanox.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Amir Vadai June 29, 2014, 8:54 a.m. UTC
IRQ affinity notifier can only have a single notifier - cpu_rmap
notifier. Can't use it to track changes in IRQ affinity map.
Detect IRQ affinity changes by comparing CPU to current IRQ affinity map
during NAPI poll thread.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Hutchings <ben@decadent.org.uk>
Fixes: 2eacc23 ("net/mlx4_core: Enforce irq affinity changes immediatly")
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cq.c      |  2 -
 drivers/net/ethernet/mellanox/mlx4/en_cq.c   |  4 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 16 +++++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  6 ---
 drivers/net/ethernet/mellanox/mlx4/eq.c      | 69 ++++------------------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 +
 include/linux/mlx4/device.h                  |  4 +-
 7 files changed, 28 insertions(+), 74 deletions(-)

Comments

Eric Dumazet June 30, 2014, 6:41 a.m. UTC | #1
On Sun, 2014-06-29 at 11:54 +0300, Amir Vadai wrote:
> IRQ affinity notifier can only have a single notifier - cpu_rmap
> notifier. Can't use it to track changes in IRQ affinity map.
> Detect IRQ affinity changes by comparing CPU to current IRQ affinity map
> during NAPI poll thread.

...

> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 8be7483..ac3dead 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -474,15 +474,9 @@ int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget)
>  	/* If we used up all the quota - we're probably not done yet... */
>  	if (done < budget) {
>  		/* Done for now */
> -		cq->mcq.irq_affinity_change = false;
>  		napi_complete(napi);
>  		mlx4_en_arm_cq(priv, cq);
>  		return done;
> -	} else if (unlikely(cq->mcq.irq_affinity_change)) {
> -		cq->mcq.irq_affinity_change = false;
> -		napi_complete(napi);
> -		mlx4_en_arm_cq(priv, cq);
> -		return 0;
>  	}
>  	return budget;
>  }

It seems nothing is done then for the TX side after this patch ?

You might want to drain whole queue instead of limiting to a 'budget',
otherwise, a cpu might be stuck servicing (soft)irq for the TX
completion, even if irq affinities say otherwise.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Mellanox June 30, 2014, 8:34 a.m. UTC | #2
On 6/30/2014 9:41 AM, Eric Dumazet wrote:
> On Sun, 2014-06-29 at 11:54 +0300, Amir Vadai wrote:
>> IRQ affinity notifier can only have a single notifier - cpu_rmap
>> notifier. Can't use it to track changes in IRQ affinity map.
>> Detect IRQ affinity changes by comparing CPU to current IRQ affinity map
>> during NAPI poll thread.
> 
> ...
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> index 8be7483..ac3dead 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> @@ -474,15 +474,9 @@ int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget)
>>  	/* If we used up all the quota - we're probably not done yet... */
>>  	if (done < budget) {
>>  		/* Done for now */
>> -		cq->mcq.irq_affinity_change = false;
>>  		napi_complete(napi);
>>  		mlx4_en_arm_cq(priv, cq);
>>  		return done;
>> -	} else if (unlikely(cq->mcq.irq_affinity_change)) {
>> -		cq->mcq.irq_affinity_change = false;
>> -		napi_complete(napi);
>> -		mlx4_en_arm_cq(priv, cq);
>> -		return 0;
>>  	}
>>  	return budget;
>>  }
> 
> It seems nothing is done then for the TX side after this patch ?
> 
> You might want to drain whole queue instead of limiting to a 'budget',
> otherwise, a cpu might be stuck servicing (soft)irq for the TX
> completion, even if irq affinities say otherwise.
> 

TX completions are very quick compared to the skb preparation and
sending. Which is not the case for RX completions. Because of that, it
is very easy to reproduce the problem in RX flows, but we never had any
report of that problem in the TX flow.
I prefer not to spend time on the TX, since we plan to send a patch soon
to use the same NAPI for both TX and RX.

Amir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 30, 2014, 9:11 a.m. UTC | #3
On Mon, 2014-06-30 at 11:34 +0300, Amir Vadai wrote:

> TX completions are very quick compared to the skb preparation and
> sending. Which is not the case for RX completions. Because of that, it
> is very easy to reproduce the problem in RX flows, but we never had any
> report of that problem in the TX flow.

This is because reporters probably use same number of RX and TX queues.

With TCP Small queues, TX completions are not always quick, if thousands
of flows are active.

Some people hit the locked cpu when say one cpu has to drain 8 TX
queues, because 7 other cpus can continuously feed more packets 

> I prefer not to spend time on the TX, since we plan to send a patch soon
> to use the same NAPI for both TX and RX.

Thanks, this sounds great.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Mellanox June 30, 2014, 10:43 a.m. UTC | #4
On 6/30/2014 12:11 PM, Eric Dumazet wrote:
> On Mon, 2014-06-30 at 11:34 +0300, Amir Vadai wrote:
> 
>> TX completions are very quick compared to the skb preparation and
>> sending. Which is not the case for RX completions. Because of that, it
>> is very easy to reproduce the problem in RX flows, but we never had any
>> report of that problem in the TX flow.
> 
> This is because reporters probably use same number of RX and TX queues.
> 
> With TCP Small queues, TX completions are not always quick, if thousands
> of flows are active.
> 
> Some people hit the locked cpu when say one cpu has to drain 8 TX
> queues, because 7 other cpus can continuously feed more packets 
> 
>> I prefer not to spend time on the TX, since we plan to send a patch soon
>> to use the same NAPI for both TX and RX.
> 
> Thanks, this sounds great.
> 
> 

Ok, so unless anyone objects, the plan is to continue with this patch
(RX only). If the unified NAPI patch will be delayed I will send the TX
fix too.

Amir

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 1, 2014, 3:33 a.m. UTC | #5
From: Amir Vadai <amirv.mellanox@gmail.com>
Date: Mon, 30 Jun 2014 11:34:22 +0300

> On 6/30/2014 9:41 AM, Eric Dumazet wrote:
>> You might want to drain whole queue instead of limiting to a 'budget',
>> otherwise, a cpu might be stuck servicing (soft)irq for the TX
>> completion, even if irq affinities say otherwise.
>> 
> 
> TX completions are very quick compared to the skb preparation and
> sending. Which is not the case for RX completions. Because of that, it
> is very easy to reproduce the problem in RX flows, but we never had any
> report of that problem in the TX flow.
> I prefer not to spend time on the TX, since we plan to send a patch soon
> to use the same NAPI for both TX and RX.

It is always advised to completely ignore the budget for TX work, this is
what we tell every driver author when discussion NAPI implementations.

Please make your driver conform to this, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Vadai July 1, 2014, 9:14 a.m. UTC | #6
On 07/01/2014 06:33 AM, David Miller wrote:
> From: Amir Vadai <amirv.mellanox@gmail.com>
> Date: Mon, 30 Jun 2014 11:34:22 +0300
>
>> On 6/30/2014 9:41 AM, Eric Dumazet wrote:
>>> You might want to drain whole queue instead of limiting to a 'budget',
>>> otherwise, a cpu might be stuck servicing (soft)irq for the TX
>>> completion, even if irq affinities say otherwise.
>>>
>>
>> TX completions are very quick compared to the skb preparation and
>> sending. Which is not the case for RX completions. Because of that, it
>> is very easy to reproduce the problem in RX flows, but we never had any
>> report of that problem in the TX flow.
>> I prefer not to spend time on the TX, since we plan to send a patch soon
>> to use the same NAPI for both TX and RX.
>
> It is always advised to completely ignore the budget for TX work, this is
> what we tell every driver author when discussion NAPI implementations.
>
> Please make your driver conform to this, thanks.
>

Ok.

Please continue the process on this V1 of the patchset. The fix to TX 
poll is not related to this patch - this patch is fixing a regression 
that broke aRFS in mlx4_en.

I will send a separate fix to purge all packets on TX work later on this 
week.

Thanks,
Amir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 80f7252..56022d6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -294,8 +294,6 @@  int mlx4_cq_alloc(struct mlx4_dev *dev, int nent,
 	init_completion(&cq->free);
 
 	cq->irq = priv->eq_table.eq[cq->vector].irq;
-	cq->irq_affinity_change = false;
-
 	return 0;
 
 err_radix:
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 4b21307..1213cc7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -128,6 +128,10 @@  int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 					mlx4_warn(mdev, "Failed assigning an EQ to %s, falling back to legacy EQ's\n",
 						  name);
 				}
+
+				cq->irq_desc =
+					irq_to_desc(mlx4_eq_get_irq(mdev->dev,
+								    cq->vector));
 			}
 		} else {
 			cq->vector = (cq->ring + 1 + priv->port) %
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index d2d4157..9672417 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -40,6 +40,7 @@ 
 #include <linux/if_ether.h>
 #include <linux/if_vlan.h>
 #include <linux/vmalloc.h>
+#include <linux/irq.h>
 
 #include "mlx4_en.h"
 
@@ -896,16 +897,25 @@  int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 
 	/* If we used up all the quota - we're probably not done yet... */
 	if (done == budget) {
+		int cpu_curr;
+		const struct cpumask *aff;
+
 		INC_PERF_COUNTER(priv->pstats.napi_quota);
-		if (unlikely(cq->mcq.irq_affinity_change)) {
-			cq->mcq.irq_affinity_change = false;
+
+		cpu_curr = smp_processor_id();
+		aff = irq_desc_get_irq_data(cq->irq_desc)->affinity;
+
+		if (unlikely(!cpumask_test_cpu(cpu_curr, aff))) {
+			/* Current cpu is not according to smp_irq_affinity -
+			 * probably affinity changed. need to stop this NAPI
+			 * poll, and restart it on the right CPU
+			 */
 			napi_complete(napi);
 			mlx4_en_arm_cq(priv, cq);
 			return 0;
 		}
 	} else {
 		/* Done for now */
-		cq->mcq.irq_affinity_change = false;
 		napi_complete(napi);
 		mlx4_en_arm_cq(priv, cq);
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 8be7483..ac3dead 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -474,15 +474,9 @@  int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget)
 	/* If we used up all the quota - we're probably not done yet... */
 	if (done < budget) {
 		/* Done for now */
-		cq->mcq.irq_affinity_change = false;
 		napi_complete(napi);
 		mlx4_en_arm_cq(priv, cq);
 		return done;
-	} else if (unlikely(cq->mcq.irq_affinity_change)) {
-		cq->mcq.irq_affinity_change = false;
-		napi_complete(napi);
-		mlx4_en_arm_cq(priv, cq);
-		return 0;
 	}
 	return budget;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index d954ec1..2a004b3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -53,11 +53,6 @@  enum {
 	MLX4_EQ_ENTRY_SIZE	= 0x20
 };
 
-struct mlx4_irq_notify {
-	void *arg;
-	struct irq_affinity_notify notify;
-};
-
 #define MLX4_EQ_STATUS_OK	   ( 0 << 28)
 #define MLX4_EQ_STATUS_WRITE_FAIL  (10 << 28)
 #define MLX4_EQ_OWNER_SW	   ( 0 << 24)
@@ -1088,57 +1083,6 @@  static void mlx4_unmap_clr_int(struct mlx4_dev *dev)
 	iounmap(priv->clr_base);
 }
 
-static void mlx4_irq_notifier_notify(struct irq_affinity_notify *notify,
-				     const cpumask_t *mask)
-{
-	struct mlx4_irq_notify *n = container_of(notify,
-						 struct mlx4_irq_notify,
-						 notify);
-	struct mlx4_priv *priv = (struct mlx4_priv *)n->arg;
-	struct radix_tree_iter iter;
-	void **slot;
-
-	radix_tree_for_each_slot(slot, &priv->cq_table.tree, &iter, 0) {
-		struct mlx4_cq *cq = (struct mlx4_cq *)(*slot);
-
-		if (cq->irq == notify->irq)
-			cq->irq_affinity_change = true;
-	}
-}
-
-static void mlx4_release_irq_notifier(struct kref *ref)
-{
-	struct mlx4_irq_notify *n = container_of(ref, struct mlx4_irq_notify,
-						 notify.kref);
-	kfree(n);
-}
-
-static void mlx4_assign_irq_notifier(struct mlx4_priv *priv,
-				     struct mlx4_dev *dev, int irq)
-{
-	struct mlx4_irq_notify *irq_notifier = NULL;
-	int err = 0;
-
-	irq_notifier = kzalloc(sizeof(*irq_notifier), GFP_KERNEL);
-	if (!irq_notifier) {
-		mlx4_warn(dev, "Failed to allocate irq notifier. irq %d\n",
-			  irq);
-		return;
-	}
-
-	irq_notifier->notify.irq = irq;
-	irq_notifier->notify.notify = mlx4_irq_notifier_notify;
-	irq_notifier->notify.release = mlx4_release_irq_notifier;
-	irq_notifier->arg = priv;
-	err = irq_set_affinity_notifier(irq, &irq_notifier->notify);
-	if (err) {
-		kfree(irq_notifier);
-		irq_notifier = NULL;
-		mlx4_warn(dev, "Failed to set irq notifier. irq %d\n", irq);
-	}
-}
-
-
 int mlx4_alloc_eq_table(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -1409,8 +1353,6 @@  int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
 				continue;
 				/*we dont want to break here*/
 			}
-			mlx4_assign_irq_notifier(priv, dev,
-						 priv->eq_table.eq[vec].irq);
 
 			eq_set_ci(&priv->eq_table.eq[vec], 1);
 		}
@@ -1427,6 +1369,14 @@  int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
 }
 EXPORT_SYMBOL(mlx4_assign_eq);
 
+int mlx4_eq_get_irq(struct mlx4_dev *dev, int vec)
+{
+	struct mlx4_priv *priv = mlx4_priv(dev);
+
+	return priv->eq_table.eq[vec].irq;
+}
+EXPORT_SYMBOL(mlx4_eq_get_irq);
+
 void mlx4_release_eq(struct mlx4_dev *dev, int vec)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -1438,9 +1388,6 @@  void mlx4_release_eq(struct mlx4_dev *dev, int vec)
 		  Belonging to a legacy EQ*/
 		mutex_lock(&priv->msix_ctl.pool_lock);
 		if (priv->msix_ctl.pool_bm & 1ULL << i) {
-			irq_set_affinity_notifier(
-				priv->eq_table.eq[vec].irq,
-				NULL);
 			free_irq(priv->eq_table.eq[vec].irq,
 				 &priv->eq_table.eq[vec]);
 			priv->msix_ctl.pool_bm &= ~(1ULL << i);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 0e15295..624e193 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -343,6 +343,7 @@  struct mlx4_en_cq {
 #define CQ_USER_PEND (MLX4_EN_CQ_STATE_POLL | MLX4_EN_CQ_STATE_POLL_YIELD)
 	spinlock_t poll_lock; /* protects from LLS/napi conflicts */
 #endif  /* CONFIG_NET_RX_BUSY_POLL */
+	struct irq_desc *irq_desc;
 };
 
 struct mlx4_en_port_profile {
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index b12f4bb..35b51e7 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -578,8 +578,6 @@  struct mlx4_cq {
 	u32			cons_index;
 
 	u16                     irq;
-	bool                    irq_affinity_change;
-
 	__be32		       *set_ci_db;
 	__be32		       *arm_db;
 	int			arm_sn;
@@ -1167,6 +1165,8 @@  int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
 		   int *vector);
 void mlx4_release_eq(struct mlx4_dev *dev, int vec);
 
+int mlx4_eq_get_irq(struct mlx4_dev *dev, int vec);
+
 int mlx4_get_phys_port_id(struct mlx4_dev *dev);
 int mlx4_wol_read(struct mlx4_dev *dev, u64 *config, int port);
 int mlx4_wol_write(struct mlx4_dev *dev, u64 config, int port);