diff mbox

[net-next,02/11] igb: Use node specific allocations for the q_vectors and rings

Message ID 1318056461-19562-3-git-send-email-jeffrey.t.kirsher@intel.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Kirsher, Jeffrey T Oct. 8, 2011, 6:47 a.m. UTC
From: Alexander Duyck <alexander.h.duyck@intel.com>

This change is meant to update the ring and vector allocations so that they
are per node instead of allocating everything on the node that
ifconfig/modprobe is called on.  By doing this we can cut down
significantly on cross node traffic.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by:  Aaron Brown  <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb.h      |    4 ++
 drivers/net/ethernet/intel/igb/igb_main.c |   80 +++++++++++++++++++++++++++--
 2 files changed, 79 insertions(+), 5 deletions(-)

Comments

David Miller Oct. 8, 2011, 7:51 p.m. UTC | #1
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri,  7 Oct 2011 23:47:32 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change is meant to update the ring and vector allocations so that they
> are per node instead of allocating everything on the node that
> ifconfig/modprobe is called on.  By doing this we can cut down
> significantly on cross node traffic.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by:  Aaron Brown  <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

adapter->node seems superfluous.

It's always "-1" when we enter the allocation functions, and we
always restore it to it's original value upon exit from such
functions.

Just get rid of it and use a local variable in these functions
to keep track of the current allocation node.

Also, what ensures that MSI-X interrupts are targetted to a cpu
on the the node where you've made these allocations?  I was
pretty sure Ben Hutchings added infrastructure that's usable
to ensure this, but I can't see where you're using it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Oct. 9, 2011, 6:08 p.m. UTC | #2
Jeff Kirsher <jeffrey.t.kirsher@intel.com> writes:
>  
>  	for (i = 0; i < adapter->num_tx_queues; i++) {
> -		ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
> +		if (orig_node == -1) {
> +			int cur_node = next_online_node(adapter->node);
> +			if (cur_node == MAX_NUMNODES)
> +				cur_node = first_online_node;

RR seems quite arbitrary. Who guarantees those nodes have any
relationship with the CPUs submitting on those queues? Or the node
the device is on.

Anyways if it's a good idea probably need to add a
dma_alloc_coherent_node() too

-Andi
Duyck, Alexander H Oct. 10, 2011, 4:15 p.m. UTC | #3
On 10/08/2011 12:51 PM, David Miller wrote:
> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Date: Fri,  7 Oct 2011 23:47:32 -0700
> 
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> This change is meant to update the ring and vector allocations so that they
>> are per node instead of allocating everything on the node that
>> ifconfig/modprobe is called on.  By doing this we can cut down
>> significantly on cross node traffic.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Tested-by:  Aaron Brown  <aaron.f.brown@intel.com>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> adapter->node seems superfluous.
> 
> It's always "-1" when we enter the allocation functions, and we
> always restore it to it's original value upon exit from such
> functions.
> 
> Just get rid of it and use a local variable in these functions
> to keep track of the current allocation node.
> 
> Also, what ensures that MSI-X interrupts are targetted to a cpu
> on the the node where you've made these allocations?  I was
> pretty sure Ben Hutchings added infrastructure that's usable
> to ensure this, but I can't see where you're using it.

Actually the main reason for having adapter->node is because in our
out-of-tree driver we end up using it as a module parameter in the event
that someone is running in single queue mode and wants to split up the
ports between nodes.  As such I would prefer to keep the parameter
around and just default it to -1 as I am currently doing.  However if it
must go I guess I can work around that sync-up issue.

In this case we don't have any guarantee other than the fact that most
people when trying to get performance will arrange their IRQs in a round
robin fashion.  However this approach is still preferred over just
allocating all of the rings on one node and incurring the possible
overhead for all of the access being primarily on a single node.  The
igb implementation doesn't have the code in place yet for the irq
affinity hints.  It is one of the few things remaining for me to sync up
between igb and ixgbe, however it is on my list of things to do.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Oct. 10, 2011, 4:25 p.m. UTC | #4
On 10/09/2011 11:08 AM, Andi Kleen wrote:
> Jeff Kirsher <jeffrey.t.kirsher@intel.com> writes:
>>  
>>  	for (i = 0; i < adapter->num_tx_queues; i++) {
>> -		ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
>> +		if (orig_node == -1) {
>> +			int cur_node = next_online_node(adapter->node);
>> +			if (cur_node == MAX_NUMNODES)
>> +				cur_node = first_online_node;
> 
> RR seems quite arbitrary. Who guarantees those nodes have any
> relationship with the CPUs submitting on those queues? Or the node
> the device is on.
> 
> Anyways if it's a good idea probably need to add a
> dma_alloc_coherent_node() too
> 
> -Andi
> 

The RR configuration is somewhat arbitrary.  However it is still better
than dumping everyting on a single node, and it works with the
configuration when the rings numbers line up with the CPU numbers since
normally the CPUs are RR on the nodes.  From what I have seen it does
work quite well and it prevents almost all cross-node memory accesses
when running a routing workload.

I was thinking along the same lines for dma_alloc_coherent_node as well.
 I've been meaning to get to it but I just haven't had the time.  I'm
intentionally holding off on the ixgbe version of these patches until I
get the time to write up such a function.  At which time I was going to
write up a patch to convert igb over to it.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Oct. 10, 2011, 4:32 p.m. UTC | #5
> The RR configuration is somewhat arbitrary.  However it is still better
> than dumping everyting on a single node, and it works with the
> configuration when the rings numbers line up with the CPU numbers since
> normally the CPUs are RR on the nodes.  From what I have seen it does
> work quite well and it prevents almost all cross-node memory accesses
> when running a routing workload.

Ok so it's optimized for one specific workload. I'm sure you'll
find some other workload where it doesn't work out.

I suppose it's hard to get right in the general case, but best
would be if ethtool had a nice and easy interface to set it at least.

However one disadvantage of that patch over the existing state of the
art (numactl modprobe ...) is that there's no way to override the placement
now. So if you do the forced RR I think you need the ethtool part too,
or at least some parameter to turn it off.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Oct. 10, 2011, 5:02 p.m. UTC | #6
On 10/10/2011 09:32 AM, Andi Kleen wrote:
>> The RR configuration is somewhat arbitrary.  However it is still better
>> than dumping everyting on a single node, and it works with the
>> configuration when the rings numbers line up with the CPU numbers since
>> normally the CPUs are RR on the nodes.  From what I have seen it does
>> work quite well and it prevents almost all cross-node memory accesses
>> when running a routing workload.
> 
> Ok so it's optimized for one specific workload. I'm sure you'll
> find some other workload where it doesn't work out.

It isn't that I optimized it for one specific workload.  I was just
citing that specific workload as one of the ones seeing the advantage.

> I suppose it's hard to get right in the general case, but best
> would be if ethtool had a nice and easy interface to set it at least.

The general case is never right for this it seems like.  At least in
this case it becomes much easier to line up the memory and interrupts so
that they are all affinitized to the same core.  From there RPS/RFS can
typically be used to spread out the work more if necessary.

> However one disadvantage of that patch over the existing state of the
> art (numactl modprobe ...) is that there's no way to override the placement
> now. So if you do the forced RR I think you need the ethtool part too,
> or at least some parameter to turn it off.
> 
> -Andi

The counter argument to that though is that the approach you mention
always limits you to one node.  At least with this approach we are
spread out over multiple nodes so that we can make full use of the
memory bandwidth on the system.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 10, 2011, 5:50 p.m. UTC | #7
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Mon, 10 Oct 2011 09:15:02 -0700

> Actually the main reason for having adapter->node is because in our
> out-of-tree driver we end up using it as a module parameter in the event
> that someone is running in single queue mode and wants to split up the
> ports between nodes.  As such I would prefer to keep the parameter
> around and just default it to -1 as I am currently doing.  However if it
> must go I guess I can work around that sync-up issue.

Please stop adding such hacks to your out-of-tree driver and add
appropriate, generic, configure mechanisms to the upstream tree.

It absolutely is not appropriate to add something which is completely
useless to the upstream tree for the sake of something being done
only externally.

You guys are the best at upstream net driver maintainence, so it
really surprises me that you continue to do completely unacceptable
crap like this.  Write the necessary generic non-module-option
mechanisms to facilitate the features you need and kill your out of
tree driver _now_.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index de35c02..9e4bed3 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -185,6 +185,8 @@  struct igb_q_vector {
 	u16 cpu;
 	u16 tx_work_limit;
 
+	int numa_node;
+
 	u16 itr_val;
 	u8 set_itr;
 	void __iomem *itr_register;
@@ -232,6 +234,7 @@  struct igb_ring {
 	};
 	/* Items past this point are only used during ring alloc / free */
 	dma_addr_t dma;                /* phys address of the ring */
+	int numa_node;                  /* node to alloc ring memory on */
 };
 
 #define IGB_RING_FLAG_RX_CSUM        0x00000001 /* RX CSUM enabled */
@@ -341,6 +344,7 @@  struct igb_adapter {
 	int vf_rate_link_speed;
 	u32 rss_queues;
 	u32 wvbr;
+	int node;
 };
 
 #define IGB_FLAG_HAS_MSI           (1 << 0)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 1c234f0..287be85 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -687,41 +687,68 @@  static int igb_alloc_queues(struct igb_adapter *adapter)
 {
 	struct igb_ring *ring;
 	int i;
+	int orig_node = adapter->node;
 
 	for (i = 0; i < adapter->num_tx_queues; i++) {
-		ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
+		if (orig_node == -1) {
+			int cur_node = next_online_node(adapter->node);
+			if (cur_node == MAX_NUMNODES)
+				cur_node = first_online_node;
+			adapter->node = cur_node;
+		}
+		ring = kzalloc_node(sizeof(struct igb_ring), GFP_KERNEL,
+				    adapter->node);
+		if (!ring)
+			ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
 		if (!ring)
 			goto err;
 		ring->count = adapter->tx_ring_count;
 		ring->queue_index = i;
 		ring->dev = &adapter->pdev->dev;
 		ring->netdev = adapter->netdev;
+		ring->numa_node = adapter->node;
 		/* For 82575, context index must be unique per ring. */
 		if (adapter->hw.mac.type == e1000_82575)
 			ring->flags = IGB_RING_FLAG_TX_CTX_IDX;
 		adapter->tx_ring[i] = ring;
 	}
+	/* Restore the adapter's original node */
+	adapter->node = orig_node;
 
 	for (i = 0; i < adapter->num_rx_queues; i++) {
-		ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
+		if (orig_node == -1) {
+			int cur_node = next_online_node(adapter->node);
+			if (cur_node == MAX_NUMNODES)
+				cur_node = first_online_node;
+			adapter->node = cur_node;
+		}
+		ring = kzalloc_node(sizeof(struct igb_ring), GFP_KERNEL,
+				    adapter->node);
+		if (!ring)
+			ring = kzalloc(sizeof(struct igb_ring), GFP_KERNEL);
 		if (!ring)
 			goto err;
 		ring->count = adapter->rx_ring_count;
 		ring->queue_index = i;
 		ring->dev = &adapter->pdev->dev;
 		ring->netdev = adapter->netdev;
+		ring->numa_node = adapter->node;
 		ring->flags = IGB_RING_FLAG_RX_CSUM; /* enable rx checksum */
 		/* set flag indicating ring supports SCTP checksum offload */
 		if (adapter->hw.mac.type >= e1000_82576)
 			ring->flags |= IGB_RING_FLAG_RX_SCTP_CSUM;
 		adapter->rx_ring[i] = ring;
 	}
+	/* Restore the adapter's original node */
+	adapter->node = orig_node;
 
 	igb_cache_ring_register(adapter);
 
 	return 0;
 
 err:
+	/* Restore the adapter's original node */
+	adapter->node = orig_node;
 	igb_free_queues(adapter);
 
 	return -ENOMEM;
@@ -1087,9 +1114,24 @@  static int igb_alloc_q_vectors(struct igb_adapter *adapter)
 	struct igb_q_vector *q_vector;
 	struct e1000_hw *hw = &adapter->hw;
 	int v_idx;
+	int orig_node = adapter->node;
 
 	for (v_idx = 0; v_idx < adapter->num_q_vectors; v_idx++) {
-		q_vector = kzalloc(sizeof(struct igb_q_vector), GFP_KERNEL);
+		if ((adapter->num_q_vectors == (adapter->num_rx_queues +
+						adapter->num_tx_queues)) &&
+		    (adapter->num_rx_queues == v_idx))
+			adapter->node = orig_node;
+		if (orig_node == -1) {
+			int cur_node = next_online_node(adapter->node);
+			if (cur_node == MAX_NUMNODES)
+				cur_node = first_online_node;
+			adapter->node = cur_node;
+		}
+		q_vector = kzalloc_node(sizeof(struct igb_q_vector), GFP_KERNEL,
+					adapter->node);
+		if (!q_vector)
+			q_vector = kzalloc(sizeof(struct igb_q_vector),
+					   GFP_KERNEL);
 		if (!q_vector)
 			goto err_out;
 		q_vector->adapter = adapter;
@@ -1098,9 +1140,14 @@  static int igb_alloc_q_vectors(struct igb_adapter *adapter)
 		netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
 		adapter->q_vector[v_idx] = q_vector;
 	}
+	/* Restore the adapter's original node */
+	adapter->node = orig_node;
+
 	return 0;
 
 err_out:
+	/* Restore the adapter's original node */
+	adapter->node = orig_node;
 	igb_free_q_vectors(adapter);
 	return -ENOMEM;
 }
@@ -2409,6 +2456,8 @@  static int __devinit igb_sw_init(struct igb_adapter *adapter)
 				  VLAN_HLEN;
 	adapter->min_frame_size = ETH_ZLEN + ETH_FCS_LEN;
 
+	adapter->node = -1;
+
 	spin_lock_init(&adapter->stats64_lock);
 #ifdef CONFIG_PCI_IOV
 	switch (hw->mac.type) {
@@ -2579,10 +2628,13 @@  static int igb_close(struct net_device *netdev)
 int igb_setup_tx_resources(struct igb_ring *tx_ring)
 {
 	struct device *dev = tx_ring->dev;
+	int orig_node = dev_to_node(dev);
 	int size;
 
 	size = sizeof(struct igb_tx_buffer) * tx_ring->count;
-	tx_ring->tx_buffer_info = vzalloc(size);
+	tx_ring->tx_buffer_info = vzalloc_node(size, tx_ring->numa_node);
+	if (!tx_ring->tx_buffer_info)
+		tx_ring->tx_buffer_info = vzalloc(size);
 	if (!tx_ring->tx_buffer_info)
 		goto err;
 
@@ -2590,16 +2642,24 @@  int igb_setup_tx_resources(struct igb_ring *tx_ring)
 	tx_ring->size = tx_ring->count * sizeof(union e1000_adv_tx_desc);
 	tx_ring->size = ALIGN(tx_ring->size, 4096);
 
+	set_dev_node(dev, tx_ring->numa_node);
 	tx_ring->desc = dma_alloc_coherent(dev,
 					   tx_ring->size,
 					   &tx_ring->dma,
 					   GFP_KERNEL);
+	set_dev_node(dev, orig_node);
+	if (!tx_ring->desc)
+		tx_ring->desc = dma_alloc_coherent(dev,
+						   tx_ring->size,
+						   &tx_ring->dma,
+						   GFP_KERNEL);
 
 	if (!tx_ring->desc)
 		goto err;
 
 	tx_ring->next_to_use = 0;
 	tx_ring->next_to_clean = 0;
+
 	return 0;
 
 err:
@@ -2722,10 +2782,13 @@  static void igb_configure_tx(struct igb_adapter *adapter)
 int igb_setup_rx_resources(struct igb_ring *rx_ring)
 {
 	struct device *dev = rx_ring->dev;
+	int orig_node = dev_to_node(dev);
 	int size, desc_len;
 
 	size = sizeof(struct igb_rx_buffer) * rx_ring->count;
-	rx_ring->rx_buffer_info = vzalloc(size);
+	rx_ring->rx_buffer_info = vzalloc_node(size, rx_ring->numa_node);
+	if (!rx_ring->rx_buffer_info)
+		rx_ring->rx_buffer_info = vzalloc(size);
 	if (!rx_ring->rx_buffer_info)
 		goto err;
 
@@ -2735,10 +2798,17 @@  int igb_setup_rx_resources(struct igb_ring *rx_ring)
 	rx_ring->size = rx_ring->count * desc_len;
 	rx_ring->size = ALIGN(rx_ring->size, 4096);
 
+	set_dev_node(dev, rx_ring->numa_node);
 	rx_ring->desc = dma_alloc_coherent(dev,
 					   rx_ring->size,
 					   &rx_ring->dma,
 					   GFP_KERNEL);
+	set_dev_node(dev, orig_node);
+	if (!rx_ring->desc)
+		rx_ring->desc = dma_alloc_coherent(dev,
+						   rx_ring->size,
+						   &rx_ring->dma,
+						   GFP_KERNEL);
 
 	if (!rx_ring->desc)
 		goto err;