diff mbox

[net-next,2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

Message ID d46b4fa36c32affff20fcc92148e33ee6176f899.1474060371.git.jbaron@akamai.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Baron Sept. 16, 2016, 9:30 p.m. UTC
Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++++++++++++++++++-------
 1 file changed, 94 insertions(+), 37 deletions(-)

Comments

Mintz, Yuval Sept. 18, 2016, 10:25 a.m. UTC | #1
> Currently, we can have high order page allocations that specify
> GFP_ATOMIC when configuring multicast MAC address filters.
> 
> For example, we have seen order 2 page allocation failures with
> ~500 multicast addresses configured.
> 
> Convert the allocation for the pending list to be done in PAGE_SIZE
> increments.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>

While I appreciate the effort, I wonder whether it's worth it:

- The hardware [even in its newer generation] provides an approximate
based classification [I.e., hashed] with 256 bins.
When configuring 500 multicast addresses, one can argue the
difference between multicast-promisc mode and actual configuration
is insignificant.
Perhaps the easier-to-maintain alternative would simply be to
determine the maximal number of multicast addresses that can be
configured using a single PAGE, and if in need of more than that
simply move into multicast-promisc.

 - While GFP_ATOMIC is required in this flow due to the fact it's being
called from sleepless context, I do believe this is mostly a remnant -
it's possible that by slightly changing the locking scheme we can have
the configuration done from sleepless context and simply switch to
GFP_KERNEL instead.

Regarding the patch itself, only comment I have:
> +			elem_group = (struct bnx2x_mcast_elem_group *)
> +				     elem_group->mcast_group_link.next;
Let's use list_next_entry() instead.
Jason Baron Sept. 19, 2016, 6:33 p.m. UTC | #2
On 09/18/2016 06:25 AM, Mintz, Yuval wrote:
>> Currently, we can have high order page allocations that specify
>> GFP_ATOMIC when configuring multicast MAC address filters.
>>
>> For example, we have seen order 2 page allocation failures with
>> ~500 multicast addresses configured.
>>
>> Convert the allocation for the pending list to be done in PAGE_SIZE
>> increments.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>
> While I appreciate the effort, I wonder whether it's worth it:
>
> - The hardware [even in its newer generation] provides an approximate
> based classification [I.e., hashed] with 256 bins.
> When configuring 500 multicast addresses, one can argue the
> difference between multicast-promisc mode and actual configuration
> is insignificant.

With 256 bins, I think it takes close to: 256*lg(256) or 2,048
multicast addresses to expect to have all bins have at least one hash, 
assuming a uniform distribution of the hashes.

> Perhaps the easier-to-maintain alternative would simply be to
> determine the maximal number of multicast addresses that can be
> configured using a single PAGE, and if in need of more than that
> simply move into multicast-promisc.
>

sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
page on x86. So if we want to fit 2,048 elements, we need 12 pages.

>   - While GFP_ATOMIC is required in this flow due to the fact it's being
> called from sleepless context, I do believe this is mostly a remnant -
> it's possible that by slightly changing the locking scheme we can have
> the configuration done from sleepless context and simply switch to
> GFP_KERNEL instead.
>

Ok if its GFP_KERNEL, I think its still undesirable to do large page 
order allocations (unless of course its converted to a single page, but
I'm not sure this makes sense as mentioned).

> Regarding the patch itself, only comment I have:
>> +			elem_group = (struct bnx2x_mcast_elem_group *)
>> +				     elem_group->mcast_group_link.next;
> Let's use list_next_entry() instead.
>
>

Yes, agreed.

I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to
enforce some maximum number of elements (perhaps 2,048 based on the
above math) for the !CHIP_IS_E1() case on top of what I already posted.

Thanks,

-Jason
Mintz, Yuval Sept. 20, 2016, 7:41 a.m. UTC | #3
> >> Currently, we can have high order page allocations that specify
> >> GFP_ATOMIC when configuring multicast MAC address filters.
> >>
> >> For example, we have seen order 2 page allocation failures with
> >> ~500 multicast addresses configured.
> >>
> >> Convert the allocation for the pending list to be done in PAGE_SIZE
> >> increments.
> >>
> >> Signed-off-by: Jason Baron <jbaron@akamai.com>
> >
> > While I appreciate the effort, I wonder whether it's worth it:
> >
> > - The hardware [even in its newer generation] provides an approximate
> > based classification [I.e., hashed] with 256 bins.
> > When configuring 500 multicast addresses, one can argue the difference
> > between multicast-promisc mode and actual configuration is
> > insignificant.
> 
> With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses
> to expect to have all bins have at least one hash, assuming a uniform distribution
> of the hashes.
> 
> > Perhaps the easier-to-maintain alternative would simply be to
> > determine the maximal number of multicast addresses that can be
> > configured using a single PAGE, and if in need of more than that
> > simply move into multicast-promisc.
> >
> 
> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So
> if we want to fit 2,048 elements, we need 12 pages.

That's not exactly what I mean - let's assume you'd have problems
allocating more than a PAGE. According to your calculation, that
means you're already using more than 170 multicast addresses.
I didn't bother trying to solve the combinatorics question of how
many bins you'd use on average for 170 filters given there are only
256 bins, but that would be a significant portion.
The question I rose was whether it actually makes a difference
under such circumstances whether the device would actually filter
those multicast addresses or be completely multicast promiscuous.
e.g., whether it's significant to be filtering out multicast ingress
traffic when you're already allowing 1/2 of all random multicast
packets to be classified for the interface.

But again, given that you've actually taken the trouble of solving
this, I guess this question is mostly theoretical. We HAVE a better
solution now [a.k.a., yours ;-) ]

> I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to enforce
> some maximum number of elements (perhaps 2,048 based on the above math)
> for the !CHIP_IS_E1() case on top of what I already posted.

The benefit should have been that we could have dropped your
Solution by limiting the driver to use at most the number of filters
that would fit in a single page.
I don't think it would serve any purpose to take your change and
in addition choose some combinatorics based upper limit.
David Laight Sept. 20, 2016, 11:30 a.m. UTC | #4
From: Jason Baron
> Sent: 19 September 2016 19:34
...
> 
> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
> page on x86. So if we want to fit 2,048 elements, we need 12 pages.

If you only need to save the mcast addresses you could use a 'heap'
that requires no overhead per entry and gives you O(log) lookup.
6 bytes per entry is 682 in a 4k page.

	David
Jason Baron Sept. 20, 2016, 2:52 p.m. UTC | #5
On 09/20/2016 03:41 AM, Mintz, Yuval wrote:
>>>> Currently, we can have high order page allocations that specify
>>>> GFP_ATOMIC when configuring multicast MAC address filters.
>>>>
>>>> For example, we have seen order 2 page allocation failures with
>>>> ~500 multicast addresses configured.
>>>>
>>>> Convert the allocation for the pending list to be done in PAGE_SIZE
>>>> increments.
>>>>
>>>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>>>
>>> While I appreciate the effort, I wonder whether it's worth it:
>>>
>>> - The hardware [even in its newer generation] provides an approximate
>>> based classification [I.e., hashed] with 256 bins.
>>> When configuring 500 multicast addresses, one can argue the difference
>>> between multicast-promisc mode and actual configuration is
>>> insignificant.
>>
>> With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses
>> to expect to have all bins have at least one hash, assuming a uniform distribution
>> of the hashes.
>>
>>> Perhaps the easier-to-maintain alternative would simply be to
>>> determine the maximal number of multicast addresses that can be
>>> configured using a single PAGE, and if in need of more than that
>>> simply move into multicast-promisc.
>>>
>>
>> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So
>> if we want to fit 2,048 elements, we need 12 pages.
>
> That's not exactly what I mean - let's assume you'd have problems
> allocating more than a PAGE. According to your calculation, that
> means you're already using more than 170 multicast addresses.
> I didn't bother trying to solve the combinatorics question of how
> many bins you'd use on average for 170 filters given there are only
> 256 bins, but that would be a significant portion.

On average for 170 filters, I get an average of 124 bins in use out
of 256 possible bins.

> The question I rose was whether it actually makes a difference
> under such circumstances whether the device would actually filter
> those multicast addresses or be completely multicast promiscuous.
> e.g., whether it's significant to be filtering out multicast ingress
> traffic when you're already allowing 1/2 of all random multicast
> packets to be classified for the interface.
>

Agreed, I think this is the more interesting question here. I thought 
that we would want to make sure we are using most of the bins before 
falling back to multicast ingress. The reason being that even if its 
more expensive for the NIC to do the filtering than the multicast mode, 
it would be more than made up for by having to drop the traffic higher 
up the stack. So I think if we can determine the percent of the bins 
that we want to use, we can then back into the average number of filters 
required to get there. As I said, I thought we would want to make sure 
we filled basically all the bins (with a high probability that is) 
before falling back to multicast, and so I threw out 2,048.

Thanks,

-Jason
Mintz, Yuval Sept. 20, 2016, 3 p.m. UTC | #6
> > The question I rose was whether it actually makes a difference under
> > such circumstances whether the device would actually filter those
> > multicast addresses or be completely multicast promiscuous.
> > e.g., whether it's significant to be filtering out multicast ingress
> > traffic when you're already allowing 1/2 of all random multicast
> > packets to be classified for the interface.
> >
> 
> Agreed, I think this is the more interesting question here. I thought that we
> would want to make sure we are using most of the bins before falling back to
> multicast ingress. The reason being that even if its more expensive for the NIC to
> do the filtering than the multicast mode, it would be more than made up for by
> having to drop the traffic higher up the stack. So I think if we can determine the
> percent of the bins that we want to use, we can then back into the average
> number of filters required to get there. As I said, I thought we would want to
> make sure we filled basically all the bins (with a high probability that is) before
> falling back to multicast, and so I threw out 2,048.

AFAIK configuring multiple filters doesn't incur any performance penalty
from the adapter side.
And I agree that from 'offloading' perspective it's probably better to
filter in HW even if the gain is negligible. 
So for the upper limit - there's not much of a reason to it; The only gain
would be to prevent driver from allocating lots-and-lots of memory
temporarily for an unnecessary configuration.
Jason Baron Sept. 20, 2016, 3:19 p.m. UTC | #7
On 09/20/2016 11:00 AM, Mintz, Yuval wrote:
>>> The question I rose was whether it actually makes a difference under
>>> such circumstances whether the device would actually filter those
>>> multicast addresses or be completely multicast promiscuous.
>>> e.g., whether it's significant to be filtering out multicast ingress
>>> traffic when you're already allowing 1/2 of all random multicast
>>> packets to be classified for the interface.
>>>
>>
>> Agreed, I think this is the more interesting question here. I thought that we
>> would want to make sure we are using most of the bins before falling back to
>> multicast ingress. The reason being that even if its more expensive for the NIC to
>> do the filtering than the multicast mode, it would be more than made up for by
>> having to drop the traffic higher up the stack. So I think if we can determine the
>> percent of the bins that we want to use, we can then back into the average
>> number of filters required to get there. As I said, I thought we would want to
>> make sure we filled basically all the bins (with a high probability that is) before
>> falling back to multicast, and so I threw out 2,048.
>
> AFAIK configuring multiple filters doesn't incur any performance penalty
> from the adapter side.
> And I agree that from 'offloading' perspective it's probably better to
> filter in HW even if the gain is negligible.
> So for the upper limit - there's not much of a reason to it; The only gain
> would be to prevent driver from allocating lots-and-lots of memory
> temporarily for an unnecessary configuration.
>

Ok. We already have an upper limit to an extent with 
/proc/sys/net/ipv4/igmp_max_memberships. And as posted I didn't include 
one b/c of the higher level limits already in place.

Thanks,

-Jason
Jason Baron Sept. 20, 2016, 6:46 p.m. UTC | #8
On 09/20/2016 07:30 AM, David Laight wrote:
> From: Jason Baron
>> Sent: 19 September 2016 19:34
> ...
>>
>> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
>> page on x86. So if we want to fit 2,048 elements, we need 12 pages.
>
> If you only need to save the mcast addresses you could use a 'heap'
> that requires no overhead per entry and gives you O(log) lookup.
> 6 bytes per entry is 682 in a 4k page.
>
> 	David
>

Indeed, that would save space here.

Based on Yuval's comments it sounds as though he agrees that it makes 
sense to go beyond a page (even if we get 682 per page as you suggest), 
when configuring these mac filters. So we would then have to allocate 
and manage the page pointers. Currently, there is a list_head per entry 
to manage the macs as a linked list. The patch I proposed continues to 
use that same data structure, thus it will not add to the memory 
footprint, it only proposes to break that footprint up into PAGE_SIZE 
chunks.

So I think the change you suggest can be viewed as an additional 
enhancement here, and also note that the memory allocations here are 
short-lived. That is, they only exist in memory until the NIC is 
re-configured.

Thanks,

-Jason
diff mbox

Patch

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..6db8dd252d7c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@  struct bnx2x_mcast_bin_elem {
 	int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+	struct bnx2x_mcast_bin_elem bin_elem;
+	struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+	struct list_head mcast_group_link;
+	union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+	((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+	sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
 	struct list_head link;
+	struct list_head group_head;
 	int type; /* BNX2X_MCAST_CMD_X */
 	union {
 		struct list_head macs_head;
@@ -2638,16 +2653,30 @@  static int bnx2x_mcast_wait(struct bnx2x *bp,
 	return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_elem_group *current_mcast_group;
+
+	while (!list_empty(mcast_group_list)) {
+		current_mcast_group = list_first_entry(mcast_group_list,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
+		list_del(&current_mcast_group->mcast_group_link);
+		kfree(current_mcast_group);
+	}
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 				   struct bnx2x_mcast_obj *o,
 				   struct bnx2x_mcast_ramrod_params *p,
 				   enum bnx2x_mcast_cmd cmd)
 {
-	int total_sz;
 	struct bnx2x_pending_mcast_cmd *new_cmd;
-	struct bnx2x_mcast_mac_elem *cur_mac = NULL;
 	struct bnx2x_mcast_list_elem *pos;
-	int macs_list_len = 0, macs_list_len_size;
+	struct bnx2x_mcast_elem_group *elem_group;
+	struct bnx2x_mcast_mac_elem *mac_elem;
+	int i = 0, offset = 0, macs_list_len = 0;
+	int total_elems, alloc_size;
 
 	/* When adding MACs we'll need to store their values */
 	if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2686,68 @@  static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 	if (!p->mcast_list_len)
 		return 0;
 
-	/* For a set command, we need to allocate sufficient memory for all
-	 * the bins, since we can't analyze at this point how much memory would
-	 * be required.
-	 */
-	macs_list_len_size = macs_list_len *
-			     sizeof(struct bnx2x_mcast_mac_elem);
-	if (cmd == BNX2X_MCAST_CMD_SET) {
-		int bin_size = BNX2X_MCAST_BINS_NUM *
-			       sizeof(struct bnx2x_mcast_bin_elem);
-
-		if (bin_size > macs_list_len_size)
-			macs_list_len_size = bin_size;
-	}
-	total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
 	/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-	new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+	new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
 	if (!new_cmd)
 		return -ENOMEM;
 
-	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
-	   cmd, macs_list_len);
-
 	INIT_LIST_HEAD(&new_cmd->data.macs_head);
-
+	INIT_LIST_HEAD(&new_cmd->group_head);
 	new_cmd->type = cmd;
 	new_cmd->done = false;
 
+	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
+	   cmd, macs_list_len);
+
 	switch (cmd) {
 	case BNX2X_MCAST_CMD_ADD:
 	case BNX2X_MCAST_CMD_SET:
-		cur_mac = (struct bnx2x_mcast_mac_elem *)
-			  ((u8 *)new_cmd + sizeof(*new_cmd));
-
-		/* Push the MACs of the current command into the pending command
-		 * MACs list: FIFO
+		/* For a set command, we need to allocate sufficient memory for
+		 * all the bins, since we can't analyze at this point how much
+		 * memory would be required.
 		 */
+		total_elems = macs_list_len;
+		if (cmd == BNX2X_MCAST_CMD_SET) {
+			if (total_elems < BNX2X_MCAST_BINS_NUM)
+				total_elems = BNX2X_MCAST_BINS_NUM;
+		}
+		i = total_elems;
+		while (i) {
+			if (i < MCAST_MAC_ELEMS_PER_PG) {
+				alloc_size = sizeof(struct list_head) +
+					     i * sizeof(union bnx2x_mcast_elem);
+				i = 0;
+			} else {
+				alloc_size = PAGE_SIZE;
+				i -= MCAST_MAC_ELEMS_PER_PG;
+			}
+			elem_group = kzalloc(alloc_size, GFP_ATOMIC);
+			if (!elem_group) {
+				bnx2x_free_groups(&new_cmd->group_head);
+				kfree(new_cmd);
+				return -ENOMEM;
+			}
+			list_add_tail(&elem_group->mcast_group_link,
+				      &new_cmd->group_head);
+		}
+		elem_group = list_first_entry(&new_cmd->group_head,
+					      struct bnx2x_mcast_elem_group,
+					      mcast_group_link);
 		list_for_each_entry(pos, &p->mcast_list, link) {
-			memcpy(cur_mac->mac, pos->mac, ETH_ALEN);
-			list_add_tail(&cur_mac->link, &new_cmd->data.macs_head);
-			cur_mac++;
+			mac_elem = &elem_group->mcast_elems[offset].mac_elem;
+			memcpy(mac_elem->mac, pos->mac, ETH_ALEN);
+			/* Push the MACs of the current command into the pending
+			 * command MACs list: FIFO
+			 */
+			list_add_tail(&mac_elem->link,
+				      &new_cmd->data.macs_head);
+			offset++;
+			if (offset == MCAST_MAC_ELEMS_PER_PG) {
+				offset = 0;
+				elem_group = (struct bnx2x_mcast_elem_group *)
+					     elem_group->mcast_group_link.next;
+			}
 		}
-
 		break;
 
 	case BNX2X_MCAST_CMD_DEL:
@@ -2978,7 +3025,8 @@  bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	u64 cur[BNX2X_MCAST_VEC_SZ], req[BNX2X_MCAST_VEC_SZ];
 	struct bnx2x_mcast_mac_elem *pmac_pos, *pmac_pos_n;
 	struct bnx2x_mcast_bin_elem *p_item;
-	int i, cnt = 0, mac_cnt = 0;
+	struct bnx2x_mcast_elem_group *elem_group;
+	int cnt = 0, mac_cnt = 0, offset = 0, i;
 
 	memset(req, 0, sizeof(u64) * BNX2X_MCAST_VEC_SZ);
 	memcpy(cur, o->registry.aprox_match.vec,
@@ -3001,9 +3049,10 @@  bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	 * a list that will be used to configure bins.
 	 */
 	cmd_pos->set_convert = true;
-	p_item = (struct bnx2x_mcast_bin_elem *)(cmd_pos + 1);
 	INIT_LIST_HEAD(&cmd_pos->data.macs_head);
-
+	elem_group = list_first_entry(&cmd_pos->group_head,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
 	for (i = 0; i < BNX2X_MCAST_BINS_NUM; i++) {
 		bool b_current = !!BIT_VEC64_TEST_BIT(cur, i);
 		bool b_required = !!BIT_VEC64_TEST_BIT(req, i);
@@ -3011,12 +3060,18 @@  bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 		if (b_current == b_required)
 			continue;
 
+		p_item = &elem_group->mcast_elems[offset].bin_elem;
 		p_item->bin = i;
 		p_item->type = b_required ? BNX2X_MCAST_CMD_SET_ADD
 					  : BNX2X_MCAST_CMD_SET_DEL;
 		list_add_tail(&p_item->link , &cmd_pos->data.macs_head);
-		p_item++;
 		cnt++;
+		offset++;
+		if (offset == MCAST_MAC_ELEMS_PER_PG) {
+			offset = 0;
+			elem_group = (struct bnx2x_mcast_elem_group *)
+				     elem_group->mcast_group_link.next;
+		}
 	}
 
 	/* We now definitely know how many commands are hiding here.
@@ -3103,6 +3158,7 @@  static inline int bnx2x_mcast_handle_pending_cmds_e2(struct bnx2x *bp,
 		 */
 		if (cmd_pos->done) {
 			list_del(&cmd_pos->link);
+			bnx2x_free_groups(&cmd_pos->group_head);
 			kfree(cmd_pos);
 		}
 
@@ -3741,6 +3797,7 @@  static inline int bnx2x_mcast_handle_pending_cmds_e1(
 	}
 
 	list_del(&cmd_pos->link);
+	bnx2x_free_groups(&cmd_pos->group_head);
 	kfree(cmd_pos);
 
 	return cnt;