Message ID | d46b4fa36c32affff20fcc92148e33ee6176f899.1474060371.git.jbaron@akamai.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
> Currently, we can have high order page allocations that specify > GFP_ATOMIC when configuring multicast MAC address filters. > > For example, we have seen order 2 page allocation failures with > ~500 multicast addresses configured. > > Convert the allocation for the pending list to be done in PAGE_SIZE > increments. > > Signed-off-by: Jason Baron <jbaron@akamai.com> While I appreciate the effort, I wonder whether it's worth it: - The hardware [even in its newer generation] provides an approximate based classification [I.e., hashed] with 256 bins. When configuring 500 multicast addresses, one can argue the difference between multicast-promisc mode and actual configuration is insignificant. Perhaps the easier-to-maintain alternative would simply be to determine the maximal number of multicast addresses that can be configured using a single PAGE, and if in need of more than that simply move into multicast-promisc. - While GFP_ATOMIC is required in this flow due to the fact it's being called from sleepless context, I do believe this is mostly a remnant - it's possible that by slightly changing the locking scheme we can have the configuration done from sleepless context and simply switch to GFP_KERNEL instead. Regarding the patch itself, only comment I have: > + elem_group = (struct bnx2x_mcast_elem_group *) > + elem_group->mcast_group_link.next; Let's use list_next_entry() instead.
On 09/18/2016 06:25 AM, Mintz, Yuval wrote: >> Currently, we can have high order page allocations that specify >> GFP_ATOMIC when configuring multicast MAC address filters. >> >> For example, we have seen order 2 page allocation failures with >> ~500 multicast addresses configured. >> >> Convert the allocation for the pending list to be done in PAGE_SIZE >> increments. >> >> Signed-off-by: Jason Baron <jbaron@akamai.com> > > While I appreciate the effort, I wonder whether it's worth it: > > - The hardware [even in its newer generation] provides an approximate > based classification [I.e., hashed] with 256 bins. > When configuring 500 multicast addresses, one can argue the > difference between multicast-promisc mode and actual configuration > is insignificant. With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses to expect to have all bins have at least one hash, assuming a uniform distribution of the hashes. > Perhaps the easier-to-maintain alternative would simply be to > determine the maximal number of multicast addresses that can be > configured using a single PAGE, and if in need of more than that > simply move into multicast-promisc. > sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So if we want to fit 2,048 elements, we need 12 pages. > - While GFP_ATOMIC is required in this flow due to the fact it's being > called from sleepless context, I do believe this is mostly a remnant - > it's possible that by slightly changing the locking scheme we can have > the configuration done from sleepless context and simply switch to > GFP_KERNEL instead. > Ok if its GFP_KERNEL, I think its still undesirable to do large page order allocations (unless of course its converted to a single page, but I'm not sure this makes sense as mentioned). > Regarding the patch itself, only comment I have: >> + elem_group = (struct bnx2x_mcast_elem_group *) >> + elem_group->mcast_group_link.next; > Let's use list_next_entry() instead. > > Yes, agreed. I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to enforce some maximum number of elements (perhaps 2,048 based on the above math) for the !CHIP_IS_E1() case on top of what I already posted. Thanks, -Jason
> >> Currently, we can have high order page allocations that specify > >> GFP_ATOMIC when configuring multicast MAC address filters. > >> > >> For example, we have seen order 2 page allocation failures with > >> ~500 multicast addresses configured. > >> > >> Convert the allocation for the pending list to be done in PAGE_SIZE > >> increments. > >> > >> Signed-off-by: Jason Baron <jbaron@akamai.com> > > > > While I appreciate the effort, I wonder whether it's worth it: > > > > - The hardware [even in its newer generation] provides an approximate > > based classification [I.e., hashed] with 256 bins. > > When configuring 500 multicast addresses, one can argue the difference > > between multicast-promisc mode and actual configuration is > > insignificant. > > With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses > to expect to have all bins have at least one hash, assuming a uniform distribution > of the hashes. > > > Perhaps the easier-to-maintain alternative would simply be to > > determine the maximal number of multicast addresses that can be > > configured using a single PAGE, and if in need of more than that > > simply move into multicast-promisc. > > > > sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So > if we want to fit 2,048 elements, we need 12 pages. That's not exactly what I mean - let's assume you'd have problems allocating more than a PAGE. According to your calculation, that means you're already using more than 170 multicast addresses. I didn't bother trying to solve the combinatorics question of how many bins you'd use on average for 170 filters given there are only 256 bins, but that would be a significant portion. The question I rose was whether it actually makes a difference under such circumstances whether the device would actually filter those multicast addresses or be completely multicast promiscuous. e.g., whether it's significant to be filtering out multicast ingress traffic when you're already allowing 1/2 of all random multicast packets to be classified for the interface. But again, given that you've actually taken the trouble of solving this, I guess this question is mostly theoretical. We HAVE a better solution now [a.k.a., yours ;-) ] > I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to enforce > some maximum number of elements (perhaps 2,048 based on the above math) > for the !CHIP_IS_E1() case on top of what I already posted. The benefit should have been that we could have dropped your Solution by limiting the driver to use at most the number of filters that would fit in a single page. I don't think it would serve any purpose to take your change and in addition choose some combinatorics based upper limit.
From: Jason Baron > Sent: 19 September 2016 19:34 ... > > sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per > page on x86. So if we want to fit 2,048 elements, we need 12 pages. If you only need to save the mcast addresses you could use a 'heap' that requires no overhead per entry and gives you O(log) lookup. 6 bytes per entry is 682 in a 4k page. David
On 09/20/2016 03:41 AM, Mintz, Yuval wrote: >>>> Currently, we can have high order page allocations that specify >>>> GFP_ATOMIC when configuring multicast MAC address filters. >>>> >>>> For example, we have seen order 2 page allocation failures with >>>> ~500 multicast addresses configured. >>>> >>>> Convert the allocation for the pending list to be done in PAGE_SIZE >>>> increments. >>>> >>>> Signed-off-by: Jason Baron <jbaron@akamai.com> >>> >>> While I appreciate the effort, I wonder whether it's worth it: >>> >>> - The hardware [even in its newer generation] provides an approximate >>> based classification [I.e., hashed] with 256 bins. >>> When configuring 500 multicast addresses, one can argue the difference >>> between multicast-promisc mode and actual configuration is >>> insignificant. >> >> With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses >> to expect to have all bins have at least one hash, assuming a uniform distribution >> of the hashes. >> >>> Perhaps the easier-to-maintain alternative would simply be to >>> determine the maximal number of multicast addresses that can be >>> configured using a single PAGE, and if in need of more than that >>> simply move into multicast-promisc. >>> >> >> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So >> if we want to fit 2,048 elements, we need 12 pages. > > That's not exactly what I mean - let's assume you'd have problems > allocating more than a PAGE. According to your calculation, that > means you're already using more than 170 multicast addresses. > I didn't bother trying to solve the combinatorics question of how > many bins you'd use on average for 170 filters given there are only > 256 bins, but that would be a significant portion. On average for 170 filters, I get an average of 124 bins in use out of 256 possible bins. > The question I rose was whether it actually makes a difference > under such circumstances whether the device would actually filter > those multicast addresses or be completely multicast promiscuous. > e.g., whether it's significant to be filtering out multicast ingress > traffic when you're already allowing 1/2 of all random multicast > packets to be classified for the interface. > Agreed, I think this is the more interesting question here. I thought that we would want to make sure we are using most of the bins before falling back to multicast ingress. The reason being that even if its more expensive for the NIC to do the filtering than the multicast mode, it would be more than made up for by having to drop the traffic higher up the stack. So I think if we can determine the percent of the bins that we want to use, we can then back into the average number of filters required to get there. As I said, I thought we would want to make sure we filled basically all the bins (with a high probability that is) before falling back to multicast, and so I threw out 2,048. Thanks, -Jason
> > The question I rose was whether it actually makes a difference under > > such circumstances whether the device would actually filter those > > multicast addresses or be completely multicast promiscuous. > > e.g., whether it's significant to be filtering out multicast ingress > > traffic when you're already allowing 1/2 of all random multicast > > packets to be classified for the interface. > > > > Agreed, I think this is the more interesting question here. I thought that we > would want to make sure we are using most of the bins before falling back to > multicast ingress. The reason being that even if its more expensive for the NIC to > do the filtering than the multicast mode, it would be more than made up for by > having to drop the traffic higher up the stack. So I think if we can determine the > percent of the bins that we want to use, we can then back into the average > number of filters required to get there. As I said, I thought we would want to > make sure we filled basically all the bins (with a high probability that is) before > falling back to multicast, and so I threw out 2,048. AFAIK configuring multiple filters doesn't incur any performance penalty from the adapter side. And I agree that from 'offloading' perspective it's probably better to filter in HW even if the gain is negligible. So for the upper limit - there's not much of a reason to it; The only gain would be to prevent driver from allocating lots-and-lots of memory temporarily for an unnecessary configuration.
On 09/20/2016 11:00 AM, Mintz, Yuval wrote: >>> The question I rose was whether it actually makes a difference under >>> such circumstances whether the device would actually filter those >>> multicast addresses or be completely multicast promiscuous. >>> e.g., whether it's significant to be filtering out multicast ingress >>> traffic when you're already allowing 1/2 of all random multicast >>> packets to be classified for the interface. >>> >> >> Agreed, I think this is the more interesting question here. I thought that we >> would want to make sure we are using most of the bins before falling back to >> multicast ingress. The reason being that even if its more expensive for the NIC to >> do the filtering than the multicast mode, it would be more than made up for by >> having to drop the traffic higher up the stack. So I think if we can determine the >> percent of the bins that we want to use, we can then back into the average >> number of filters required to get there. As I said, I thought we would want to >> make sure we filled basically all the bins (with a high probability that is) before >> falling back to multicast, and so I threw out 2,048. > > AFAIK configuring multiple filters doesn't incur any performance penalty > from the adapter side. > And I agree that from 'offloading' perspective it's probably better to > filter in HW even if the gain is negligible. > So for the upper limit - there's not much of a reason to it; The only gain > would be to prevent driver from allocating lots-and-lots of memory > temporarily for an unnecessary configuration. > Ok. We already have an upper limit to an extent with /proc/sys/net/ipv4/igmp_max_memberships. And as posted I didn't include one b/c of the higher level limits already in place. Thanks, -Jason
On 09/20/2016 07:30 AM, David Laight wrote: > From: Jason Baron >> Sent: 19 September 2016 19:34 > ... >> >> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per >> page on x86. So if we want to fit 2,048 elements, we need 12 pages. > > If you only need to save the mcast addresses you could use a 'heap' > that requires no overhead per entry and gives you O(log) lookup. > 6 bytes per entry is 682 in a 4k page. > > David > Indeed, that would save space here. Based on Yuval's comments it sounds as though he agrees that it makes sense to go beyond a page (even if we get 682 per page as you suggest), when configuring these mac filters. So we would then have to allocate and manage the page pointers. Currently, there is a list_head per entry to manage the macs as a linked list. The patch I proposed continues to use that same data structure, thus it will not add to the memory footprint, it only proposes to break that footprint up into PAGE_SIZE chunks. So I think the change you suggest can be viewed as an additional enhancement here, and also note that the memory allocations here are short-lived. That is, they only exist in memory until the NIC is re-configured. Thanks, -Jason
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c index d468380c2a23..6db8dd252d7c 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c @@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem { int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */ }; +union bnx2x_mcast_elem { + struct bnx2x_mcast_bin_elem bin_elem; + struct bnx2x_mcast_mac_elem mac_elem; +}; + +struct bnx2x_mcast_elem_group { + struct list_head mcast_group_link; + union bnx2x_mcast_elem mcast_elems[]; +}; + +#define MCAST_MAC_ELEMS_PER_PG \ + ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \ + sizeof(union bnx2x_mcast_elem)) + struct bnx2x_pending_mcast_cmd { struct list_head link; + struct list_head group_head; int type; /* BNX2X_MCAST_CMD_X */ union { struct list_head macs_head; @@ -2638,16 +2653,30 @@ static int bnx2x_mcast_wait(struct bnx2x *bp, return 0; } +static void bnx2x_free_groups(struct list_head *mcast_group_list) +{ + struct bnx2x_mcast_elem_group *current_mcast_group; + + while (!list_empty(mcast_group_list)) { + current_mcast_group = list_first_entry(mcast_group_list, + struct bnx2x_mcast_elem_group, + mcast_group_link); + list_del(¤t_mcast_group->mcast_group_link); + kfree(current_mcast_group); + } +} + static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, struct bnx2x_mcast_obj *o, struct bnx2x_mcast_ramrod_params *p, enum bnx2x_mcast_cmd cmd) { - int total_sz; struct bnx2x_pending_mcast_cmd *new_cmd; - struct bnx2x_mcast_mac_elem *cur_mac = NULL; struct bnx2x_mcast_list_elem *pos; - int macs_list_len = 0, macs_list_len_size; + struct bnx2x_mcast_elem_group *elem_group; + struct bnx2x_mcast_mac_elem *mac_elem; + int i = 0, offset = 0, macs_list_len = 0; + int total_elems, alloc_size; /* When adding MACs we'll need to store their values */ if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET) @@ -2657,50 +2686,68 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp, if (!p->mcast_list_len) return 0; - /* For a set command, we need to allocate sufficient memory for all - * the bins, since we can't analyze at this point how much memory would - * be required. - */ - macs_list_len_size = macs_list_len * - sizeof(struct bnx2x_mcast_mac_elem); - if (cmd == BNX2X_MCAST_CMD_SET) { - int bin_size = BNX2X_MCAST_BINS_NUM * - sizeof(struct bnx2x_mcast_bin_elem); - - if (bin_size > macs_list_len_size) - macs_list_len_size = bin_size; - } - total_sz = sizeof(*new_cmd) + macs_list_len_size; - /* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */ - new_cmd = kzalloc(total_sz, GFP_ATOMIC); - + new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC); if (!new_cmd) return -ENOMEM; - DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", - cmd, macs_list_len); - INIT_LIST_HEAD(&new_cmd->data.macs_head); - + INIT_LIST_HEAD(&new_cmd->group_head); new_cmd->type = cmd; new_cmd->done = false; + DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n", + cmd, macs_list_len); + switch (cmd) { case BNX2X_MCAST_CMD_ADD: case BNX2X_MCAST_CMD_SET: - cur_mac = (struct bnx2x_mcast_mac_elem *) - ((u8 *)new_cmd + sizeof(*new_cmd)); - - /* Push the MACs of the current command into the pending command - * MACs list: FIFO + /* For a set command, we need to allocate sufficient memory for + * all the bins, since we can't analyze at this point how much + * memory would be required. */ + total_elems = macs_list_len; + if (cmd == BNX2X_MCAST_CMD_SET) { + if (total_elems < BNX2X_MCAST_BINS_NUM) + total_elems = BNX2X_MCAST_BINS_NUM; + } + i = total_elems; + while (i) { + if (i < MCAST_MAC_ELEMS_PER_PG) { + alloc_size = sizeof(struct list_head) + + i * sizeof(union bnx2x_mcast_elem); + i = 0; + } else { + alloc_size = PAGE_SIZE; + i -= MCAST_MAC_ELEMS_PER_PG; + } + elem_group = kzalloc(alloc_size, GFP_ATOMIC); + if (!elem_group) { + bnx2x_free_groups(&new_cmd->group_head); + kfree(new_cmd); + return -ENOMEM; + } + list_add_tail(&elem_group->mcast_group_link, + &new_cmd->group_head); + } + elem_group = list_first_entry(&new_cmd->group_head, + struct bnx2x_mcast_elem_group, + mcast_group_link); list_for_each_entry(pos, &p->mcast_list, link) { - memcpy(cur_mac->mac, pos->mac, ETH_ALEN); - list_add_tail(&cur_mac->link, &new_cmd->data.macs_head); - cur_mac++; + mac_elem = &elem_group->mcast_elems[offset].mac_elem; + memcpy(mac_elem->mac, pos->mac, ETH_ALEN); + /* Push the MACs of the current command into the pending + * command MACs list: FIFO + */ + list_add_tail(&mac_elem->link, + &new_cmd->data.macs_head); + offset++; + if (offset == MCAST_MAC_ELEMS_PER_PG) { + offset = 0; + elem_group = (struct bnx2x_mcast_elem_group *) + elem_group->mcast_group_link.next; + } } - break; case BNX2X_MCAST_CMD_DEL: @@ -2978,7 +3025,8 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp, u64 cur[BNX2X_MCAST_VEC_SZ], req[BNX2X_MCAST_VEC_SZ]; struct bnx2x_mcast_mac_elem *pmac_pos, *pmac_pos_n; struct bnx2x_mcast_bin_elem *p_item; - int i, cnt = 0, mac_cnt = 0; + struct bnx2x_mcast_elem_group *elem_group; + int cnt = 0, mac_cnt = 0, offset = 0, i; memset(req, 0, sizeof(u64) * BNX2X_MCAST_VEC_SZ); memcpy(cur, o->registry.aprox_match.vec, @@ -3001,9 +3049,10 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp, * a list that will be used to configure bins. */ cmd_pos->set_convert = true; - p_item = (struct bnx2x_mcast_bin_elem *)(cmd_pos + 1); INIT_LIST_HEAD(&cmd_pos->data.macs_head); - + elem_group = list_first_entry(&cmd_pos->group_head, + struct bnx2x_mcast_elem_group, + mcast_group_link); for (i = 0; i < BNX2X_MCAST_BINS_NUM; i++) { bool b_current = !!BIT_VEC64_TEST_BIT(cur, i); bool b_required = !!BIT_VEC64_TEST_BIT(req, i); @@ -3011,12 +3060,18 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp, if (b_current == b_required) continue; + p_item = &elem_group->mcast_elems[offset].bin_elem; p_item->bin = i; p_item->type = b_required ? BNX2X_MCAST_CMD_SET_ADD : BNX2X_MCAST_CMD_SET_DEL; list_add_tail(&p_item->link , &cmd_pos->data.macs_head); - p_item++; cnt++; + offset++; + if (offset == MCAST_MAC_ELEMS_PER_PG) { + offset = 0; + elem_group = (struct bnx2x_mcast_elem_group *) + elem_group->mcast_group_link.next; + } } /* We now definitely know how many commands are hiding here. @@ -3103,6 +3158,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e2(struct bnx2x *bp, */ if (cmd_pos->done) { list_del(&cmd_pos->link); + bnx2x_free_groups(&cmd_pos->group_head); kfree(cmd_pos); } @@ -3741,6 +3797,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e1( } list_del(&cmd_pos->link); + bnx2x_free_groups(&cmd_pos->group_head); kfree(cmd_pos); return cnt;
Currently, we can have high order page allocations that specify GFP_ATOMIC when configuring multicast MAC address filters. For example, we have seen order 2 page allocation failures with ~500 multicast addresses configured. Convert the allocation for the pending list to be done in PAGE_SIZE increments. Signed-off-by: Jason Baron <jbaron@akamai.com> --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++++++++++++++++++------- 1 file changed, 94 insertions(+), 37 deletions(-)