diff mbox series

[net-next,01/15] net/mlx5: DR, Add buddy allocator utilities

Message ID 20200925193809.463047-2-saeed@kernel.org
State Changes Requested
Delegated to: David Miller
Headers show
Series [net-next,01/15] net/mlx5: DR, Add buddy allocator utilities | expand

Commit Message

Saeed Mahameed Sept. 25, 2020, 7:37 p.m. UTC
From: Yevgeny Kliteynik <kliteyn@nvidia.com>

Add implementation of SW Steering variation of buddy allocator.

The buddy system for ICM memory uses 2 main data structures:
  - Bitmap per order, that keeps the current state of allocated
    blocks for this order
  - Indicator for the number of available blocks per each order

In addition, there is one more hierarchy of searching in the bitmaps
in order to accelerate the search of the next free block which done
via find-first function:
The buddy system spends lots of its time in searching the next free
space using function find_first_bit, which scans a big array of long
values in order to find the first bit. We added one more array of
longs, where each bit indicates a long value in the original array,
that way there is a need for much less searches for the next free area.

For example, for the following bits array of 128 bits where all
bits are zero except for the last bit  :  0000........00001
the corresponding bits-per-long array is:  0001

The search will be done over the bits-per-long array first, and after
the first bit is found, we will use it as a start pointer in the
bigger array (bits array).

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../mellanox/mlx5/core/steering/dr_buddy.c    | 255 ++++++++++++++++++
 .../mellanox/mlx5/core/steering/mlx5dr.h      |  34 +++
 3 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_buddy.c

Comments

David Miller Sept. 26, 2020, 10:15 p.m. UTC | #1
From: saeed@kernel.org
Date: Fri, 25 Sep 2020 12:37:55 -0700

> From: Yevgeny Kliteynik <kliteyn@nvidia.com>
> 
> Add implementation of SW Steering variation of buddy allocator.
> 
> The buddy system for ICM memory uses 2 main data structures:
>   - Bitmap per order, that keeps the current state of allocated
>     blocks for this order
>   - Indicator for the number of available blocks per each order
> 
> In addition, there is one more hierarchy of searching in the bitmaps
> in order to accelerate the search of the next free block which done
> via find-first function:
> The buddy system spends lots of its time in searching the next free
> space using function find_first_bit, which scans a big array of long
> values in order to find the first bit. We added one more array of
> longs, where each bit indicates a long value in the original array,
> that way there is a need for much less searches for the next free area.
> 
> For example, for the following bits array of 128 bits where all
> bits are zero except for the last bit  :  0000........00001
> the corresponding bits-per-long array is:  0001
> 
> The search will be done over the bits-per-long array first, and after
> the first bit is found, we will use it as a start pointer in the
> bigger array (bits array).
> 
> Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Instead of a bits-per-long array, it seems so much simpler and more
cache friendly to maintain instead just a "lowest set bit" value.

In the initial state it is zero for all values, remember this is just
a hint.

When you allocate, if num_free is non-zero of course, you start the
bit scan from the "lowest set bit" value.  When the set bit is found,
update the "lowest set bit" cache to the set bit plus one (or zero if
the new value exceeds the bitmap size).

Then on free you update "lowest set bit" to the bit being set if it is
smaller than the current "lowest set bit" value for that order.

No double scanning of bitmap arrays, just a single bitmap search with
a variable start point.
Yevgeny Kliteynik Sept. 28, 2020, 7:58 p.m. UTC | #2
> From: David Miller <davem@davemloft.net>
> Sent: Sunday, September 27, 2020 01:16
> To: saeed@kernel.org
> Cc: kuba@kernel.org; netdev@vger.kernel.org; Yevgeny Kliteynik
> <kliteyn@nvidia.com>; Erez Shitrit <erezsh@nvidia.com>; Mark Bloch
> <mbloch@nvidia.com>; Saeed Mahameed <saeedm@nvidia.com>
> Subject: Re: [net-next 01/15] net/mlx5: DR, Add buddy allocator utilities
> 
> External email: Use caution opening links or attachments
> 
> 
> From: saeed@kernel.org
> Date: Fri, 25 Sep 2020 12:37:55 -0700
> 
> > From: Yevgeny Kliteynik <kliteyn@nvidia.com>
> >
> > Add implementation of SW Steering variation of buddy allocator.
> >
> > The buddy system for ICM memory uses 2 main data structures:
> >   - Bitmap per order, that keeps the current state of allocated
> >     blocks for this order
> >   - Indicator for the number of available blocks per each order
> >
> > In addition, there is one more hierarchy of searching in the bitmaps
> > in order to accelerate the search of the next free block which done
> > via find-first function:
> > The buddy system spends lots of its time in searching the next free
> > space using function find_first_bit, which scans a big array of long
> > values in order to find the first bit. We added one more array of
> > longs, where each bit indicates a long value in the original array,
> > that way there is a need for much less searches for the next free area.
> >
> > For example, for the following bits array of 128 bits where all bits
> > are zero except for the last bit  :  0000........00001 the
> > corresponding bits-per-long array is:  0001
> >
> > The search will be done over the bits-per-long array first, and after
> > the first bit is found, we will use it as a start pointer in the
> > bigger array (bits array).
> >
> > Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
> > Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
> > Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> > Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> 
> Instead of a bits-per-long array, it seems so much simpler and more cache
> friendly to maintain instead just a "lowest set bit" value.
> 
> In the initial state it is zero for all values, remember this is just a hint.
> 
> When you allocate, if num_free is non-zero of course, you start the bit scan
> from the "lowest set bit" value.  When the set bit is found, update the "lowest
> set bit" cache to the set bit plus one (or zero if the new value exceeds the bitmap
> size).

This will work when you allocate everything and freeing "in order".
In our vase the system is more dynamic - we allocate, and then some chunks
are freed all over, creating free spots in the bit array in a rather random manner.
By replacing the bits-per-long array with a single counter we loose this ability
to jump faster to the free spot.
It is still an improvement to start from a certain place and not from the beginning,
but bits-per-long array saves more time.
Also, in terms of memory, it's not a big hit - it's bit per 32 chunks (or byte per 256 chunks).

-- YK

> Then on free you update "lowest set bit" to the bit being set if it is smaller than
> the current "lowest set bit" value for that order.
> 
> No double scanning of bitmap arrays, just a single bitmap search with a variable
> start point.
David Miller Sept. 28, 2020, 9:41 p.m. UTC | #3
From: Yevgeny Kliteynik <kliteyn@nvidia.com>
Date: Mon, 28 Sep 2020 19:58:59 +0000

> By replacing the bits-per-long array with a single counter we loose
> this ability to jump faster to the free spot.

I don't understand why this is true, because upon the free we will
update the hint and that's where the next bit search will start.

Have you even tried my suggestion?
Yevgeny Kliteynik Sept. 29, 2020, 8:44 p.m. UTC | #4
On 29-Sep-20 00:41, David Miller wrote:
> 
> From: Yevgeny Kliteynik <kliteyn@nvidia.com>
> Date: Mon, 28 Sep 2020 19:58:59 +0000
> 
>> By replacing the bits-per-long array with a single counter we loose
>> this ability to jump faster to the free spot.
> 
> I don't understand why this is true, because upon the free we will
> update the hint and that's where the next bit search will start.
> 
> Have you even tried my suggestion?

I did, but I couldn't make it work for some use cases that
I expect to be fairly common. Perhaps I misunderstood something.
Let's look at the following use case (I'll make it 'extreme' to
better illustrate the problem):

A buddy is all allocated by mostly small chunks.
This means that the bit array of size n will look something like this:

idx  | 0 | 1 | ...                                     |n-2|n-1|
      |---|---|-----------------------------------------|---|---|
array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 0 |
      |---|---|-----------------------------------------|---|---|

Then chunk that was allocated at index n-1 is freed, which means
that 'lowest set bit' is set to n-1.
Array will look like this:

idx  | 0 | 1 | ...                                     |n-2|n-1|
      |---|---|-----------------------------------------|---|---|
array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
      |---|---|-----------------------------------------|---|---|


Then chunk that was allocated at index 0 is freed, which means
that 'lowest set bit' is set to 0.
Array will look like this:

idx  | 0 | 1 | ...                                     |n-2|n-1|
      |---|---|-----------------------------------------|---|---|
array| 1 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
      |---|---|-----------------------------------------|---|---|

Then a new allocation request comes in.
The 'lowest set bit' is 0, which allows us to find the spot.

The 'lowest set bit' is now set to 1.
Array will look like this:

idx  | 0 | 1 | ...                                     |n-2|n-1|
      |---|---|-----------------------------------------|---|---|
array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
      |---|---|-----------------------------------------|---|---|

Then another allocation request comes in.
The 'lowest set bit' is 1, which means that we now need
to scan the whole array.

Is there a way to make 'lowest set bit' hint to work for
the use cases similar to what I've described?

-- YK
Yevgeny Kliteynik Oct. 6, 2020, 1:02 p.m. UTC | #5
On 29-Sep-20 23:44, Yevgeny Kliteynik wrote:>
 > On 29-Sep-20 00:41, David Miller wrote:
 >>
 >> From: Yevgeny Kliteynik <kliteyn@nvidia.com>
 >> Date: Mon, 28 Sep 2020 19:58:59 +0000
 >>
 >>> By replacing the bits-per-long array with a single counter we loose
 >>> this ability to jump faster to the free spot.
 >>
 >> I don't understand why this is true, because upon the free we will
 >> update the hint and that's where the next bit search will start.
 >>
 >> Have you even tried my suggestion?
 >
 > I did, but I couldn't make it work for some use cases that
 > I expect to be fairly common. Perhaps I misunderstood something.
 > Let's look at the following use case (I'll make it 'extreme' to
 > better illustrate the problem):

Following up on this mail.
In addition to the case I've described, I'd like to clarify why
the 'lowest set bit' hint doesn't work here.

Buddy allocator allocates blocks of different sizes, so when it
scans the bits array, the allocator looks for free *area* of at
least the required size.
Can't store this info in a 'lowest set bit' counter.

Furthermore, when buddy allocator scans for such areas, it
takes into consideration blocks alignment as required by HW,
which can't be stored in an external counter.

The code here implements standard buddy allocator with a small
optimization of an additional level to speed up the search.
Do you see something wrong with this additional level?

-- YK


 > A buddy is all allocated by mostly small chunks.
 > This means that the bit array of size n will look something like this:
 >
 > idx  | 0 | 1 | ...                                     |n-2|n-1|
 >      |---|---|-----------------------------------------|---|---|
 > array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 0 |
 >      |---|---|-----------------------------------------|---|---|
 >
 > Then chunk that was allocated at index n-1 is freed, which means
 > that 'lowest set bit' is set to n-1.
 > Array will look like this:
 >
 > idx  | 0 | 1 | ...                                     |n-2|n-1|
 >      |---|---|-----------------------------------------|---|---|
 > array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
 >      |---|---|-----------------------------------------|---|---|
 >
 >
 > Then chunk that was allocated at index 0 is freed, which means
 > that 'lowest set bit' is set to 0.
 > Array will look like this:
 >
 > idx  | 0 | 1 | ...                                     |n-2|n-1|
 >      |---|---|-----------------------------------------|---|---|
 > array| 1 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
 >      |---|---|-----------------------------------------|---|---|
 >
 > Then a new allocation request comes in.
 > The 'lowest set bit' is 0, which allows us to find the spot.
 >
 > The 'lowest set bit' is now set to 1.
 > Array will look like this:
 >
 > idx  | 0 | 1 | ...                                     |n-2|n-1|
 >      |---|---|-----------------------------------------|---|---|
 > array| 0 | 0 | ....... all zeroes - all full .......   | 0 | 1 |
 >      |---|---|-----------------------------------------|---|---|
 >
 > Then another allocation request comes in.
 > The 'lowest set bit' is 1, which means that we now need
 > to scan the whole array.
 >
 > Is there a way to make 'lowest set bit' hint to work for
 > the use cases similar to what I've described?
 >
 > -- YK
 >
David Miller Oct. 6, 2020, 2:47 p.m. UTC | #6
From: Yevgeny Kliteynik <kliteyn@nvidia.com>
Date: Tue, 6 Oct 2020 16:02:24 +0300

> Buddy allocator allocates blocks of different sizes, so when it
> scans the bits array, the allocator looks for free *area* of at
> least the required size.
> Can't store this info in a 'lowest set bit' counter.

If you make it per-order, why not?

> Furthermore, when buddy allocator scans for such areas, it
> takes into consideration blocks alignment as required by HW,
> which can't be stored in an external counter.

That's why you scan the bits, which you have to do anyways.

I'm kinda disappointed in how this discussion is going, to be honest.
Yevgeny Kliteynik Oct. 7, 2020, 9:25 p.m. UTC | #7
On 06-Oct-20 17:47, David Miller wrote:
 > From: Yevgeny Kliteynik <kliteyn@nvidia.com>
 > Date: Tue, 6 Oct 2020 16:02:24 +0300
 >
 >> Buddy allocator allocates blocks of different sizes, so when it
 >> scans the bits array, the allocator looks for free *area* of at
 >> least the required size.
 >> Can't store this info in a 'lowest set bit' counter.
 >
 > If you make it per-order, why not?

OK, I’ll send the V2 series with the standard allocator.
The optimization will be handled in a separate patch later on.

Thanks
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9826a041e407..132786e853aa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -80,7 +80,7 @@  mlx5_core-$(CONFIG_MLX5_EN_TLS) += en_accel/tls.o en_accel/tls_rxtx.o en_accel/t
 
 mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o \
 					steering/dr_matcher.o steering/dr_rule.o \
-					steering/dr_icm_pool.o \
+					steering/dr_icm_pool.o steering/dr_buddy.o \
 					steering/dr_ste.o steering/dr_send.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_buddy.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_buddy.c
new file mode 100644
index 000000000000..9cabf968ac11
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_buddy.c
@@ -0,0 +1,255 @@ 
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2004 Topspin Communications. All rights reserved.
+ * Copyright (c) 2005 - 2008 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2006 - 2007 Cisco Systems, Inc. All rights reserved.
+ * Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
+ */
+
+#include "dr_types.h"
+
+static unsigned long dr_find_first_bit(const unsigned long *bitmap_per_long,
+				       const unsigned long *bitmap,
+				       unsigned long size)
+{
+	unsigned int bit_per_long_size = DIV_ROUND_UP(size, BITS_PER_LONG);
+	unsigned int bitmap_idx;
+
+	/* find the first free in the first level */
+	bitmap_idx = find_first_bit(bitmap_per_long, bit_per_long_size);
+	/* find the next level */
+	return find_next_bit(bitmap, size, bitmap_idx * BITS_PER_LONG);
+}
+
+int mlx5dr_buddy_init(struct mlx5dr_icm_buddy_mem *buddy,
+		      unsigned int max_order)
+{
+	int i;
+
+	buddy->max_order = max_order;
+
+	INIT_LIST_HEAD(&buddy->list_node);
+	INIT_LIST_HEAD(&buddy->used_list);
+	INIT_LIST_HEAD(&buddy->hot_list);
+
+	buddy->bitmap = kcalloc(buddy->max_order + 1,
+				sizeof(*buddy->bitmap),
+				GFP_KERNEL);
+	buddy->num_free = kcalloc(buddy->max_order + 1,
+				  sizeof(*buddy->num_free),
+				  GFP_KERNEL);
+	buddy->bitmap_per_long = kcalloc(buddy->max_order + 1,
+					 sizeof(*buddy->bitmap_per_long),
+					 GFP_KERNEL);
+
+	if (!buddy->bitmap || !buddy->num_free || !buddy->bitmap_per_long)
+		goto err_free_all;
+
+	/* Allocating max_order bitmaps, one for each order */
+
+	for (i = 0; i <= buddy->max_order; ++i) {
+		unsigned int size = 1 << (buddy->max_order - i);
+
+		buddy->bitmap[i] = bitmap_zalloc(size, GFP_KERNEL);
+		if (!buddy->bitmap[i])
+			goto err_out_free_each_bit_per_order;
+	}
+
+	for (i = 0; i <= buddy->max_order; ++i) {
+		unsigned int size = BITS_TO_LONGS(1 << (buddy->max_order - i));
+
+		buddy->bitmap_per_long[i] = bitmap_zalloc(size, GFP_KERNEL);
+		if (!buddy->bitmap_per_long[i])
+			goto err_out_free_set;
+	}
+
+	/* In the beginning, we have only one order that is available for
+	 * use (the biggest one), so mark the first bit in both bitmaps.
+	 */
+
+	bitmap_set(buddy->bitmap[buddy->max_order], 0, 1);
+	bitmap_set(buddy->bitmap_per_long[buddy->max_order], 0, 1);
+
+	buddy->num_free[buddy->max_order] = 1;
+
+	return 0;
+
+err_out_free_set:
+	for (i = 0; i <= buddy->max_order; ++i)
+		bitmap_free(buddy->bitmap_per_long[i]);
+
+err_out_free_each_bit_per_order:
+	kfree(buddy->bitmap_per_long);
+
+	for (i = 0; i <= buddy->max_order; ++i)
+		bitmap_free(buddy->bitmap[i]);
+
+err_free_all:
+	kfree(buddy->bitmap_per_long);
+	kfree(buddy->num_free);
+	kfree(buddy->bitmap);
+	return -ENOMEM;
+}
+
+void mlx5dr_buddy_cleanup(struct mlx5dr_icm_buddy_mem *buddy)
+{
+	int i;
+
+	list_del(&buddy->list_node);
+
+	for (i = 0; i <= buddy->max_order; ++i) {
+		bitmap_free(buddy->bitmap[i]);
+		bitmap_free(buddy->bitmap_per_long[i]);
+	}
+
+	kfree(buddy->bitmap_per_long);
+	kfree(buddy->num_free);
+	kfree(buddy->bitmap);
+}
+
+/**
+ * dr_buddy_get_seg_borders() - Find the borders of specific segment.
+ * @seg: Segment number.
+ * @low: Pointer to hold the low border of the provided segment.
+ * @high: Pointer to hold the high border of the provided segment.
+ *
+ * Find the borders (high and low) of specific seg (segment location)
+ * of the lower level of the bitmap in order to mark the upper layer
+ * of bitmap.
+ */
+static void dr_buddy_get_seg_borders(unsigned int seg,
+				     unsigned int *low,
+				     unsigned int *high)
+{
+	*low = (seg / BITS_PER_LONG) * BITS_PER_LONG;
+	*high = ((seg / BITS_PER_LONG) + 1) * BITS_PER_LONG;
+}
+
+/**
+ * dr_buddy_update_upper_bitmap() - Update second level bitmap.
+ * @buddy: Buddy to update.
+ * @seg: Segment number.
+ * @order: Order of the buddy to update.
+ *
+ * We have two layers of searching in the bitmaps, so when
+ * needed update the second layer of search.
+ */
+static void dr_buddy_update_upper_bitmap(struct mlx5dr_icm_buddy_mem *buddy,
+					 unsigned long seg,
+					 unsigned int order)
+{
+	unsigned int h, l, m;
+
+	/* clear upper layer of search if needed */
+	dr_buddy_get_seg_borders(seg, &l, &h);
+	m = find_next_bit(buddy->bitmap[order], h, l);
+	if (m == h) /* nothing in the long that includes seg */
+		bitmap_clear(buddy->bitmap_per_long[order],
+			     seg / BITS_PER_LONG, 1);
+}
+
+static int dr_buddy_find_free_seg(struct mlx5dr_icm_buddy_mem *buddy,
+				  unsigned int start_order,
+				  unsigned int *segment,
+				  unsigned int *order)
+{
+	unsigned int seg, order_iter, m;
+
+	for (order_iter = start_order;
+	     order_iter <= buddy->max_order; ++order_iter) {
+		if (!buddy->num_free[order_iter])
+			continue;
+
+		m = 1 << (buddy->max_order - order_iter);
+		seg = dr_find_first_bit(buddy->bitmap_per_long[order_iter],
+					buddy->bitmap[order_iter], m);
+
+		if (WARN(seg >= m,
+			 "ICM Buddy: failed finding free mem for order %d\n",
+			 order_iter))
+			return -ENOMEM;
+
+		break;
+	}
+
+	if (order_iter > buddy->max_order)
+		return -ENOMEM;
+
+	*segment = seg;
+	*order = order_iter;
+	return 0;
+}
+
+/**
+ * mlx5dr_buddy_alloc_mem() - Update second level bitmap.
+ * @buddy: Buddy to update.
+ * @order: Order of the buddy to update.
+ * @segment: Segment number.
+ *
+ * This function finds the first area of the ICM memory managed by this buddy.
+ * It uses the data structures of the buddy system in order to find the first
+ * area of free place, starting from the current order till the maximum order
+ * in the system.
+ *
+ * Return: 0 when segment is set, non-zero error status otherwise.
+ *
+ * The function returns the location (segment) in the whole buddy ICM memory
+ * area - the index of the memory segment that is available for use.
+ */
+int mlx5dr_buddy_alloc_mem(struct mlx5dr_icm_buddy_mem *buddy,
+			   unsigned int order,
+			   unsigned int *segment)
+{
+	unsigned int seg, order_iter;
+	int err;
+
+	err = dr_buddy_find_free_seg(buddy, order, &seg, &order_iter);
+	if (err)
+		return err;
+
+	bitmap_clear(buddy->bitmap[order_iter], seg, 1);
+	/* clear upper layer of search if needed */
+	dr_buddy_update_upper_bitmap(buddy, seg, order_iter);
+	--buddy->num_free[order_iter];
+
+	/* If we found free memory in some order that is bigger than the
+	 * required order, we need to split every order between the required
+	 * order and the order that we found into two parts, and mark accordingly.
+	 */
+	while (order_iter > order) {
+		--order_iter;
+		seg <<= 1;
+		bitmap_set(buddy->bitmap[order_iter], seg ^ 1, 1);
+		bitmap_set(buddy->bitmap_per_long[order_iter],
+			   (seg ^ 1) / BITS_PER_LONG, 1);
+
+		++buddy->num_free[order_iter];
+	}
+
+	seg <<= order;
+	*segment = seg;
+
+	return 0;
+}
+
+void mlx5dr_buddy_free_mem(struct mlx5dr_icm_buddy_mem *buddy,
+			   unsigned int seg, unsigned int order)
+{
+	seg >>= order;
+
+	/* Whenever a segment is free,
+	 * the mem is added to the buddy that gave it.
+	 */
+	while (test_bit(seg ^ 1, buddy->bitmap[order])) {
+		bitmap_clear(buddy->bitmap[order], seg ^ 1, 1);
+		dr_buddy_update_upper_bitmap(buddy, seg ^ 1, order);
+		--buddy->num_free[order];
+		seg >>= 1;
+		++order;
+	}
+	bitmap_set(buddy->bitmap[order], seg, 1);
+	bitmap_set(buddy->bitmap_per_long[order],
+		   seg / BITS_PER_LONG, 1);
+
+	++buddy->num_free[order];
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h b/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h
index 7deaca9ade3b..40bdbdc6e3f2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h
@@ -126,4 +126,38 @@  mlx5dr_is_supported(struct mlx5_core_dev *dev)
 	return MLX5_CAP_ESW_FLOWTABLE_FDB(dev, sw_owner);
 }
 
+/* buddy functions & structure */
+
+struct mlx5dr_icm_mr;
+
+struct mlx5dr_icm_buddy_mem {
+	unsigned long		**bitmap;
+	unsigned int		*num_free;
+	unsigned long		**bitmap_per_long;
+	u32			max_order;
+	struct list_head	list_node;
+	struct mlx5dr_icm_mr	*icm_mr;
+	struct mlx5dr_icm_pool	*pool;
+
+	/* This is the list of used chunks. HW may be accessing this memory */
+	struct list_head	used_list;
+
+	/* Hardware may be accessing this memory but at some future,
+	 * undetermined time, it might cease to do so.
+	 * sync_ste command sets them free.
+	 */
+	struct list_head	hot_list;
+	/* indicates the byte size of hot mem */
+	unsigned int		hot_memory_size;
+};
+
+int mlx5dr_buddy_init(struct mlx5dr_icm_buddy_mem *buddy,
+		      unsigned int max_order);
+void mlx5dr_buddy_cleanup(struct mlx5dr_icm_buddy_mem *buddy);
+int mlx5dr_buddy_alloc_mem(struct mlx5dr_icm_buddy_mem *buddy,
+			   unsigned int order,
+			   unsigned int *segment);
+void mlx5dr_buddy_free_mem(struct mlx5dr_icm_buddy_mem *buddy,
+			   unsigned int seg, unsigned int order);
+
 #endif /* _MLX5DR_H_ */