[v3,net-next,08/14] mlx4: use order-0 pages for RX

Message ID	20170213195858.5215-9-edumazet@google.com
State	Deferred, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Eric Dumazet <edumazet@google.com> To: "David S . Miller" <davem@davemloft.net> Cc: netdev <netdev@vger.kernel.org>, Tariq Toukan <tariqt@mellanox.com>, Martin KaFai Lau <kafai@fb.com>, Saeed Mahameed <saeedm@mellanox.com>, Willem de Bruijn <willemb@google.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Brenden Blanco <bblanco@plumgrid.com>, Alexei Starovoitov <ast@kernel.org>, Eric Dumazet <edumazet@google.com>, Eric Dumazet <eric.dumazet@gmail.com> Subject: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Date: Mon, 13 Feb 2017 11:58:52 -0800 Message-Id: <20170213195858.5215-9-edumazet@google.com> In-Reply-To: <20170213195858.5215-1-edumazet@google.com> References: <20170213195858.5215-1-edumazet@google.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet Feb. 13, 2017, 7:58 p.m. UTC

Use of order-3 pages is problematic in some cases.

This patch might add three kinds of regression :

1) a CPU performance regression, but we will add later page
recycling and performance should be back.

2) TCP receiver could grow its receive window slightly slower,
   because skb->len/skb->truesize ratio will decrease.
   This is mostly ok, we prefer being conservative to not risk OOM,
   and eventually tune TCP better in the future.
   This is consistent with other drivers using 2048 per ethernet frame.

3) Because we allocate one page per RX slot, we consume more
   memory for the ring buffers. XDP already had this constraint anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 +++++++++++++---------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
 2 files changed, 33 insertions(+), 43 deletions(-)

Alexander H Duyck Feb. 13, 2017, 8:51 p.m. UTC | #1

On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:
> Use of order-3 pages is problematic in some cases.
>
> This patch might add three kinds of regression :
>
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
>
> 2) TCP receiver could grow its receive window slightly slower,
>    because skb->len/skb->truesize ratio will decrease.
>    This is mostly ok, we prefer being conservative to not risk OOM,
>    and eventually tune TCP better in the future.
>    This is consistent with other drivers using 2048 per ethernet frame.
>
> 3) Because we allocate one page per RX slot, we consume more
>    memory for the ring buffers. XDP already had this constraint anyway.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 +++++++++++++---------------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
>  2 files changed, 33 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 0c61c1200f2aec4c74d7403d9ec3ed06821c1bac..069ea09185fb0669d5c1f9b1b88f517b2d69c5ed 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -53,38 +53,26 @@
>  static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
>                             struct mlx4_en_rx_alloc *page_alloc,
>                             const struct mlx4_en_frag_info *frag_info,
> -                           gfp_t _gfp)
> +                           gfp_t gfp)
>  {
> -       int order;
>         struct page *page;
>         dma_addr_t dma;
>
> -       for (order = priv->rx_page_order; ;) {
> -               gfp_t gfp = _gfp;
> -
> -               if (order)
> -                       gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NOMEMALLOC;
> -               page = alloc_pages(gfp, order);
> -               if (likely(page))
> -                       break;
> -               if (--order < 0 ||
> -                   ((PAGE_SIZE << order) < frag_info->frag_size))
> -                       return -ENOMEM;
> -       }
> -       dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
> -                          priv->dma_dir);
> +       page = alloc_page(gfp);
> +       if (unlikely(!page))
> +               return -ENOMEM;
> +       dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
>         if (unlikely(dma_mapping_error(priv->ddev, dma))) {
>                 put_page(page);
>                 return -ENOMEM;
>         }
> -       page_alloc->page_size = PAGE_SIZE << order;
>         page_alloc->page = page;
>         page_alloc->dma = dma;
>         page_alloc->page_offset = 0;
>         /* Not doing get_page() for each frag is a big win
>          * on asymetric workloads. Note we can not use atomic_set().
>          */
> -       page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
> +       page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
>         return 0;
>  }
>
> @@ -105,7 +93,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>                 page_alloc[i].page_offset += frag_info->frag_stride;
>
>                 if (page_alloc[i].page_offset + frag_info->frag_stride <=
> -                   ring_alloc[i].page_size)
> +                   PAGE_SIZE)
>                         continue;
>
>                 if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i],
> @@ -127,11 +115,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>         while (i--) {
>                 if (page_alloc[i].page != ring_alloc[i].page) {
>                         dma_unmap_page(priv->ddev, page_alloc[i].dma,
> -                               page_alloc[i].page_size,
> -                               priv->dma_dir);
> +                                      PAGE_SIZE, priv->dma_dir);
>                         page = page_alloc[i].page;
>                         /* Revert changes done by mlx4_alloc_pages */
> -                       page_ref_sub(page, page_alloc[i].page_size /
> +                       page_ref_sub(page, PAGE_SIZE /
>                                            priv->frag_info[i].frag_stride - 1);
>                         put_page(page);
>                 }
> @@ -147,8 +134,8 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
>         u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
>
>
> -       if (next_frag_end > frags[i].page_size)
> -               dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
> +       if (next_frag_end > PAGE_SIZE)
> +               dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
>                                priv->dma_dir);
>
>         if (frags[i].page)
> @@ -168,9 +155,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
>                                      frag_info, GFP_KERNEL | __GFP_COLD))
>                         goto out;
>
> -               en_dbg(DRV, priv, "  frag %d allocator: - size:%d frags:%d\n",
> -                      i, ring->page_alloc[i].page_size,
> -                      page_ref_count(ring->page_alloc[i].page));
> +               en_dbg(DRV, priv, "  frag %d allocator: - frags:%d\n",
> +                      i, page_ref_count(ring->page_alloc[i].page));
>         }
>         return 0;
>
> @@ -180,11 +166,10 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
>
>                 page_alloc = &ring->page_alloc[i];
>                 dma_unmap_page(priv->ddev, page_alloc->dma,
> -                              page_alloc->page_size,
> -                              priv->dma_dir);
> +                              PAGE_SIZE, priv->dma_dir);
>                 page = page_alloc->page;
>                 /* Revert changes done by mlx4_alloc_pages */
> -               page_ref_sub(page, page_alloc->page_size /
> +               page_ref_sub(page, PAGE_SIZE /
>                                    priv->frag_info[i].frag_stride - 1);
>                 put_page(page);
>                 page_alloc->page = NULL;

You can probably simplify this by using __page_frag_cache_drain since
that way you can just doe the sub and test for the whole set instead
of doing N - 1 and 1.

Eric Dumazet Feb. 13, 2017, 9:09 p.m. UTC | #2

On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:

>> +                              PAGE_SIZE, priv->dma_dir);
>>                 page = page_alloc->page;
>>                 /* Revert changes done by mlx4_alloc_pages */
>> -               page_ref_sub(page, page_alloc->page_size /
>> +               page_ref_sub(page, PAGE_SIZE /
>>                                    priv->frag_info[i].frag_stride - 1);
>>                 put_page(page);
>>                 page_alloc->page = NULL;
>
> You can probably simplify this by using __page_frag_cache_drain since
> that way you can just doe the sub and test for the whole set instead
> of doing N - 1 and 1.

Well, we remove this anyway in following patch ;)

Alexander H Duyck Feb. 13, 2017, 11:16 p.m. UTC | #3

On Mon, Feb 13, 2017 at 1:09 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:
>
>>> +                              PAGE_SIZE, priv->dma_dir);
>>>                 page = page_alloc->page;
>>>                 /* Revert changes done by mlx4_alloc_pages */
>>> -               page_ref_sub(page, page_alloc->page_size /
>>> +               page_ref_sub(page, PAGE_SIZE /
>>>                                    priv->frag_info[i].frag_stride - 1);
>>>                 put_page(page);
>>>                 page_alloc->page = NULL;
>>
>> You can probably simplify this by using __page_frag_cache_drain since
>> that way you can just doe the sub and test for the whole set instead
>> of doing N - 1 and 1.
>
> Well, we remove this anyway in following patch ;)

Any plans to add the bulk page count updates back at some point?  I
just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
update code to better handle incrementing page count").  I should have
patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
the reasons why I implemented __page_frag_cache_drain in the first
place.  As I'm sure Jesper will probably point out the atomic op for
get_page/page_ref_inc can be pretty expensive if I recall correctly.

- Alex

Eric Dumazet Feb. 13, 2017, 11:22 p.m. UTC | #4

On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> Any plans to add the bulk page count updates back at some point?  I
> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
> update code to better handle incrementing page count").  I should have
> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
> the reasons why I implemented __page_frag_cache_drain in the first
> place.  As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

As a matter of fact I did this last week, but could not show any
improvement on one TCP flow,
so decided to not include this in this patch series.

Maybe the issue was more a sender issue, I might revisit this later.

Alexander H Duyck Feb. 13, 2017, 11:26 p.m. UTC | #5

On Mon, Feb 13, 2017 at 3:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>> Any plans to add the bulk page count updates back at some point?  I
>> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
>> update code to better handle incrementing page count").  I should have
>> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
>> the reasons why I implemented __page_frag_cache_drain in the first
>> place.  As I'm sure Jesper will probably point out the atomic op for
>> get_page/page_ref_inc can be pretty expensive if I recall correctly.
>
> As a matter of fact I did this last week, but could not show any
> improvement on one TCP flow,
> so decided to not include this in this patch series.
>
> Maybe the issue was more a sender issue, I might revisit this later.

Odds are for a single TCP flow you won't notice.  This tends to be
more of a small packets type performance issue.  If you hammer on he
Rx using pktgen you would be more likely to see it.

Anyway patch looks fine from a functional perspective so I am okay
with it.  The page count bulking is something that can be addressed
later.

Thanks.

- Alex

Eric Dumazet Feb. 13, 2017, 11:29 p.m. UTC | #6

On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

>
> Odds are for a single TCP flow you won't notice.  This tends to be
> more of a small packets type performance issue.  If you hammer on he
> Rx using pktgen you would be more likely to see it.
>
> Anyway patch looks fine from a functional perspective so I am okay
> with it.  The page count bulking is something that can be addressed
> later.

Well, if we hammer enough, pages are not recycled because we cycle
through RX ring buffer too fast.

This might be different for PowerPC though, because of its huge PAGE_SIZE.

Alexander H Duyck Feb. 13, 2017, 11:47 p.m. UTC | #7

On Mon, Feb 13, 2017 at 3:29 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>>
>> Odds are for a single TCP flow you won't notice.  This tends to be
>> more of a small packets type performance issue.  If you hammer on he
>> Rx using pktgen you would be more likely to see it.
>>
>> Anyway patch looks fine from a functional perspective so I am okay
>> with it.  The page count bulking is something that can be addressed
>> later.
>
> Well, if we hammer enough, pages are not recycled because we cycle
> through RX ring buffer too fast.
>
> This might be different for PowerPC though, because of its huge PAGE_SIZE.

Actually it depends on the use case.  In the case of pktgen packets
they are usually dropped pretty early in the receive path.  Think
something more along the lines of a TCP syn flood versus something
that would be loading up a socket.

- Alex

Eric Dumazet Feb. 14, 2017, 12:22 a.m. UTC | #8

On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> Actually it depends on the use case.  In the case of pktgen packets
> they are usually dropped pretty early in the receive path.  Think
> something more along the lines of a TCP syn flood versus something
> that would be loading up a socket.

So I gave another spin an it, reducing the MTU on the sender to 500
instead of 1500 to triple the load (in term of packets per second)
since the sender seems to hit some kind of limit around 30Gbit.

Number of packets we process on one RX queue , and one TCP flow.

Current patch series : 6.3 Mpps

Which is not too bad ;)

This number does not change putting your __page_frag_cache_drain() trick.

Alexander H Duyck Feb. 14, 2017, 12:34 a.m. UTC | #9

On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>> Actually it depends on the use case.  In the case of pktgen packets
>> they are usually dropped pretty early in the receive path.  Think
>> something more along the lines of a TCP syn flood versus something
>> that would be loading up a socket.
>
> So I gave another spin an it, reducing the MTU on the sender to 500
> instead of 1500 to triple the load (in term of packets per second)
> since the sender seems to hit some kind of limit around 30Gbit.
>
> Number of packets we process on one RX queue , and one TCP flow.
>
> Current patch series : 6.3 Mpps
>
> Which is not too bad ;)
>
> This number does not change putting your __page_frag_cache_drain() trick.

Well the __page_frag_cache_drain by itself isn't going to add much of
anything.  You really aren't going to be using it except for in the
slow path.  I was talking about the bulk page count update by itself.
All __page_frag_cache_drain gets you is it cleans up the
page_frag_sub(n-1); put_page(1); code calls.

The approach I took for the Intel drivers isn't too different than
what we did for the skbuff page frag cache.  Basically I update the
page count once every 65535 frames setting it to 64K, and holding no
more than 65535 references in the pagecnt_bias.  Also in my code I
don't update the count until after the first recycle call since the
highest likelihood of us discarding the frame is after the first
allocation anyway so we wait until the first recycle to start
repopulating the count.

- Alex

Eric Dumazet Feb. 14, 2017, 12:46 a.m. UTC | #10

On Mon, 2017-02-13 at 16:34 -0800, Alexander Duyck wrote:
> On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >
> >> Actually it depends on the use case.  In the case of pktgen packets
> >> they are usually dropped pretty early in the receive path.  Think
> >> something more along the lines of a TCP syn flood versus something
> >> that would be loading up a socket.
> >
> > So I gave another spin an it, reducing the MTU on the sender to 500
> > instead of 1500 to triple the load (in term of packets per second)
> > since the sender seems to hit some kind of limit around 30Gbit.
> >
> > Number of packets we process on one RX queue , and one TCP flow.
> >
> > Current patch series : 6.3 Mpps
> >
> > Which is not too bad ;)
> >
> > This number does not change putting your __page_frag_cache_drain() trick.
> 
> Well the __page_frag_cache_drain by itself isn't going to add much of
> anything.  You really aren't going to be using it except for in the
> slow path.  I was talking about the bulk page count update by itself.
> All __page_frag_cache_drain gets you is it cleans up the
> page_frag_sub(n-1); put_page(1); code calls.
> 
> The approach I took for the Intel drivers isn't too different than
> what we did for the skbuff page frag cache.  Basically I update the
> page count once every 65535 frames setting it to 64K, and holding no
> more than 65535 references in the pagecnt_bias.  Also in my code I
> don't update the count until after the first recycle call since the
> highest likelihood of us discarding the frame is after the first
> allocation anyway so we wait until the first recycle to start
> repopulating the count.

Alex, be assured that I implemented the full thing, of course.

( The pagecnt_bias field, .. refilled every 16K rounds )

And I repeat : I got _no_ convincing results to justify this patch being
sent in this series.

This series is not about adding 1% or 2% performance changes, but adding
robustness to RX allocations.

Eric Dumazet Feb. 14, 2017, 12:47 a.m. UTC | #11

On Mon, 2017-02-13 at 16:46 -0800, Eric Dumazet wrote:

> Alex, be assured that I implemented the full thing, of course.
> 
> ( The pagecnt_bias field, .. refilled every 16K rounds )

Correction, USHRT_MAX is  ~64K, not 16K

Jesper Dangaard Brouer Feb. 14, 2017, 12:12 p.m. UTC | #12

On Mon, 13 Feb 2017 15:16:35 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

[...]
> ... As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

It is important to understand that there are two cases for the cost of
an atomic op, which depend on the cache-coherency state of the
cacheline.

Measured on Skylake CPU i7-6700K CPU @ 4.00GHz

(1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
(2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns

Notice the huge difference. And in case 2, it is enough that the remote
CPU reads the cacheline and brings it into "Shared" (MESI) state, and
the local CPU then does the atomic op.

One key ideas behind the page_pool, is that remote CPUs read/detect
refcnt==1 (Shared-state), and store the page in a small per-CPU array.
When array is full, it gets bulk returned to the shared-ptr-ring pool.
When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
during it's bulk refill, to latency-hide the MESI transitions needed.

Eric Dumazet Feb. 14, 2017, 1:45 p.m. UTC | #13

On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:

> It is important to understand that there are two cases for the cost of
> an atomic op, which depend on the cache-coherency state of the
> cacheline.
>
> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>
> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>

Okay, it seems you guys really want a patch that I said was not giving
good results

Let me publish the numbers I get , adding or not the last (and not
official) patch.

If I _force_ the user space process to run on the other node,
then the results are not the ones Alex or you are expecting.

I have with this patch about 2.7 Mpps of this silly single TCP flow,
and 3.5 Mpps without it.

lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
0.00      0.00      4.50

Profile of the cpu on NUMA node 1 ( netserver consuming data ) :

    54.73%  [kernel]      [k] copy_user_enhanced_fast_string
    31.07%  [kernel]      [k] skb_release_data
     4.24%  [kernel]      [k] skb_copy_datagram_iter
     1.35%  [kernel]      [k] copy_page_to_iter
     0.98%  [kernel]      [k] _raw_spin_lock
     0.90%  [kernel]      [k] skb_release_head_state
     0.60%  [kernel]      [k] tcp_transmit_skb
     0.51%  [kernel]      [k] mlx4_en_xmit
     0.33%  [kernel]      [k] ___cache_free
     0.28%  [kernel]      [k] tcp_rcv_established

Profile of cpu handling mlx4 softirqs (NUMA node 0)


    48.00%  [kernel]          [k] mlx4_en_process_rx_cq
    12.92%  [kernel]          [k] napi_gro_frags
     7.28%  [kernel]          [k] inet_gro_receive
     7.17%  [kernel]          [k] tcp_gro_receive
     5.10%  [kernel]          [k] dev_gro_receive
     4.87%  [kernel]          [k] skb_gro_receive
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     2.04%  [kernel]          [k] __build_skb
     1.02%  [kernel]          [k] napi_reuse_skb.isra.95
     1.01%  [kernel]          [k] tcp4_gro_receive
     0.65%  [kernel]          [k] kmem_cache_alloc
     0.45%  [kernel]          [k] _raw_spin_lock

Without the latest  patch (the exact patch series v3 I submitted),
thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.

lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
0.00      0.00      4.50

Profiles of the two cpus :

    74.85%  [kernel]      [k] copy_user_enhanced_fast_string
     6.42%  [kernel]      [k] skb_release_data
     5.65%  [kernel]      [k] skb_copy_datagram_iter
     1.83%  [kernel]      [k] copy_page_to_iter
     1.59%  [kernel]      [k] _raw_spin_lock
     1.48%  [kernel]      [k] skb_release_head_state
     0.72%  [kernel]      [k] tcp_transmit_skb
     0.68%  [kernel]      [k] mlx4_en_xmit
     0.43%  [kernel]      [k] page_frag_free
     0.38%  [kernel]      [k] ___cache_free
     0.37%  [kernel]      [k] tcp_established_options
     0.37%  [kernel]      [k] __ip_local_out


   37.98%  [kernel]          [k] mlx4_en_process_rx_cq
    26.47%  [kernel]          [k] napi_gro_frags
     7.02%  [kernel]          [k] inet_gro_receive
     5.89%  [kernel]          [k] tcp_gro_receive
     5.17%  [kernel]          [k] dev_gro_receive
     4.80%  [kernel]          [k] skb_gro_receive
     2.61%  [kernel]          [k] __build_skb
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     1.59%  [kernel]          [k] napi_reuse_skb.isra.95
     0.95%  [kernel]          [k] tcp4_gro_receive
     0.51%  [kernel]          [k] kmem_cache_alloc
     0.42%  [kernel]          [k] __inet_lookup_established
     0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu


So probably this will need further analysis, outside of the scope of
this patch series.

Could we now please Ack this v3 and merge it ?

Thanks.



> Notice the huge difference. And in case 2, it is enough that the remote
> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
> the local CPU then does the atomic op.
>
> One key ideas behind the page_pool, is that remote CPUs read/detect
> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
> When array is full, it gets bulk returned to the shared-ptr-ring pool.
> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
> during it's bulk refill, to latency-hide the MESI transitions needed.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Eric Dumazet Feb. 14, 2017, 2:12 p.m. UTC | #14

On Tue, Feb 14, 2017 at 5:45 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> Could we now please Ack this v3 and merge it ?
>

BTW I found the limitation on sender side.

After doing :

lpaa23:~# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

lpaa23:~# ethtool -C eth0 tx-usecs 8 tx-frames 8

-> 36 Gbit on a single TCP flow, this is the highest number I ever got.

Tariq Toukan Feb. 14, 2017, 2:56 p.m. UTC | #15

On 14/02/2017 3:45 PM, Eric Dumazet wrote:
> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>
>> It is important to understand that there are two cases for the cost of
>> an atomic op, which depend on the cache-coherency state of the
>> cacheline.
>>
>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>
>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>
> Okay, it seems you guys really want a patch that I said was not giving
> good results
>
> Let me publish the numbers I get , adding or not the last (and not
> official) patch.
>
> If I _force_ the user space process to run on the other node,
> then the results are not the ones Alex or you are expecting.
>
> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> and 3.5 Mpps without it.
>
> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> 0.00      0.00      4.50
>
> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>
>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>      31.07%  [kernel]      [k] skb_release_data
>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>       1.35%  [kernel]      [k] copy_page_to_iter
>       0.98%  [kernel]      [k] _raw_spin_lock
>       0.90%  [kernel]      [k] skb_release_head_state
>       0.60%  [kernel]      [k] tcp_transmit_skb
>       0.51%  [kernel]      [k] mlx4_en_xmit
>       0.33%  [kernel]      [k] ___cache_free
>       0.28%  [kernel]      [k] tcp_rcv_established
>
> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>
>
>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>      12.92%  [kernel]          [k] napi_gro_frags
>       7.28%  [kernel]          [k] inet_gro_receive
>       7.17%  [kernel]          [k] tcp_gro_receive
>       5.10%  [kernel]          [k] dev_gro_receive
>       4.87%  [kernel]          [k] skb_gro_receive
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       2.04%  [kernel]          [k] __build_skb
>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>       1.01%  [kernel]          [k] tcp4_gro_receive
>       0.65%  [kernel]          [k] kmem_cache_alloc
>       0.45%  [kernel]          [k] _raw_spin_lock
>
> Without the latest  patch (the exact patch series v3 I submitted),
> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.
>
> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> 0.00      0.00      4.50
>
> Profiles of the two cpus :
>
>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>       6.42%  [kernel]      [k] skb_release_data
>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>       1.83%  [kernel]      [k] copy_page_to_iter
>       1.59%  [kernel]      [k] _raw_spin_lock
>       1.48%  [kernel]      [k] skb_release_head_state
>       0.72%  [kernel]      [k] tcp_transmit_skb
>       0.68%  [kernel]      [k] mlx4_en_xmit
>       0.43%  [kernel]      [k] page_frag_free
>       0.38%  [kernel]      [k] ___cache_free
>       0.37%  [kernel]      [k] tcp_established_options
>       0.37%  [kernel]      [k] __ip_local_out
>
>
>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>      26.47%  [kernel]          [k] napi_gro_frags
>       7.02%  [kernel]          [k] inet_gro_receive
>       5.89%  [kernel]          [k] tcp_gro_receive
>       5.17%  [kernel]          [k] dev_gro_receive
>       4.80%  [kernel]          [k] skb_gro_receive
>       2.61%  [kernel]          [k] __build_skb
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>       0.95%  [kernel]          [k] tcp4_gro_receive
>       0.51%  [kernel]          [k] kmem_cache_alloc
>       0.42%  [kernel]          [k] __inet_lookup_established
>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>
>
> So probably this will need further analysis, outside of the scope of
> this patch series.
>
> Could we now please Ack this v3 and merge it ?
>
> Thanks.
Thanks Eric.

As the previous series caused hangs, we must run functional regression 
tests over this series as well.
Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more 
elegant and more correct, functionally.

However, performance wise: we fear that the numbers will be drastically 
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations 
bottlenecks, especially on systems with costly
dma operations, such as ARM, iommu=on, etc...

We already have this exact issue in mlx5, where we moved to order-0 
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance 
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing 
the same in mlx4.
So, the question is, can we live with this degradation until those 
bottleneck challenges are addressed?
Following our perf experts feedback, I cannot just simply Ack. We need 
to have a clear plan to close the perf gap or reduce the impact.

Internally, I already implemented "dynamic page-cache" and "page-reuse" 
mechanisms in the driver,
and together they totally bridge the performance gap.
That's why I would like to hear from Jesper what is the status of his 
page_pool API, it is promising and could totally solve these issues.

Regards,
Tariq

>
>
>
>> Notice the huge difference. And in case 2, it is enough that the remote
>> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
>> the local CPU then does the atomic op.
>>
>> One key ideas behind the page_pool, is that remote CPUs read/detect
>> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
>> When array is full, it gets bulk returned to the shared-ptr-ring pool.
>> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
>> during it's bulk refill, to latency-hide the MESI transitions needed.
>>
>> --
>> Best regards,
>>    Jesper Dangaard Brouer
>>    MSc.CS, Principal Kernel Engineer at Red Hat
>>    LinkedIn: http://www.linkedin.com/in/brouer

Eric Dumazet Feb. 14, 2017, 3:51 p.m. UTC | #16

On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:

> As the previous series caused hangs, we must run functional regression 
> tests over this series as well.
> Run has already started, and results will be available tomorrow morning.
> 
> In general, I really like this series. The re-factorization looks more 
> elegant and more correct, functionally.
> 
> However, performance wise: we fear that the numbers will be drastically 
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations 
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...
> 

So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.

> We already have this exact issue in mlx5, where we moved to order-0 
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance 
> degradation, and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing 
> the same in mlx4.
> So, the question is, can we live with this degradation until those 
> bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.

> Following our perf experts feedback, I cannot just simply Ack. We need 
> to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.

Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.

Eric Dumazet Feb. 14, 2017, 4:03 p.m. UTC | #17

> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>

BTW, you do understand that on PowerPC right now, an Ethernet frame
holds 65536*8 =  half a MByte , right ?

So any PowerPC host using mlx4 NIC can easily be bringed down, by
using a few TCP flows and sending out of order packets.

That is in itself quite scary.

David Miller Feb. 14, 2017, 5:04 p.m. UTC | #18

From: Tariq Toukan <ttoukan.linux@gmail.com>
Date: Tue, 14 Feb 2017 16:56:49 +0200

> Internally, I already implemented "dynamic page-cache" and
> "page-reuse" mechanisms in the driver, and together they totally
> bridge the performance gap.

I worry about a dynamically growing page cache inside of drivers
because it is invisible to the rest of the kernel.

It responds only to local needs.

The price of the real page allocator comes partly because it can
respond to global needs.

If a driver consumes some unreasonable percentage of system memory, it
is keeping that memory from being used from other parts of the system
even if it would be better for networking to be slightly slower with
less cache because that other thing that needs memory is more
important.

I think this is one of the primary reasons that the MM guys severely
chastise us when we build special purpose local caches into networking
facilities.

And the more I think about it the more I think they are right.

One path I see around all of this is full integration.  Meaning that
we can free pages into the page allocator which are still DMA mapped.
And future allocations from that device are prioritized to take still
DMA mapped objects.

Yes, we still need to make the page allocator faster, but this kind of
work helps everyone not just 100GB ethernet NICs.

David Laight Feb. 14, 2017, 5:17 p.m. UTC | #19

From: David Miller
> Sent: 14 February 2017 17:04
...
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.
...

For systems with 'expensive' iommu has anyone tried separating the
allocation of iommu resource (eg page table slots) from their
assignment to physical pages?

Provided the page sizes all match, setting up a receive buffer might
be as simple as writing the physical address into the iommu slot
that matches the ring entry.

Or am I thinking about hardware that is much simpler than real life?

	David

David Miller Feb. 14, 2017, 5:22 p.m. UTC | #20

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 14 Feb 2017 17:17:22 +0000

> From: David Miller
>> Sent: 14 February 2017 17:04
> ...
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> ...
> 
> For systems with 'expensive' iommu has anyone tried separating the
> allocation of iommu resource (eg page table slots) from their
> assignment to physical pages?
> 
> Provided the page sizes all match, setting up a receive buffer might
> be as simple as writing the physical address into the iommu slot
> that matches the ring entry.
> 
> Or am I thinking about hardware that is much simpler than real life?

You still will eat an expensive MMIO or hypervisor call to setup the
mapping.

IOMMU is expensive because of two operations, the slot allocation
(which takes locks) and the modification of the IOMMU PTE to setup
or teardown the mapping.

This is why attempts to preallocate slots (which people have looked
into) never really takes off.  You really have to eliminate the
entire operation to get worthwhile gains.

Tom Herbert Feb. 14, 2017, 5:29 p.m. UTC | #21

On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>
>> As the previous series caused hangs, we must run functional regression
>> tests over this series as well.
>> Run has already started, and results will be available tomorrow morning.
>>
>> In general, I really like this series. The re-factorization looks more
>> elegant and more correct, functionally.
>>
>> However, performance wise: we fear that the numbers will be drastically
>> lower with this transition to order-0 pages,
>> because of the (becoming critical) page allocator and dma operations
>> bottlenecks, especially on systems with costly
>> dma operations, such as ARM, iommu=on, etc...
>>
>
> So, again, performance after this patch series his higher,
> once you have sensible RX queues parameters, for the expected workload.
>
> Only in pathological cases, you might have some regression.
>
> The old schem was _maybe_ better _when_ memory is not fragmented.
>
> When you run hosts for months, memory _is_ fragmented.
>
> You never see that on benchmarks, unless you force memory being
> fragmented.
>
>
>
>> We already have this exact issue in mlx5, where we moved to order-0
>> allocations with a fixed size cache, but that was not enough.
>> Customers of mlx5 have already complained about the performance
>> degradation, and currently this is hurting our business.
>> We get a clear nack from our performance regression team regarding doing
>> the same in mlx4.
>> So, the question is, can we live with this degradation until those
>> bottleneck challenges are addressed?
>
> Again, there is no degradation.
>
> We have been using order-0 pages for years at Google.
>
> Only when we made the mistake to rebase from the upstream driver and
> order-3 pages we got horrible regressions, causing production outages.
>
> I was silly to believe that mm layer got better.
>
>> Following our perf experts feedback, I cannot just simply Ack. We need
>> to have a clear plan to close the perf gap or reduce the impact.
>
> Your perf experts need to talk to me, or any experts at Google and
> Facebook, really.
>

I agree with this 100%! To be blunt, power users like this are testing
your drivers far beyond what Mellanox is doing and understand how
performance gains in benchmarks translate to possible gains in real
production way more than your perf experts can. Listen to Eric!

Tom


> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>
>

Alexander H Duyck Feb. 14, 2017, 5:29 p.m. UTC | #22

On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
> On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>>
>> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>
>>> It is important to understand that there are two cases for the cost of
>>> an atomic op, which depend on the cache-coherency state of the
>>> cacheline.
>>>
>>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>>
>>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>>
>> Okay, it seems you guys really want a patch that I said was not giving
>> good results
>>
>> Let me publish the numbers I get , adding or not the last (and not
>> official) patch.
>>
>> If I _force_ the user space process to run on the other node,
>> then the results are not the ones Alex or you are expecting.
>>
>> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> and 3.5 Mpps without it.
>>
>> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> 0.00      0.00      4.50
>>
>> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>>
>>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>>      31.07%  [kernel]      [k] skb_release_data
>>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>>       1.35%  [kernel]      [k] copy_page_to_iter
>>       0.98%  [kernel]      [k] _raw_spin_lock
>>       0.90%  [kernel]      [k] skb_release_head_state
>>       0.60%  [kernel]      [k] tcp_transmit_skb
>>       0.51%  [kernel]      [k] mlx4_en_xmit
>>       0.33%  [kernel]      [k] ___cache_free
>>       0.28%  [kernel]      [k] tcp_rcv_established
>>
>> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>>
>>
>>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>>      12.92%  [kernel]          [k] napi_gro_frags
>>       7.28%  [kernel]          [k] inet_gro_receive
>>       7.17%  [kernel]          [k] tcp_gro_receive
>>       5.10%  [kernel]          [k] dev_gro_receive
>>       4.87%  [kernel]          [k] skb_gro_receive
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       2.04%  [kernel]          [k] __build_skb
>>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>>       1.01%  [kernel]          [k] tcp4_gro_receive
>>       0.65%  [kernel]          [k] kmem_cache_alloc
>>       0.45%  [kernel]          [k] _raw_spin_lock
>>
>> Without the latest  patch (the exact patch series v3 I submitted),
>> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> reads.
>>
>> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> 0.00      0.00      4.50
>>
>> Profiles of the two cpus :
>>
>>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>>       6.42%  [kernel]      [k] skb_release_data
>>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>>       1.83%  [kernel]      [k] copy_page_to_iter
>>       1.59%  [kernel]      [k] _raw_spin_lock
>>       1.48%  [kernel]      [k] skb_release_head_state
>>       0.72%  [kernel]      [k] tcp_transmit_skb
>>       0.68%  [kernel]      [k] mlx4_en_xmit
>>       0.43%  [kernel]      [k] page_frag_free
>>       0.38%  [kernel]      [k] ___cache_free
>>       0.37%  [kernel]      [k] tcp_established_options
>>       0.37%  [kernel]      [k] __ip_local_out
>>
>>
>>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>>      26.47%  [kernel]          [k] napi_gro_frags
>>       7.02%  [kernel]          [k] inet_gro_receive
>>       5.89%  [kernel]          [k] tcp_gro_receive
>>       5.17%  [kernel]          [k] dev_gro_receive
>>       4.80%  [kernel]          [k] skb_gro_receive
>>       2.61%  [kernel]          [k] __build_skb
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>>       0.95%  [kernel]          [k] tcp4_gro_receive
>>       0.51%  [kernel]          [k] kmem_cache_alloc
>>       0.42%  [kernel]          [k] __inet_lookup_established
>>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>>
>>
>> So probably this will need further analysis, outside of the scope of
>> this patch series.
>>
>> Could we now please Ack this v3 and merge it ?
>>
>> Thanks.
>
> Thanks Eric.
>
> As the previous series caused hangs, we must run functional regression tests
> over this series as well.
> Run has already started, and results will be available tomorrow morning.
>
> In general, I really like this series. The re-factorization looks more
> elegant and more correct, functionally.
>
> However, performance wise: we fear that the numbers will be drastically
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...

So to give you an idea I originally came up with the approach used in
the Intel drivers when I was dealing with a PowerPC system where the
IOMMU was a requirement.  With this setup correctly the map/unmap
calls should be almost non-existent.  Basically the only time we
should have to allocate or free a page is if something is sitting on
the pages for an excessively long time or if the interrupt is bouncing
between memory nodes which forces us to free the memory since it is
sitting on the wrong node.

> We already have this exact issue in mlx5, where we moved to order-0
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance degradation,
> and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing the
> same in mlx4.

What form of recycling were you doing?  If you were doing the offset
based setup then that obviously only gets you so much.  The advantage
to using the page count is that you get almost a mobius strip effect
for your buffers and descriptor rings where you just keep flipping the
page offset back and forth via an XOR and the only cost is
dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
do the full allocation, mapping, and unmapping.

> So, the question is, can we live with this degradation until those
> bottleneck challenges are addressed?
> Following our perf experts feedback, I cannot just simply Ack. We need to
> have a clear plan to close the perf gap or reduce the impact.

I think we need to define what is the degradation you are expecting to
see.  Using the page count based approach should actually be better in
some cases than the order 3 page and just walking frags approach since
the theoretical reuse is infinite instead of fixed.

> Internally, I already implemented "dynamic page-cache" and "page-reuse"
> mechanisms in the driver,
> and together they totally bridge the performance gap.
> That's why I would like to hear from Jesper what is the status of his
> page_pool API, it is promising and could totally solve these issues.
>
> Regards,
> Tariq

The page pool may provide gains but we have to actually see it before
we can guarantee it.  If worse comes to worse I might just resort to
standardizing the logic used for the Intel driver page count based
approach.  Then if nothing else we would have a standard path for all
the drivers to use if we start going down this route.

- Alex

Jesper Dangaard Brouer Feb. 14, 2017, 6:46 p.m. UTC | #23

On Tue, 14 Feb 2017 09:29:54 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> >
> >
> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >>
> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:
> >>  
> >>> It is important to understand that there are two cases for the cost of
> >>> an atomic op, which depend on the cache-coherency state of the
> >>> cacheline.
> >>>
> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >>>
> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >>>  
> >> Okay, it seems you guys really want a patch that I said was not giving
> >> good results
> >>
> >> Let me publish the numbers I get , adding or not the last (and not
> >> official) patch.
> >>
> >> If I _force_ the user space process to run on the other node,
> >> then the results are not the ones Alex or you are expecting.
> >>
> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> and 3.5 Mpps without it.
> >>
> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> 0.00      0.00      4.50
> >>
> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >>
> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >>      31.07%  [kernel]      [k] skb_release_data
> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >>       0.90%  [kernel]      [k] skb_release_head_state
> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >>       0.33%  [kernel]      [k] ___cache_free
> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >>
> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >>
> >>
> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      12.92%  [kernel]          [k] napi_gro_frags
> >>       7.28%  [kernel]          [k] inet_gro_receive
> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >>       5.10%  [kernel]          [k] dev_gro_receive
> >>       4.87%  [kernel]          [k] skb_gro_receive
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       2.04%  [kernel]          [k] __build_skb
> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >>
> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> reads.
> >>
> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> 0.00      0.00      4.50
> >>
> >> Profiles of the two cpus :
> >>
> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >>       6.42%  [kernel]      [k] skb_release_data
> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >>       1.48%  [kernel]      [k] skb_release_head_state
> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >>       0.43%  [kernel]      [k] page_frag_free
> >>       0.38%  [kernel]      [k] ___cache_free
> >>       0.37%  [kernel]      [k] tcp_established_options
> >>       0.37%  [kernel]      [k] __ip_local_out
> >>
> >>
> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      26.47%  [kernel]          [k] napi_gro_frags
> >>       7.02%  [kernel]          [k] inet_gro_receive
> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >>       5.17%  [kernel]          [k] dev_gro_receive
> >>       4.80%  [kernel]          [k] skb_gro_receive
> >>       2.61%  [kernel]          [k] __build_skb
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >>
> >>
> >> So probably this will need further analysis, outside of the scope of
> >> this patch series.
> >>
> >> Could we now please Ack this v3 and merge it ?
> >>
> >> Thanks.  
> >
> > Thanks Eric.
> >
> > As the previous series caused hangs, we must run functional regression tests
> > over this series as well.
> > Run has already started, and results will be available tomorrow morning.
> >
> > In general, I really like this series. The re-factorization looks more
> > elegant and more correct, functionally.
> >
> > However, performance wise: we fear that the numbers will be drastically
> > lower with this transition to order-0 pages,
> > because of the (becoming critical) page allocator and dma operations
> > bottlenecks, especially on systems with costly
> > dma operations, such as ARM, iommu=on, etc...  
> 
> So to give you an idea I originally came up with the approach used in
> the Intel drivers when I was dealing with a PowerPC system where the
> IOMMU was a requirement.  With this setup correctly the map/unmap
> calls should be almost non-existent.  Basically the only time we
> should have to allocate or free a page is if something is sitting on
> the pages for an excessively long time or if the interrupt is bouncing
> between memory nodes which forces us to free the memory since it is
> sitting on the wrong node.
>
> > We already have this exact issue in mlx5, where we moved to order-0
> > allocations with a fixed size cache, but that was not enough.
> > Customers of mlx5 have already complained about the performance degradation,
> > and currently this is hurting our business.
> > We get a clear nack from our performance regression team regarding doing the
> > same in mlx4.  
> 
> What form of recycling were you doing?  If you were doing the offset
> based setup then that obviously only gets you so much.  The advantage
> to using the page count is that you get almost a mobius strip effect
> for your buffers and descriptor rings where you just keep flipping the
> page offset back and forth via an XOR and the only cost is
> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> do the full allocation, mapping, and unmapping.
> 
> > So, the question is, can we live with this degradation until those
> > bottleneck challenges are addressed?
> > Following our perf experts feedback, I cannot just simply Ack. We need to
> > have a clear plan to close the perf gap or reduce the impact.  
> 
> I think we need to define what is the degradation you are expecting to
> see.  Using the page count based approach should actually be better in
> some cases than the order 3 page and just walking frags approach since
> the theoretical reuse is infinite instead of fixed.
> 
> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> > mechanisms in the driver,
> > and together they totally bridge the performance gap.
> > That's why I would like to hear from Jesper what is the status of his
> > page_pool API, it is promising and could totally solve these issues.
> >
> > Regards,
> > Tariq  
> 
> The page pool may provide gains but we have to actually see it before
> we can guarantee it.  If worse comes to worse I might just resort to
> standardizing the logic used for the Intel driver page count based
> approach.  Then if nothing else we would have a standard path for all
> the drivers to use if we start going down this route.

With this Intel driver page count based recycle approach, the recycle
size is tied to the size of the RX ring.  As Eric and Tariq discovered.
And for other performance reasons (memory footprint of walking RX ring
data-structures), don't want to increase the RX ring sizes.  Thus, it
create two opposite performance needs.  That is why I think a more
explicit approach with a pool is more attractive.

How is this approach doing to work for XDP?
(XDP doesn't "share" the page, and in-general we don't want the extra
atomic.)

We absolutely need recycling with XDP, when transmitting out another
device, and the other devices DMA-TX completion need some way of
returning this page.
What is basically needed is a standardized callback to allow the remote
driver to return the page to the originating driver.  As we don't have
a NDP for XDP-forward/transmit yet, we could pass this callback as a
parameter along with the packet-page to send?

Eric Dumazet Feb. 14, 2017, 7:02 p.m. UTC | #24

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)
>
> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.
> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?
>
>


mlx4 already has a cache for XDP.
I believe I did not change this part, it still should work.

commit d576acf0a22890cf3f8f7a9b035f1558077f6770
Author: Brenden Blanco <bblanco@plumgrid.com>
Date:   Tue Jul 19 12:16:52 2016 -0700

    net/mlx4_en: add page recycle to prepare rx ring for tx support

I have not checked if recent Tom work added core infra for this cache.

Alexander H Duyck Feb. 14, 2017, 7:06 p.m. UTC | #25

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 14 Feb 2017 09:29:54 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>> >
>> >
>> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>> >>
>> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> >> <brouer@redhat.com> wrote:
>> >>
>> >>> It is important to understand that there are two cases for the cost of
>> >>> an atomic op, which depend on the cache-coherency state of the
>> >>> cacheline.
>> >>>
>> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>> >>>
>> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>> >>>
>> >> Okay, it seems you guys really want a patch that I said was not giving
>> >> good results
>> >>
>> >> Let me publish the numbers I get , adding or not the last (and not
>> >> official) patch.
>> >>
>> >> If I _force_ the user space process to run on the other node,
>> >> then the results are not the ones Alex or you are expecting.
>> >>
>> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> >> and 3.5 Mpps without it.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> >> 0.00      0.00      4.50
>> >>
>> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>> >>
>> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>      31.07%  [kernel]      [k] skb_release_data
>> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.35%  [kernel]      [k] copy_page_to_iter
>> >>       0.98%  [kernel]      [k] _raw_spin_lock
>> >>       0.90%  [kernel]      [k] skb_release_head_state
>> >>       0.60%  [kernel]      [k] tcp_transmit_skb
>> >>       0.51%  [kernel]      [k] mlx4_en_xmit
>> >>       0.33%  [kernel]      [k] ___cache_free
>> >>       0.28%  [kernel]      [k] tcp_rcv_established
>> >>
>> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>> >>
>> >>
>> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      12.92%  [kernel]          [k] napi_gro_frags
>> >>       7.28%  [kernel]          [k] inet_gro_receive
>> >>       7.17%  [kernel]          [k] tcp_gro_receive
>> >>       5.10%  [kernel]          [k] dev_gro_receive
>> >>       4.87%  [kernel]          [k] skb_gro_receive
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       2.04%  [kernel]          [k] __build_skb
>> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       1.01%  [kernel]          [k] tcp4_gro_receive
>> >>       0.65%  [kernel]          [k] kmem_cache_alloc
>> >>       0.45%  [kernel]          [k] _raw_spin_lock
>> >>
>> >> Without the latest  patch (the exact patch series v3 I submitted),
>> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> >> reads.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> >> 0.00      0.00      4.50
>> >>
>> >> Profiles of the two cpus :
>> >>
>> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>       6.42%  [kernel]      [k] skb_release_data
>> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.83%  [kernel]      [k] copy_page_to_iter
>> >>       1.59%  [kernel]      [k] _raw_spin_lock
>> >>       1.48%  [kernel]      [k] skb_release_head_state
>> >>       0.72%  [kernel]      [k] tcp_transmit_skb
>> >>       0.68%  [kernel]      [k] mlx4_en_xmit
>> >>       0.43%  [kernel]      [k] page_frag_free
>> >>       0.38%  [kernel]      [k] ___cache_free
>> >>       0.37%  [kernel]      [k] tcp_established_options
>> >>       0.37%  [kernel]      [k] __ip_local_out
>> >>
>> >>
>> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      26.47%  [kernel]          [k] napi_gro_frags
>> >>       7.02%  [kernel]          [k] inet_gro_receive
>> >>       5.89%  [kernel]          [k] tcp_gro_receive
>> >>       5.17%  [kernel]          [k] dev_gro_receive
>> >>       4.80%  [kernel]          [k] skb_gro_receive
>> >>       2.61%  [kernel]          [k] __build_skb
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       0.95%  [kernel]          [k] tcp4_gro_receive
>> >>       0.51%  [kernel]          [k] kmem_cache_alloc
>> >>       0.42%  [kernel]          [k] __inet_lookup_established
>> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>> >>
>> >>
>> >> So probably this will need further analysis, outside of the scope of
>> >> this patch series.
>> >>
>> >> Could we now please Ack this v3 and merge it ?
>> >>
>> >> Thanks.
>> >
>> > Thanks Eric.
>> >
>> > As the previous series caused hangs, we must run functional regression tests
>> > over this series as well.
>> > Run has already started, and results will be available tomorrow morning.
>> >
>> > In general, I really like this series. The re-factorization looks more
>> > elegant and more correct, functionally.
>> >
>> > However, performance wise: we fear that the numbers will be drastically
>> > lower with this transition to order-0 pages,
>> > because of the (becoming critical) page allocator and dma operations
>> > bottlenecks, especially on systems with costly
>> > dma operations, such as ARM, iommu=on, etc...
>>
>> So to give you an idea I originally came up with the approach used in
>> the Intel drivers when I was dealing with a PowerPC system where the
>> IOMMU was a requirement.  With this setup correctly the map/unmap
>> calls should be almost non-existent.  Basically the only time we
>> should have to allocate or free a page is if something is sitting on
>> the pages for an excessively long time or if the interrupt is bouncing
>> between memory nodes which forces us to free the memory since it is
>> sitting on the wrong node.
>>
>> > We already have this exact issue in mlx5, where we moved to order-0
>> > allocations with a fixed size cache, but that was not enough.
>> > Customers of mlx5 have already complained about the performance degradation,
>> > and currently this is hurting our business.
>> > We get a clear nack from our performance regression team regarding doing the
>> > same in mlx4.
>>
>> What form of recycling were you doing?  If you were doing the offset
>> based setup then that obviously only gets you so much.  The advantage
>> to using the page count is that you get almost a mobius strip effect
>> for your buffers and descriptor rings where you just keep flipping the
>> page offset back and forth via an XOR and the only cost is
>> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
>> do the full allocation, mapping, and unmapping.
>>
>> > So, the question is, can we live with this degradation until those
>> > bottleneck challenges are addressed?
>> > Following our perf experts feedback, I cannot just simply Ack. We need to
>> > have a clear plan to close the perf gap or reduce the impact.
>>
>> I think we need to define what is the degradation you are expecting to
>> see.  Using the page count based approach should actually be better in
>> some cases than the order 3 page and just walking frags approach since
>> the theoretical reuse is infinite instead of fixed.
>>
>> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
>> > mechanisms in the driver,
>> > and together they totally bridge the performance gap.
>> > That's why I would like to hear from Jesper what is the status of his
>> > page_pool API, it is promising and could totally solve these issues.
>> >
>> > Regards,
>> > Tariq
>>
>> The page pool may provide gains but we have to actually see it before
>> we can guarantee it.  If worse comes to worse I might just resort to
>> standardizing the logic used for the Intel driver page count based
>> approach.  Then if nothing else we would have a standard path for all
>> the drivers to use if we start going down this route.
>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)

The extra atomic is moot since it is we can get rid of that by doing
bulk page count updates.

Why can't XDP share a page?  You have brought up the user space aspect
of things repeatedly but the AF_PACKET patches John did more or less
demonstrated that wasn't the case.  What you end up with is you have
to either be providing pure user space pages or pure kernel pages.
You can't have pages being swapped between the two without introducing
security issues.  So even with your pool approach it doesn't matter
which way things are run.  Your pool is either device to user space,
or device to kernel space and it can't be both without creating
sharing concerns.

> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.

I fully agree here.  However the easiest way of handling this is via
the page count in my opinion.

> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?

No.  This assumes way too much.  Most packets aren't going to be
device to device routing.  We have seen this type of thing and
rejected it multiple times.  Don't think driver to driver.  This is
driver to network stack, socket, device, virtual machine, storage,
etc.  The fact is there are many spots where a frame might get
terminated.  This is why the bulk alloc/free of skbs never went
anywhere.  We never really addressed the fact that there are many more
spots where a frame is terminated.

This is why I said before what we need to do is have a page destructor
to handle this sort of thing.  The idea is you want to have this work
everywhere.  Having just drivers do this would make it a nice toy but
completely useless since not too many people are doing
routing/bridging between interfaces.  Using the page destructor it is
easy to create a pool of "free" pages that you can then pull your DMA
pages out of or that you can release back into the page allocator.

Jesper Dangaard Brouer Feb. 14, 2017, 7:38 p.m. UTC | #26

On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Tariq Toukan <ttoukan.linux@gmail.com>
> Date: Tue, 14 Feb 2017 16:56:49 +0200
> 
> > Internally, I already implemented "dynamic page-cache" and
> > "page-reuse" mechanisms in the driver, and together they totally
> > bridge the performance gap.  

It sounds like you basically implemented a page_pool scheme...

> I worry about a dynamically growing page cache inside of drivers
> because it is invisible to the rest of the kernel.

Exactly, that is why I wanted a separate standardized thing, I call the
page_pool, which is part of the MM-tree and interacts with the page
allocator.  E.g. it must implement/support a way the page allocator can
reclaim pages from it (admit I didn't implement this in RFC patches).

> It responds only to local needs.

Generally true, but a side effect of recycling these pages, result in
less fragmentation of the page allocator/buddy system.

> The price of the real page allocator comes partly because it can
> respond to global needs.
> 
> If a driver consumes some unreasonable percentage of system memory, it
> is keeping that memory from being used from other parts of the system
> even if it would be better for networking to be slightly slower with
> less cache because that other thing that needs memory is more
> important.

(That is why I want to have OOM protection at device level, with the
recycle feedback from page pool we have this knowledge, and further I
wanted to allow blocking a specific RX queue[1])
[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html#userspace-delivery-and-oom

> I think this is one of the primary reasons that the MM guys severely
> chastise us when we build special purpose local caches into networking
> facilities.
> 
> And the more I think about it the more I think they are right.

+1

> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.

I like this idea.  Are you saying that this should be done per DMA
engine or per device?

If this is per device, it is almost the page_pool idea.  

> Yes, we still need to make the page allocator faster, but this kind of
> work helps everyone not just 100GB ethernet NICs.

True.  And Mel already have some generic improvements to the page
allocator queued for the next merge.  And I have the responsibility to
get the bulking API into shape.

Jesper Dangaard Brouer Feb. 14, 2017, 7:50 p.m. UTC | #27

On Tue, 14 Feb 2017 11:06:25 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Tue, 14 Feb 2017 09:29:54 -0800
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:  
> >> >
> >> >
> >> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >> >>
> >> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> >> <brouer@redhat.com> wrote:
> >> >>  
> >> >>> It is important to understand that there are two cases for the cost of
> >> >>> an atomic op, which depend on the cache-coherency state of the
> >> >>> cacheline.
> >> >>>
> >> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >> >>>
> >> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >> >>>  
> >> >> Okay, it seems you guys really want a patch that I said was not giving
> >> >> good results
> >> >>
> >> >> Let me publish the numbers I get , adding or not the last (and not
> >> >> official) patch.
> >> >>
> >> >> If I _force_ the user space process to run on the other node,
> >> >> then the results are not the ones Alex or you are expecting.
> >> >>
> >> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> >> and 3.5 Mpps without it.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >> >>
> >> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>      31.07%  [kernel]      [k] skb_release_data
> >> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >> >>       0.90%  [kernel]      [k] skb_release_head_state
> >> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.33%  [kernel]      [k] ___cache_free
> >> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >> >>
> >> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >> >>
> >> >>
> >> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      12.92%  [kernel]          [k] napi_gro_frags
> >> >>       7.28%  [kernel]          [k] inet_gro_receive
> >> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >> >>       5.10%  [kernel]          [k] dev_gro_receive
> >> >>       4.87%  [kernel]          [k] skb_gro_receive
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       2.04%  [kernel]          [k] __build_skb
> >> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >> >>
> >> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> >> reads.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profiles of the two cpus :
> >> >>
> >> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>       6.42%  [kernel]      [k] skb_release_data
> >> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >> >>       1.48%  [kernel]      [k] skb_release_head_state
> >> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.43%  [kernel]      [k] page_frag_free
> >> >>       0.38%  [kernel]      [k] ___cache_free
> >> >>       0.37%  [kernel]      [k] tcp_established_options
> >> >>       0.37%  [kernel]      [k] __ip_local_out
> >> >>
> >> >>
> >> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      26.47%  [kernel]          [k] napi_gro_frags
> >> >>       7.02%  [kernel]          [k] inet_gro_receive
> >> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >> >>       5.17%  [kernel]          [k] dev_gro_receive
> >> >>       4.80%  [kernel]          [k] skb_gro_receive
> >> >>       2.61%  [kernel]          [k] __build_skb
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >> >>
> >> >>
> >> >> So probably this will need further analysis, outside of the scope of
> >> >> this patch series.
> >> >>
> >> >> Could we now please Ack this v3 and merge it ?
> >> >>
> >> >> Thanks.  
> >> >
> >> > Thanks Eric.
> >> >
> >> > As the previous series caused hangs, we must run functional regression tests
> >> > over this series as well.
> >> > Run has already started, and results will be available tomorrow morning.
> >> >
> >> > In general, I really like this series. The re-factorization looks more
> >> > elegant and more correct, functionally.
> >> >
> >> > However, performance wise: we fear that the numbers will be drastically
> >> > lower with this transition to order-0 pages,
> >> > because of the (becoming critical) page allocator and dma operations
> >> > bottlenecks, especially on systems with costly
> >> > dma operations, such as ARM, iommu=on, etc...  
> >>
> >> So to give you an idea I originally came up with the approach used in
> >> the Intel drivers when I was dealing with a PowerPC system where the
> >> IOMMU was a requirement.  With this setup correctly the map/unmap
> >> calls should be almost non-existent.  Basically the only time we
> >> should have to allocate or free a page is if something is sitting on
> >> the pages for an excessively long time or if the interrupt is bouncing
> >> between memory nodes which forces us to free the memory since it is
> >> sitting on the wrong node.
> >>  
> >> > We already have this exact issue in mlx5, where we moved to order-0
> >> > allocations with a fixed size cache, but that was not enough.
> >> > Customers of mlx5 have already complained about the performance degradation,
> >> > and currently this is hurting our business.
> >> > We get a clear nack from our performance regression team regarding doing the
> >> > same in mlx4.  
> >>
> >> What form of recycling were you doing?  If you were doing the offset
> >> based setup then that obviously only gets you so much.  The advantage
> >> to using the page count is that you get almost a mobius strip effect
> >> for your buffers and descriptor rings where you just keep flipping the
> >> page offset back and forth via an XOR and the only cost is
> >> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> >> do the full allocation, mapping, and unmapping.
> >>  
> >> > So, the question is, can we live with this degradation until those
> >> > bottleneck challenges are addressed?
> >> > Following our perf experts feedback, I cannot just simply Ack. We need to
> >> > have a clear plan to close the perf gap or reduce the impact.  
> >>
> >> I think we need to define what is the degradation you are expecting to
> >> see.  Using the page count based approach should actually be better in
> >> some cases than the order 3 page and just walking frags approach since
> >> the theoretical reuse is infinite instead of fixed.
> >>  
> >> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> >> > mechanisms in the driver,
> >> > and together they totally bridge the performance gap.
> >> > That's why I would like to hear from Jesper what is the status of his
> >> > page_pool API, it is promising and could totally solve these issues.
> >> >
> >> > Regards,
> >> > Tariq  
> >>
> >> The page pool may provide gains but we have to actually see it before
> >> we can guarantee it.  If worse comes to worse I might just resort to
> >> standardizing the logic used for the Intel driver page count based
> >> approach.  Then if nothing else we would have a standard path for all
> >> the drivers to use if we start going down this route.  
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)  
> 
> The extra atomic is moot since it is we can get rid of that by doing
> bulk page count updates.
> 
> Why can't XDP share a page?  You have brought up the user space aspect
> of things repeatedly but the AF_PACKET patches John did more or less
> demonstrated that wasn't the case.  What you end up with is you have
> to either be providing pure user space pages or pure kernel pages.
> You can't have pages being swapped between the two without introducing
> security issues.  So even with your pool approach it doesn't matter
> which way things are run.  Your pool is either device to user space,
> or device to kernel space and it can't be both without creating
> sharing concerns.

(I think we are talking past each other here.)

 
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.  
> 
> I fully agree here.  However the easiest way of handling this is via
> the page count in my opinion.
> 
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?  
> 
> No.  This assumes way too much.  Most packets aren't going to be
> device to device routing.  We have seen this type of thing and
> rejected it multiple times.  Don't think driver to driver.  This is
> driver to network stack, socket, device, virtual machine, storage,
> etc.  The fact is there are many spots where a frame might get
> terminated. [...]

I fully agree, that XDP need to think further than driver to driver.

 
> This is why I said before what we need to do is have a page destructor
> to handle this sort of thing.  The idea is you want to have this work
> everywhere.  Having just drivers do this would make it a nice toy but
> completely useless since not too many people are doing
> routing/bridging between interfaces.  Using the page destructor it is
> easy to create a pool of "free" pages that you can then pull your DMA
> pages out of or that you can release back into the page allocator.

How is this page destructor different from my page_pool RFC
implementation?  (It basically functions as a page destructor...)

Help me understand what I'm missing?

Pointers to page_pool discussions
* page_pool RFC patchset:
 - http://lkml.kernel.org/r/20161220132444.18788.50875.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132812.18788.20431.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132817.18788.64726.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132822.18788.19768.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132827.18788.8658.stgit@firesoul

David Miller Feb. 14, 2017, 7:59 p.m. UTC | #28

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 14 Feb 2017 20:38:22 +0100

> On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
> 
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> 
> I like this idea.  Are you saying that this should be done per DMA
> engine or per device?
> 
> If this is per device, it is almost the page_pool idea.  

Per-device is simplest, at least at first.

Maybe later down the road we can try to pool by "mapping entity" be
that a parent IOMMU or something else like a hypervisor managed
page table.

Jesper Dangaard Brouer Feb. 14, 2017, 8:02 p.m. UTC | #29

On Tue, 14 Feb 2017 11:02:01 -0800
Eric Dumazet <edumazet@google.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> 
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)
> >
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?
> >
> >  
> 
> 
> mlx4 already has a cache for XDP.
> I believe I did not change this part, it still should work.
> 
> commit d576acf0a22890cf3f8f7a9b035f1558077f6770
> Author: Brenden Blanco <bblanco@plumgrid.com>
> Date:   Tue Jul 19 12:16:52 2016 -0700
> 
>     net/mlx4_en: add page recycle to prepare rx ring for tx support

This obviously does not work for the case I'm talking about
(transmitting out another device with XDP).

Eric Dumazet Feb. 14, 2017, 9:56 p.m. UTC | #30

>
> This obviously does not work for the case I'm talking about
> (transmitting out another device with XDP).
>

XDP_TX does not handle this yet.

When XDP_TX was added, it was very clear that the transmit _had_ to be
done on the same port.

Since all this discussion happened in this thread ( mlx4: use order-0
pages for RX )
I was kind of assuming all the comments were relevant to current code or patch,
not future devs ;)

Tariq Toukan Feb. 15, 2017, 4:42 p.m. UTC | #31

On 14/02/2017 7:29 PM, Tom Herbert wrote:
> On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>>
>>> As the previous series caused hangs, we must run functional regression
>>> tests over this series as well.
>>> Run has already started, and results will be available tomorrow morning.
>>>
>>> In general, I really like this series. The re-factorization looks more
>>> elegant and more correct, functionally.
>>>
>>> However, performance wise: we fear that the numbers will be drastically
>>> lower with this transition to order-0 pages,
>>> because of the (becoming critical) page allocator and dma operations
>>> bottlenecks, especially on systems with costly
>>> dma operations, such as ARM, iommu=on, etc...
>>>
>> So, again, performance after this patch series his higher,
>> once you have sensible RX queues parameters, for the expected workload.
>>
>> Only in pathological cases, you might have some regression.
>>
>> The old schem was _maybe_ better _when_ memory is not fragmented.
>>
>> When you run hosts for months, memory _is_ fragmented.
>>
>> You never see that on benchmarks, unless you force memory being
>> fragmented.
>>
>>
>>
>>> We already have this exact issue in mlx5, where we moved to order-0
>>> allocations with a fixed size cache, but that was not enough.
>>> Customers of mlx5 have already complained about the performance
>>> degradation, and currently this is hurting our business.
>>> We get a clear nack from our performance regression team regarding doing
>>> the same in mlx4.
>>> So, the question is, can we live with this degradation until those
>>> bottleneck challenges are addressed?
>> Again, there is no degradation.
>>
>> We have been using order-0 pages for years at Google.
>>
>> Only when we made the mistake to rebase from the upstream driver and
>> order-3 pages we got horrible regressions, causing production outages.
>>
>> I was silly to believe that mm layer got better.
>>
>>> Following our perf experts feedback, I cannot just simply Ack. We need
>>> to have a clear plan to close the perf gap or reduce the impact.
>> Your perf experts need to talk to me, or any experts at Google and
>> Facebook, really.
>>
> I agree with this 100%! To be blunt, power users like this are testing
> your drivers far beyond what Mellanox is doing and understand how
> performance gains in benchmarks translate to possible gains in real
> production way more than your perf experts can. Listen to Eric!
>
> Tom
>
>
>> Anything _relying_ on order-3 pages being available to impress
>> friends/customers is a lie.

Isn't it the same principle in page_frag_alloc() ?
It is called form __netdev_alloc_skb()/__napi_alloc_skb().

Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is 
a frag of a huge page,
and it is not going to be freed before the other non-linear frags.
Cannot this cause the same threats (memory pinning and so...)?

Currently, mlx4 doesn't use this generic API, while most other drivers do.

Similar claims are true for TX:
https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

>>

Eric Dumazet Feb. 15, 2017, 4:57 p.m. UTC | #32

On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>

>
> Isn't it the same principle in page_frag_alloc() ?
> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>
> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

This is not ok.

This is a very well known problem, we already mentioned that here in the past,
but at least core networking stack uses  order-0 pages on PowerPC.

mlx4 driver suffers from this problem 100% more than other drivers ;)

One problem at a time Tariq. Right now, only mlx4 has this big problem
compared to other NIC.

Then, if we _still_ hit major issues, we might also need to force
napi_get_frags()
to allocate skb->head using kmalloc() instead of a page frag.

That is a very simple fix.

Remember that we have skb->truesize that is an approximation, it will
never be completely accurate,
but we need to make it better.

mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
is obviously wrong when host is under memory pressure
(2 frags per page -> truesize should be 2048)

> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
> frag of a huge page,
> and it is not going to be freed before the other non-linear frags.
> Cannot this cause the same threats (memory pinning and so...)?
>
> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>
> Similar claims are true for TX:
> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

We do not have such problem on TX. GFP_KERNEL allocations do not have
the same issues.

Tasks are usually not malicious in our DC, and most serious
applications use memcg or such memory control.

Tariq Toukan Feb. 16, 2017, 1:08 p.m. UTC | #33

On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>> Isn't it the same principle in page_frag_alloc() ?
>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>
>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> This is not ok.
>
> This is a very well known problem, we already mentioned that here in the past,
> but at least core networking stack uses  order-0 pages on PowerPC.
You're right, we should have done this as well in mlx4 on PPC.
> mlx4 driver suffers from this problem 100% more than other drivers ;)
>
> One problem at a time Tariq. Right now, only mlx4 has this big problem
> compared to other NIC.
We _do_ agree that the series improves the driver's quality, stability,
and performance in a fragmented system.

But due to the late rc we're in, and the fact that we know what benchmarks
our customers are going to run, we cannot Ack the series and get it
as is inside kernel 4.11.

We are interested to get your series merged along another perf improvement
we are preparing for next rc1. This way we will earn the desired stability
without breaking existing benchmarks.
I think this is the right thing to do at this point of time.


The idea behind the perf improvement, suggested by Jesper, is to split
the napi_poll call mlx4_en_process_rx_cq() loop into two.
The first loop extracts completed CQEs and starts prefetching on data
and RX descriptors. The second loop process the real packets.

>
> Then, if we _still_ hit major issues, we might also need to force
> napi_get_frags()
> to allocate skb->head using kmalloc() instead of a page frag.
>
> That is a very simple fix.
>
> Remember that we have skb->truesize that is an approximation, it will
> never be completely accurate,
> but we need to make it better.
>
> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
> is obviously wrong when host is under memory pressure
> (2 frags per page -> truesize should be 2048)
>
>
>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
>> frag of a huge page,
>> and it is not going to be freed before the other non-linear frags.
>> Cannot this cause the same threats (memory pinning and so...)?
>>
>> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>>
>> Similar claims are true for TX:
>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
> We do not have such problem on TX. GFP_KERNEL allocations do not have
> the same issues.
>
> Tasks are usually not malicious in our DC, and most serious
> applications use memcg or such memory control.

Eric Dumazet Feb. 16, 2017, 3:47 p.m. UTC | #34

On Thu, 2017-02-16 at 15:08 +0200, Tariq Toukan wrote:
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> > On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
> >> Isn't it the same principle in page_frag_alloc() ?
> >> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
> >>
> >> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> > This is not ok.
> >
> > This is a very well known problem, we already mentioned that here in the past,
> > but at least core networking stack uses  order-0 pages on PowerPC.
> You're right, we should have done this as well in mlx4 on PPC.
> > mlx4 driver suffers from this problem 100% more than other drivers ;)
> >
> > One problem at a time Tariq. Right now, only mlx4 has this big problem
> > compared to other NIC.
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
> 
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
> 
> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
> 
> 
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.

Make sure to resubmit my patches before anything new.

We need to backport them to stable versions, without XDP, without
anything fancy.

And submit what is needed for 4.11, since current mlx4 driver in
net-next is broken, in case you missed it.

Thanks.

Tom Herbert Feb. 16, 2017, 5:05 p.m. UTC | #35

On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>
>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> Isn't it the same principle in page_frag_alloc() ?
>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>
>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>
>> This is not ok.
>>
>> This is a very well known problem, we already mentioned that here in the
>> past,
>> but at least core networking stack uses  order-0 pages on PowerPC.
>
> You're right, we should have done this as well in mlx4 on PPC.
>>
>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>
>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>> compared to other NIC.
>
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
>
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
>
You're admitting that Eric's patches improve driver quality,
stability, and performance but you're not allowing this in the kernel
because "we know what benchmarks our customers are going to run".
Sorry, but that is a weak explanation.

> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
>
>
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.
>
>
>>
>> Then, if we _still_ hit major issues, we might also need to force
>> napi_get_frags()
>> to allocate skb->head using kmalloc() instead of a page frag.
>>
>> That is a very simple fix.
>>
>> Remember that we have skb->truesize that is an approximation, it will
>> never be completely accurate,
>> but we need to make it better.
>>
>> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
>> is obviously wrong when host is under memory pressure
>> (2 frags per page -> truesize should be 2048)
>>
>>
>>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is
>>> a
>>> frag of a huge page,
>>> and it is not going to be freed before the other non-linear frags.
>>> Cannot this cause the same threats (memory pinning and so...)?
>>>
>>> Currently, mlx4 doesn't use this generic API, while most other drivers
>>> do.
>>>
>>> Similar claims are true for TX:
>>>
>>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
>>
>> We do not have such problem on TX. GFP_KERNEL allocations do not have
>> the same issues.
>>
>> Tasks are usually not malicious in our DC, and most serious
>> applications use memcg or such memory control.
>
>

David Miller Feb. 16, 2017, 7:03 p.m. UTC | #36

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 16 Feb 2017 09:05:26 -0800

> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>
>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>
>>>> Isn't it the same principle in page_frag_alloc() ?
>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>
>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>
>>> This is not ok.
>>>
>>> This is a very well known problem, we already mentioned that here in the
>>> past,
>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>
>> You're right, we should have done this as well in mlx4 on PPC.
>>>
>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>
>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>> compared to other NIC.
>>
>> We _do_ agree that the series improves the driver's quality, stability,
>> and performance in a fragmented system.
>>
>> But due to the late rc we're in, and the fact that we know what benchmarks
>> our customers are going to run, we cannot Ack the series and get it
>> as is inside kernel 4.11.
>>
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".
> Sorry, but that is a weak explanation.

I have to agree with Tom and Eric.

If your customers have gotten into the habit of using metrics which
actually do not represent real life performance, that is a completely
inappropriate reason to not include Eric's changes as-is.

Saeed Mahameed Feb. 16, 2017, 9:06 p.m. UTC | #37

On Thu, Feb 16, 2017 at 9:03 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Thu, 16 Feb 2017 09:05:26 -0800
>
>> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>>
>>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>>
>>>>> Isn't it the same principle in page_frag_alloc() ?
>>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>>
>>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>>
>>>> This is not ok.
>>>>
>>>> This is a very well known problem, we already mentioned that here in the
>>>> past,
>>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>>
>>> You're right, we should have done this as well in mlx4 on PPC.
>>>>
>>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>>
>>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>>> compared to other NIC.
>>>
>>> We _do_ agree that the series improves the driver's quality, stability,
>>> and performance in a fragmented system.
>>>
>>> But due to the late rc we're in, and the fact that we know what benchmarks
>>> our customers are going to run, we cannot Ack the series and get it
>>> as is inside kernel 4.11.
>>>
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>> Sorry, but that is a weak explanation.
>
> I have to agree with Tom and Eric.
>
> If your customers have gotten into the habit of using metrics which
> actually do not represent real life performance, that is a completely
> inappropriate reason to not include Eric's changes as-is.

Guys, It is not about customers, benchmark and great performance numbers.
We agree with you that those patches are good and needed for the long term,
but as we are already at rc8 and although this change is correct, i
think it is a little bit too late
to have such huge change in the core RX engine of the driver. ( the
code is already like this for
more kernel releases than one could count, it will hurt no one to keep
it like this for two weeks more).

All we ask is to have a little bit more time - one or two weeks- to
test them and evaluate the impact.
As Eric stated we don't care if they make it to 4.11 or 4.12, the idea
is to have them in ASAP.
so why not wait to 4.11 (just two more weeks) and Tariq already said
that he will accept it as is.
and by then we will be smarter and have a clear plan of the gaps and
how to close them.

For PowerPC page order issue, Eric already have a simpler suggestion
that i support, and can easily be
sent to net and -stable.

Eric Dumazet Feb. 22, 2017, 4:22 p.m. UTC | #38

On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> Use of order-3 pages is problematic in some cases.
> 
> This patch might add three kinds of regression :
> 
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
> 
> 2) TCP receiver could grow its receive window slightly slower,
>    because skb->len/skb->truesize ratio will decrease.
>    This is mostly ok, we prefer being conservative to not risk OOM,
>    and eventually tune TCP better in the future.
>    This is consistent with other drivers using 2048 per ethernet frame.
> 
> 3) Because we allocate one page per RX slot, we consume more
>    memory for the ring buffers. XDP already had this constraint anyway.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Note that we also could use a different strategy.

Assume RX rings of 4096 entries/slots.

With this patch, mlx4 gets the strategy used by Alexander in Intel
drivers : 

Each RX slot has an allocated page, and uses half of it, flipping to the
other half every time the slot is used.

So a ring buffer of 4096 slots allocates 4096 pages.

When we receive a packet train for the same flow, GRO builds an skb with
~45 page frags, all from different pages.

The put_page() done from skb_release_data() touches ~45 different struct
page cache lines, and show a high cost. (compared to the order-3 used
today by mlx4, this adds extra cache line misses and stalls for the
consumer)

If we instead try to use the two halves of one page on consecutive RX
slots, we might instead cook skb with the same number of MSS (45), but
half the number of cache lines for put_page(), so we should speed up the
consumer.

This means the number of active pages would be minimal, especially on
PowerPC. Pages that have been used by X=2 received frags would be put in
a quarantine (size to be determined).
On PowerPC, X would be PAGE_SIZE/frag_size

This strategy would consume less memory on PowerPC : 
65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
4096.

The quarantine would be sized to increase chances of reusing an old
page, without consuming too much memory.

Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))

x86 would still use 4096 pages, but PowerPC would use 98+128 pages
instead of 4096) (14 MBytes instead of 256 MBytes)

Alexander H Duyck Feb. 22, 2017, 5:23 p.m. UTC | #39

On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> Use of order-3 pages is problematic in some cases.
>>
>> This patch might add three kinds of regression :
>>
>> 1) a CPU performance regression, but we will add later page
>> recycling and performance should be back.
>>
>> 2) TCP receiver could grow its receive window slightly slower,
>>    because skb->len/skb->truesize ratio will decrease.
>>    This is mostly ok, we prefer being conservative to not risk OOM,
>>    and eventually tune TCP better in the future.
>>    This is consistent with other drivers using 2048 per ethernet frame.
>>
>> 3) Because we allocate one page per RX slot, we consume more
>>    memory for the ring buffers. XDP already had this constraint anyway.
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> ---
>
> Note that we also could use a different strategy.
>
> Assume RX rings of 4096 entries/slots.
>
> With this patch, mlx4 gets the strategy used by Alexander in Intel
> drivers :
>
> Each RX slot has an allocated page, and uses half of it, flipping to the
> other half every time the slot is used.
>
> So a ring buffer of 4096 slots allocates 4096 pages.
>
> When we receive a packet train for the same flow, GRO builds an skb with
> ~45 page frags, all from different pages.
>
> The put_page() done from skb_release_data() touches ~45 different struct
> page cache lines, and show a high cost. (compared to the order-3 used
> today by mlx4, this adds extra cache line misses and stalls for the
> consumer)
>
> If we instead try to use the two halves of one page on consecutive RX
> slots, we might instead cook skb with the same number of MSS (45), but
> half the number of cache lines for put_page(), so we should speed up the
> consumer.

So there is a problem that is being overlooked here.  That is the cost
of the DMA map/unmap calls.  The problem is many PowerPC systems have
an IOMMU that you have to work around, and that IOMMU comes at a heavy
cost for every map/unmap call.  So unless you are saying you wan to
setup a hybrid between the mlx5 and this approach where we have a page
cache that these all fall back into you will take a heavy cost for
having to map and unmap pages.

The whole reason why I implemented the Intel page reuse approach the
way I did is to try and mitigate the IOMMU issue, it wasn't so much to
resolve allocator/freeing expense.  Basically the allocator scales,
the IOMMU does not.  So any solution would require making certain that
we can leave the pages pinned in the DMA to avoid having to take the
global locks involved in accessing the IOMMU.

> This means the number of active pages would be minimal, especially on
> PowerPC. Pages that have been used by X=2 received frags would be put in
> a quarantine (size to be determined).
> On PowerPC, X would be PAGE_SIZE/frag_size
>
>
> This strategy would consume less memory on PowerPC :
> 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
> 4096.
>
> The quarantine would be sized to increase chances of reusing an old
> page, without consuming too much memory.
>
> Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))
>
> x86 would still use 4096 pages, but PowerPC would use 98+128 pages
> instead of 4096) (14 MBytes instead of 256 MBytes)

So any solution will need to work with an IOMMU enabled on the
platform.  I assume you have some x86 test systems you could run with
an IOMMU enabled.  My advice would be to try running in that
environment and see where the overhead lies.

- Alex

David Laight Feb. 22, 2017, 5:58 p.m. UTC | #40

From: Alexander Duyck

> Sent: 22 February 2017 17:24

...
> So there is a problem that is being overlooked here.  That is the cost

> of the DMA map/unmap calls.  The problem is many PowerPC systems have

> an IOMMU that you have to work around, and that IOMMU comes at a heavy

> cost for every map/unmap call.  So unless you are saying you wan to

> setup a hybrid between the mlx5 and this approach where we have a page

> cache that these all fall back into you will take a heavy cost for

> having to map and unmap pages.

..

I can't help feeling that you need to look at how to get the iommu
code to reuse pages, rather the ethernet driver.
Maybe something like:

1) The driver requests a mapped receive buffer from the iommu.
This might give it memory that is already mapped but not in use.

2) When the receive completes the driver tells the iommu the mapping
is no longer needed. The iommu is not (yet) changed.

3) When the skb is freed the iommu is told that the buffer can be freed.

4) At (1), if the driver is using too much iommu resource then the mapping
for completed receives can be removed to free up iommu space.

Probably not as simple as it looks :-)

	David

Eric Dumazet Feb. 22, 2017, 6:21 p.m. UTC | #41

On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> >> Use of order-3 pages is problematic in some cases.
> >>
> >> This patch might add three kinds of regression :
> >>
> >> 1) a CPU performance regression, but we will add later page
> >> recycling and performance should be back.
> >>
> >> 2) TCP receiver could grow its receive window slightly slower,
> >>    because skb->len/skb->truesize ratio will decrease.
> >>    This is mostly ok, we prefer being conservative to not risk OOM,
> >>    and eventually tune TCP better in the future.
> >>    This is consistent with other drivers using 2048 per ethernet frame.
> >>
> >> 3) Because we allocate one page per RX slot, we consume more
> >>    memory for the ring buffers. XDP already had this constraint anyway.
> >>
> >> Signed-off-by: Eric Dumazet <edumazet@google.com>
> >> ---
> >
> > Note that we also could use a different strategy.
> >
> > Assume RX rings of 4096 entries/slots.
> >
> > With this patch, mlx4 gets the strategy used by Alexander in Intel
> > drivers :
> >
> > Each RX slot has an allocated page, and uses half of it, flipping to the
> > other half every time the slot is used.
> >
> > So a ring buffer of 4096 slots allocates 4096 pages.
> >
> > When we receive a packet train for the same flow, GRO builds an skb with
> > ~45 page frags, all from different pages.
> >
> > The put_page() done from skb_release_data() touches ~45 different struct
> > page cache lines, and show a high cost. (compared to the order-3 used
> > today by mlx4, this adds extra cache line misses and stalls for the
> > consumer)
> >
> > If we instead try to use the two halves of one page on consecutive RX
> > slots, we might instead cook skb with the same number of MSS (45), but
> > half the number of cache lines for put_page(), so we should speed up the
> > consumer.
> 
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
> 
> The whole reason why I implemented the Intel page reuse approach the
> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
> resolve allocator/freeing expense.  Basically the allocator scales,
> the IOMMU does not.  So any solution would require making certain that
> we can leave the pages pinned in the DMA to avoid having to take the
> global locks involved in accessing the IOMMU.


I do not see any difference for the fact that we keep pages mapped the
same way.

mlx4_en_complete_rx_desc() will still use the :

dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
                              frag_size, priv->dma_dir);

for every single MSS we receive.

This wont change.

Alexander H Duyck Feb. 23, 2017, 1:08 a.m. UTC | #42

On Wed, Feb 22, 2017 at 10:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
>> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> >> Use of order-3 pages is problematic in some cases.
>> >>
>> >> This patch might add three kinds of regression :
>> >>
>> >> 1) a CPU performance regression, but we will add later page
>> >> recycling and performance should be back.
>> >>
>> >> 2) TCP receiver could grow its receive window slightly slower,
>> >>    because skb->len/skb->truesize ratio will decrease.
>> >>    This is mostly ok, we prefer being conservative to not risk OOM,
>> >>    and eventually tune TCP better in the future.
>> >>    This is consistent with other drivers using 2048 per ethernet frame.
>> >>
>> >> 3) Because we allocate one page per RX slot, we consume more
>> >>    memory for the ring buffers. XDP already had this constraint anyway.
>> >>
>> >> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >> ---
>> >
>> > Note that we also could use a different strategy.
>> >
>> > Assume RX rings of 4096 entries/slots.
>> >
>> > With this patch, mlx4 gets the strategy used by Alexander in Intel
>> > drivers :
>> >
>> > Each RX slot has an allocated page, and uses half of it, flipping to the
>> > other half every time the slot is used.
>> >
>> > So a ring buffer of 4096 slots allocates 4096 pages.
>> >
>> > When we receive a packet train for the same flow, GRO builds an skb with
>> > ~45 page frags, all from different pages.
>> >
>> > The put_page() done from skb_release_data() touches ~45 different struct
>> > page cache lines, and show a high cost. (compared to the order-3 used
>> > today by mlx4, this adds extra cache line misses and stalls for the
>> > consumer)
>> >
>> > If we instead try to use the two halves of one page on consecutive RX
>> > slots, we might instead cook skb with the same number of MSS (45), but
>> > half the number of cache lines for put_page(), so we should speed up the
>> > consumer.
>>
>> So there is a problem that is being overlooked here.  That is the cost
>> of the DMA map/unmap calls.  The problem is many PowerPC systems have
>> an IOMMU that you have to work around, and that IOMMU comes at a heavy
>> cost for every map/unmap call.  So unless you are saying you wan to
>> setup a hybrid between the mlx5 and this approach where we have a page
>> cache that these all fall back into you will take a heavy cost for
>> having to map and unmap pages.
>>
>> The whole reason why I implemented the Intel page reuse approach the
>> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
>> resolve allocator/freeing expense.  Basically the allocator scales,
>> the IOMMU does not.  So any solution would require making certain that
>> we can leave the pages pinned in the DMA to avoid having to take the
>> global locks involved in accessing the IOMMU.
>
>
> I do not see any difference for the fact that we keep pages mapped the
> same way.
>
> mlx4_en_complete_rx_desc() will still use the :
>
> dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
>                               frag_size, priv->dma_dir);
>
> for every single MSS we receive.
>
> This wont change.

Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.

The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.

- Alex

Eric Dumazet Feb. 23, 2017, 2:06 a.m. UTC | #43

On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:

> 
> Right but you were talking about using both halves one after the
> other.  If that occurs you have nothing left that you can reuse.  That
> was what I was getting at.  If you use up both halves you end up
> having to unmap the page.
> 

You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.

Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.

> The whole idea behind using only half the page per descriptor is to
> allow us to loop through the ring before we end up reusing it again.
> That buys us enough time that usually the stack has consumed the frame
> before we need it again.

The same will happen really.

Best maybe is for me to send the patch ;)

Alexander H Duyck Feb. 23, 2017, 2:18 a.m. UTC | #44

On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>
>>
>> Right but you were talking about using both halves one after the
>> other.  If that occurs you have nothing left that you can reuse.  That
>> was what I was getting at.  If you use up both halves you end up
>> having to unmap the page.
>>
>
> You must have misunderstood me.
>
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
>
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
>
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
>
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
>
> If not, _then_ we unmap and release the page.

Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.

> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
>
> To recap on x86 :
>
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
>
> And 2048 pages would be in the reserve.

The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.

>> The whole idea behind using only half the page per descriptor is to
>> allow us to loop through the ring before we end up reusing it again.
>> That buys us enough time that usually the stack has consumed the frame
>> before we need it again.
>
>
> The same will happen really.
>
> Best maybe is for me to send the patch ;)

I think I have the idea now.  However patches are always welcome..  :-)

Tariq Toukan Feb. 23, 2017, 2:02 p.m. UTC | #45

On 23/02/2017 4:18 AM, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>>
>>>
>>> Right but you were talking about using both halves one after the
>>> other.  If that occurs you have nothing left that you can reuse.  That
>>> was what I was getting at.  If you use up both halves you end up
>>> having to unmap the page.
>>>
>>
>> You must have misunderstood me.
>>
>> Once we use both halves of a page, we _keep_ the page, we do not unmap
>> it.
>>
>> We save the page pointer in a ring buffer of pages.
>> Call it the 'quarantine'
>>
>> When we _need_ to replenish the RX desc, we take a look at the oldest
>> entry in the quarantine ring.
>>
>> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
>> this saved page.
>>
>> If not, _then_ we unmap and release the page.
>
> Okay, that was what I was referring to when I mentioned a "hybrid
> between the mlx5 and the Intel approach".  Makes sense.
>

Indeed, in mlx5 Striding RQ (mpwqe) we do something similar.
Our NIC (ConnectX-4 LX and newer) knows to write multiple _consecutive_ 
packets into the same RX buffer (page).
AFAIU, this is what Eric suggests to do in SW in mlx4.

Here are the main characteristics of our page-cache in mlx5:
1) FIFO (for higher chances of an available page).
2) If the page-cache head is busy, it is not freed. This has its pros 
and cons. We might reconsider.
3) Pages in cache have no page-to-WQE assignment (WQE is for Work Queue 
Element, == RX descriptor). They are shared for all WQEs of an RQ and 
might be used by different WQEs in different rounds.
4) Cache size is smaller than suggested, we would happily increase it to 
reflect a whole ring.

Still, performance tests over mlx5 show that on high load we quickly end 
up allocating pages as the stack does not release its ref count on time.
Increasing the cache size helps of course.
As there's no _fixed_ fair size that guarantees the availability of 
pages every ring cycle, reflecting a ring size can help, and would give 
the opportunity for users to tune their performance by setting their 
ring size according to how powerful their CPUs are, and what traffic 
type/load they're running.

>> Note that we would have received 4096 frames before looking at the page
>> count, so there is high chance both halves were consumed.
>>
>> To recap on x86 :
>>
>> 2048 active pages would be visible by the device, because 4096 RX desc
>> would contain dma addresses pointing to the 4096 halves.
>>
>> And 2048 pages would be in the reserve.
>
> The buffer info layout for something like that would probably be
> pretty interesting.  Basically you would be doubling up the ring so
> that you handle 2 Rx descriptors per a single buffer info since you
> would automatically know that it would be an even/odd setup in terms
> of the buffer offsets.
>
> If you get a chance to do something like that I would love to know the
> result.  Otherwise if I get a chance I can try messing with i40e or
> ixgbe some time and see what kind of impact it has.
>
>>> The whole idea behind using only half the page per descriptor is to
>>> allow us to loop through the ring before we end up reusing it again.
>>> That buys us enough time that usually the stack has consumed the frame
>>> before we need it again.
>>
>>
>> The same will happen really.
>>
>> Best maybe is for me to send the patch ;)
>
> I think I have the idea now.  However patches are always welcome..  :-)
>

Same here :-)

Jesper Dangaard Brouer Feb. 24, 2017, 9:42 a.m. UTC | #46

On Wed, 22 Feb 2017 18:06:58 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> >   
> 
> You must have misunderstood me.

FYI, I also misunderstood you (Eric) to start with ;-)
 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.

I do like it, and it should work.  I like it because it solves my
concern, regarding being able to adjust the amount of
outstanding-frames independently of the RX ring size.


Do notice: driver developers have to use Alex'es new DMA API in-order
to get writable-pages, else this will violate the DMA API.  And XDP
requires writable pages.

Eric Dumazet March 12, 2017, 2:57 p.m. UTC | #47

On Wed, 2017-02-22 at 18:06 -0800, Eric Dumazet wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> > 
> 
> You must have misunderstood me.
> 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.
> 
> 
> > The whole idea behind using only half the page per descriptor is to
> > allow us to loop through the ring before we end up reusing it again.
> > That buys us enough time that usually the stack has consumed the frame
> > before we need it again.
> 
> 
> The same will happen really.
> 
> Best maybe is for me to send the patch ;)

Excellent results so far, performance on PowerPC is back, and x86 gets a
gain as well.

Problem is XDP TX :

I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
RX is owned by current cpu.

Since TX completion is using a different NAPI, I really do not believe
we can avoid an atomic operation, like a spinlock, to protect the list
of pages ( ring->page_cache )

[v3,net-next,08/14] mlx4: use order-0 pages for RX

Commit Message

Comments

Patch