[RFC,11/11] net/mlx5e: XDP TX xmit more

Message ID	1473252152-11379-12-git-send-email-saeedm@mellanox.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Saeed Mahameed <saeedm@mellanox.com> To: iovisor-dev <iovisor-dev@lists.iovisor.org> Cc: netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>, Brenden Blanco <bblanco@plumgrid.com>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, Tom Herbert <tom@herbertland.com>, Martin KaFai Lau <kafai@fb.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Saeed Mahameed <saeedm@mellanox.com> Subject: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Date: Wed, 7 Sep 2016 15:42:32 +0300 Message-Id: <1473252152-11379-12-git-send-email-saeedm@mellanox.com> In-Reply-To: <1473252152-11379-1-git-send-email-saeedm@mellanox.com> References: <1473252152-11379-1-git-send-email-saeedm@mellanox.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Saeed Mahameed Sept. 7, 2016, 12:42 p.m. UTC

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams     XDP TX       XDP TX (xmit more)
---------------------------------------------------
1           4.90Mpps      7.50Mpps
2           9.50Mpps      14.8Mpps
4           16.5Mpps      25.1Mpps
8           21.5Mpps      27.5Mpps*
16          24.1Mpps      27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
 2 files changed, 49 insertions(+), 17 deletions(-)

John Fastabend Sept. 7, 2016, 1:44 p.m. UTC | #1

On 16-09-07 05:42 AM, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams     XDP TX       XDP TX (xmit more)
> ---------------------------------------------------
> 1           4.90Mpps      7.50Mpps
> 2           9.50Mpps      14.8Mpps
> 4           16.5Mpps      25.1Mpps
> 8           21.5Mpps      27.5Mpps*
> 16          24.1Mpps      27.5Mpps*
> 

Hi Saeed,

How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.

> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 

Thanks,
John

Saeed Mahameed Sept. 7, 2016, 2:40 p.m. UTC | #2

On Wed, Sep 7, 2016 at 4:44 PM, John Fastabend via iovisor-dev
<iovisor-dev@lists.iovisor.org> wrote:
> On 16-09-07 05:42 AM, Saeed Mahameed wrote:
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
>>
>> XDP forward packet rate:
>>
>> Comparing XDP with and w/o xmit more (bulk transmit):
>>
>> Streams     XDP TX       XDP TX (xmit more)
>> ---------------------------------------------------
>> 1           4.90Mpps      7.50Mpps
>> 2           9.50Mpps      14.8Mpps
>> 4           16.5Mpps      25.1Mpps
>> 8           21.5Mpps      27.5Mpps*
>> 16          24.1Mpps      27.5Mpps*
>>
>
> Hi Saeed,
>
> How many cores are you using with these numbers? Just a single
> core? Or are streams being RSS'd across cores somehow.
>

Hi John,

Right I should have been more clear here, numbers of streams refers to
the active RSS cores.
We just manipulate the number of rings with ethtool -L to test this.

>> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> we will be working on the analysis and will publish the conclusions
>> later.
>>
>
> Thanks,
> John
> _______________________________________________
> iovisor-dev mailing list
> iovisor-dev@lists.iovisor.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Eric Dumazet Sept. 7, 2016, 2:41 p.m. UTC | #3

On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.

Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.

Saeed Mahameed Sept. 7, 2016, 3:08 p.m. UTC | #4

On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
>
> Why is this idea depends on XDP ?
>
> It looks like we could apply it to any driver having one IRQ servicing
> one RX and one TX, without XDP being involved.
>

Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ? maybe we
can have a fence from napi RX function
that will hold the xmit queue until done and then flush the TX queue
with the setting the right xmit more flags, without the need
of explicitly intervening with TX flow (hitting the doorbell).

Eric Dumazet Sept. 7, 2016, 3:32 p.m. UTC | #5

On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> >>
> >> Here we introduce a xmit more like mechanism that will queue up more
> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> >>
> >> Once RX napi budget is consumed and we exit napi RX loop, we will
> >> flush (doorbell) all XDP looped packets in case there are such.
> >
> > Why is this idea depends on XDP ?
> >
> > It looks like we could apply it to any driver having one IRQ servicing
> > one RX and one TX, without XDP being involved.
> >
> 
> Yes but it is more complicated than XDP case, where the RX ring posts
> the TX descriptors and once done
> the RX ring hits the doorbell once for all the TX descriptors it
> posted, and it is the only possible place to hit a doorbell
> for XDP TX ring.
> 
> For regular TX and RX ring sharing the same IRQ, there is no such
> simple connection between them, and hitting a doorbell
> from RX ring napi would race with xmit ndo function of the TX ring.
> 
> How do you synchronize in such case ?
> isn't the existing xmit more mechanism sufficient enough ?

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)

Saeed Mahameed Sept. 7, 2016, 4:57 p.m. UTC | #6

On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >>
>> >> Here we introduce a xmit more like mechanism that will queue up more
>> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >>
>> >> Once RX napi budget is consumed and we exit napi RX loop, we will
>> >> flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > Why is this idea depends on XDP ?
>> >
>> > It looks like we could apply it to any driver having one IRQ servicing
>> > one RX and one TX, without XDP being involved.
>> >
>>
>> Yes but it is more complicated than XDP case, where the RX ring posts
>> the TX descriptors and once done
>> the RX ring hits the doorbell once for all the TX descriptors it
>> posted, and it is the only possible place to hit a doorbell
>> for XDP TX ring.
>>
>> For regular TX and RX ring sharing the same IRQ, there is no such
>> simple connection between them, and hitting a doorbell
>> from RX ring napi would race with xmit ndo function of the TX ring.
>>
>> How do you synchronize in such case ?
>> isn't the existing xmit more mechanism sufficient enough ?
>
> Only if a qdisc is present and pressure is high enough.
>
> But in a forwarding setup, we likely receive at a lower rate than the
> NIC can transmit.
>

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.

> A simple cmpxchg could be used to synchronize the thing, if we really
> cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
> smaller than doorbell one)
>

Eric Dumazet Sept. 7, 2016, 6:19 p.m. UTC | #7

On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device
> TX ring is idle most of the time, i think his idea can come in handy here.
> I am not fully involved in the details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.

What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.

Jesper Dangaard Brouer Sept. 7, 2016, 6:22 p.m. UTC | #8

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
[...]
> >
> > Only if a qdisc is present and pressure is high enough.
> >
> > But in a forwarding setup, we likely receive at a lower rate than the
> > NIC can transmit.

Yes, I can confirm this happens in my experiments.

> >  
> 
> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device TX ring is idle most of the time, i think
> his idea can come in handy here. I am not fully involved in the
> details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible.  This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

Saeed Mahameed Sept. 7, 2016, 8:09 p.m. UTC | #9

On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:
>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device
>> TX ring is idle most of the time, i think his idea can come in handy here.
>> I am not fully involved in the details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
>
> I do not think qdisc is relevant here.
>
> Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
> because only this layer can know if more packets are to come.
>
>
> What I am saying is that regardless of skb->xmit_more being set or not,
> (for example if no qdisc is even used)
> a NAPI driver can arm a bit asking the doorbell being sent at the end of
> NAPI.
>
> I am not saying this must be done, only that the idea could be extended
> to non XDP world, if we care enough.
>

Yes, and i am just trying to suggest Ideas that do not require
communication between RX (NAPI) and TX.

The problem here is the synchronization (TX doorbell from RX) which is
not as simple as atomic operation for some drivers.

How about RX bulking ? it also can help here, since for the forwarding
case, the forwarding path will be able to
process bulk of RX SKBs and can bulk xmit the portion of SKBs that
will be forwarded.

As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if
you are joining of course).

Thanks
Saeed.

John Fastabend Sept. 8, 2016, 2:58 a.m. UTC | #10

On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
> 
> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> [...]
>>>
>>> Only if a qdisc is present and pressure is high enough.
>>>
>>> But in a forwarding setup, we likely receive at a lower rate than the
>>> NIC can transmit.
> 
> Yes, I can confirm this happens in my experiments.
> 
>>>  
>>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device TX ring is idle most of the time, i think
>> his idea can come in handy here. I am not fully involved in the
>> details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
> 
> Yes. I have some ideas around getting more bulking going from the qdisc
> layer, by having the drivers provide some feedback to the qdisc layer
> indicating xmit_more should be possible.  This will be a topic at the
> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> challenge people to come up with a good solution ;-)
> 

One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.

.John

Tom Herbert Sept. 8, 2016, 3:21 a.m. UTC | #11

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:
> On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>>
>> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> [...]
>>>>
>>>> Only if a qdisc is present and pressure is high enough.
>>>>
>>>> But in a forwarding setup, we likely receive at a lower rate than the
>>>> NIC can transmit.
>>
>> Yes, I can confirm this happens in my experiments.
>>
>>>>
>>>
>>> Jesper has a similar Idea to make the qdisc think it is under
>>> pressure, when the device TX ring is idle most of the time, i think
>>> his idea can come in handy here. I am not fully involved in the
>>> details, maybe he can elaborate more.
>>>
>>> But if it works, it will be transparent to napi, and xmit more will
>>> happen by design.
>>
>> Yes. I have some ideas around getting more bulking going from the qdisc
>> layer, by having the drivers provide some feedback to the qdisc layer
>> indicating xmit_more should be possible.  This will be a topic at the
>> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> challenge people to come up with a good solution ;-)
>>
>
> One thing I've noticed but haven't yet actually analyzed much is if
> I shrink the nic descriptor ring size to only be slightly larger than
> the qdisc layer bulking size I get more bulking and better perf numbers.
> At least on microbenchmarks. The reason being the nic pushes back more
> on the qdisc. So maybe a case for making the ring size in the NIC some
> factor of the expected number of queues feeding the descriptor ring.
>

BQL is not helping with that?

Tom

> .John

Jesper Dangaard Brouer Sept. 8, 2016, 5:11 a.m. UTC | #12

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:
> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:  
> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> [...]  
> >>>>
> >>>> Only if a qdisc is present and pressure is high enough.
> >>>>
> >>>> But in a forwarding setup, we likely receive at a lower rate than the
> >>>> NIC can transmit.  
> >>
> >> Yes, I can confirm this happens in my experiments.
> >>  
> >>>>  
> >>>
> >>> Jesper has a similar Idea to make the qdisc think it is under
> >>> pressure, when the device TX ring is idle most of the time, i think
> >>> his idea can come in handy here. I am not fully involved in the
> >>> details, maybe he can elaborate more.
> >>>
> >>> But if it works, it will be transparent to napi, and xmit more will
> >>> happen by design.  
> >>
> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> indicating xmit_more should be possible.  This will be a topic at the
> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> challenge people to come up with a good solution ;-)
> >>  
> >
> > One thing I've noticed but haven't yet actually analyzed much is if
> > I shrink the nic descriptor ring size to only be slightly larger than
> > the qdisc layer bulking size I get more bulking and better perf numbers.
> > At least on microbenchmarks. The reason being the nic pushes back more
> > on the qdisc. So maybe a case for making the ring size in the NIC some
> > factor of the expected number of queues feeding the descriptor ring.
> >  

I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).

 
> BQL is not helping with that?

Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight".  Which Tom already have
a patch for :-)  Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

Jesper Dangaard Brouer Sept. 8, 2016, 8:11 a.m. UTC | #13

I'm sorry but I have a problem with this patch!

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

 Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

 Lets do bundling/bulking from the start!

The reason behind the xmit_more API is that we could not change the
API of all the drivers.  And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces.  I also want bulk transmit from day-1 here.  It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.

But the mSwitch[1] article actually already solved this destination
sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports".  And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

--Jesper
(top post)


On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:

> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams     XDP TX       XDP TX (xmit more)
> ---------------------------------------------------
> 1           4.90Mpps      7.50Mpps
> 2           9.50Mpps      14.8Mpps
> 4           16.5Mpps      25.1Mpps
> 8           21.5Mpps      27.5Mpps*
> 16          24.1Mpps      27.5Mpps*
> 
> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>  2 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index df2c9e0..6846208 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -265,7 +265,8 @@ struct mlx5e_cq {
>  
>  struct mlx5e_rq;
>  typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
> -				       struct mlx5_cqe64 *cqe);
> +				       struct mlx5_cqe64 *cqe,
> +				       bool *xdp_doorbell);
>  typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
>  				  u16 ix);
>  
> @@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
>  
>  void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
>  			bool recycle);
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			 bool *xdp_doorbell);
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			       bool *xdp_doorbell);
>  bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
>  int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
>  int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 912a0e2..ed93251 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
>  static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  					     struct mlx5e_cq *cq,
>  					     int update_owner_only,
> -					     int budget_rem)
> +					     int budget_rem,
> +					     bool *xdp_doorbell)
>  {
>  	u32 cqcc = cq->wq.cc + update_owner_only;
>  	u32 cqe_count;
> @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  			mlx5e_read_mini_arr_slot(cq, cqcc);
>  
>  		mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
> -		rq->handle_rx_cqe(rq, &cq->title);
> +		rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
>  	}
>  	mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
>  	cq->wq.cc = cqcc;
> @@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  
>  static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
>  					      struct mlx5e_cq *cq,
> -					      int budget_rem)
> +					      int budget_rem,
> +					      bool *xdp_doorbell)
>  {
>  	mlx5e_read_title_slot(rq, cq, cq->wq.cc);
>  	mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
>  	mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
> -	rq->handle_rx_cqe(rq, &cq->title);
> +	rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
>  	cq->mini_arr_idx++;
>  
> -	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
> +	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
>  }
>  
>  void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
> @@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
>  	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
>  	sq->pc += MLX5E_XDP_TX_WQEBBS;
>  
> -	/* TODO: xmit more */
> +	/* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
> +	return true;
> +}
> +
> +static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
> +{
> +	struct mlx5_wq_cyc *wq = &sq->wq;
> +	struct mlx5e_tx_wqe *wqe;
> +	u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
> +
> +	wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
> +
>  	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  
> +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>  	/* fill sq edge with nops to avoid wqe wrap around */
>  	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
>  		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
>  		mlx5e_send_nop(sq, false);
>  	}
> -	return true;
> +#endif
>  }
>  
>  /* returns true if packet was consumed by xdp */
>  static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  				    const struct bpf_prog *prog,
>  				    struct mlx5e_dma_info *di,
> -				    void *data, u16 len)
> +				    void *data, u16 len,
> +				    bool *xdp_doorbell)
>  {
>  	bool consumed = false;
>  	struct xdp_buff xdp;
> @@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
>  						MLX5_RX_HEADROOM,
>  						len);
> +		if (unlikely(!consumed) && (*xdp_doorbell)) {
> +			/* SQ is full, ring doorbell */
> +			mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> +			*xdp_doorbell = false;
> +		}
>  		rq->stats.xdp_tx += consumed;
> +		*xdp_doorbell |= consumed;
>  		return consumed;
>  	default:
>  		bpf_warn_invalid_xdp_action(act);
> @@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  	return false;
>  }
>  
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			 bool *xdp_doorbell)
>  {
>  	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
>  	struct mlx5e_dma_info *di;
> @@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>  		goto wq_ll_pop;
>  	}
>  
> -	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
> +	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
>  		goto wq_ll_pop; /* page/packet was consumed by XDP */
>  
>  	skb = build_skb(va, RQ_PAGE_SIZE(rq));
> @@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  	skb->len  += headlen;
>  }
>  
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			       bool *xdp_doorbell)
>  {
>  	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
>  	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
> @@ -860,13 +883,15 @@ mpwrq_cqe_out:
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  {
>  	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
> +	bool xdp_doorbell = false;
>  	int work_done = 0;
>  
>  	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
>  		return 0;
>  
>  	if (cq->decmprs_left)
> -		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
> +		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
> +							&xdp_doorbell);
>  
>  	for (; work_done < budget; work_done++) {
>  		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
> @@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  		if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
>  			work_done +=
>  				mlx5e_decompress_cqes_start(rq, cq,
> -							    budget - work_done);
> +							    budget - work_done,
> +							    &xdp_doorbell);
>  			continue;
>  		}
>  
>  		mlx5_cqwq_pop(&cq->wq);
>  
> -		rq->handle_rx_cqe(rq, cqe);
> +		rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
>  	}
>  
> +	if (xdp_doorbell)
> +		mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> +
>  	mlx5_cqwq_update_db_record(&cq->wq);
>  
>  	/* ensure cq space is freed before enabling more cqes */

Tom Herbert Sept. 8, 2016, 4:26 p.m. UTC | #14

On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@herbertland.com> wrote:
>
>> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:
>> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >>
>> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> [...]
>> >>>>
>> >>>> Only if a qdisc is present and pressure is high enough.
>> >>>>
>> >>>> But in a forwarding setup, we likely receive at a lower rate than the
>> >>>> NIC can transmit.
>> >>
>> >> Yes, I can confirm this happens in my experiments.
>> >>
>> >>>>
>> >>>
>> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >>> pressure, when the device TX ring is idle most of the time, i think
>> >>> his idea can come in handy here. I am not fully involved in the
>> >>> details, maybe he can elaborate more.
>> >>>
>> >>> But if it works, it will be transparent to napi, and xmit more will
>> >>> happen by design.
>> >>
>> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> challenge people to come up with a good solution ;-)
>> >>
>> >
>> > One thing I've noticed but haven't yet actually analyzed much is if
>> > I shrink the nic descriptor ring size to only be slightly larger than
>> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> > At least on microbenchmarks. The reason being the nic pushes back more
>> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> > factor of the expected number of queues feeding the descriptor ring.
>> >
>
> I've also played with shrink the NIC descriptor ring size, it works,
> but it is an ugly hack to get NIC pushes backs, and I foresee it will
> hurt normal use-cases. (There are other reasons for shrinking the ring
> size like cache usage, but that is unrelated to this).
>
>
>> BQL is not helping with that?
>
> Exactly. But the BQL _byte_ limit is not what is needed, what we need
> to know is the _packets_ currently "in-flight".  Which Tom already have
> a patch for :-)  Once we have that the algorithm is simple.
>
> Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> packets in-flight, the qdisc start it's bulk dequeue building phase,
> before calling the driver. The allowed max qdisc bulk size should
> likely be related to pkts-in-flight.
>
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs....

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Jesper Dangaard Brouer Sept. 8, 2016, 5:19 p.m. UTC | #15

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@herbertland.com> wrote:
> >  
> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:  
> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >> >>
> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:  
> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> >> [...]  
> >> >>>>
> >> >>>> Only if a qdisc is present and pressure is high enough.
> >> >>>>
> >> >>>> But in a forwarding setup, we likely receive at a lower rate than the
> >> >>>> NIC can transmit.  
> >> >>
> >> >> Yes, I can confirm this happens in my experiments.
> >> >>  
> >> >>>>  
> >> >>>
> >> >>> Jesper has a similar Idea to make the qdisc think it is under
> >> >>> pressure, when the device TX ring is idle most of the time, i think
> >> >>> his idea can come in handy here. I am not fully involved in the
> >> >>> details, maybe he can elaborate more.
> >> >>>
> >> >>> But if it works, it will be transparent to napi, and xmit more will
> >> >>> happen by design.  
> >> >>
> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> >> indicating xmit_more should be possible.  This will be a topic at the
> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> >> challenge people to come up with a good solution ;-)
> >> >>  
> >> >
> >> > One thing I've noticed but haven't yet actually analyzed much is if
> >> > I shrink the nic descriptor ring size to only be slightly larger than
> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
> >> > At least on microbenchmarks. The reason being the nic pushes back more
> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
> >> > factor of the expected number of queues feeding the descriptor ring.
> >> >  
> >
> > I've also played with shrink the NIC descriptor ring size, it works,
> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
> > hurt normal use-cases. (There are other reasons for shrinking the ring
> > size like cache usage, but that is unrelated to this).
> >
> >  
> >> BQL is not helping with that?  
> >
> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
> > to know is the _packets_ currently "in-flight".  Which Tom already have
> > a patch for :-)  Once we have that the algorithm is simple.
> >
> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> > packets in-flight, the qdisc start it's bulk dequeue building phase,
> > before calling the driver. The allowed max qdisc bulk size should
> > likely be related to pkts-in-flight.
> >  
> Sorry, I'm still missing it. The point of BQL is that we minimize the
> amount of data (and hence number of packets) that needs to be queued
> in the device in order to prevent the link from going idle while there
> are outstanding packets to be sent. The algorithm is based on counting
> bytes not packets because bytes are roughly an equal cost unit of
> work. So if we've queued 100K of bytes on the queue we know how long
> that takes around 80 usecs @10G, but if we count packets then we
> really don't know much about that. 100 packets enqueued could
> represent 6400 bytes or 6400K worth of data so time to transmit is
> anywhere from 5usecs to 5msecs....
> 
> Shouldn't qdisc bulk size be based on the BQL limit? What is the
> simple algorithm to apply to in-flight packets?

Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.

The idea/algo is trying to predict the future.  If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues.  This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider several packets are
in-flight.
When that happens, I will bet on, I can stop sending packets to the
device, and instead queue them in the qdisc layer.  If I'm unlucky and
the flow stops, then I'm hoping that the last packet stuck in the qdisc,
will be picked by the next napi-schedule, before the device driver runs
"dry".

Tom Herbert Sept. 8, 2016, 6:16 p.m. UTC | #16

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 8 Sep 2016 09:26:03 -0700
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@herbertland.com> wrote:
>> >
>> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:
>> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >> >>
>> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> >> [...]
>> >> >>>>
>> >> >>>> Only if a qdisc is present and pressure is high enough.
>> >> >>>>
>> >> >>>> But in a forwarding setup, we likely receive at a lower rate than the
>> >> >>>> NIC can transmit.
>> >> >>
>> >> >> Yes, I can confirm this happens in my experiments.
>> >> >>
>> >> >>>>
>> >> >>>
>> >> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >> >>> pressure, when the device TX ring is idle most of the time, i think
>> >> >>> his idea can come in handy here. I am not fully involved in the
>> >> >>> details, maybe he can elaborate more.
>> >> >>>
>> >> >>> But if it works, it will be transparent to napi, and xmit more will
>> >> >>> happen by design.
>> >> >>
>> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> >> challenge people to come up with a good solution ;-)
>> >> >>
>> >> >
>> >> > One thing I've noticed but haven't yet actually analyzed much is if
>> >> > I shrink the nic descriptor ring size to only be slightly larger than
>> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> >> > At least on microbenchmarks. The reason being the nic pushes back more
>> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> >> > factor of the expected number of queues feeding the descriptor ring.
>> >> >
>> >
>> > I've also played with shrink the NIC descriptor ring size, it works,
>> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
>> > hurt normal use-cases. (There are other reasons for shrinking the ring
>> > size like cache usage, but that is unrelated to this).
>> >
>> >
>> >> BQL is not helping with that?
>> >
>> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
>> > to know is the _packets_ currently "in-flight".  Which Tom already have
>> > a patch for :-)  Once we have that the algorithm is simple.
>> >
>> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
>> > packets in-flight, the qdisc start it's bulk dequeue building phase,
>> > before calling the driver. The allowed max qdisc bulk size should
>> > likely be related to pkts-in-flight.
>> >
>> Sorry, I'm still missing it. The point of BQL is that we minimize the
>> amount of data (and hence number of packets) that needs to be queued
>> in the device in order to prevent the link from going idle while there
>> are outstanding packets to be sent. The algorithm is based on counting
>> bytes not packets because bytes are roughly an equal cost unit of
>> work. So if we've queued 100K of bytes on the queue we know how long
>> that takes around 80 usecs @10G, but if we count packets then we
>> really don't know much about that. 100 packets enqueued could
>> represent 6400 bytes or 6400K worth of data so time to transmit is
>> anywhere from 5usecs to 5msecs....
>>
>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>> simple algorithm to apply to in-flight packets?
>
> Maybe the algorithm is not so simple, and we likely also have to take
> BQL bytes into account.
>
> The reason for wanting packets-in-flight is because we are attacking a
> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.

> The idea/algo is trying to predict the future.  If we see a given/high
> packet rate, which equals a high transaction cost, then lets try not
> calling the driver, and instead backlog the packet in the qdisc,
> speculatively hoping the current rate continues.  This will in effect
> allow bulking and amortize the 70ns transaction cost over N packets.
>
> Instead of tracking a rate of packets or doorbells per sec, I will let
> BQLs packet-in-flight tell me when the driver sees a rate high enough
> that the drivers (DMA-TX completion) consider several packets are
> in-flight.
> When that happens, I will bet on, I can stop sending packets to the
> device, and instead queue them in the qdisc layer.  If I'm unlucky and
> the flow stops, then I'm hoping that the last packet stuck in the qdisc,
> will be picked by the next napi-schedule, before the device driver runs
> "dry".
>
This is exactly what BQL already does (except the queue limit is on
bytes). Once the byte limit is reached the queue is stopped. At TX
completion time some number of bytes are freed up so that a bulk of
packets can be sent to the queue limit.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Rick Jones Sept. 8, 2016, 6:48 p.m. UTC | #17

On 09/08/2016 11:16 AM, Tom Herbert wrote:
> On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Thu, 8 Sep 2016 09:26:03 -0700
>> Tom Herbert <tom@herbertland.com> wrote:
>>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>>> simple algorithm to apply to in-flight packets?
>>
>> Maybe the algorithm is not so simple, and we likely also have to take
>> BQL bytes into account.
>>
>> The reason for wanting packets-in-flight is because we are attacking a
>> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
>> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
>> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
>> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>>
> But you should be able to do this with BQL and it is more accurate.
> BQL tells how many bytes need to be sent and that can be used to
> create a bulk of packets to send with one doorbell.

With small packets and the "default" ring size for this NIC/driver 
combination, is the BQL large enough that the ring fills before one hits 
the BQL?

rick jones

Eric Dumazet Sept. 8, 2016, 6:52 p.m. UTC | #18

On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote:

> With small packets and the "default" ring size for this NIC/driver 
> combination, is the BQL large enough that the ring fills before one hits 
> the BQL?

It depends on how TX completion (NAPI handler) is implemented in the
driver.

Say how many packets can be dequeued by each invocation.

Drivers have a lot of variations there.

Alexei Starovoitov Sept. 9, 2016, 3:22 a.m. UTC | #19

On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> 
> I'm sorry but I have a problem with this patch!

is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.

> Looking at this patch, I want to bring up a fundamental architectural
> concern with the development direction of XDP transmit.
> 
> 
> What you are trying to implement, with delaying the doorbell, is
> basically TX bulking for TX_XDP.
> 
>  Why not implement a TX bulking interface directly instead?!?
> 
> Yes, the tailptr/doorbell is the most costly operation, but why not
> also take advantage of the benefits of bulking for other parts of the
> code? (benefit is smaller, by every cycles counts in this area)
> 
> This hole XDP exercise is about avoiding having a transaction cost per
> packet, that reads "bulking" or "bundling" of packets, where possible.
> 
>  Lets do bundling/bulking from the start!

mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.

> The reason behind the xmit_more API is that we could not change the
> API of all the drivers.  And we found that calling an explicit NDO
> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> would hit the common single packet use-case.
> 
> It should be really easy to build a bundle of packets that need XDP_TX
> action, especially given you only have a single destination "port".
> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.

> In the future, XDP need to support XDP_FWD forwarding of packets/pages
> out other interfaces.  I also want bulk transmit from day-1 here.  It
> is slightly more tricky to sort packets for multiple outgoing
> interfaces efficiently in the pool loop.

I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.

> But the mSwitch[1] article actually already solved this destination
> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> understanding the next steps, for a smarter data structure, when
> starting to have more TX "ports".  And perhaps align your single
> XDP_TX destination data structure to this future development.
> 
> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

> --Jesper
> (top post)

since when it's ok to top post?

> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > 
> > Here we introduce a xmit more like mechanism that will queue up more
> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> > 
> > Once RX napi budget is consumed and we exit napi RX loop, we will
> > flush (doorbell) all XDP looped packets in case there are such.
> > 
> > XDP forward packet rate:
> > 
> > Comparing XDP with and w/o xmit more (bulk transmit):
> > 
> > Streams     XDP TX       XDP TX (xmit more)
> > ---------------------------------------------------
> > 1           4.90Mpps      7.50Mpps
> > 2           9.50Mpps      14.8Mpps
> > 4           16.5Mpps      25.1Mpps
> > 8           21.5Mpps      27.5Mpps*
> > 16          24.1Mpps      27.5Mpps*
> > 
> > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> > we will be working on the analysis and will publish the conclusions
> > later.
> > 
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
> >  2 files changed, 49 insertions(+), 17 deletions(-)
...
> > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> >  			mlx5e_read_mini_arr_slot(cq, cqcc);
> >  
> >  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> >  
> > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */

Saeed,
please make sure to remove such debug bits.

Jesper Dangaard Brouer Sept. 9, 2016, 5:36 a.m. UTC | #20

On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I'm sorry but I have a problem with this patch!  
> 
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.

What I'm proposing can also be used by the normal stack.
 
> > Looking at this patch, I want to bring up a fundamental architectural
> > concern with the development direction of XDP transmit.
> > 
> > 
> > What you are trying to implement, with delaying the doorbell, is
> > basically TX bulking for TX_XDP.
> > 
> >  Why not implement a TX bulking interface directly instead?!?
> > 
> > Yes, the tailptr/doorbell is the most costly operation, but why not
> > also take advantage of the benefits of bulking for other parts of the
> > code? (benefit is smaller, by every cycles counts in this area)
> > 
> > This hole XDP exercise is about avoiding having a transaction cost per
> > packet, that reads "bulking" or "bundling" of packets, where possible.
> > 
> >  Lets do bundling/bulking from the start!  
> 
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.

This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.  

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX. 
Every time you call the stack code, then you flush your icache.  When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.

 
> > The reason behind the xmit_more API is that we could not change the
> > API of all the drivers.  And we found that calling an explicit NDO
> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> > would hit the common single packet use-case.
> > 
> > It should be really easy to build a bundle of packets that need XDP_TX
> > action, especially given you only have a single destination "port".
> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  
> 
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
> 
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.

I just explain it will be more efficient due to better use of icache.

 
> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
> > out other interfaces.  I also want bulk transmit from day-1 here.  It
> > is slightly more tricky to sort packets for multiple outgoing
> > interfaces efficiently in the pool loop.  
> 
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.

So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access.  In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.


> > But the mSwitch[1] article actually already solved this destination
> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> > understanding the next steps, for a smarter data structure, when
> > starting to have more TX "ports".  And perhaps align your single
> > XDP_TX destination data structure to this future development.
> > 
> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf  
> 
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
> 
> > --Jesper
> > (top post)  
> 
> since when it's ok to top post?

What a kneejerk reaction.  When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow).  I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.

 
> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
> >   
> > > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > > 
> > > Here we introduce a xmit more like mechanism that will queue up more
> > > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> > > 
> > > Once RX napi budget is consumed and we exit napi RX loop, we will
> > > flush (doorbell) all XDP looped packets in case there are such.
> > > 
> > > XDP forward packet rate:
> > > 
> > > Comparing XDP with and w/o xmit more (bulk transmit):
> > > 
> > > Streams     XDP TX       XDP TX (xmit more)
> > > ---------------------------------------------------
> > > 1           4.90Mpps      7.50Mpps
> > > 2           9.50Mpps      14.8Mpps
> > > 4           16.5Mpps      25.1Mpps
> > > 8           21.5Mpps      27.5Mpps*
> > > 16          24.1Mpps      27.5Mpps*
> > > 
> > > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> > > we will be working on the analysis and will publish the conclusions
> > > later.
> > > 
> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
> > >  2 files changed, 49 insertions(+), 17 deletions(-)  
> ...
> > > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> > >  			mlx5e_read_mini_arr_slot(cq, cqcc);
> > >  
> > >  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> > >  
> > > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */  
> 
> Saeed,
> please make sure to remove such debug bits.
>

Alexei Starovoitov Sept. 9, 2016, 6:30 a.m. UTC | #21

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > >  Lets do bundling/bulking from the start!  
> > 
> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.
> > See nothing wrong about it. RX side processes the packets and
> > when it's done it tells TX to xmit whatever it collected.
> 
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.  
> 
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
> 
> Imagine you have packets intermixed towards the stack and XDP_TX. 
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.

correct. And why is that a problem?
As we discussed numerous times before XDP is deliberately not trying
to work with 10% of the traffic. If most of the traffic is going into
the stack there is no reason to use XDP. We have tc and netfilter
to deal with it. The cases where most of the traffic needs
skb should not use XDP. If we try to add such uses cases to XDP we
will only hurt XDP performance, increase complexity and gain nothing back.

Let's say a user wants to send 50% into the stack->tcp->socket->user and
another 50% via XDP_TX. The performance is going to be dominated by the stack.
So everything that XDP does to receive and/or transmit is irrelevant.
If we try to optimize XDP for that, we gain nothing in performance.
The user could have used netfilter just as well in such scenario.
The performance would have been the same.

XDP only makes sense when it's servicing most of the traffic,
like L4 load balancer, ILA router or DoS prevention use cases.
Sorry for the broken record. XDP is not a solution for every
networking use case. It only makes sense for packet in and out.
When packet goes to the host it has to go through skb and
optimizing that path is a task that is orthogonal to the XDP patches.

To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.

Saeed Mahameed Sept. 9, 2016, 3:03 p.m. UTC | #22

On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
<iovisor-dev@lists.iovisor.org> wrote:
> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I'm sorry but I have a problem with this patch!
>
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.
>
>> Looking at this patch, I want to bring up a fundamental architectural
>> concern with the development direction of XDP transmit.
>>
>>
>> What you are trying to implement, with delaying the doorbell, is
>> basically TX bulking for TX_XDP.
>>
>>  Why not implement a TX bulking interface directly instead?!?
>>
>> Yes, the tailptr/doorbell is the most costly operation, but why not
>> also take advantage of the benefits of bulking for other parts of the
>> code? (benefit is smaller, by every cycles counts in this area)
>>
>> This hole XDP exercise is about avoiding having a transaction cost per
>> packet, that reads "bulking" or "bundling" of packets, where possible.
>>
>>  Lets do bundling/bulking from the start!

Jesper, what we did here is also bulking, instead of bulkin in a
temporary list in the driver
we list the packets in the HW and once done we transmit all at once via the
xdp_doorbell indication.

I agree with you that we can take advantage and improve the icahce by
bulkin first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling
in the diver and we need to do the comparison between both approaches
before we decide which is better.

this must be marked as future work and not have this from the start.

>
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.
>
>> The reason behind the xmit_more API is that we could not change the
>> API of all the drivers.  And we found that calling an explicit NDO
>> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> would hit the common single packet use-case.
>>
>> It should be really easy to build a bundle of packets that need XDP_TX
>> action, especially given you only have a single destination "port".
>> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
>
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.
>
>> In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> out other interfaces.  I also want bulk transmit from day-1 here.  It
>> is slightly more tricky to sort packets for multiple outgoing
>> interfaces efficiently in the pool loop.
>
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.
>
>> But the mSwitch[1] article actually already solved this destination
>> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> understanding the next steps, for a smarter data structure, when
>> starting to have more TX "ports".  And perhaps align your single
>> XDP_TX destination data structure to this future development.
>>
>> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
>
>> --Jesper
>> (top post)
>
> since when it's ok to top post?
>
>> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >
>> > Here we introduce a xmit more like mechanism that will queue up more
>> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >
>> > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > XDP forward packet rate:
>> >
>> > Comparing XDP with and w/o xmit more (bulk transmit):
>> >
>> > Streams     XDP TX       XDP TX (xmit more)
>> > ---------------------------------------------------
>> > 1           4.90Mpps      7.50Mpps
>> > 2           9.50Mpps      14.8Mpps
>> > 4           16.5Mpps      25.1Mpps
>> > 8           21.5Mpps      27.5Mpps*
>> > 16          24.1Mpps      27.5Mpps*
>> >
>> > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> > we will be working on the analysis and will publish the conclusions
>> > later.
>> >
>> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>> >  2 files changed, 49 insertions(+), 17 deletions(-)
> ...
>> > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>> >                     mlx5e_read_mini_arr_slot(cq, cqcc);
>> >
>> >     mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>> >
>> > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>
> Saeed,
> please make sure to remove such debug bits.
>

Sure, will fix this.

Thanks,
Saeed.

Tom Herbert Sept. 9, 2016, 7:02 p.m. UTC | #23

On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 8 Sep 2016 20:22:04 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >
>> > I'm sorry but I have a problem with this patch!
>>
>> is it because the variable is called 'xdp_doorbell'?
>> Frankly I see nothing scary in this patch.
>> It extends existing code by adding a flag to ring doorbell or not.
>> The end of rx napi is used as an obvious heuristic to flush the pipe.
>> Looks pretty generic to me.
>> The same code can be used for non-xdp as well once we figure out
>> good algorithm for xmit_more in the stack.
>
> What I'm proposing can also be used by the normal stack.
>
>> > Looking at this patch, I want to bring up a fundamental architectural
>> > concern with the development direction of XDP transmit.
>> >
>> >
>> > What you are trying to implement, with delaying the doorbell, is
>> > basically TX bulking for TX_XDP.
>> >
>> >  Why not implement a TX bulking interface directly instead?!?
>> >
>> > Yes, the tailptr/doorbell is the most costly operation, but why not
>> > also take advantage of the benefits of bulking for other parts of the
>> > code? (benefit is smaller, by every cycles counts in this area)
>> >
>> > This hole XDP exercise is about avoiding having a transaction cost per
>> > packet, that reads "bulking" or "bundling" of packets, where possible.
>> >
>> >  Lets do bundling/bulking from the start!
>>
>> mlx4 already does bulking and this proposed mlx5 set of patches
>> does bulking as well.
>> See nothing wrong about it. RX side processes the packets and
>> when it's done it tells TX to xmit whatever it collected.
>
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.
>
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
>
> Imagine you have packets intermixed towards the stack and XDP_TX.
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.
>
>
>> > The reason behind the xmit_more API is that we could not change the
>> > API of all the drivers.  And we found that calling an explicit NDO
>> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> > would hit the common single packet use-case.
>> >
>> > It should be really easy to build a bundle of packets that need XDP_TX
>> > action, especially given you only have a single destination "port".
>> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>>
>> not sure what are you proposing here?
>> Sounds like you want to extend it to multi port in the future?
>> Sure. The proposed code is easily extendable.
>>
>> Or you want to see something like a link list of packets
>> or an array of packets that RX side is preparing and then
>> send the whole array/list to TX port?
>> I don't think that would be efficient, since it would mean
>> unnecessary copy of pointers.
>
> I just explain it will be more efficient due to better use of icache.
>
>
>> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> > out other interfaces.  I also want bulk transmit from day-1 here.  It
>> > is slightly more tricky to sort packets for multiple outgoing
>> > interfaces efficiently in the pool loop.
>>
>> I don't think so. Multi port is natural extension to this set of patches.
>> With multi port the end of RX will tell multiple ports (that were
>> used to tx) to ring the bell. Pretty trivial and doesn't involve any
>> extra arrays or link lists.
>
> So, have you solved the problem exclusive access to a TX ring of a
> remote/different net_device when sending?
>
> In you design you assume there exist many TX ring available for other
> devices to access.  In my design I also want to support devices that
> doesn't have this HW capability, and e.g. only have one TX queue.
>
Right, but segregating TX queues used by the stack from the those used
by XDP is pretty fundamental to the design. If we start mixing them,
then we need to pull in several features (such as BQL which seems like
what you're proposing) into the XDP path. If this starts to slow
things down or we need to reinvent a bunch of existing features to not
use skbuffs that seems to run contrary to "the simple as possible"
model for XDP-- may as well use the regular stack at that point
maybe...

Tom

>
>> > But the mSwitch[1] article actually already solved this destination
>> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> > understanding the next steps, for a smarter data structure, when
>> > starting to have more TX "ports".  And perhaps align your single
>> > XDP_TX destination data structure to this future development.
>> >
>> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>>
>> I don't see how this particular paper applies to the existing kernel code.
>> It's great to take ideas from research papers, but real code is different.
>>
>> > --Jesper
>> > (top post)
>>
>> since when it's ok to top post?
>
> What a kneejerk reaction.  When writing something general we often
> reply to the top of the email, and then often delete the rest (which
> makes it hard for later comers to follow).  I was bcc'ing some people,
> which needed the context, so it was a service note to you, that I
> didn't write anything below.
>
>
>> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
>> >
>> > > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> > >
>> > > Here we introduce a xmit more like mechanism that will queue up more
>> > > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> > >
>> > > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > > flush (doorbell) all XDP looped packets in case there are such.
>> > >
>> > > XDP forward packet rate:
>> > >
>> > > Comparing XDP with and w/o xmit more (bulk transmit):
>> > >
>> > > Streams     XDP TX       XDP TX (xmit more)
>> > > ---------------------------------------------------
>> > > 1           4.90Mpps      7.50Mpps
>> > > 2           9.50Mpps      14.8Mpps
>> > > 4           16.5Mpps      25.1Mpps
>> > > 8           21.5Mpps      27.5Mpps*
>> > > 16          24.1Mpps      27.5Mpps*
>> > >
>> > > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> > > we will be working on the analysis and will publish the conclusions
>> > > later.
>> > >
>> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> > > ---
>> > >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>> > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>> > >  2 files changed, 49 insertions(+), 17 deletions(-)
>> ...
>> > > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>> > >                   mlx5e_read_mini_arr_slot(cq, cqcc);
>> > >
>> > >   mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>> > >
>> > > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>>
>> Saeed,
>> please make sure to remove such debug bits.
>>
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Jesper Dangaard Brouer Sept. 12, 2016, 8:56 a.m. UTC | #24

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > >  Lets do bundling/bulking from the start!    
> > > 
> > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > does bulking as well.
> > > See nothing wrong about it. RX side processes the packets and
> > > when it's done it tells TX to xmit whatever it collected.  
> > 
> > This is doing "hidden" bulking and not really taking advantage of using
> > the icache more effeciently.  
> > 
> > Let me explain the problem I see, little more clear then, so you
> > hopefully see where I'm going.
> > 
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
> correct. And why is that a problem?

It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance.  What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me.  Let me explain this again, in another way.

This is a mental model switch.  Stop seeing the lowest driver RX as
something that works on a per packet basis.  Maybe is it is easier to
understand if we instead see this as vector processing?  This is about
having a vector of packets, where we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle (directly measurable with perf, while I-cache
is not directly measurable).

Lets assume everything fits into the I-cache (XDP+driver code). The
CPU-frontend still have to decode the instructions from the I-cache
into micro-ops.  The next level of optimizations is to reuse the
decoded I-cache by running it on all elements in the packet-vector.

The Intel "64 and IA-32 Architectures Optimization Reference Manual"
(section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make
sure each hot code block is less than about 500 instructions.  Thus,
the different "stages" working on the packet-vector, need to be rather
small and compact.

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Notice: The same mental model switch applies to delivery packets to
the regular netstack.  I've brought this up before[3].  Instead of
flushing the drivers I-cache for every packet, by calling the stack,
let instead bundle up N-packets in the driver before calling the
stack.  I showed 10% speedup by a naive implementation of this
approach.  Edward Cree also showed[4] a 10% performance boost, and
went further into the stack, showing a 25% increase.

A goal is also, to make optimizing netstack code-size independent of
the driver code-size, by separating the netstacks I-cache usage from
the drivers.

[3] http://lists.openwall.net/netdev/2016/01/15/51
[4] http://lists.openwall.net/netdev/2016/04/19/89

Jesper Dangaard Brouer Sept. 12, 2016, 10:15 a.m. UTC | #25

On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
> <iovisor-dev@lists.iovisor.org> wrote:
> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:  
> >>
> >> I'm sorry but I have a problem with this patch!  
> >> Looking at this patch, I want to bring up a fundamental architectural
> >> concern with the development direction of XDP transmit.
> >>
> >>
> >> What you are trying to implement, with delaying the doorbell, is
> >> basically TX bulking for TX_XDP.
> >>
> >>  Why not implement a TX bulking interface directly instead?!?
> >>
> >> Yes, the tailptr/doorbell is the most costly operation, but why not
> >> also take advantage of the benefits of bulking for other parts of the
> >> code? (benefit is smaller, by every cycles counts in this area)
> >>
> >> This hole XDP exercise is about avoiding having a transaction cost per
> >> packet, that reads "bulking" or "bundling" of packets, where possible.
> >>
> >>  Lets do bundling/bulking from the start!  
> 
> Jesper, what we did here is also bulking, instead of bulking in a
> temporary list in the driver we list the packets in the HW and once
> done we transmit all at once via the xdp_doorbell indication.
> 
> I agree with you that we can take advantage and improve the icache by
> bulking first in software and then queue all at once in the hw then
> ring one doorbell.
> 
> but I also agree with Alexei that this will introduce an extra
> pointer/list handling in the driver and we need to do the comparison
> between both approaches before we decide which is better.

I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis.  It might be easier to view this as a kind
of vector processing.  This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle.  The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.

It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
packets.  Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets.  You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.

> this must be marked as future work and not have this from the start.

We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.

> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.

I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO.  And now mlx5 is building on the same principle. That is why
I'm yelling STOP.

> >> The reason behind the xmit_more API is that we could not change the
> >> API of all the drivers.  And we found that calling an explicit NDO
> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> >> would hit the common single packet use-case.
> >>
> >> It should be really easy to build a bundle of packets that need XDP_TX
> >> action, especially given you only have a single destination "port".
> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  

[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

Jesper Dangaard Brouer Sept. 12, 2016, 11:30 a.m. UTC | #26

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
[...]
> To make further progress in this discussion can we talk about
> the use case you have in mind instead? Then solution will
> be much clear, I hope.

The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets.  I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive).  Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...

Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack.  This is due to hitting a bottleneck in the page
allocator.  I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
 It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.

Alexei Starovoitov Sept. 12, 2016, 5:53 p.m. UTC | #27

On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > > >  Lets do bundling/bulking from the start!    
> > > > 
> > > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > > does bulking as well.
> > > > See nothing wrong about it. RX side processes the packets and
> > > > when it's done it tells TX to xmit whatever it collected.  
> > > 
> > > This is doing "hidden" bulking and not really taking advantage of using
> > > the icache more effeciently.  
> > > 
> > > Let me explain the problem I see, little more clear then, so you
> > > hopefully see where I'm going.
> > > 
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> > correct. And why is that a problem?
> 
> It is good that you can see and acknowledge the I-cache problem.
> 
> XDP is all about performance.  What I hear is, that you are arguing
> against a model that will yield better performance, that does not make
> sense to me.  Let me explain this again, in another way.

I'm arguing against your proposal that I think will be more complex and
lower performance than what Saeed and the team already implemented.
Therefore I don't think it's fair to block the patch and ask them to
reimplement it just to test an idea that may or may not improve performance.

Getting maximum performance is tricky. Good is better than perfect.
It's important to argue about user space visible bits upfront, but
on the kernel performance side we should build/test incrementally.
This particular patch 11/11 is simple, easy to review and provides
good performance. What's not to like?

Alexei Starovoitov Sept. 12, 2016, 7:56 p.m. UTC | #28

On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> [...]
> > To make further progress in this discussion can we talk about
> > the use case you have in mind instead? Then solution will
> > be much clear, I hope.
> 
> The DDoS use-case _is_ affected by this "hidden" bulking design.
> 
> Lets say, I want to implement a DDoS facility. Instead of just
> dropping the malicious packets, I want to see the bad packets.  I
> implement this by rewriting the destination-MAC to be my monitor
> machine and then XDP_TX the packet.

not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.

> In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> of the traffic is delivered to the stack. (See note 1)

hmm. DoS prevention use case is when 99% of the traffic is dropped.

> Once the DDoS attack starts, then the traffic pattern changes, and XDP
> should (hopefully only) catch the malicious traffic (monitor machine can
> help diagnose false positive).  Now, due to interleaving the DDoS
> traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> more icache misses...
> 
> 
> 
> Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> program with 100% delivery to the stack, actually slows down the
> normal stack.  This is due to hitting a bottleneck in the page
> allocator.  I'm working removing that bottleneck with page_pool, and
> that solution is orthogonal to this problem.

sure. no one arguing against improving page allocator.

>  It is actually an excellent argument, for why you would want to run a
> DDoS XDP filter only on a restricted number of RX queues.

no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad packets.

Jesper Dangaard Brouer Sept. 12, 2016, 8:48 p.m. UTC | #29

On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.    
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

Tom Herbert Sept. 12, 2016, 9:45 p.m. UTC | #30

On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Fri, 9 Sep 2016 18:03:09 +0300
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>
>> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>> <iovisor-dev@lists.iovisor.org> wrote:
>> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >>
>> >> I'm sorry but I have a problem with this patch!
>> >> Looking at this patch, I want to bring up a fundamental architectural
>> >> concern with the development direction of XDP transmit.
>> >>
>> >>
>> >> What you are trying to implement, with delaying the doorbell, is
>> >> basically TX bulking for TX_XDP.
>> >>
>> >>  Why not implement a TX bulking interface directly instead?!?
>> >>
>> >> Yes, the tailptr/doorbell is the most costly operation, but why not
>> >> also take advantage of the benefits of bulking for other parts of the
>> >> code? (benefit is smaller, by every cycles counts in this area)
>> >>
>> >> This hole XDP exercise is about avoiding having a transaction cost per
>> >> packet, that reads "bulking" or "bundling" of packets, where possible.
>> >>
>> >>  Lets do bundling/bulking from the start!
>>
>> Jesper, what we did here is also bulking, instead of bulking in a
>> temporary list in the driver we list the packets in the HW and once
>> done we transmit all at once via the xdp_doorbell indication.
>>
>> I agree with you that we can take advantage and improve the icache by
>> bulking first in software and then queue all at once in the hw then
>> ring one doorbell.
>>
>> but I also agree with Alexei that this will introduce an extra
>> pointer/list handling in the driver and we need to do the comparison
>> between both approaches before we decide which is better.
>
> I welcome implementing both approaches and benchmarking them against
> each-other, I'll gladly dedicate time for this!
>
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> The key point is stop seeing the lowest driver RX as something that
> works on a per packet basis.  It might be easier to view this as a kind
> of vector processing.  This is about having a vector of packets, where
> we apply some action/operation.
>
> This is about using the CPU more efficiently, getting it to do more
> instructions per cycle.  The next level of optimization (for >= Sandy
> Bridge CPUs) is to make these vector processing stages small enough to fit
> into the CPUs decoded-I-cache section.
>
>
> It might also be important to mention, that for netstack delivery, I
> don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
> packets.  Why, because the NIC-HW runs independently and have the
> opportunity to deliver more frames in the RX ring queue, while the
> stack "slow" processed packets.  You can view this as "bulking" from
> the RX ring queue, with a "look-back" before exiting the NAPI poll loop.
>
>
>> this must be marked as future work and not have this from the start.
>
> We both know that statement is BS, and the other approach will never be
> implemented once this patch is accepted upstream.
>
>
>> > mlx4 already does bulking and this proposed mlx5 set of patches
>> > does bulking as well.
>
> I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
> way IMHO.  And now mlx5 is building on the same principle. That is why
> I'm yelling STOP.
>
>
>> >> The reason behind the xmit_more API is that we could not change the
>> >> API of all the drivers.  And we found that calling an explicit NDO
>> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> >> would hit the common single packet use-case.
>> >>
>> >> It should be really easy to build a bundle of packets that need XDP_TX
>> >> action, especially given you only have a single destination "port".
>> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Edward Cree Sept. 13, 2016, 3:20 p.m. UTC | #31

On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
BTW, I'd also still rather like to see that happen, I never really
understood the objections people had to those patches when I posted them.  I
still believe that dealing in skb-lists instead of skbs, and thus
'automatically' bulking similar packets, is better than trying to categorise
packets into flows early on based on some set of keys.  The problem with the
latter approach is that there are now two definitions of "similar":
1) the set of fields used to index the flow
2) what will actually cause the stack's behaviour to differ if not using the
cached values.
Quite apart from the possibility of bugs if one changes but not the other,
this forces (1) to be conservative, only considering things "similar" if the
entire stack will.  Whereas with bundling, the stack can keep packets
together until they reach a layer at which they are no longer "similar"
enough.  Thus, for instance, packets with the same IP 3-tuple but different
port numbers can be grouped together for IP layer processing, then split
apart for L4.

-Ed

Eric Dumazet Sept. 13, 2016, 3:58 p.m. UTC | #32

On Tue, 2016-09-13 at 16:20 +0100, Edward Cree wrote:
> On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> > I'm reacting so loudly, because this is a mental model switch, that
> > need to be applied to the full drivers RX path. Also for normal stack
> > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> > there is between 10%-25% perf gain here.
> >
> > [1] http://lists.openwall.net/netdev/2016/04/19/89
> > [2] http://lists.openwall.net/netdev/2016/01/15/51
> BTW, I'd also still rather like to see that happen, I never really
> understood the objections people had to those patches when I posted them.  I
> still believe that dealing in skb-lists instead of skbs, and thus
> 'automatically' bulking similar packets, is better than trying to categorise
> packets into flows early on based on some set of keys.  The problem with the
> latter approach is that there are now two definitions of "similar":
> 1) the set of fields used to index the flow
> 2) what will actually cause the stack's behaviour to differ if not using the
> cached values.
> Quite apart from the possibility of bugs if one changes but not the other,
> this forces (1) to be conservative, only considering things "similar" if the
> entire stack will.  Whereas with bundling, the stack can keep packets
> together until they reach a layer at which they are no longer "similar"
> enough.  Thus, for instance, packets with the same IP 3-tuple but different
> port numbers can be grouped together for IP layer processing, then split
> apart for L4.

To be fair you never showed us the numbers for DDOS traffic, and you did
not show us how typical TCP + netfilter modules kind of traffic would be
handled.

Show us real numbers, not synthetic ones, say when receiving traffic on
100,000 or more TCP sockets.

We also care about icache pressure, and GRO/TSO already provides
bundling where it is applicable, without adding insane complexity in the
stacks.

Just look at how complex the software fallbacks for GSO/checksumming
are, how many bugs we had to fix... And this is only at the edge of our
stack.

Jesper Dangaard Brouer Sept. 13, 2016, 4:47 p.m. UTC | #33

On Tue, 13 Sep 2016 08:58:30 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We also care about icache pressure, and GRO/TSO already provides
> bundling where it is applicable, without adding insane complexity in
> the stacks.

Sorry, I cannot resist. The GRO code is really bad regarding icache
pressure/usage, due to how everything is function pointers calling
function pointers, even if the general case is calling the function
defined just next to it in the same C-file (which usually cause
inlining).  I can easily get 10% more performance for UDP use-cases by
simply disabling the GRO code, and I measure a significant drop in
icache-misses.

Edward's solution should lower icache pressure.

[RFC,11/11] net/mlx5e: XDP TX xmit more

Commit Message

Comments

Patch