diff mbox series

[bpf-next,1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy

Message ID 1604498942-24274-2-git-send-email-magnus.karlsson@gmail.com
State Not Applicable
Delegated to: BPF Maintainers
Headers show
Series xsk: i40e: Tx performance improvements | expand

Checks

Context Check Description
jkicinski/patch_count success Link
jkicinski/cover_letter success Link
jkicinski/fixes_present success Link
jkicinski/tree_selection success Clearly marked for bpf-next
jkicinski/subject_prefix success Link
jkicinski/source_inline success Was 0 now: 0
jkicinski/verify_signedoff success Link
jkicinski/module_param success Was 0 now: 0
jkicinski/build_32bit fail Errors and warnings before: 5 this patch: 5
jkicinski/kdoc success Errors and warnings before: 0 this patch: 0
jkicinski/verify_fixes success Link
jkicinski/checkpatch fail Link
jkicinski/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
jkicinski/header_inline success Link
jkicinski/stable success Stable not CCed

Commit Message

Magnus Karlsson Nov. 4, 2020, 2:08 p.m. UTC
From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce lazy Tx completions when a queue is used for AF_XDP
zero-copy. In the current design, each time we get into the NAPI poll
loop we try to complete as many Tx packets as possible from the
NIC. This is performed by reading the head pointer register in the NIC
that tells us how many packets have been completed. Reading this
register is expensive as it is across PCIe, so let us try to limit the
number of times it is read by only completing Tx packets to user-space
when the number of available descriptors in the Tx HW ring is below
some threshold. This will decrease the number of reads issued to the
NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
microbenchmark.

The threshold is set to the minimum possible size that the HW ring can
have. This so that we do not run into a scenario where the threshold
is higher than the configured number of descriptors in the HW ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Nov. 4, 2020, 11:33 p.m. UTC | #1
On Wed,  4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Introduce lazy Tx completions when a queue is used for AF_XDP
> zero-copy. In the current design, each time we get into the NAPI poll
> loop we try to complete as many Tx packets as possible from the
> NIC. This is performed by reading the head pointer register in the NIC
> that tells us how many packets have been completed. Reading this
> register is expensive as it is across PCIe, so let us try to limit the
> number of times it is read by only completing Tx packets to user-space
> when the number of available descriptors in the Tx HW ring is below
> some threshold. This will decrease the number of reads issued to the
> NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
> microbenchmark.
> 
> The threshold is set to the minimum possible size that the HW ring can
> have. This so that we do not run into a scenario where the threshold
> is higher than the configured number of descriptors in the HW ring.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

I feel like this needs a big fat warning somewhere.

It's perfectly fine to never complete TCP packets, but AF_XDP could be
used to implement protocols in user space. What if someone wants to
implement something like TSQ?
Jakub Kicinski Nov. 4, 2020, 11:35 p.m. UTC | #2
On Wed, 4 Nov 2020 15:33:20 -0800 Jakub Kicinski wrote:
> I feel like this needs a big fat warning somewhere.
> 
> It's perfectly fine to never complete TCP packets,

s/TCP/normal XDP/, sorry

> but AF_XDP could be used to implement protocols in user space. What
> if someone wants to implement something like TSQ?
Magnus Karlsson Nov. 5, 2020, 2:17 p.m. UTC | #3
On Thu, Nov 5, 2020 at 12:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote:
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Introduce lazy Tx completions when a queue is used for AF_XDP
> > zero-copy. In the current design, each time we get into the NAPI poll
> > loop we try to complete as many Tx packets as possible from the
> > NIC. This is performed by reading the head pointer register in the NIC
> > that tells us how many packets have been completed. Reading this
> > register is expensive as it is across PCIe, so let us try to limit the
> > number of times it is read by only completing Tx packets to user-space
> > when the number of available descriptors in the Tx HW ring is below
> > some threshold. This will decrease the number of reads issued to the
> > NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
> > microbenchmark.
> >
> > The threshold is set to the minimum possible size that the HW ring can
> > have. This so that we do not run into a scenario where the threshold
> > is higher than the configured number of descriptors in the HW ring.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
> I feel like this needs a big fat warning somewhere.
>
> It's perfectly fine to never complete TCP packets, but AF_XDP could be
> used to implement protocols in user space. What if someone wants to
> implement something like TSQ?

I might misunderstand you, but with TSQ here (for something that
bypasses qdisk and any buffering and just goes straight to the driver)
you mean the ability to have just a few buffers outstanding and
continuously reuse these? If so, that is likely best achieved by
setting a low Tx queue size on the NIC. Note that even without this
patch, completions could be delayed. Though this patch makes that the
normal case. In any way, I think this calls for some improved
documentation.

I also discovered a corner case that will lead to a deadlock if the
completion ring size is half the size of the Tx NIC ring size. This
needs to be fixed, so I will spin a v2.

Thanks: Magnus
Jakub Kicinski Nov. 5, 2020, 3:45 p.m. UTC | #4
On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote:
> > I feel like this needs a big fat warning somewhere.
> >
> > It's perfectly fine to never complete TCP packets, but AF_XDP could be
> > used to implement protocols in user space. What if someone wants to
> > implement something like TSQ?  
> 
> I might misunderstand you, but with TSQ here (for something that
> bypasses qdisk and any buffering and just goes straight to the driver)
> you mean the ability to have just a few buffers outstanding and
> continuously reuse these? If so, that is likely best achieved by
> setting a low Tx queue size on the NIC. Note that even without this
> patch, completions could be delayed. Though this patch makes that the
> normal case. In any way, I think this calls for some improved
> documentation.

TSQ tries to limit the amount of data the TCP stack queues into TC/sched
and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until
some of the transfer is reported as completed. 

IIUC you're allowing up to 64 descriptors to linger without reporting
back that the transfer is done. That means that user space implementing
a scheme similar to TSQ may see its transfers stalled.
Magnus Karlsson Nov. 6, 2020, 7:09 p.m. UTC | #5
On Thu, Nov 5, 2020 at 4:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote:
> > > I feel like this needs a big fat warning somewhere.
> > >
> > > It's perfectly fine to never complete TCP packets, but AF_XDP could be
> > > used to implement protocols in user space. What if someone wants to
> > > implement something like TSQ?
> >
> > I might misunderstand you, but with TSQ here (for something that
> > bypasses qdisk and any buffering and just goes straight to the driver)
> > you mean the ability to have just a few buffers outstanding and
> > continuously reuse these? If so, that is likely best achieved by
> > setting a low Tx queue size on the NIC. Note that even without this
> > patch, completions could be delayed. Though this patch makes that the
> > normal case. In any way, I think this calls for some improved
> > documentation.
>
> TSQ tries to limit the amount of data the TCP stack queues into TC/sched
> and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until
> some of the transfer is reported as completed.

Thanks. Got it. There is one more use case I can think of for quick
completions of Tx buffers and that is if you have metadata associated
with the completion, for example a Tx time stamp. Not that this
capability exists today, but hopefully it will get added at some
point.

Anyway after some more thinking, I would like to remove this patch
from the patch set and put it on the shelf for a while. The reason
behind this is that if we can get a good busy poll solution for AF_XDP
sockets, then we do not need this patch. With busy-poll the choice of
when to complete Tx buffers would be left to the application in a nice
way. If the application would like to quickly get buffers completed
(at the cost of some performance) it would call sendto() (or friends)
soon after it put the packet on the Tx ring. If max throughput is
desired with no regard to when a buffer is returned, then sendto()
would be called only after a large batch of packets have been put on
the Tx ring. No need for any threshold or new knob, in other words,
much nicer. So let us wait for Björn's busy poll patches and see where
it leads. Please protest if you do not agree. Otherwise I will submit
a v2 without this patch and with Maciej's proposed simplification.

> IIUC you're allowing up to 64 descriptors to linger without reporting
> back that the transfer is done. That means that user space implementing
> a scheme similar to TSQ may see its transfers stalled.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 6acede0..f8815b3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -9,6 +9,8 @@ 
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
 
+#define I40E_TX_COMPLETION_THRESHOLD I40E_MIN_NUM_DESCRIPTORS
+
 int i40e_alloc_rx_bi_zc(struct i40e_ring *rx_ring)
 {
 	unsigned long sz = sizeof(*rx_ring->rx_bi_zc) * rx_ring->count;
@@ -460,12 +462,15 @@  static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
  **/
 bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi, struct i40e_ring *tx_ring)
 {
+	u32 i, completed_frames, xsk_frames = 0, head_idx;
 	struct xsk_buff_pool *bp = tx_ring->xsk_pool;
-	u32 i, completed_frames, xsk_frames = 0;
-	u32 head_idx = i40e_get_head(tx_ring);
 	struct i40e_tx_buffer *tx_bi;
 	unsigned int ntc;
 
+	if (I40E_DESC_UNUSED(tx_ring) >= I40E_TX_COMPLETION_THRESHOLD)
+		goto out_xmit;
+
+	head_idx = i40e_get_head(tx_ring);
 	if (head_idx < tx_ring->next_to_clean)
 		head_idx += tx_ring->count;
 	completed_frames = head_idx - tx_ring->next_to_clean;