Message ID | 1604498942-24274-2-git-send-email-magnus.karlsson@gmail.com |
---|---|
State | Not Applicable |
Delegated to: | BPF Maintainers |
Headers | show |
Series | xsk: i40e: Tx performance improvements | expand |
Context | Check | Description |
---|---|---|
jkicinski/patch_count | success | Link |
jkicinski/cover_letter | success | Link |
jkicinski/fixes_present | success | Link |
jkicinski/tree_selection | success | Clearly marked for bpf-next |
jkicinski/subject_prefix | success | Link |
jkicinski/source_inline | success | Was 0 now: 0 |
jkicinski/verify_signedoff | success | Link |
jkicinski/module_param | success | Was 0 now: 0 |
jkicinski/build_32bit | fail | Errors and warnings before: 5 this patch: 5 |
jkicinski/kdoc | success | Errors and warnings before: 0 this patch: 0 |
jkicinski/verify_fixes | success | Link |
jkicinski/checkpatch | fail | Link |
jkicinski/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
jkicinski/header_inline | success | Link |
jkicinski/stable | success | Stable not CCed |
On Wed, 4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote: > From: Magnus Karlsson <magnus.karlsson@intel.com> > > Introduce lazy Tx completions when a queue is used for AF_XDP > zero-copy. In the current design, each time we get into the NAPI poll > loop we try to complete as many Tx packets as possible from the > NIC. This is performed by reading the head pointer register in the NIC > that tells us how many packets have been completed. Reading this > register is expensive as it is across PCIe, so let us try to limit the > number of times it is read by only completing Tx packets to user-space > when the number of available descriptors in the Tx HW ring is below > some threshold. This will decrease the number of reads issued to the > NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock > microbenchmark. > > The threshold is set to the minimum possible size that the HW ring can > have. This so that we do not run into a scenario where the threshold > is higher than the configured number of descriptors in the HW ring. > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> I feel like this needs a big fat warning somewhere. It's perfectly fine to never complete TCP packets, but AF_XDP could be used to implement protocols in user space. What if someone wants to implement something like TSQ?
On Wed, 4 Nov 2020 15:33:20 -0800 Jakub Kicinski wrote: > I feel like this needs a big fat warning somewhere. > > It's perfectly fine to never complete TCP packets, s/TCP/normal XDP/, sorry > but AF_XDP could be used to implement protocols in user space. What > if someone wants to implement something like TSQ?
On Thu, Nov 5, 2020 at 12:33 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote: > > From: Magnus Karlsson <magnus.karlsson@intel.com> > > > > Introduce lazy Tx completions when a queue is used for AF_XDP > > zero-copy. In the current design, each time we get into the NAPI poll > > loop we try to complete as many Tx packets as possible from the > > NIC. This is performed by reading the head pointer register in the NIC > > that tells us how many packets have been completed. Reading this > > register is expensive as it is across PCIe, so let us try to limit the > > number of times it is read by only completing Tx packets to user-space > > when the number of available descriptors in the Tx HW ring is below > > some threshold. This will decrease the number of reads issued to the > > NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock > > microbenchmark. > > > > The threshold is set to the minimum possible size that the HW ring can > > have. This so that we do not run into a scenario where the threshold > > is higher than the configured number of descriptors in the HW ring. > > > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> > > I feel like this needs a big fat warning somewhere. > > It's perfectly fine to never complete TCP packets, but AF_XDP could be > used to implement protocols in user space. What if someone wants to > implement something like TSQ? I might misunderstand you, but with TSQ here (for something that bypasses qdisk and any buffering and just goes straight to the driver) you mean the ability to have just a few buffers outstanding and continuously reuse these? If so, that is likely best achieved by setting a low Tx queue size on the NIC. Note that even without this patch, completions could be delayed. Though this patch makes that the normal case. In any way, I think this calls for some improved documentation. I also discovered a corner case that will lead to a deadlock if the completion ring size is half the size of the Tx NIC ring size. This needs to be fixed, so I will spin a v2. Thanks: Magnus
On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote: > > I feel like this needs a big fat warning somewhere. > > > > It's perfectly fine to never complete TCP packets, but AF_XDP could be > > used to implement protocols in user space. What if someone wants to > > implement something like TSQ? > > I might misunderstand you, but with TSQ here (for something that > bypasses qdisk and any buffering and just goes straight to the driver) > you mean the ability to have just a few buffers outstanding and > continuously reuse these? If so, that is likely best achieved by > setting a low Tx queue size on the NIC. Note that even without this > patch, completions could be delayed. Though this patch makes that the > normal case. In any way, I think this calls for some improved > documentation. TSQ tries to limit the amount of data the TCP stack queues into TC/sched and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until some of the transfer is reported as completed. IIUC you're allowing up to 64 descriptors to linger without reporting back that the transfer is done. That means that user space implementing a scheme similar to TSQ may see its transfers stalled.
On Thu, Nov 5, 2020 at 4:45 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote: > > > I feel like this needs a big fat warning somewhere. > > > > > > It's perfectly fine to never complete TCP packets, but AF_XDP could be > > > used to implement protocols in user space. What if someone wants to > > > implement something like TSQ? > > > > I might misunderstand you, but with TSQ here (for something that > > bypasses qdisk and any buffering and just goes straight to the driver) > > you mean the ability to have just a few buffers outstanding and > > continuously reuse these? If so, that is likely best achieved by > > setting a low Tx queue size on the NIC. Note that even without this > > patch, completions could be delayed. Though this patch makes that the > > normal case. In any way, I think this calls for some improved > > documentation. > > TSQ tries to limit the amount of data the TCP stack queues into TC/sched > and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until > some of the transfer is reported as completed. Thanks. Got it. There is one more use case I can think of for quick completions of Tx buffers and that is if you have metadata associated with the completion, for example a Tx time stamp. Not that this capability exists today, but hopefully it will get added at some point. Anyway after some more thinking, I would like to remove this patch from the patch set and put it on the shelf for a while. The reason behind this is that if we can get a good busy poll solution for AF_XDP sockets, then we do not need this patch. With busy-poll the choice of when to complete Tx buffers would be left to the application in a nice way. If the application would like to quickly get buffers completed (at the cost of some performance) it would call sendto() (or friends) soon after it put the packet on the Tx ring. If max throughput is desired with no regard to when a buffer is returned, then sendto() would be called only after a large batch of packets have been put on the Tx ring. No need for any threshold or new knob, in other words, much nicer. So let us wait for Björn's busy poll patches and see where it leads. Please protest if you do not agree. Otherwise I will submit a v2 without this patch and with Maciej's proposed simplification. > IIUC you're allowing up to 64 descriptors to linger without reporting > back that the transfer is done. That means that user space implementing > a scheme similar to TSQ may see its transfers stalled.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 6acede0..f8815b3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -9,6 +9,8 @@ #include "i40e_txrx_common.h" #include "i40e_xsk.h" +#define I40E_TX_COMPLETION_THRESHOLD I40E_MIN_NUM_DESCRIPTORS + int i40e_alloc_rx_bi_zc(struct i40e_ring *rx_ring) { unsigned long sz = sizeof(*rx_ring->rx_bi_zc) * rx_ring->count; @@ -460,12 +462,15 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring, **/ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi, struct i40e_ring *tx_ring) { + u32 i, completed_frames, xsk_frames = 0, head_idx; struct xsk_buff_pool *bp = tx_ring->xsk_pool; - u32 i, completed_frames, xsk_frames = 0; - u32 head_idx = i40e_get_head(tx_ring); struct i40e_tx_buffer *tx_bi; unsigned int ntc; + if (I40E_DESC_UNUSED(tx_ring) >= I40E_TX_COMPLETION_THRESHOLD) + goto out_xmit; + + head_idx = i40e_get_head(tx_ring); if (head_idx < tx_ring->next_to_clean) head_idx += tx_ring->count; completed_frames = head_idx - tx_ring->next_to_clean;