diff mbox series

[hirsute/linux] veth: Do not record rx queue hint in veth_xmit

Message ID 20220117155715.29431-2-tim.gardner@canonical.com
State New
Headers show
Series [hirsute/linux] veth: Do not record rx queue hint in veth_xmit | expand

Commit Message

Tim Gardner Jan. 17, 2022, 3:57 p.m. UTC
From: Daniel Borkmann <daniel@iogearbox.net>

BugLink: https://bugs.launchpad.net/bugs/1958155

commit 710ad98c363a66a0cd8526465426c5c5f8377ee0 upstream.

Laurent reported that they have seen a significant amount of TCP retransmissions
at high throughput from applications residing in network namespaces talking to
the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
as per systemd default) of the phys device such as ena or virtio_net due to all
traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
setup was _not_ using XDP on veths as the issue is generic.)

More specifically, after edbea9220251 ("veth: Store queue_mapping independently
of XDP prog presence") which made it all the way back to v4.19.184+,
skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
queue by default for veths) instead of leaving at 0.

This is eventually retained and callbacks like ena_select_queue() will also pick
single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
is forwarded to that device via upper stack or other means. Similarly, for others
not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.

In general, it is a _bad_ idea for virtual devices like veth to mess around with
queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
the skb->queue_mapping was left untouched, and so prior to edbea9220251 the
netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.

Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
from veth_xmit() altogether.

If the veth peer has an XDP program attached, then it would return the first RX
queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
However, this is still better than breaking the generic case.

Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence")
Fixes: 638264dc9022 ("veth: Support per queue XDP ring")
Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5f76445a31b79be85b337cce5f35affb77cc18fc linux-stable linux-5.10.y)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 drivers/net/veth.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Stefan Bader Jan. 25, 2022, 2:12 p.m. UTC | #1
On 17.01.22 16:57, Tim Gardner wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> 
> BugLink: https://bugs.launchpad.net/bugs/1958155
> 
> commit 710ad98c363a66a0cd8526465426c5c5f8377ee0 upstream.
> 
> Laurent reported that they have seen a significant amount of TCP retransmissions
> at high throughput from applications residing in network namespaces talking to
> the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
> as per systemd default) of the phys device such as ena or virtio_net due to all
> traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
> setup was _not_ using XDP on veths as the issue is generic.)
> 
> More specifically, after edbea9220251 ("veth: Store queue_mapping independently
> of XDP prog presence") which made it all the way back to v4.19.184+,
> skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
> queue by default for veths) instead of leaving at 0.
> 
> This is eventually retained and callbacks like ena_select_queue() will also pick
> single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
> is forwarded to that device via upper stack or other means. Similarly, for others
> not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
> call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
> 
> In general, it is a _bad_ idea for virtual devices like veth to mess around with
> queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
> the skb->queue_mapping was left untouched, and so prior to edbea9220251 the
> netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
> 
> Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
> from veth_xmit() altogether.
> 
> If the veth peer has an XDP program attached, then it would return the first RX
> queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
> However, this is still better than breaking the generic case.
> 
> Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence")
> Fixes: 638264dc9022 ("veth: Support per queue XDP ring")
> Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> (cherry picked from commit 5f76445a31b79be85b337cce5f35affb77cc18fc linux-stable linux-5.10.y)
> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
> ---

Applied to focal:linux-hwe-5.11/hwe-5.11-next (Hirsute EOL). Thanks.

-Stefan

>   drivers/net/veth.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 36abe756282e..0fc3210ca69c 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -301,7 +301,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   	if (rxq < rcv->real_num_rx_queues) {
>   		rq = &rcv_priv->rq[rxq];
>   		rcv_xdp = rcu_access_pointer(rq->xdp_prog);
> -		skb_record_rx_queue(skb, rxq);
>   	}
>   
>   	skb_tx_timestamp(skb);
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 36abe756282e..0fc3210ca69c 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -301,7 +301,6 @@  static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (rxq < rcv->real_num_rx_queues) {
 		rq = &rcv_priv->rq[rxq];
 		rcv_xdp = rcu_access_pointer(rq->xdp_prog);
-		skb_record_rx_queue(skb, rxq);
 	}
 
 	skb_tx_timestamp(skb);