diff mbox series

[ovs-dev] netdev-dpdk: fix incorrect shinfo initialization

Message ID 20201014072248.374384-1-yang_y_yi@163.com
State Accepted
Headers show
Series [ovs-dev] netdev-dpdk: fix incorrect shinfo initialization | expand

Commit Message

yang_y_yi Oct. 14, 2020, 7:22 a.m. UTC
From: Yi Yang <yangyi01@inspur.com>

shinfo is used to store reference counter and free callback
of an external buffer, but it is stored in mbuf if the mbuf
has tailroom for it.

This is wrong because the mbuf (and its data) can be freed
before the external buffer, for example:

  pkt2 = rte_pktmbuf_alloc(mp);
  rte_pktmbuf_attach(pkt2, pkt);
  rte_pktmbuf_free(pkt);

After this, pkt is freed, but it still contains shinfo, which
is referenced by pkt2.

Fix this by always storing shinfo at the tail of external buffer.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Co-authored-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Yi Yang <yangyi01@inspur.com>
---
 lib/netdev-dpdk.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

Comments

Flavio Leitner Oct. 27, 2020, 11:34 a.m. UTC | #1
On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
> From: Yi Yang <yangyi01@inspur.com>
> 
> shinfo is used to store reference counter and free callback
> of an external buffer, but it is stored in mbuf if the mbuf
> has tailroom for it.
> 
> This is wrong because the mbuf (and its data) can be freed
> before the external buffer, for example:
> 
>   pkt2 = rte_pktmbuf_alloc(mp);
>   rte_pktmbuf_attach(pkt2, pkt);
>   rte_pktmbuf_free(pkt);
> 
> After this, pkt is freed, but it still contains shinfo, which
> is referenced by pkt2.
> 
> Fix this by always storing shinfo at the tail of external buffer.
> 
> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> Co-authored-by: Olivier Matz <olivier.matz@6wind.com>
> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> ---

Acked-by: Flavio Leitner <fbl@sysclose.org>

Thanks Yi,
fbl
Ilya Maximets Oct. 27, 2020, 12:47 p.m. UTC | #2
On 10/27/20 12:34 PM, Flavio Leitner wrote:
> On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
>> From: Yi Yang <yangyi01@inspur.com>
>>
>> shinfo is used to store reference counter and free callback
>> of an external buffer, but it is stored in mbuf if the mbuf
>> has tailroom for it.
>>
>> This is wrong because the mbuf (and its data) can be freed
>> before the external buffer, for example:
>>
>>   pkt2 = rte_pktmbuf_alloc(mp);
>>   rte_pktmbuf_attach(pkt2, pkt);
>>   rte_pktmbuf_free(pkt);

How is that possible with OVS?  Right now OVS doesn't support multi-segement
mbufs and will, likely, not support them in a near future because it requires
changes all other the codebase.

Is there any other scenario that could lead to issues with current OVS
implementation?

>>
>> After this, pkt is freed, but it still contains shinfo, which
>> is referenced by pkt2.
>>
>> Fix this by always storing shinfo at the tail of external buffer.
>>
>> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
>> Co-authored-by: Olivier Matz <olivier.matz@6wind.com>
>> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
>> Signed-off-by: Yi Yang <yangyi01@inspur.com>
>> ---
> 
> Acked-by: Flavio Leitner <fbl@sysclose.org>
> 
> Thanks Yi,
> fbl
>
Flavio Leitner Oct. 27, 2020, 1:08 p.m. UTC | #3
On Tue, Oct 27, 2020 at 01:47:22PM +0100, Ilya Maximets wrote:
> On 10/27/20 12:34 PM, Flavio Leitner wrote:
> > On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
> >> From: Yi Yang <yangyi01@inspur.com>
> >>
> >> shinfo is used to store reference counter and free callback
> >> of an external buffer, but it is stored in mbuf if the mbuf
> >> has tailroom for it.
> >>
> >> This is wrong because the mbuf (and its data) can be freed
> >> before the external buffer, for example:
> >>
> >>   pkt2 = rte_pktmbuf_alloc(mp);
> >>   rte_pktmbuf_attach(pkt2, pkt);
> >>   rte_pktmbuf_free(pkt);
> 
> How is that possible with OVS?  Right now OVS doesn't support multi-segement
> mbufs and will, likely, not support them in a near future because it requires
> changes all other the codebase.
> 
> Is there any other scenario that could lead to issues with current OVS
> implementation?

This is copying packets. The shinfo is allocated in the mbuf of
the first packet which could be deleted without any references
to the external buffer still using it.

fbl


> 
> >>
> >> After this, pkt is freed, but it still contains shinfo, which
> >> is referenced by pkt2.
> >>
> >> Fix this by always storing shinfo at the tail of external buffer.
> >>
> >> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
> >> Co-authored-by: Olivier Matz <olivier.matz@6wind.com>
> >> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> >> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> >> ---
> > 
> > Acked-by: Flavio Leitner <fbl@sysclose.org>
> > 
> > Thanks Yi,
> > fbl
> > 
>
Yi Yang (杨燚)-云服务集团 Oct. 28, 2020, 12:35 a.m. UTC | #4
-----邮件原件-----
发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Flavio Leitner
发送时间: 2020年10月27日 21:08
收件人: Ilya Maximets <i.maximets@ovn.org>
抄送: yang_y_yi@163.com; ovs-dev@openvswitch.org; olivier.matz@6wind.com
主题: Re: [ovs-dev] [PATCH] netdev-dpdk: fix incorrect shinfo initialization

On Tue, Oct 27, 2020 at 01:47:22PM +0100, Ilya Maximets wrote:
> On 10/27/20 12:34 PM, Flavio Leitner wrote:
> > On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
> >> From: Yi Yang <yangyi01@inspur.com>
> >>
> >> shinfo is used to store reference counter and free callback of an 
> >> external buffer, but it is stored in mbuf if the mbuf has tailroom 
> >> for it.
> >>
> >> This is wrong because the mbuf (and its data) can be freed before 
> >> the external buffer, for example:
> >>
> >>   pkt2 = rte_pktmbuf_alloc(mp);
> >>   rte_pktmbuf_attach(pkt2, pkt);
> >>   rte_pktmbuf_free(pkt);
> 
> How is that possible with OVS?  Right now OVS doesn't support 
> multi-segement mbufs and will, likely, not support them in a near 
> future because it requires changes all other the codebase.
> 
> Is there any other scenario that could lead to issues with current OVS 
> implementation?

This is copying packets. The shinfo is allocated in the mbuf of the first packet which could be deleted without any references to the external buffer still using it.

Fbl

[Yi Yang]  Yes, this is not related with multi-segment mbuf, dpdk interfaces to system interfaces communication will use it if the packet size is greater than mtu size, i.e. TSO case from veth/tap to dpdk/vhost and backward will use it, this is a wrong use of shinfo, the same fix (which is used by virtio/vhost driver)has been merged into dpdk branch.
 
[Yi Yang] 



> 
> >>
> >> After this, pkt is freed, but it still contains shinfo, which is 
> >> referenced by pkt2.
> >>
> >> Fix this by always storing shinfo at the tail of external buffer.
> >>
> >> Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload 
> >> support")
> >> Co-authored-by: Olivier Matz <olivier.matz@6wind.com>
> >> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> >> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> >> ---
> > 
> > Acked-by: Flavio Leitner <fbl@sysclose.org>
> > 
> > Thanks Yi,
> > fbl
> > 
> 

--
fbl
Ilya Maximets Feb. 1, 2021, 7:47 p.m. UTC | #5
On 10/28/20 1:35 AM, Yi Yang (杨燚)-云服务集团 wrote:
> -----邮件原件-----
> 发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Flavio Leitner
> 发送时间: 2020年10月27日 21:08
> 收件人: Ilya Maximets <i.maximets@ovn.org>
> 抄送: yang_y_yi@163.com; ovs-dev@openvswitch.org; olivier.matz@6wind.com
> 主题: Re: [ovs-dev] [PATCH] netdev-dpdk: fix incorrect shinfo initialization
> 
> On Tue, Oct 27, 2020 at 01:47:22PM +0100, Ilya Maximets wrote:
>> On 10/27/20 12:34 PM, Flavio Leitner wrote:
>>> On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
>>>> From: Yi Yang <yangyi01@inspur.com>
>>>>
>>>> shinfo is used to store reference counter and free callback of an 
>>>> external buffer, but it is stored in mbuf if the mbuf has tailroom 
>>>> for it.
>>>>
>>>> This is wrong because the mbuf (and its data) can be freed before 
>>>> the external buffer, for example:
>>>>
>>>>   pkt2 = rte_pktmbuf_alloc(mp);
>>>>   rte_pktmbuf_attach(pkt2, pkt);
>>>>   rte_pktmbuf_free(pkt);
>>
>> How is that possible with OVS?  Right now OVS doesn't support 
>> multi-segement mbufs and will, likely, not support them in a near 
>> future because it requires changes all other the codebase.
>>
>> Is there any other scenario that could lead to issues with current OVS 
>> implementation?
> 
> This is copying packets. The shinfo is allocated in the mbuf of the first packet which could be deleted without any references to the external buffer still using it.
> 
> Fbl
> 
> [Yi Yang]  Yes, this is not related with multi-segment mbuf, dpdk interfaces to system interfaces communication will use it if the packet size is greater than mtu size, i.e. TSO case from veth/tap to dpdk/vhost and backward will use it, this is a wrong use of shinfo, the same fix (which is used by virtio/vhost driver)has been merged into dpdk branch.

Thanks.  Sorry for delay.
I added some of this information to the commit message and
applied to master.  Backported down to 2.13.

I'm wondering, though, why net_tap PMD implements TSO in userspace and
doesn't offload this to kernel via virtio headers?   In many cases
actual segmentation is not necessary or could be done later by HW,
so it makes sense to not waste cycles in userspace and let the kernel
decide if it's needed or not.

Best regards, Ilya Maximets.
Yi Yang (杨燚)-云服务集团 Feb. 2, 2021, 1:27 a.m. UTC | #6
Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO. 

-----邮件原件-----
发件人: Ilya Maximets [mailto:i.maximets@ovn.org] 
发送时间: 2021年2月2日 3:47
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>; fbl@sysclose.org; i.maximets@ovn.org
抄送: yang_y_yi@163.com; ovs-dev@openvswitch.org; olivier.matz@6wind.com
主题: Re: 答复: [ovs-dev] [PATCH] netdev-dpdk: fix incorrect shinfo initialization

On 10/28/20 1:35 AM, Yi Yang (杨燚)-云服务集团 wrote:
> -----邮件原件-----
> 发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Flavio Leitner
> 发送时间: 2020年10月27日 21:08
> 收件人: Ilya Maximets <i.maximets@ovn.org>
> 抄送: yang_y_yi@163.com; ovs-dev@openvswitch.org; olivier.matz@6wind.com
> 主题: Re: [ovs-dev] [PATCH] netdev-dpdk: fix incorrect shinfo 
> initialization
> 
> On Tue, Oct 27, 2020 at 01:47:22PM +0100, Ilya Maximets wrote:
>> On 10/27/20 12:34 PM, Flavio Leitner wrote:
>>> On Wed, Oct 14, 2020 at 03:22:48PM +0800, yang_y_yi@163.com wrote:
>>>> From: Yi Yang <yangyi01@inspur.com>
>>>>
>>>> shinfo is used to store reference counter and free callback of an 
>>>> external buffer, but it is stored in mbuf if the mbuf has tailroom 
>>>> for it.
>>>>
>>>> This is wrong because the mbuf (and its data) can be freed before 
>>>> the external buffer, for example:
>>>>
>>>>   pkt2 = rte_pktmbuf_alloc(mp);
>>>>   rte_pktmbuf_attach(pkt2, pkt);
>>>>   rte_pktmbuf_free(pkt);
>>
>> How is that possible with OVS?  Right now OVS doesn't support 
>> multi-segement mbufs and will, likely, not support them in a near 
>> future because it requires changes all other the codebase.
>>
>> Is there any other scenario that could lead to issues with current 
>> OVS implementation?
> 
> This is copying packets. The shinfo is allocated in the mbuf of the first packet which could be deleted without any references to the external buffer still using it.
> 
> Fbl
> 
> [Yi Yang]  Yes, this is not related with multi-segment mbuf, dpdk interfaces to system interfaces communication will use it if the packet size is greater than mtu size, i.e. TSO case from veth/tap to dpdk/vhost and backward will use it, this is a wrong use of shinfo, the same fix (which is used by virtio/vhost driver)has been merged into dpdk branch.

Thanks.  Sorry for delay.
I added some of this information to the commit message and applied to master.  Backported down to 2.13.

I'm wondering, though, why net_tap PMD implements TSO in userspace and
doesn't offload this to kernel via virtio headers?   In many cases
actual segmentation is not necessary or could be done later by HW, so it makes sense to not waste cycles in userspace and let the kernel decide if it's needed or not.

Best regards, Ilya Maximets.
William Tu Feb. 6, 2021, 4:15 p.m. UTC | #7
On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO.
>
thanks, learned a lot from these discussions.

I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
Do you guys think it makes sense to add TSO at dpdk net_tap?
Or simply using the current OVS's userspace-enable-tso on tap/veth is
good enough?
(using type=system, not using dpdk port type on tap/veth.)

Regards,
William
Yi Yang (杨燚)-云服务集团 Feb. 7, 2021, 1:13 a.m. UTC | #8
In OVS DPDK (for tap in lib/netdev-linux.c), we don't use net_tap, internal type is also tap type in OVS DPDK use case, that means all the bridge ports (i.e. br-int, ovs-netdev, br-phy, etc.) are tap, it will result in deadlock if we use PMDs to handle them.

For non-bridge tap ports, PMD is much better than ovs-vswitchd, ovs DPDK has support that, you just add tap by vdev (if tap has been created, it will be opened), but if existing tap is in network namespace, it can't be handle. A patch I sent before added netns option, that can fix this issue.

TSO-related code in linux/netdev-linux.c can be ported into DPDK net-tap pmd driver, it is very easy.

-----邮件原件-----
发件人: William Tu [mailto:u9012063@gmail.com] 
发送时间: 2021年2月7日 0:16
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: i.maximets@ovn.org; fbl@sysclose.org; yang_y_yi@163.com; ovs-dev@openvswitch.org; olivier.matz@6wind.com
主题: Re: [ovs-dev] 答复: 答复: [PATCH] netdev-dpdk: fix incorrect shinfo initialization

On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO.
>
thanks, learned a lot from these discussions.

I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
Do you guys think it makes sense to add TSO at dpdk net_tap?
Or simply using the current OVS's userspace-enable-tso on tap/veth is good enough?
(using type=system, not using dpdk port type on tap/veth.)

Regards,
William
Ilya Maximets Feb. 8, 2021, 4:57 p.m. UTC | #9
On 2/6/21 5:15 PM, William Tu wrote:
> On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>>
>> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO.
>>
> thanks, learned a lot from these discussions.
> 
> I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
> Do you guys think it makes sense to add TSO at dpdk net_tap?
> Or simply using the current OVS's userspace-enable-tso on tap/veth is
> good enough?
> (using type=system, not using dpdk port type on tap/veth.)
> 
> Regards,
> William
> 

I didn't benchmark all types of interfaces, but I'd say that, if you
need more or less high performance solution for userspace<->kernel
communication, you should, probably, take a look at virtio-user
ports with vhost kernel backend:
  https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
This should be the fastest and also feature-rich solution.

Tap devices are not designed for high performance in general,
so I'd not suggest any of them for highly loaded ports.
If it's only for some small management traffic, it should be fine
to just use netdev-linux implementation.

netdev-afxdp with pmd or non-pmd modes on a veth devices is another
(potentially high performance) solution.

Best regards, Ilya Maximets.
William Tu Feb. 8, 2021, 7:13 p.m. UTC | #10
On Mon, Feb 8, 2021 at 8:57 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 2/6/21 5:15 PM, William Tu wrote:
> > On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> >>
> >> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO.
> >>
> > thanks, learned a lot from these discussions.
> >
> > I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
> > Do you guys think it makes sense to add TSO at dpdk net_tap?
> > Or simply using the current OVS's userspace-enable-tso on tap/veth is
> > good enough?
> > (using type=system, not using dpdk port type on tap/veth.)
> >
> > Regards,
> > William
> >
>
> I didn't benchmark all types of interfaces, but I'd say that, if you
> need more or less high performance solution for userspace<->kernel
> communication, you should, probably, take a look at virtio-user
> ports with vhost kernel backend:
>   https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
> This should be the fastest and also feature-rich solution.
Thanks! I will give it a try.
>
> Tap devices are not designed for high performance in general,
> so I'd not suggest any of them for highly loaded ports.
> If it's only for some small management traffic, it should be fine
> to just use netdev-linux implementation.

That's what I thought until Flavio enables vnet header.
>
> netdev-afxdp with pmd or non-pmd modes on a veth devices is another
> (potentially high performance) solution.

When testing intra-host container to container performance,
Tap device becomes much faster than netdev-afxdp, especially with iperf TCP.
Mostly due to vnet header's TSO and csum offload feature.
It's a big limitation for XDP frame which couldn't carry large buffer or carry
the partial csum information.

I reach a conclusion that for intra-host container to container
TCP performance, from the fastest configuration to slowest (ns: namespace)
0) dpdk vhostuser in ns0 -> vhostuer - OVS userspace
(But requires TCP in userspace and application modification)
1) veth0 in ns0 -> veth with TSO - OVS kernel module - veth with TSO
-> veth1 in ns1
2) tap0 in ns0 -> virtio_user - OVS userspace - virtio_user -> tap1 in ns1
3) tap0 in ns0 -> recv_tap - OVS with userspace-tso - tap_batch_send
-> tap1 in ns1
4) veth0 in ns0 -> af_packet sock - OVS with userspace-tso -
af_packet_sock -> veth1 in ns1
5) veth0 in ns0 -> netdev-afxdp - OVS - netdev-afxdp -> veth1 in ns1

I also tested Toshiaki's XDP offload patch,
https://www.mail-archive.com/ovs-dev@openvswitch.org/msg45930.html
I would guess it's between 2 to 4.

Regards,
William
Toshiaki Makita Feb. 9, 2021, 10:37 a.m. UTC | #11
On 2021/02/09 4:13, William Tu wrote:
> On Mon, Feb 8, 2021 at 8:57 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 2/6/21 5:15 PM, William Tu wrote:
>>> On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>>>>
>>>> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to do this, from source code, tap fd isn't enabled vnet header and TSO.
>>>>
>>> thanks, learned a lot from these discussions.
>>>
>>> I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
>>> Do you guys think it makes sense to add TSO at dpdk net_tap?
>>> Or simply using the current OVS's userspace-enable-tso on tap/veth is
>>> good enough?
>>> (using type=system, not using dpdk port type on tap/veth.)
>>>
>>> Regards,
>>> William
>>>
>>
>> I didn't benchmark all types of interfaces, but I'd say that, if you
>> need more or less high performance solution for userspace<->kernel
>> communication, you should, probably, take a look at virtio-user
>> ports with vhost kernel backend:
>>    https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
>> This should be the fastest and also feature-rich solution.
> Thanks! I will give it a try.
>>
>> Tap devices are not designed for high performance in general,
>> so I'd not suggest any of them for highly loaded ports.
>> If it's only for some small management traffic, it should be fine
>> to just use netdev-linux implementation.
> 
> That's what I thought until Flavio enables vnet header.
>>
>> netdev-afxdp with pmd or non-pmd modes on a veth devices is another
>> (potentially high performance) solution.
> 
> When testing intra-host container to container performance,
> Tap device becomes much faster than netdev-afxdp, especially with iperf TCP.
> Mostly due to vnet header's TSO and csum offload feature.
> It's a big limitation for XDP frame which couldn't carry large buffer or carry
> the partial csum information.
> 
> I reach a conclusion that for intra-host container to container
> TCP performance, from the fastest configuration to slowest (ns: namespace)
> 0) dpdk vhostuser in ns0 -> vhostuer - OVS userspace
> (But requires TCP in userspace and application modification)
> 1) veth0 in ns0 -> veth with TSO - OVS kernel module - veth with TSO
> -> veth1 in ns1
> 2) tap0 in ns0 -> virtio_user - OVS userspace - virtio_user -> tap1 in ns1
> 3) tap0 in ns0 -> recv_tap - OVS with userspace-tso - tap_batch_send
> -> tap1 in ns1
> 4) veth0 in ns0 -> af_packet sock - OVS with userspace-tso -
> af_packet_sock -> veth1 in ns1
> 5) veth0 in ns0 -> netdev-afxdp - OVS - netdev-afxdp -> veth1 in ns1
> 
> I also tested Toshiaki's XDP offload patch,
> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg45930.html
> I would guess it's between 2 to 4.

Note that veth native XDP is fast when all of packet processing is done in XDP world.
That means packets generated in containers are not fast even with native veth XDP.
The fast case is that a phy device XDP_REDIRECTs frames to veth, and then the peer
veth in ns does something and XDP_TXes the frame, and then REDIRECT it to another veth pair.

phy --XDP_REDIRECT--> veth-host0 --> veth-ns0 --XDP_TX--> veth-host0 --XDP_REDIRECT--> veth-host1 --> veth-ns1

Having said that, missing TSO is indeed a big limitation.
BTW there is some progress on TSO in XDP world...
https://patchwork.kernel.org/project/netdevbpf/cover/cover.1611086134.git.lorenzo@kernel.org/

Toshiaki Makita
diff mbox series

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 0b830be..c7f9326 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2654,12 +2654,8 @@  dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
     uint16_t buf_len;
     void *buf;
 
-    if (rte_pktmbuf_tailroom(pkt) >= sizeof *shinfo) {
-        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
-    } else {
-        total_len += sizeof *shinfo + sizeof(uintptr_t);
-        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
-    }
+    total_len += sizeof *shinfo + sizeof(uintptr_t);
+    total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
 
     if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
         VLOG_ERR("Can't copy packet: too big %u", total_len);
@@ -2674,20 +2670,14 @@  dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
     }
 
     /* Initialize shinfo. */
-    if (shinfo) {
-        shinfo->free_cb = netdev_dpdk_extbuf_free;
-        shinfo->fcb_opaque = buf;
-        rte_mbuf_ext_refcnt_set(shinfo, 1);
-    } else {
-        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
-                                                    netdev_dpdk_extbuf_free,
-                                                    buf);
-        if (OVS_UNLIKELY(shinfo == NULL)) {
-            rte_free(buf);
-            VLOG_ERR("Failed to initialize shared info for mbuf while "
-                     "attempting to attach an external buffer.");
-            return NULL;
-        }
+    shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+                                                netdev_dpdk_extbuf_free,
+                                                buf);
+    if (OVS_UNLIKELY(shinfo == NULL)) {
+        rte_free(buf);
+        VLOG_ERR("Failed to initialize shared info for mbuf while "
+                 "attempting to attach an external buffer.");
+        return NULL;
     }
 
     rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,