Message ID | 20191126100744.5083-1-prashantbhole.linux@gmail.com |
---|---|
Headers | show |
Series | virtio_net XDP offload | expand |
On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: > Note: This RFC has been sent to netdev as well as qemu-devel lists > > This series introduces XDP offloading from virtio_net. It is based on > the following work by Jason Wang: > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net > > Current XDP performance in virtio-net is far from what we can achieve > on host. Several major factors cause the difference: > - Cost of virtualization > - Cost of virtio (populating virtqueue and context switching) > - Cost of vhost, it needs more optimization > - Cost of data copy > Because of above reasons there is a need of offloading XDP program to > host. This set is an attempt to implement XDP offload from the guest. This turns the guest kernel into a uAPI proxy. BPF uAPI calls related to the "offloaded" BPF objects are forwarded to the hypervisor, they pop up in QEMU which makes the requested call to the hypervisor kernel. Today it's the Linux kernel tomorrow it may be someone's proprietary "SmartNIC" implementation. Why can't those calls be forwarded at the higher layer? Why do they have to go through the guest kernel? If kernel performs no significant work (or "adds value", pardon the expression), and problem can easily be solved otherwise we shouldn't do the work of maintaining the mechanism. The approach of kernel generating actual machine code which is then loaded into a sandbox on the hypervisor/SmartNIC is another story. I'd appreciate if others could chime in.
Hi Jakub: On 2019/11/27 上午4:35, Jakub Kicinski wrote: > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: >> Note: This RFC has been sent to netdev as well as qemu-devel lists >> >> This series introduces XDP offloading from virtio_net. It is based on >> the following work by Jason Wang: >> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net >> >> Current XDP performance in virtio-net is far from what we can achieve >> on host. Several major factors cause the difference: >> - Cost of virtualization >> - Cost of virtio (populating virtqueue and context switching) >> - Cost of vhost, it needs more optimization >> - Cost of data copy >> Because of above reasons there is a need of offloading XDP program to >> host. This set is an attempt to implement XDP offload from the guest. > This turns the guest kernel into a uAPI proxy. > > BPF uAPI calls related to the "offloaded" BPF objects are forwarded > to the hypervisor, they pop up in QEMU which makes the requested call > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may > be someone's proprietary "SmartNIC" implementation. > > Why can't those calls be forwarded at the higher layer? Why do they > have to go through the guest kernel? I think doing forwarding at higher layer have the following issues: - Need a dedicated library (probably libbpf) but application may choose to do eBPF syscall directly - Depends on guest agent to work - Can't work for virtio-net hardware, since it still requires a hardware interface for carrying offloading information - Implement at the level of kernel may help for future extension like BPF object pinning and eBPF helper etc. Basically, this series is trying to have an implementation of transporting eBPF through virtio, so it's not necessarily a guest to host but driver and device. For device, it could be either a virtual one (as done in qemu) or a real hardware. > > If kernel performs no significant work (or "adds value", pardon the > expression), and problem can easily be solved otherwise we shouldn't > do the work of maintaining the mechanism. My understanding is that it should not be much difference compared to other offloading technology. > > The approach of kernel generating actual machine code which is then > loaded into a sandbox on the hypervisor/SmartNIC is another story. We've considered such way, but actual machine code is not as portable as eBPF bytecode consider we may want: - Support migration - Further offload the program to smart NIC (e.g through macvtap passthrough mode etc). Thanks > I'd appreciate if others could chime in. >
On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote: > On 2019/11/27 上午4:35, Jakub Kicinski wrote: > > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: > >> Note: This RFC has been sent to netdev as well as qemu-devel lists > >> > >> This series introduces XDP offloading from virtio_net. It is based on > >> the following work by Jason Wang: > >> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net > >> > >> Current XDP performance in virtio-net is far from what we can achieve > >> on host. Several major factors cause the difference: > >> - Cost of virtualization > >> - Cost of virtio (populating virtqueue and context switching) > >> - Cost of vhost, it needs more optimization > >> - Cost of data copy > >> Because of above reasons there is a need of offloading XDP program to > >> host. This set is an attempt to implement XDP offload from the guest. > > This turns the guest kernel into a uAPI proxy. > > > > BPF uAPI calls related to the "offloaded" BPF objects are forwarded > > to the hypervisor, they pop up in QEMU which makes the requested call > > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may > > be someone's proprietary "SmartNIC" implementation. > > > > Why can't those calls be forwarded at the higher layer? Why do they > > have to go through the guest kernel? > > > I think doing forwarding at higher layer have the following issues: > > - Need a dedicated library (probably libbpf) but application may choose > to do eBPF syscall directly > - Depends on guest agent to work This can be said about any user space functionality. > - Can't work for virtio-net hardware, since it still requires a hardware > interface for carrying offloading information The HW virtio-net presumably still has a PF and hopefully reprs for VFs, so why can't it attach the program there? > - Implement at the level of kernel may help for future extension like > BPF object pinning and eBPF helper etc. No idea what you mean by this. > Basically, this series is trying to have an implementation of > transporting eBPF through virtio, so it's not necessarily a guest to > host but driver and device. For device, it could be either a virtual one > (as done in qemu) or a real hardware. SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as is the x86 hypervisor side. This set turns the kernel into a uAPI forwarder. 3 years ago my answer to this proposal would have been very different. Today after all the CPU bugs it seems like the SmartNICs (which are just another CPU running proprietary code) may just take off.. > > If kernel performs no significant work (or "adds value", pardon the > > expression), and problem can easily be solved otherwise we shouldn't > > do the work of maintaining the mechanism. > > My understanding is that it should not be much difference compared to > other offloading technology. I presume you mean TC offloads? In virtualization there is inherently a hypervisor which will receive the request, be it an IO hub/SmartNIC or the traditional hypervisor on the same CPU. The ACL/routing offloads differ significantly, because it's either the driver that does all the HW register poking directly or the complexity of programming a rule into a HW table is quite low. Same is true for the NFP BPF offload, BTW, the driver does all the heavy lifting and compiles the final machine code image. You can't say verifying and JITing BPF code into machine code entirely in the hypervisor is similarly simple. So no, there is a huge difference. > > The approach of kernel generating actual machine code which is then > > loaded into a sandbox on the hypervisor/SmartNIC is another story. > > We've considered such way, but actual machine code is not as portable as > eBPF bytecode consider we may want: > > - Support migration > - Further offload the program to smart NIC (e.g through macvtap > passthrough mode etc). You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not guarantee migration either, if the environment is expected to be running different version of HW and SW. But yes, JITing in the guest kernel when you don't know what to JIT for may be hard, I was just saying that I don't mean to discourage people from implementing sandboxes which run JITed code on SmartNICs. My criticism is (as always?) against turning the kernel into a one-to-one uAPI forwarder into unknown platform code. For cloud use cases I believe the higher layer should solve this.
On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote: > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: > > Note: This RFC has been sent to netdev as well as qemu-devel lists > > > > This series introduces XDP offloading from virtio_net. It is based on > > the following work by Jason Wang: > > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net > > > > Current XDP performance in virtio-net is far from what we can achieve > > on host. Several major factors cause the difference: > > - Cost of virtualization > > - Cost of virtio (populating virtqueue and context switching) > > - Cost of vhost, it needs more optimization > > - Cost of data copy > > Because of above reasons there is a need of offloading XDP program to > > host. This set is an attempt to implement XDP offload from the guest. > > This turns the guest kernel into a uAPI proxy. > > BPF uAPI calls related to the "offloaded" BPF objects are forwarded > to the hypervisor, they pop up in QEMU which makes the requested call > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may > be someone's proprietary "SmartNIC" implementation. > > Why can't those calls be forwarded at the higher layer? Why do they > have to go through the guest kernel? Well everyone is writing these programs and attaching them to NICs. For better or worse that's how userspace is written. Yes, in the simple case where everything is passed through, it could instead be passed through some other channel just as well, but then userspace would need significant changes just to make it work with virtio. > If kernel performs no significant work (or "adds value", pardon the > expression), and problem can easily be solved otherwise we shouldn't > do the work of maintaining the mechanism. > > The approach of kernel generating actual machine code which is then > loaded into a sandbox on the hypervisor/SmartNIC is another story. But that's transparent to guest userspace. Making userspace care whether it's a SmartNIC or a software device breaks part of virtualization's appeal, which is that it looks like a hardware box to the guest. > I'd appreciate if others could chime in.
On Wed, 27 Nov 2019 15:32:17 -0500, Michael S. Tsirkin wrote: > On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote: > > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: > > > Note: This RFC has been sent to netdev as well as qemu-devel lists > > > > > > This series introduces XDP offloading from virtio_net. It is based on > > > the following work by Jason Wang: > > > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net > > > > > > Current XDP performance in virtio-net is far from what we can achieve > > > on host. Several major factors cause the difference: > > > - Cost of virtualization > > > - Cost of virtio (populating virtqueue and context switching) > > > - Cost of vhost, it needs more optimization > > > - Cost of data copy > > > Because of above reasons there is a need of offloading XDP program to > > > host. This set is an attempt to implement XDP offload from the guest. > > > > This turns the guest kernel into a uAPI proxy. > > > > BPF uAPI calls related to the "offloaded" BPF objects are forwarded > > to the hypervisor, they pop up in QEMU which makes the requested call > > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may > > be someone's proprietary "SmartNIC" implementation. > > > > Why can't those calls be forwarded at the higher layer? Why do they > > have to go through the guest kernel? > > Well everyone is writing these programs and attaching them to NICs. Who's everyone? > For better or worse that's how userspace is written. HW offload requires modifying the user space, too. The offload is not transparent. Do you know that? > Yes, in the simple case where everything is passed through, it could > instead be passed through some other channel just as well, but then > userspace would need significant changes just to make it work with > virtio. There is a recently spawned effort to create an "XDP daemon" or otherwise a control application which would among other things link separate XDP apps to share a NIC attachment point. Making use of cloud APIs would make a perfect addition to that. Obviously if one asks a kernel guy to solve a problem one'll get kernel code as an answer. And writing higher layer code requires companies to actually organize their teams and have "full stack" strategies. We've seen this story already with net_failover wart. At least that time we weren't risking building a proxy to someone's proprietary FW. > > If kernel performs no significant work (or "adds value", pardon the > > expression), and problem can easily be solved otherwise we shouldn't > > do the work of maintaining the mechanism. > > > > The approach of kernel generating actual machine code which is then > > loaded into a sandbox on the hypervisor/SmartNIC is another story. > > But that's transparent to guest userspace. Making userspace care whether > it's a SmartNIC or a software device breaks part of virtualization's > appeal, which is that it looks like a hardware box to the guest. It's not hardware unless you JITed machine code for it, it's just someone else's software. I'm not arguing with the appeal. I'm arguing the risk/benefit ratio doesn't justify opening this can of worms. > > I'd appreciate if others could chime in.
On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote: > > I'd appreciate if others could chime in. The performance improvements are quite appealing. In general offloading from higher layers into lower layers is necessary long term. But the approach taken by patches 15 and 17 is a dead end. I don't see how it can ever catch up with the pace of bpf development. As presented this approach works for the most basic programs and simple maps. No line info, no BTF, no debuggability. There are no tail_calls either. I don't think I've seen a single production XDP program that doesn't use tail calls. Static and dynamic linking is coming. Wraping one bpf feature at a time with virtio api is never going to be complete. How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ? OBJ_PIN/GET ? Where bpffs is going to live ? Any realistic XDP application will be using a lot more than single self contained XDP prog with hash and array maps. It feels that the whole sys_bpf needs to be forwarded as a whole from guest into host. In case of true hw offload the host is managing HW. So it doesn't forward syscalls into the driver. The offload from guest into host is different. BPF can be seen as a resource that host provides and guest kernel plus qemu would be forwarding requests between guest user space and host kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly. The FD that hosts sees would need a corresponding mirror FD in the guest. There are still questions about bpffs paths, but the main issue of one-feature-at-a-time will be addressed in such approach. There could be other solutions, of course.
On 2019/11/28 上午3:49, Jakub Kicinski wrote: > On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote: >> On 2019/11/27 上午4:35, Jakub Kicinski wrote: >>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: >>>> Note: This RFC has been sent to netdev as well as qemu-devel lists >>>> >>>> This series introduces XDP offloading from virtio_net. It is based on >>>> the following work by Jason Wang: >>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net >>>> >>>> Current XDP performance in virtio-net is far from what we can achieve >>>> on host. Several major factors cause the difference: >>>> - Cost of virtualization >>>> - Cost of virtio (populating virtqueue and context switching) >>>> - Cost of vhost, it needs more optimization >>>> - Cost of data copy >>>> Because of above reasons there is a need of offloading XDP program to >>>> host. This set is an attempt to implement XDP offload from the guest. >>> This turns the guest kernel into a uAPI proxy. >>> >>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded >>> to the hypervisor, they pop up in QEMU which makes the requested call >>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may >>> be someone's proprietary "SmartNIC" implementation. >>> >>> Why can't those calls be forwarded at the higher layer? Why do they >>> have to go through the guest kernel? >> >> I think doing forwarding at higher layer have the following issues: >> >> - Need a dedicated library (probably libbpf) but application may choose >> to do eBPF syscall directly >> - Depends on guest agent to work > This can be said about any user space functionality. Yes but the feature may have too much unnecessary dependencies: dedicated library, guest agent, host agent etc. This can only work for some specific setups and will lead vendor specific implementations. > >> - Can't work for virtio-net hardware, since it still requires a hardware >> interface for carrying offloading information > The HW virtio-net presumably still has a PF and hopefully reprs for > VFs, so why can't it attach the program there? Then you still need a interface for carrying such information? It will work like assuming we had a virtio-net VF with reprs: libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF Still need a vendor specific way for passing eBPF commands from driver to reprs/PF, and possibility, it could still be a virtio interface there. In this proposal it will work out of box as simple as: libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF If the request comes from host (e.g flow offloading, configuration etc), VF reprs make perfect fit. But if the request comes from guest, having much longer journey looks quite like a burden (dependencies, bugs etc) . What's more important, we can not assume the how virtio-net HW is structured, it could even not a SRIOV or PCI card. > >> - Implement at the level of kernel may help for future extension like >> BPF object pinning and eBPF helper etc. > No idea what you mean by this. My understanding is, we should narrow the gap between non-offloaded eBPF program and offloaded eBPF program. Making maps or progs to be visible to kernel may help to persist a unified API e.g object pinning through sysfs, tracepoint, debug etc. > >> Basically, this series is trying to have an implementation of >> transporting eBPF through virtio, so it's not necessarily a guest to >> host but driver and device. For device, it could be either a virtual one >> (as done in qemu) or a real hardware. > SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as > is the x86 hypervisor side. This set turns the kernel into a uAPI > forwarder. Not necessarily, as what has been done by NFP, driver filter out the features that is not supported, and the bpf object is still visible in the kernel (and see above comment). > > 3 years ago my answer to this proposal would have been very different. > Today after all the CPU bugs it seems like the SmartNICs (which are > just another CPU running proprietary code) may just take off.. > That's interesting but vendor may choose to use FPGA other than SoC in this case. Anyhow discussion like this is somehow out of the scope of the series. >>> If kernel performs no significant work (or "adds value", pardon the >>> expression), and problem can easily be solved otherwise we shouldn't >>> do the work of maintaining the mechanism. >> My understanding is that it should not be much difference compared to >> other offloading technology. > I presume you mean TC offloads? In virtualization there is inherently a > hypervisor which will receive the request, be it an IO hub/SmartNIC or > the traditional hypervisor on the same CPU. > > The ACL/routing offloads differ significantly, because it's either the > driver that does all the HW register poking directly or the complexity > of programming a rule into a HW table is quite low. > > Same is true for the NFP BPF offload, BTW, the driver does all the > heavy lifting and compiles the final machine code image. Yes and this series benefit from the infrastructure invented from NFP. But I'm not sure this is a good point since, technically the machine code could be generated by smart NIC as well. > > You can't say verifying and JITing BPF code into machine code entirely > in the hypervisor is similarly simple. Yes and that's why we choose to do in on the device (host) to simplify things. > > So no, there is a huge difference. > >>> The approach of kernel generating actual machine code which is then >>> loaded into a sandbox on the hypervisor/SmartNIC is another story. >> We've considered such way, but actual machine code is not as portable as >> eBPF bytecode consider we may want: >> >> - Support migration >> - Further offload the program to smart NIC (e.g through macvtap >> passthrough mode etc). > You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not > guarantee migration either, Yes, but it's more portable than machine code. > if the environment is expected to be > running different version of HW and SW. Right, we plan to have feature negotiation. > But yes, JITing in the guest > kernel when you don't know what to JIT for may be hard, Yes. > I was just > saying that I don't mean to discourage people from implementing > sandboxes which run JITed code on SmartNICs. My criticism is (as > always?) against turning the kernel into a one-to-one uAPI forwarder > into unknown platform code. We have FUSE and I think it's not only the forwarder, and we may do much more work on top in the future. For unknown platform code, I'm not sure why we need care about that. There's no way for us to prevent such implementation and if we try to formalize it through a specification (virtio spec and probably eBPF spec), it may help actually. > > For cloud use cases I believe the higher layer should solve this. > Technically possible, but have lots of drawbacks. Thanks
On 2019/11/28 上午11:32, Alexei Starovoitov wrote: > On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote: >> I'd appreciate if others could chime in. > The performance improvements are quite appealing. > In general offloading from higher layers into lower layers is necessary long term. > > But the approach taken by patches 15 and 17 is a dead end. I don't see how it > can ever catch up with the pace of bpf development. This applies for any hardware offloading features, isn't it? > As presented this approach > works for the most basic programs and simple maps. No line info, no BTF, no > debuggability. There are no tail_calls either. If I understand correctly, neither of above were implemented in NFP. We can collaborate to find solution for all of those. > I don't think I've seen a single > production XDP program that doesn't use tail calls. It looks to me we can manage to add this support. > Static and dynamic linking > is coming. Wraping one bpf feature at a time with virtio api is never going to > be complete. It's a common problem for hardware that want to implement eBPF offloading, not a virtio specific one. > How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ? > OBJ_PIN/GET ? Where bpffs is going to live ? If we want pinning work in the case of virt, it should live in both host and guest probably. > Any realistic XDP application will > be using a lot more than single self contained XDP prog with hash and array > maps. It's possible if we want to use XDP offloading to accelerate VNF which often has simple logic. > It feels that the whole sys_bpf needs to be forwarded as a whole from > guest into host. In case of true hw offload the host is managing HW. So it > doesn't forward syscalls into the driver. The offload from guest into host is > different. BPF can be seen as a resource that host provides and guest kernel > plus qemu would be forwarding requests between guest user space and host > kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly. > The FD that hosts sees would need a corresponding mirror FD in the guest. There > are still questions about bpffs paths, but the main issue of > one-feature-at-a-time will be addressed in such approach. We try to follow what NFP did by starting from a fraction of the whole eBPF features. It would be very hard to have all eBPF features implemented from the start. It would be helpful to clarify what's the minimal set of features that you want to have from the start. > There could be other > solutions, of course. > > Suggestions are welcomed. Thanks
On 11/27/19 10:18 PM, Jason Wang wrote: > We try to follow what NFP did by starting from a fraction of the whole > eBPF features. It would be very hard to have all eBPF features > implemented from the start. It would be helpful to clarify what's the > minimal set of features that you want to have from the start. Offloading guest programs needs to prevent a guest XDP program from running bpf helpers that access host kernel data. e.g., bpf_fib_lookup
On 2019/12/2 上午12:54, David Ahern wrote: > On 11/27/19 10:18 PM, Jason Wang wrote: >> We try to follow what NFP did by starting from a fraction of the whole >> eBPF features. It would be very hard to have all eBPF features >> implemented from the start. It would be helpful to clarify what's the >> minimal set of features that you want to have from the start. > Offloading guest programs needs to prevent a guest XDP program from > running bpf helpers that access host kernel data. e.g., bpf_fib_lookup Right, so we probably need a new type of eBPF program on the host and filter out the unsupported helpers there. Thanks >
On Wed, Nov 27, 2019 at 03:40:14PM -0800, Jakub Kicinski wrote: > > For better or worse that's how userspace is written. > > HW offload requires modifying the user space, too. The offload is not > transparent. Do you know that? It's true, offload of program itself isn't transparent. Adding a 3rd interface (software/hardware/host) isn't welcome though, IMHO.