Message ID | 20191209135522.16576-1-bjorn.topel@gmail.com |
---|---|
Headers | show |
Series | Introduce the BPF dispatcher | expand |
Björn Töpel <bjorn.topel@gmail.com> writes: > Overview > ======== > > This is the 4th iteration of the series that introduces the BPF > dispatcher, which is a mechanism to avoid indirect calls. > > The BPF dispatcher is a multi-way branch code generator, targeted for > BPF programs. E.g. when an XDP program is executed via the > bpf_prog_run_xdp(), it is invoked via an indirect call. With > retpolines enabled, the indirect call has a substantial performance > impact. The dispatcher is a mechanism that transform indirect calls to > direct calls, and therefore avoids the retpoline. The dispatcher is > generated using the BPF JIT, and relies on text poking provided by > bpf_arch_text_poke(). > > The dispatcher hijacks a trampoline function it via the __fentry__ nop > of the trampoline. One dispatcher instance currently supports up to 48 > dispatch points. This can be extended in the future. > > In this series, only one dispatcher instance is supported, and the > only user is XDP. The dispatcher is updated when an XDP program is > attached/detached to/from a netdev. An alternative to this could have > been to update the dispatcher at program load point, but as there are > usually more XDP programs loaded than attached, so the latter was > picked. I like the new version where it's integrated into bpf_prog_run_xdp(); nice! :) > The XDP dispatcher is always enabled, if available, because it helps > even when retpolines are disabled. Please refer to the "Performance" > section below. Looking at those numbers, I think I would moderate "helps" to "doesn't hurt" - a difference of less than 1ns is basically in the noise. You mentioned in the earlier version that this would impact the time it takes to attach an XDP program. Got any numbers for this? -Toke
On Mon, 9 Dec 2019 14:55:16 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote: > Performance > =========== > > The tests were performed using the xdp_rxq_info sample program with > the following command-line: > > 1. XDP_DRV: > # xdp_rxq_info --dev eth0 --action XDP_DROP > 2. XDP_SKB: > # xdp_rxq_info --dev eth0 -S --action XDP_DROP > 3. xdp-perf, from selftests/bpf: > # test_progs -v -t xdp_perf > > > Run with mitigations=auto > ------------------------- > > Baseline: > 1. 22.0 Mpps > 2. 3.8 Mpps > 3. 15 ns > > Dispatcher: > 1. 29.4 Mpps (+34%) > 2. 4.0 Mpps (+5%) > 3. 5 ns (+66%) Thanks for providing these extra measurement points. This is good work. I just want to remind people that when working at these high speeds, it is easy to get amazed by a +34% improvement, but we have to be careful to understand that this is saving approx 10 ns time or cycles. In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger (1/3.8-1/4)*1000 = 13.15 ns. Than #1 (22.0 Mpps -> 29.4 Mpps) (1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns = 10 ns. The 10 ns improvement is a big deal in XDP context, and also correspond to my own experience with retpoline (approx 12 ns overhead). To Bjørn, I would appreciate more digits on your Mpps numbers, so I get more accuracy on my checks-and-balances I described above. I suspect the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we get more accuracy. > Dispatcher (full; walk all entries, and fallback): > 1. 20.4 Mpps (-7%) > 2. 3.8 Mpps > 3. 18 ns (-20%) > > Run with mitigations=off > ------------------------ > > Baseline: > 1. 29.6 Mpps > 2. 4.1 Mpps > 3. 5 ns > > Dispatcher: > 1. 30.7 Mpps (+4%) > 2. 4.1 Mpps > 3. 5 ns While +4% sounds good, but could be measurement noise ;-) (1/29.6-1/30.7)*1000 = 1.21 ns As both #3 says 5 ns.
On Mon, 9 Dec 2019 at 16:00, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Björn Töpel <bjorn.topel@gmail.com> writes: > [...] > > I like the new version where it's integrated into bpf_prog_run_xdp(); > nice! :) > Yes, me too! Nice suggestion! > > The XDP dispatcher is always enabled, if available, because it helps > > even when retpolines are disabled. Please refer to the "Performance" > > section below. > > Looking at those numbers, I think I would moderate "helps" to "doesn't > hurt" - a difference of less than 1ns is basically in the noise. > > You mentioned in the earlier version that this would impact the time it > takes to attach an XDP program. Got any numbers for this? > Ah, no, I forgot to measure that. I'll get back with that. So, when a new program is entered or removed from dispatcher, it needs to be re-jited, but more importantly -- a text poke is needed. I don't know if this is a concern or not, but let's measure it. Björn > -Toke >
On Mon, 9 Dec 2019 at 18:00, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Mon, 9 Dec 2019 14:55:16 +0100 > Björn Töpel <bjorn.topel@gmail.com> wrote: > > > Performance > > =========== > > > > The tests were performed using the xdp_rxq_info sample program with > > the following command-line: > > > > 1. XDP_DRV: > > # xdp_rxq_info --dev eth0 --action XDP_DROP > > 2. XDP_SKB: > > # xdp_rxq_info --dev eth0 -S --action XDP_DROP > > 3. xdp-perf, from selftests/bpf: > > # test_progs -v -t xdp_perf > > > > > > Run with mitigations=auto > > ------------------------- > > > > Baseline: > > 1. 22.0 Mpps > > 2. 3.8 Mpps > > 3. 15 ns > > > > Dispatcher: > > 1. 29.4 Mpps (+34%) > > 2. 4.0 Mpps (+5%) > > 3. 5 ns (+66%) > > Thanks for providing these extra measurement points. This is good > work. I just want to remind people that when working at these high > speeds, it is easy to get amazed by a +34% improvement, but we have to > be careful to understand that this is saving approx 10 ns time or > cycles. > > In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger > (1/3.8-1/4)*1000 = 13.15 ns. Than #1 (22.0 Mpps -> 29.4 Mpps) > (1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns = > 10 ns. The 10 ns improvement is a big deal in XDP context, and also > correspond to my own experience with retpoline (approx 12 ns overhead). > Ok, good! :-) > To Bjørn, I would appreciate more digits on your Mpps numbers, so I get > more accuracy on my checks-and-balances I described above. I suspect > the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we > get more accuracy. > Ok! Let me re-run them. If you have some spare cycles, yt would be great if you could try it out as well on your Mellanox setup. Historically you've always been able to get more stable numbers than I. :-) > > > Dispatcher (full; walk all entries, and fallback): > > 1. 20.4 Mpps (-7%) > > 2. 3.8 Mpps > > 3. 18 ns (-20%) > > > > Run with mitigations=off > > ------------------------ > > > > Baseline: > > 1. 29.6 Mpps > > 2. 4.1 Mpps > > 3. 5 ns > > > > Dispatcher: > > 1. 30.7 Mpps (+4%) > > 2. 4.1 Mpps > > 3. 5 ns > > While +4% sounds good, but could be measurement noise ;-) > > (1/29.6-1/30.7)*1000 = 1.21 ns > > As both #3 says 5 ns. > True. Maybe that simply hints that we shouldn't use the dispatcher here? Thanks for the comments! Björn > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer >
On Mon, 9 Dec 2019 18:45:12 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote: > On Mon, 9 Dec 2019 at 18:00, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > On Mon, 9 Dec 2019 14:55:16 +0100 > > Björn Töpel <bjorn.topel@gmail.com> wrote: > > > > > Performance > > > =========== > > > > > > The tests were performed using the xdp_rxq_info sample program with > > > the following command-line: > > > > > > 1. XDP_DRV: > > > # xdp_rxq_info --dev eth0 --action XDP_DROP > > > 2. XDP_SKB: > > > # xdp_rxq_info --dev eth0 -S --action XDP_DROP > > > 3. xdp-perf, from selftests/bpf: > > > # test_progs -v -t xdp_perf > > > > > > > > > Run with mitigations=auto > > > ------------------------- > > > > > > Baseline: > > > 1. 22.0 Mpps > > > 2. 3.8 Mpps > > > 3. 15 ns > > > > > > Dispatcher: > > > 1. 29.4 Mpps (+34%) > > > 2. 4.0 Mpps (+5%) > > > 3. 5 ns (+66%) > > > > Thanks for providing these extra measurement points. This is good > > work. I just want to remind people that when working at these high > > speeds, it is easy to get amazed by a +34% improvement, but we have to > > be careful to understand that this is saving approx 10 ns time or > > cycles. > > > > In reality cycles or time saved in #2 (3.8 Mpps -> 4.0 Mpps) is larger > > (1/3.8-1/4)*1000 = 13.15 ns. Than #1 (22.0 Mpps -> 29.4 Mpps) > > (1/22-1/29.4)*1000 = 11.44 ns. Test #3 keeps us honest 15 ns -> 5 ns = > > 10 ns. The 10 ns improvement is a big deal in XDP context, and also > > correspond to my own experience with retpoline (approx 12 ns overhead). > > > > Ok, good! :-) > > > To Bjørn, I would appreciate more digits on your Mpps numbers, so I get > > more accuracy on my checks-and-balances I described above. I suspect > > the 3.8 Mpps -> 4.0 Mpps will be closer to the other numbers when we > > get more accuracy. > > > > Ok! Let me re-run them. Well, I don't think you should waste your time re-running these... It clearly shows a significant improvement. I'm just complaining that I didn't have enough digits to do accurate checks-and-balances, they are close enough that I believe them. > If you have some spare cycles, yt would be > great if you could try it out as well on your Mellanox setup. I'll add it to my TODO list... but no promises. > Historically you've always been able to get more stable numbers than > I. :-) > > > > > > Dispatcher (full; walk all entries, and fallback): > > > 1. 20.4 Mpps (-7%) > > > 2. 3.8 Mpps > > > 3. 18 ns (-20%) > > > > > > Run with mitigations=off > > > ------------------------ > > > > > > Baseline: > > > 1. 29.6 Mpps > > > 2. 4.1 Mpps > > > 3. 5 ns > > > > > > Dispatcher: > > > 1. 30.7 Mpps (+4%) > > > 2. 4.1 Mpps > > > 3. 5 ns > > > > While +4% sounds good, but could be measurement noise ;-) > > > > (1/29.6-1/30.7)*1000 = 1.21 ns > > > > As both #3 says 5 ns. > > > > True. Maybe that simply hints that we shouldn't use the dispatcher here? No. I actually think it is worth exposing this code as much as possible. And if it really is 1.2 ns improvement, then I'll gladly take that as well ;-) I think this is awesome work! -- thanks for doing this!!!
On 12/9/2019 5:55 AM, Björn Töpel wrote: > Overview > ======== > > This is the 4th iteration of the series that introduces the BPF > dispatcher, which is a mechanism to avoid indirect calls. Good to see the progress with getting a mechansism to avoid indirect calls upstream. [...] > Performance > =========== > > The tests were performed using the xdp_rxq_info sample program with > the following command-line: > > 1. XDP_DRV: > # xdp_rxq_info --dev eth0 --action XDP_DROP > 2. XDP_SKB: > # xdp_rxq_info --dev eth0 -S --action XDP_DROP > 3. xdp-perf, from selftests/bpf: > # test_progs -v -t xdp_perf What is this test_progs? I don't see such ann app under selftests/bpf > Run with mitigations=auto > ------------------------- > > Baseline: > 1. 22.0 Mpps > 2. 3.8 Mpps > 3. 15 ns > > Dispatcher: > 1. 29.4 Mpps (+34%) > 2. 4.0 Mpps (+5%) > 3. 5 ns (+66%) > > Dispatcher (full; walk all entries, and fallback): > 1. 20.4 Mpps (-7%) > 2. 3.8 Mpps > 3. 18 ns (-20%) Are these packets received on a single queue? Or multiple queues? Do you see similar improvements even with xdpsock? Thanks Sridhar
On Mon, 9 Dec 2019 at 14:55, Björn Töpel <bjorn.topel@gmail.com> wrote: > [...] > > Discussion/feedback > =================== > > My measurements did not show any improvements for the jump target 16 B > alignments. Maybe it would make sense to leave alignment out? > I did a micro benchmark with "test_progs -t xdp_prog" for all sizes (max 48 aligned, max 64 non-aligned) of the dispatcher. I couldn't measure any difference at all, so I will leave the last patch out (aligning jump targets). If a workload appears where this is measurable, it can be added at that point. The micro benchmark also show that it makes little sense disabling the dispatcher when "mitigations=off". The diff is within 1ns(!), when mitigations are off. I'll post the data as a reply to the v4 cover letter, so that the cover isn't clobbered with data. :-P Björn
On Tue, 10 Dec 2019 at 20:28, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > [...] > > The tests were performed using the xdp_rxq_info sample program with > > the following command-line: > > > > 1. XDP_DRV: > > # xdp_rxq_info --dev eth0 --action XDP_DROP > > 2. XDP_SKB: > > # xdp_rxq_info --dev eth0 -S --action XDP_DROP > > 3. xdp-perf, from selftests/bpf: > > # test_progs -v -t xdp_perf > > What is this test_progs? I don't see such ann app under selftests/bpf > The "test_progs" program resides in tools/testing/selftests/bpf. The xdp_perf is part of the series! > > > Run with mitigations=auto > > ------------------------- > > > > Baseline: > > 1. 22.0 Mpps > > 2. 3.8 Mpps > > 3. 15 ns > > > > Dispatcher: > > 1. 29.4 Mpps (+34%) > > 2. 4.0 Mpps (+5%) > > 3. 5 ns (+66%) > > > > Dispatcher (full; walk all entries, and fallback): > > 1. 20.4 Mpps (-7%) > > 2. 3.8 Mpps > > 3. 18 ns (-20%) > > Are these packets received on a single queue? Or multiple queues? > Do you see similar improvements even with xdpsock? > Yes, just a single queue, regular XDP. I left out xdpsock for now, and only focus on the micro benchmark and XDP. I'll get back with xdpsock benchmarks. Cheers, Björn
On Mon, 9 Dec 2019 at 18:42, Björn Töpel <bjorn.topel@gmail.com> wrote: > [...] > > You mentioned in the earlier version that this would impact the time it > > takes to attach an XDP program. Got any numbers for this? > > > > Ah, no, I forgot to measure that. I'll get back with that. So, when a > new program is entered or removed from dispatcher, it needs to be > re-jited, but more importantly -- a text poke is needed. I don't know > if this is a concern or not, but let's measure it. > Toke, I tried to measure the impact, but didn't really get anything useful out. :-( My concern was mainly that text-poking is a point of contention, and it messes with the icache. As for contention, we're already synchronized around the rtnl-lock. As for the icache-flush effects... well... I'm open to suggestions how to measure the impact in a useful way. > > Björn > > > -Toke > >
Björn Töpel <bjorn.topel@gmail.com> writes: > On Mon, 9 Dec 2019 at 18:42, Björn Töpel <bjorn.topel@gmail.com> wrote: >> > [...] >> > You mentioned in the earlier version that this would impact the time it >> > takes to attach an XDP program. Got any numbers for this? >> > >> >> Ah, no, I forgot to measure that. I'll get back with that. So, when a >> new program is entered or removed from dispatcher, it needs to be >> re-jited, but more importantly -- a text poke is needed. I don't know >> if this is a concern or not, but let's measure it. >> > > Toke, I tried to measure the impact, but didn't really get anything > useful out. :-( > > My concern was mainly that text-poking is a point of contention, and > it messes with the icache. As for contention, we're already > synchronized around the rtnl-lock. As for the icache-flush effects... > well... I'm open to suggestions how to measure the impact in a useful > way. Hmm, how about: Test 1: - Run a test with a simple drop program (like you have been) on a physical interface (A), sampling the PPS with interval I. - Load a new XDP program on interface B (which could just be a veth I guess?) - Record the PPS delta in the sampling interval on which the program was loaded on interval B. You could also record for how many intervals the throughput drops, but I would guess you'd need a fairly short sampling interval to see anything for this. Test 2: - Run an XDP_TX program that just reflects the packets. - Have the traffic generator measure per-packet latency (from it's transmitted until the same packet comes back). - As above, load a program on a different interface and look for a blip in the recorded latency. Both of these tests could also be done with the program being replaced being the one that processes packets on the physical interface (instead of on another interface). That way you could also see if there's any difference for that before/after patch... -Toke