Message ID | 20190612155605.22450-6-maximmi@mellanox.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | AF_XDP infrastructure improvements and mlx5e support | expand |
On Wed, 12 Jun 2019 15:56:43 +0000, Maxim Mikityanskiy wrote: > The typical XDP memory scheme is one packet per page. Change the AF_XDP > frame size in libbpf to 4096, which is the page size on x86, to allow > libbpf to be used with the drivers with the packet-per-page scheme. This is slightly surprising. Why does the driver care about the bufsz? You're not supposed to so page operations on UMEM pages, anyway. And the RX size filter should be configured according to MTU regardless of XDP state. Can you explain? > Add a command line option -f to xdpsock to allow to specify a custom > frame size. > > Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> > Reviewed-by: Tariq Toukan <tariqt@mellanox.com> > Acked-by: Saeed Mahameed <saeedm@mellanox.com>
On 2019-06-12 23:10, Jakub Kicinski wrote: > On Wed, 12 Jun 2019 15:56:43 +0000, Maxim Mikityanskiy wrote: >> The typical XDP memory scheme is one packet per page. Change the AF_XDP >> frame size in libbpf to 4096, which is the page size on x86, to allow >> libbpf to be used with the drivers with the packet-per-page scheme. > > This is slightly surprising. Why does the driver care about the bufsz? The classic XDP implementation supports only the packet-per-page scheme. mlx5e implements this scheme, because it perfectly fits with xdp_return and page pool APIs. AF_XDP relies on XDP, and even though AF_XDP doesn't really allocate or release pages, it works on top of XDP, and XDP implementation in mlx5e does allocate and release pages (in general case) and works with the packet-per-page scheme. > You're not supposed to so page operations on UMEM pages, anyway. > And the RX size filter should be configured according to MTU regardless > of XDP state. Yes, of course, MTU is taken into account. > Can you explain? > >> Add a command line option -f to xdpsock to allow to specify a custom >> frame size. >> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> >> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> >> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
On Thu, 13 Jun 2019 14:01:39 +0000, Maxim Mikityanskiy wrote: > On 2019-06-12 23:10, Jakub Kicinski wrote: > > On Wed, 12 Jun 2019 15:56:43 +0000, Maxim Mikityanskiy wrote: > >> The typical XDP memory scheme is one packet per page. Change the AF_XDP > >> frame size in libbpf to 4096, which is the page size on x86, to allow > >> libbpf to be used with the drivers with the packet-per-page scheme. > > > > This is slightly surprising. Why does the driver care about the bufsz? > > The classic XDP implementation supports only the packet-per-page scheme. > mlx5e implements this scheme, because it perfectly fits with xdp_return > and page pool APIs. AF_XDP relies on XDP, and even though AF_XDP doesn't > really allocate or release pages, it works on top of XDP, and XDP > implementation in mlx5e does allocate and release pages (in general > case) and works with the packet-per-page scheme. Yes, okay, I get that. But I still don't know what's the exact use you have for AF_XDP buffers being 4k.. Could you point us in the code to the place which relies on all buffers being 4k in any XDP scenario? > > You're not supposed to so page operations on UMEM pages, anyway. > > And the RX size filter should be configured according to MTU regardless > > of XDP state. > > Yes, of course, MTU is taken into account. > > > Can you explain? > > > >> Add a command line option -f to xdpsock to allow to specify a custom > >> frame size. > >> > >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> > >> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> > >> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
On 2019-06-13 20:29, Jakub Kicinski wrote: > On Thu, 13 Jun 2019 14:01:39 +0000, Maxim Mikityanskiy wrote: >> On 2019-06-12 23:10, Jakub Kicinski wrote: >>> On Wed, 12 Jun 2019 15:56:43 +0000, Maxim Mikityanskiy wrote: >>>> The typical XDP memory scheme is one packet per page. Change the AF_XDP >>>> frame size in libbpf to 4096, which is the page size on x86, to allow >>>> libbpf to be used with the drivers with the packet-per-page scheme. >>> >>> This is slightly surprising. Why does the driver care about the bufsz? >> >> The classic XDP implementation supports only the packet-per-page scheme. >> mlx5e implements this scheme, because it perfectly fits with xdp_return >> and page pool APIs. AF_XDP relies on XDP, and even though AF_XDP doesn't >> really allocate or release pages, it works on top of XDP, and XDP >> implementation in mlx5e does allocate and release pages (in general >> case) and works with the packet-per-page scheme. > > Yes, okay, I get that. But I still don't know what's the exact use you > have for AF_XDP buffers being 4k.. Could you point us in the code to > the place which relies on all buffers being 4k in any XDP scenario? 1. An XDP program is set on all queues, so to support non-4k AF_XDP frames, we would also need to support multiple-packet-per-page XDP for regular queues. 2. Page allocation in mlx5e perfectly fits page-sized XDP frames. Some examples in the code are: 2.1. mlx5e_free_rx_mpwqe calls a generic mlx5e_page_release to release the pages of a MPWQE (multi-packet work queue element), which is implemented as xsk_umem_fq_reuse for the case of XSK. We avoid extra overhead by using the fact that packet == page. 2.2. mlx5e_free_xdpsq_desc performs cleanup after XDP transmits. In case of XDP_TX, we can free/recycle the pages without having a refcount overhead, by using the fact that packet == page. >>> You're not supposed to so page operations on UMEM pages, anyway. >>> And the RX size filter should be configured according to MTU regardless >>> of XDP state. >> >> Yes, of course, MTU is taken into account. >> >>> Can you explain? >>> >>>> Add a command line option -f to xdpsock to allow to specify a custom >>>> frame size. >>>> >>>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> >>>> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> >>>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
On Fri, 14 Jun 2019 13:25:28 +0000, Maxim Mikityanskiy wrote: > On 2019-06-13 20:29, Jakub Kicinski wrote: > > On Thu, 13 Jun 2019 14:01:39 +0000, Maxim Mikityanskiy wrote: > > > > Yes, okay, I get that. But I still don't know what's the exact use you > > have for AF_XDP buffers being 4k.. Could you point us in the code to > > the place which relies on all buffers being 4k in any XDP scenario? Okay, I still don't get it, but that's for explaining :) Perhaps it will become clearer when you resping with patch 17 split into reviewable chunks :) > 1. An XDP program is set on all queues, so to support non-4k AF_XDP > frames, we would also need to support multiple-packet-per-page XDP for > regular queues. Mm.. do you have some materials of how the mlx5 DMA/RX works? I'd think that if you do single packet per buffer as long as all packets are guaranteed to fit in the buffer (based on MRU) the HW shouldn't care what the size of the buffer is. > 2. Page allocation in mlx5e perfectly fits page-sized XDP frames. Some > examples in the code are: > > 2.1. mlx5e_free_rx_mpwqe calls a generic mlx5e_page_release to release > the pages of a MPWQE (multi-packet work queue element), which is > implemented as xsk_umem_fq_reuse for the case of XSK. We avoid extra > overhead by using the fact that packet == page. > > 2.2. mlx5e_free_xdpsq_desc performs cleanup after XDP transmits. In case > of XDP_TX, we can free/recycle the pages without having a refcount > overhead, by using the fact that packet == page.
On 2019-06-15 04:40, Jakub Kicinski wrote: > On Fri, 14 Jun 2019 13:25:28 +0000, Maxim Mikityanskiy wrote: >> On 2019-06-13 20:29, Jakub Kicinski wrote: >>> On Thu, 13 Jun 2019 14:01:39 +0000, Maxim Mikityanskiy wrote: >>> >>> Yes, okay, I get that. But I still don't know what's the exact use you >>> have for AF_XDP buffers being 4k.. Could you point us in the code to >>> the place which relies on all buffers being 4k in any XDP scenario? > > Okay, I still don't get it, but that's for explaining :) Perhaps it > will become clearer when you resping with patch 17 split into > reviewable chunks :) I'm sorry, as I said above, I don't think splitting it is necessary or is a good thing to do. I used to have it separated, but I squashed them to shorten the series and to avoid having stupid /* TODO: implement */ comments in empty functions that are implemented in the next patch. Unsquashing them is going to take more time, which I unfortunately don't have as I'm flying to Netconf tomorrow and then going on vacation. So, I would really like to avoid it unless absolutely necessary. Moreover, it won't increase readability - you'll have to jump between the patches to see the complete implementation of a single function - it's a single feature, after all. >> 1. An XDP program is set on all queues, so to support non-4k AF_XDP >> frames, we would also need to support multiple-packet-per-page XDP for >> regular queues. > > Mm.. do you have some materials of how the mlx5 DMA/RX works? I'd think > that if you do single packet per buffer as long as all packets are > guaranteed to fit in the buffer (based on MRU) the HW shouldn't care > what the size of the buffer is. It's not related to hardware, it helps get better performance by utilizing page pool in the optimal way (without having refcnt == 2 on pages). Maybe Tariq or Saeed could explain it more clearly. >> 2. Page allocation in mlx5e perfectly fits page-sized XDP frames. Some >> examples in the code are: >> >> 2.1. mlx5e_free_rx_mpwqe calls a generic mlx5e_page_release to release >> the pages of a MPWQE (multi-packet work queue element), which is >> implemented as xsk_umem_fq_reuse for the case of XSK. We avoid extra >> overhead by using the fact that packet == page. >> >> 2.2. mlx5e_free_xdpsq_desc performs cleanup after XDP transmits. In case >> of XDP_TX, we can free/recycle the pages without having a refcount >> overhead, by using the fact that packet == page.
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index d08ee1ab7bb4..86d173a332c1 100644 --- a/samples/bpf/xdpsock_user.c +++ b/samples/bpf/xdpsock_user.c @@ -68,6 +68,7 @@ static int opt_queue; static int opt_poll; static int opt_interval = 1; static u32 opt_xdp_bind_flags; +static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE; static __u32 prog_id; struct xsk_umem_info { @@ -276,6 +277,12 @@ static size_t gen_eth_frame(struct xsk_umem_info *umem, u64 addr) static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) { struct xsk_umem_info *umem; + struct xsk_umem_config cfg = { + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, + .frame_size = opt_xsk_frame_size, + .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, + }; int ret; umem = calloc(1, sizeof(*umem)); @@ -283,7 +290,7 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) exit_with_error(errno); ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, - NULL); + &cfg); if (ret) exit_with_error(-ret); @@ -323,11 +330,9 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem) &idx); if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS) exit_with_error(-ret); - for (i = 0; - i < XSK_RING_PROD__DEFAULT_NUM_DESCS * - XSK_UMEM__DEFAULT_FRAME_SIZE; - i += XSK_UMEM__DEFAULT_FRAME_SIZE) - *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i; + for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++) + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = + i * opt_xsk_frame_size; xsk_ring_prod__submit(&xsk->umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS); @@ -346,6 +351,7 @@ static struct option long_options[] = { {"interval", required_argument, 0, 'n'}, {"zero-copy", no_argument, 0, 'z'}, {"copy", no_argument, 0, 'c'}, + {"frame-size", required_argument, 0, 'f'}, {0, 0, 0, 0} }; @@ -365,8 +371,9 @@ static void usage(const char *prog) " -n, --interval=n Specify statistics update interval (default 1 sec).\n" " -z, --zero-copy Force zero-copy mode.\n" " -c, --copy Force copy mode.\n" + " -f, --frame-size=n Set the frame size (must be a power of two, default is %d).\n" "\n"; - fprintf(stderr, str, prog); + fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE); exit(EXIT_FAILURE); } @@ -377,7 +384,7 @@ static void parse_command_line(int argc, char **argv) opterr = 0; for (;;) { - c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options, + c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options, &option_index); if (c == -1) break; @@ -420,6 +427,9 @@ static void parse_command_line(int argc, char **argv) case 'F': opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; break; + case 'f': + opt_xsk_frame_size = atoi(optarg); + break; default: usage(basename(argv[0])); } @@ -432,6 +442,11 @@ static void parse_command_line(int argc, char **argv) usage(basename(argv[0])); } + if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) { + fprintf(stderr, "--frame-size=%d is not a power of two\n", + opt_xsk_frame_size); + usage(basename(argv[0])); + } } static void kick_tx(struct xsk_socket_info *xsk) @@ -583,8 +598,7 @@ static void tx_only(struct xsk_socket_info *xsk) for (i = 0; i < BATCH_SIZE; i++) { xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr - = (frame_nb + i) << - XSK_UMEM__DEFAULT_FRAME_SHIFT; + = (frame_nb + i) * opt_xsk_frame_size; xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len = sizeof(pkt_data) - 1; } @@ -661,21 +675,19 @@ int main(int argc, char **argv) } ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */ - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); + NUM_FRAMES * opt_xsk_frame_size); if (ret) exit_with_error(ret); /* Create sockets... */ - umem = xsk_configure_umem(bufs, - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); + umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size); xsks[num_socks++] = xsk_configure_socket(umem); if (opt_bench == BENCH_TXONLY) { int i; - for (i = 0; i < NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE; - i += XSK_UMEM__DEFAULT_FRAME_SIZE) - (void)gen_eth_frame(umem, i); + for (i = 0; i < NUM_FRAMES; i++) + (void)gen_eth_frame(umem, i * opt_xsk_frame_size); } signal(SIGINT, int_exit); diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h index 82ea71a0f3ec..833a6e60d065 100644 --- a/tools/lib/bpf/xsk.h +++ b/tools/lib/bpf/xsk.h @@ -167,7 +167,7 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk); #define XSK_RING_CONS__DEFAULT_NUM_DESCS 2048 #define XSK_RING_PROD__DEFAULT_NUM_DESCS 2048 -#define XSK_UMEM__DEFAULT_FRAME_SHIFT 11 /* 2048 bytes */ +#define XSK_UMEM__DEFAULT_FRAME_SHIFT 12 /* 4096 bytes */ #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT) #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0