Message ID | e17f52a6-b2ad-6b39-a655-2e8779a5d192@mojatatu.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2017-04-15 at 13:07 -0400, Jamal Hadi Salim wrote: > Eric, > > How does attached look instead of the 32K? > I found it helps to let user space suggest something > larger. > > cheers, > jamal Looks dangerous to me, for various reasons. 1) Memory allocations might not like it Have you tried your change after user does a setsockopt( SO_RCVBUFFORCE, 256 Mbytes), and a recvmsg ( .. 64 Mbytes) ? Presumably, we could replace 32768 by (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER), but this will not matter on x86. 2) We might have paths in the kernel filling a potential big skb without yielding cpu or a spinlock or a mutex. -> latency source. What perf numbers do you have, using 1MB buffers instead of 32KB ? The syscall overhead seems tiny compared to the actual cost of filling the netlink message, accessing thousands of cache lines all over the places.
On 17-04-15 11:08 PM, Eric Dumazet wrote: > On Sat, 2017-04-15 at 13:07 -0400, Jamal Hadi Salim wrote: >> Eric, >> >> How does attached look instead of the 32K? >> I found it helps to let user space suggest something >> larger. >> >> cheers, >> jamal > > Looks dangerous to me, for various reasons. > > 1) Memory allocations might not like it > > Have you tried your change after user does a > setsockopt( SO_RCVBUFFORCE, 256 Mbytes), and a > recvmsg ( .. 64 Mbytes) ? > > Presumably, we could replace 32768 by (PAGE_SIZE << > PAGE_ALLOC_COSTLY_ORDER), but this will not matter on x86. > For my use case I dont need to go that high, but i can see plausibility that someone else will. Is there a reasonable large number other than 32K? 128K-512K would be way sufficient. > 2) We might have paths in the kernel filling a potential big skb without > yielding cpu or a spinlock or a mutex. -> latency source. > > > What perf numbers do you have, using 1MB buffers instead of 32KB ? > > The syscall overhead seems tiny compared to the actual cost of filling > the netlink message, accessing thousands of cache lines all over the > places. > sycall is affecting me - but I have only compared with limited traffic running at the same time as dumping. The more i can batch the sooner i can stop polluting the cache. The tests I have done are with a default socket buffer of 4M and say recvmsg(... 128K). I dont need to go higher that 256-512K to achieve my goals. With default of 32K I can fit about 250-60 actions in one batch. With 128K I can fit 4x that. It takes about 1.5 minutes for one process to dump 1M actions on my laptop (Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz) with 32K; 25% of that time with 128K. tc is single threaded, so i can keep one cpu busy 100% while I dump which means latency fear is lowered. My eventual need: To dump all relevant stats every 5 seconds. I will send the other patch I talked about which filters based on time which helps in most cases but not always. I am also now thinking of adding "a range index filter" and then multi-threading several parrallel requests, one for each range of indices. cheers, jamal
On Sun, 16 Apr 2017 09:03:08 -0400 Jamal Hadi Salim <jhs@mojatatu.com> wrote: > On 17-04-15 11:08 PM, Eric Dumazet wrote: > > On Sat, 2017-04-15 at 13:07 -0400, Jamal Hadi Salim wrote: > >> Eric, > >> > >> How does attached look instead of the 32K? > >> I found it helps to let user space suggest something > >> larger. > >> > >> cheers, > >> jamal > > > > Looks dangerous to me, for various reasons. > > > > 1) Memory allocations might not like it > > > > Have you tried your change after user does a > > setsockopt( SO_RCVBUFFORCE, 256 Mbytes), and a > > recvmsg ( .. 64 Mbytes) ? > > > > Presumably, we could replace 32768 by (PAGE_SIZE << > > PAGE_ALLOC_COSTLY_ORDER), but this will not matter on x86. > > > > For my use case I dont need to go that high, but i can see > plausibility that someone else will. Is there a reasonable > large number other than 32K? 128K-512K would be way sufficient. It was common with routing daemons to set SO_RCVBUF to very large values to avoid losing notifications. > > 2) We might have paths in the kernel filling a potential big skb without > > yielding cpu or a spinlock or a mutex. -> latency source. > > > > > > What perf numbers do you have, using 1MB buffers instead of 32KB ? > > > > The syscall overhead seems tiny compared to the actual cost of filling > > the netlink message, accessing thousands of cache lines all over the > > places. > > > > sycall is affecting me - but I have only compared with limited > traffic running at the same time as dumping. The more i can batch > the sooner i can stop polluting the cache. > > The tests I have done are with a default socket buffer of 4M > and say recvmsg(... 128K). I dont need to go higher > that 256-512K to achieve my goals. > With default of 32K I can fit about 250-60 actions in one batch. > With 128K I can fit 4x that. > It takes about 1.5 minutes for one process to dump 1M actions > on my laptop (Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz) with > 32K; 25% of that time with 128K. tc is single threaded, so i can > keep one cpu busy 100% while I dump which means latency fear > is lowered. > > My eventual need: To dump all relevant stats every 5 seconds. > I will send the other patch I talked about which filters based > on time which helps in most cases but not always. > > I am also now thinking of adding "a range index filter" and then > multi-threading several parrallel requests, one for each range of > indices. > > cheers, > jamal
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 7b73c7c..bc982ef 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1849,7 +1849,7 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, /* Record the max length of recvmsg() calls for future allocations */ nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len); nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len, - SKB_WITH_OVERHEAD(32768)); + sk->sk_rcvbuf / 4); copied = data_skb->len; if (len < copied) {