Message ID | 1320859475.3916.21.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 09 Nov 2011 18:24:35 +0100 > [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference > > When a socket uses IP_PKTINFO notifications, we currently force a dst > reference for each received skb. Reader has to access dst to get needed > information (rt_iif & rt_spec_dst) and must release dst reference. > > We also forced a dst reference if skb was put in socket backlog, even > without IP_PKTINFO handling. This happens under stress/load. > > We can instead store the needed information in skb->cb[], so that only > softirq handler really access dst, improving cache hit ratios. > > This removes two atomic operations per packet, and false sharing as > well. > > On a benchmark using a mono threaded receiver (doing only recvmsg() > calls), I can reach 720.000 pps instead of 570.000 pps. > > IP_PKTINFO is typically used by DNS servers, and any multihomed aware > UDP application. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Looks good, if it compiles I'll push it out to net-next :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 09 novembre 2011 à 16:37 -0500, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 09 Nov 2011 18:24:35 +0100 > > > [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference > > > > When a socket uses IP_PKTINFO notifications, we currently force a dst > > reference for each received skb. Reader has to access dst to get needed > > information (rt_iif & rt_spec_dst) and must release dst reference. > > > > We also forced a dst reference if skb was put in socket backlog, even > > without IP_PKTINFO handling. This happens under stress/load. > > > > We can instead store the needed information in skb->cb[], so that only > > softirq handler really access dst, improving cache hit ratios. > > > > This removes two atomic operations per packet, and false sharing as > > well. > > > > On a benchmark using a mono threaded receiver (doing only recvmsg() > > calls), I can reach 720.000 pps instead of 570.000 pps. > > > > IP_PKTINFO is typically used by DNS servers, and any multihomed aware > > UDP application. > > > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > Looks good, if it compiles I'll push it out to net-next :-) Arg :( I cross my fingers :) BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE bytes : skb->truesize=4352 len=26 (payload only) Truesize being now more precise, we hit badly the shared udp_memory_allocated, even with single frames. I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid ping/pong... -#define SK_MEM_QUANTUM ((int)PAGE_SIZE) +#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/ip.h b/include/net/ip.h index eca0ef7..fd1561e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -450,7 +450,7 @@ extern int ip_options_rcv_srr(struct sk_buff *skb); * Functions provided by ip_sockglue.c */ -extern int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); +extern void ipv4_pktinfo_prepare(struct sk_buff *skb); extern void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb); extern int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc); diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 09ff51b..b516030 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -55,20 +55,13 @@ /* * SOL_IP control messages. */ +#define PKTINFO_SKB_CB(__skb) ((struct in_pktinfo *)((__skb)->cb)) static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb) { - struct in_pktinfo info; - struct rtable *rt = skb_rtable(skb); - + struct in_pktinfo info = *PKTINFO_SKB_CB(skb); + info.ipi_addr.s_addr = ip_hdr(skb)->daddr; - if (rt) { - info.ipi_ifindex = rt->rt_iif; - info.ipi_spec_dst.s_addr = rt->rt_spec_dst; - } else { - info.ipi_ifindex = 0; - info.ipi_spec_dst.s_addr = 0; - } put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info); } @@ -992,20 +985,28 @@ e_inval: } /** - * ip_queue_rcv_skb - Queue an skb into sock receive queue + * ipv4_pktinfo_prepare - transfert some info from rtable to skb * @sk: socket * @skb: buffer * - * Queues an skb into socket receive queue. If IP_CMSG_PKTINFO option - * is not set, we drop skb dst entry now, while dst cache line is hot. + * To support IP_CMSG_PKTINFO option, we store rt_iif and rt_spec_dst + * in skb->cb[] before dst drop. + * This way, receiver doesnt make cache line misses to read rtable. */ -int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +void ipv4_pktinfo_prepare(struct sk_buff *skb) { - if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO)) - skb_dst_drop(skb); - return sock_queue_rcv_skb(sk, skb); + struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb); + const struct rtable *rt = skb_rtable(skb); + + if (rt) { + pktinfo->ipi_ifindex = rt->rt_iif; + pktinfo->ipi_spec_dst.s_addr = rt->rt_spec_dst; + } else { + pktinfo->ipi_ifindex = 0; + pktinfo->ipi_spec_dst.s_addr = 0; + } + skb_dst_drop(skb); } -EXPORT_SYMBOL(ip_queue_rcv_skb); int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 007e2eb..7a8410d 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -292,7 +292,8 @@ static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb) { /* Charge it to the socket. */ - if (ip_queue_rcv_skb(sk, skb) < 0) { + ipv4_pktinfo_prepare(skb); + if (sock_queue_rcv_skb(sk, skb) < 0) { kfree_skb(skb); return NET_RX_DROP; } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ab0966d..6854f58 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1357,7 +1357,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (inet_sk(sk)->inet_daddr) sock_rps_save_rxhash(sk, skb); - rc = ip_queue_rcv_skb(sk, skb); + rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); @@ -1473,6 +1473,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) rc = 0; + ipv4_pktinfo_prepare(skb); bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 331af3b..204f2e8 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -383,7 +383,8 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb) } /* Charge it to the socket. */ - if (ip_queue_rcv_skb(sk, skb) < 0) { + skb_dst_drop(skb); + if (sock_queue_rcv_skb(sk, skb) < 0) { kfree_skb(skb); return NET_RX_DROP; } diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 846f475..b4a4a15 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -538,7 +538,9 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) goto drop; } - if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) { + skb_dst_drop(skb); + rc = sock_queue_rcv_skb(sk, skb); + if (rc < 0) { /* Note that an ENOMEM error is charged twice */ if (rc == -ENOMEM) UDP6_INC_STATS_BH(sock_net(sk),