Message ID | 5705F759.9020003@huawei.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2016-04-07 at 13:59 +0800, Yang Yingliang wrote: > > On 2016/3/30 21:47, Eric Dumazet wrote: > > On Wed, 2016-03-30 at 13:56 +0800, Yang Yingliang wrote: > > > >> Sorry, I made a mistake. I am very sure my kernel has these two patches. > >> And I can get some dropping of the packets in 10Gb eth. > >> > >> # netstat -s | grep -i backlog > >> TCPBacklogDrop: 4135 > >> # netstat -s | grep -i backlog > >> TCPBacklogDrop: 4167 > > > > Sender will retransmit and the receiver backlog will lilely be emptied > > before the packets arrive again. > > > > Are you sure these are TCP drops ? > Yes. > > > > > Which 10Gb NIC is it ? (ethtool -i eth0) > The NIC driver is not upstream. And my system is arm64. > > > > > What is the max size of sendmsg() chunks are generated by your apps ? > 256KB > > > > > Are they forcing small SO_RCVBUF or SO_SNDBUF ? > I am not sure. > I add some debug message in kernel: > [2016-04-06 10:56:55][ 1365.477140] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12402232 rmem_alloc:0 truesize:53320 > [2016-04-06 10:56:55][ 1365.477170] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12460884 rmem_alloc:55986 truesize:58652 > [2016-04-06 10:56:55][ 1365.477192] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12506206 rmem_alloc:0 truesize:45322 > [2016-04-06 10:56:55][ 1365.477226] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12519536 rmem_alloc:7998 truesize:13330 > [2016-04-06 10:56:55][ 1365.477254] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12575522 rmem_alloc:0 truesize:55986 > [2016-04-06 10:56:55][ 1365.477282] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652 > [2016-04-06 10:56:55][ 1365.477301] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:26660 truesize:31992 > [2016-04-06 10:56:55][ 1365.477321] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:26660 > [2016-04-06 10:56:55][ 1365.477341] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:42656 > [2016-04-06 10:56:55][ 1365.477384] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652 > [2016-04-06 10:56:55][ 1365.477403] TCP: rcvbuf:10485760 sndbuf:2097152 > limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:34658 > > > > > What percentage of drops do you have ? > netstat -s | grep -i TCPBacklogDrop increases 20-40 per second. > It's about 1.2% (117724(TCPBacklogDrop)/214502873(InSegs of cat > /proc/net/snmp)). > > > > > Here (at Google), we have less than one backlog drop per billion > > packets, on host facing the public Internet. > > > > If a TCP sender sends a burst of tiny packets because it is misbehaving, > > you absolutely will drop packets, especially if applications use > > sendmsg() with very big lengths and big SO_SNDBUF. > > > > Trying to not drop these hostile packets as you did is simply opening > > your host to DOS attacks. > > > > Eventually, we should even drop earlier in TCP stack (before taking > > socket lock). > > > > > How about expand the buffer like: Please do not send patches before really understanding the issue you have. Having a backlog of 12506206 bytes is ridiculous. Dropping packets is absolutely fine if this ever happens. Something is really wrong on your host, or the sender simply does not comply with TCP protocol (not caring of receiver window at all) Since you added a trace of truesize, please also trace skb->len
On Thu, 2016-04-07 at 03:21 -0700, Eric Dumazet wrote: > Please do not send patches before really understanding the issue you > have. > > Having a backlog of 12506206 bytes is ridiculous. Dropping packets is > absolutely fine if this ever happens. > > Something is really wrong on your host, or the sender simply does not > comply with TCP protocol (not caring of receiver window at all) > > Since you added a trace of truesize, please also trace skb->len > BTW, have you played with /proc/sys/net/ipv4/tcp_adv_win_scale ?
On 2016/4/7 22:51, Eric Dumazet wrote: > On Thu, 2016-04-07 at 03:21 -0700, Eric Dumazet wrote: > >> Please do not send patches before really understanding the issue you >> have. >> >> Having a backlog of 12506206 bytes is ridiculous. Dropping packets is >> absolutely fine if this ever happens. >> >> Something is really wrong on your host, or the sender simply does not >> comply with TCP protocol (not caring of receiver window at all) >> >> Since you added a trace of truesize, please also trace skb->len >> [2016-04-08 18:33:39][ 9748.726948] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:31992, len:17540 [2016-04-08 18:33:39][ 9748.726964] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:29326, truesize:18662, len:10240 [2016-04-08 18:33:39][ 9748.726986] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:39990, len:21920 [2016-04-08 18:33:39][ 9748.727028] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.727068] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.727082] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:21328, truesize:5332, len:2940 [2016-04-08 18:33:39][ 9748.727310] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:53320, len:29220 [2016-04-08 18:33:39][ 9748.727326] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:26660, truesize:7998, len:4400 [2016-04-08 18:33:39][ 9748.727352] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:47988, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.727389] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:39990, len:21920 [2016-04-08 18:33:39][ 9748.727409] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:12607514 rmem_alloc:58652, truesize:18662, len:10240 If I expand buffer 5 times((sndbuf+rcvbuf)*5). There are only 5M data in backlog at most. [2016-04-08 18:33:39][ 9748.777743] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5435954 rmem_alloc:0, truesize:55986, len:30680 [2016-04-08 18:33:39][ 9748.777762] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5457282 rmem_alloc:58652, truesize:21328, len:11700 [2016-04-08 18:33:39][ 9748.777804] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5515934 rmem_alloc:55986, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.777818] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5537262 rmem_alloc:0, truesize:21328, len:11700 [2016-04-08 18:33:39][ 9748.777839] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5574586 rmem_alloc:0, truesize:37324, len:20460 [2016-04-08 18:33:39][ 9748.777854] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5601246 rmem_alloc:58652, truesize:26660, len:14620 [2016-04-08 18:33:39][ 9748.777881] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5659898 rmem_alloc:21328, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.777894] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:5675894 rmem_alloc:37324, truesize:15996, len:8780 [2016-04-08 18:33:39][ 9748.778047] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:58652 rmem_alloc:0, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.778075] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:117304 rmem_alloc:0, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.778084] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:122636 rmem_alloc:0, truesize:5332, len:2940 [2016-04-08 18:33:39][ 9748.778109] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:175956 rmem_alloc:0, truesize:53320, len:29220 [2016-04-08 18:33:39][ 9748.778156] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:234608 rmem_alloc:0, truesize:58652, len:32140 [2016-04-08 18:33:39][ 9748.778178] TCP: rcvbuf:10485760 sndbuf:2097152 limit:12582912 backloglen:282596 rmem_alloc:58652, truesize:47988, len:26300 > > BTW, have you played with /proc/sys/net/ipv4/tcp_adv_win_scale ? > I expand tcp_adv_win_scale and tcp_rmem. It has no effect. > > > >
On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
> I expand tcp_adv_win_scale and tcp_rmem. It has no effect.
Try :
echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
And restart your flows.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 08 Apr 2016 07:44:25 -0700 > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: > >> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. > > Try : > > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale > > And restart your flows. I'm honestly beginning to suspect a bug in their driver and how they handle skb->truesize. Yang, until you show us the driver you are using and how is handles receive packets, we are largely in the dark about a major component of this issue and that is entirely unfair to us.
On Fri, 2016-04-08 at 12:53 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 08 Apr 2016 07:44:25 -0700 > > > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: > > > >> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. > > > > Try : > > > > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale > > > > And restart your flows. > > I'm honestly beginning to suspect a bug in their driver and how they > handle skb->truesize. > > Yang, until you show us the driver you are using and how is handles > receive packets, we are largely in the dark about a major component > of this issue and that is entirely unfair to us. Apparently their skb->truesize and skb->len combinations are correct. I suspect an issue with rcvbuf autouning on a bidirectional tcp traffic. We mostly focus on unidirectional flows, but they seem to use a mixed case. Also, fact that sendmsg() locks the socket for the duration of the call is problematic : I suspect their issues would mostly disappear by using smaller chunk sizes (ie 64KB per sendmsg() instead of 256KB). We also could add resched points in sendmsg() (processing backlog if it gets too hot), but I fear this would slow down the fast path.
On 2016/4/8 22:44, Eric Dumazet wrote: > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: > >> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. > > Try : > > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale > > And restart your flows. > cat /proc/sys/net/ipv4/tcp_rmem 10240 2097152 10485760 echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale It seems has not effect.
On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote: > > On 2016/4/8 22:44, Eric Dumazet wrote: > > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: > > > >> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. > > > > Try : > > > > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale > > > > And restart your flows. > > > cat /proc/sys/net/ipv4/tcp_rmem > 10240 2097152 10485760 What about leaving the default values ? $ cat /proc/sys/net/ipv4/tcp_rmem 4096 87380 6291456 > > echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale > > It seems has not effect. > I have no idea what you did on the sender side to allow it to send more than 1.5 MB then.
On 2016/4/9 1:04, Eric Dumazet wrote: > On Fri, 2016-04-08 at 12:53 -0400, David Miller wrote: >> From: Eric Dumazet <eric.dumazet@gmail.com> >> Date: Fri, 08 Apr 2016 07:44:25 -0700 >> >>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: >>> >>>> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. >>> >>> Try : >>> >>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale >>> >>> And restart your flows. >> >> I'm honestly beginning to suspect a bug in their driver and how they >> handle skb->truesize. >> >> Yang, until you show us the driver you are using and how is handles >> receive packets, we are largely in the dark about a major component >> of this issue and that is entirely unfair to us. > > Apparently their skb->truesize and skb->len combinations are correct. > > I suspect an issue with rcvbuf autouning on a bidirectional tcp traffic. > We mostly focus on unidirectional flows, but they seem to use a mixed > case. > > Also, fact that sendmsg() locks the socket for the duration of the call > is problematic : I suspect their issues would mostly disappear by using > smaller chunk sizes (ie 64KB per sendmsg() instead of 256KB). It's less packets dropping with using 64KB chunk. > > We also could add resched points in sendmsg() (processing backlog if it > gets too hot), but I fear this would slow down the fast path. > > > > >
On 2016/4/11 20:13, Eric Dumazet wrote: > On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote: >> >> On 2016/4/8 22:44, Eric Dumazet wrote: >>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: >>> >>>> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. >>> >>> Try : >>> >>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale >>> >>> And restart your flows. >>> >> cat /proc/sys/net/ipv4/tcp_rmem >> 10240 2097152 10485760 > > What about leaving the default values ? I tried, it did not work. > > $ cat /proc/sys/net/ipv4/tcp_rmem > 4096 87380 6291456 > >> >> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem >> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale >> >> It seems has not effect. >> > > I have no idea what you did on the sender side to allow it to send more > than 1.5 MB then. We are doing performance test. The sender send 256KB per-block with 128 threads to one socket. And the receiver uses 10Gb NIC to handle the data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi. I added some debug messages and found handling backlog packets in __release_sock() cost about 11ms at most. This can cause backlog queue overflow. The sk_data_ready is re-assigned, it may cost time in our program. I will check it out.
On 2016/4/12 10:59, Yang Yingliang wrote: > > > On 2016/4/11 20:13, Eric Dumazet wrote: >> On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote: >>> >>> On 2016/4/8 22:44, Eric Dumazet wrote: >>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote: >>>> >>>>> I expand tcp_adv_win_scale and tcp_rmem. It has no effect. >>>> >>>> Try : >>>> >>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale >>>> >>>> And restart your flows. >>>> >>> cat /proc/sys/net/ipv4/tcp_rmem >>> 10240 2097152 10485760 >> >> What about leaving the default values ? > I tried, it did not work. > >> >> $ cat /proc/sys/net/ipv4/tcp_rmem >> 4096 87380 6291456 >> >>> >>> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem >>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale >>> >>> It seems has not effect. >>> >> >> I have no idea what you did on the sender side to allow it to send more >> than 1.5 MB then. > > We are doing performance test. The sender send 256KB per-block with 128 > threads to one socket. And the receiver uses 10Gb NIC to handle the > data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi. > > I added some debug messages and found handling backlog packets in > __release_sock() cost about 11ms at most. This can cause backlog queue > overflow. The sk_data_ready is re-assigned, it may cost time in our > program. I will check it out. > I traced the cost cycles of handling backlog packets in __release_sock(). 16.97 ms to handling about 12MB backlog packets, of which 13.66ms to do sk_data_ready. The speed of handling packets in TCP is 5.65Gb/s which is smaller than the NIC's bandwidth. So the packets will be dropped. If the cost of sk_data_read cannot be reduced, do we have other choice exclude dropping packets ?
On Tue, 2016-04-12 at 20:31 +0800, Yang Yingliang wrote: > I traced the cost cycles of handling backlog packets in > __release_sock(). > 16.97 ms to handling about 12MB backlog packets, of which 13.66ms to do > sk_data_ready. > The speed of handling packets in TCP is 5.65Gb/s which is smaller than > the NIC's bandwidth. So the packets will be dropped. > > If the cost of sk_data_read cannot be reduced, do we have other choice > exclude dropping packets ? Normally, TCP stack sends ACK packets with appropriate RWIN. Sender should not send more packets than allowed in RWIN, even if there are 128 threads using one TCP socket, it does not matter. Imagine you do not have a backlog problem (nothing does the sendmsg() while you receive data), and nothing reads the socket. Then the receiver should eventually send WIN 0 back to the sender and sender should stop, before any drop can possibly happen. I have no problem receiving one TCP flow at 34Gbit, so it must be something related to the huge windows you seem to use. One possibility could be to tweak in ACK packets a reduced rwin so that the sender is not allowed to continue the flood while we are painfully processing a huge backlog.
diff --git a/include/net/tcp.h b/include/net/tcp.h index 6d204f3..da1bc16 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -281,6 +281,7 @@ extern unsigned int sysctl_tcp_notsent_lowat; extern int sysctl_tcp_min_tso_segs; extern int sysctl_tcp_autocorking; extern int sysctl_tcp_invalid_ratelimit; +extern int sysctl_tcp_backlog_buf_multi; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index f0e8297..9511410 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -631,6 +631,13 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "tcp_backlog_buf_multi", + .data = &sysctl_tcp_backlog_buf_multi, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_NETLABEL { .procname = "cipso_cache_enable", diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 87463c8..337ad55 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -101,6 +101,8 @@ int sysctl_tcp_thin_dupack __read_mostly; int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; int sysctl_tcp_early_retrans __read_mostly = 3; int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2; +int sysctl_tcp_backlog_buf_multi __read_mostly = 1; +EXPORT_SYMBOL(sysctl_tcp_backlog_buf_multi); #define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 13b92d5..39272f3 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1635,7 +1635,8 @@ process: if (!tcp_prequeue(sk, skb)) ret = tcp_v4_do_rcv(sk, skb); } else if (unlikely(sk_add_backlog(sk, skb, - sk->sk_rcvbuf + sk->sk_sndbuf))) { + (sk->sk_rcvbuf + sk->sk_sndbuf) * + sysctl_tcp_backlog_buf_multi))) { bh_unlock_sock(sk); NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); goto discard_and_relse; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index c1147ac..1e8f709 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1433,7 +1433,8 @@ process: if (!tcp_prequeue(sk, skb)) ret = tcp_v6_do_rcv(sk, skb); } else if (unlikely(sk_add_backlog(sk, skb, - sk->sk_rcvbuf + sk->sk_sndbuf))) { + (sk->sk_rcvbuf + sk->sk_sndbuf) * + sysctl_tcp_backlog_buf_multi))) { bh_unlock_sock(sk); NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); goto discard_and_relse;