diff mbox

[RFC] net: decrease the length of backlog queue immediately after it's detached from sk

Message ID 5705F759.9020003@huawei.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Yang Yingliang April 7, 2016, 5:59 a.m. UTC
On 2016/3/30 21:47, Eric Dumazet wrote:
> On Wed, 2016-03-30 at 13:56 +0800, Yang Yingliang wrote:
>
>> Sorry, I made a mistake. I am very sure my kernel has these two patches.
>> And I can get some dropping of the packets in 10Gb eth.
>>
>> # netstat -s | grep -i backlog
>>       TCPBacklogDrop: 4135
>> # netstat -s | grep -i backlog
>>       TCPBacklogDrop: 4167
>
> Sender will retransmit and the receiver backlog will lilely be emptied
> before the packets arrive again.
>
> Are you sure these are TCP drops ?
Yes.

>
> Which 10Gb NIC is it ? (ethtool -i eth0)
The NIC driver is not upstream. And my system is arm64.

>
> What is the max size of sendmsg() chunks are generated by your apps ?
256KB

>
> Are they forcing small SO_RCVBUF or SO_SNDBUF ?
I am not sure.
I add some debug message in kernel:
[2016-04-06 10:56:55][ 1365.477140] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12402232 rmem_alloc:0 truesize:53320
[2016-04-06 10:56:55][ 1365.477170] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12460884 rmem_alloc:55986 truesize:58652
[2016-04-06 10:56:55][ 1365.477192] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12506206 rmem_alloc:0 truesize:45322
[2016-04-06 10:56:55][ 1365.477226] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12519536 rmem_alloc:7998 truesize:13330
[2016-04-06 10:56:55][ 1365.477254] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12575522 rmem_alloc:0 truesize:55986
[2016-04-06 10:56:55][ 1365.477282] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652
[2016-04-06 10:56:55][ 1365.477301] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:26660 truesize:31992
[2016-04-06 10:56:55][ 1365.477321] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:26660
[2016-04-06 10:56:55][ 1365.477341] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:42656
[2016-04-06 10:56:55][ 1365.477384] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652
[2016-04-06 10:56:55][ 1365.477403] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:34658

>
> What percentage of drops do you have ?
netstat -s | grep -i TCPBacklogDrop increases 20-40 per second.
It's about 1.2% (117724(TCPBacklogDrop)/214502873(InSegs of cat 
/proc/net/snmp)).

>
> Here (at Google), we have less than one backlog drop per billion
> packets, on host facing the public Internet.
>
> If a TCP sender sends a burst of tiny packets because it is misbehaving,
> you absolutely will drop packets, especially if applications use
> sendmsg() with very big lengths and big SO_SNDBUF.
>
> Trying to not drop these hostile packets as you did is simply opening
> your host to DOS attacks.
>
> Eventually, we should even drop earlier in TCP stack (before taking
> socket lock).
>
>
How about expand the buffer like:

--

Comments

Eric Dumazet April 7, 2016, 10:21 a.m. UTC | #1
On Thu, 2016-04-07 at 13:59 +0800, Yang Yingliang wrote:
> 
> On 2016/3/30 21:47, Eric Dumazet wrote:
> > On Wed, 2016-03-30 at 13:56 +0800, Yang Yingliang wrote:
> >
> >> Sorry, I made a mistake. I am very sure my kernel has these two patches.
> >> And I can get some dropping of the packets in 10Gb eth.
> >>
> >> # netstat -s | grep -i backlog
> >>       TCPBacklogDrop: 4135
> >> # netstat -s | grep -i backlog
> >>       TCPBacklogDrop: 4167
> >
> > Sender will retransmit and the receiver backlog will lilely be emptied
> > before the packets arrive again.
> >
> > Are you sure these are TCP drops ?
> Yes.
> 
> >
> > Which 10Gb NIC is it ? (ethtool -i eth0)
> The NIC driver is not upstream. And my system is arm64.
> 
> >
> > What is the max size of sendmsg() chunks are generated by your apps ?
> 256KB
> 
> >
> > Are they forcing small SO_RCVBUF or SO_SNDBUF ?
> I am not sure.
> I add some debug message in kernel:
> [2016-04-06 10:56:55][ 1365.477140] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12402232 rmem_alloc:0 truesize:53320
> [2016-04-06 10:56:55][ 1365.477170] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12460884 rmem_alloc:55986 truesize:58652
> [2016-04-06 10:56:55][ 1365.477192] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12506206 rmem_alloc:0 truesize:45322
> [2016-04-06 10:56:55][ 1365.477226] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12519536 rmem_alloc:7998 truesize:13330
> [2016-04-06 10:56:55][ 1365.477254] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12575522 rmem_alloc:0 truesize:55986
> [2016-04-06 10:56:55][ 1365.477282] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652
> [2016-04-06 10:56:55][ 1365.477301] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:26660 truesize:31992
> [2016-04-06 10:56:55][ 1365.477321] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:26660
> [2016-04-06 10:56:55][ 1365.477341] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:58652 truesize:42656
> [2016-04-06 10:56:55][ 1365.477384] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:58652
> [2016-04-06 10:56:55][ 1365.477403] TCP: rcvbuf:10485760 sndbuf:2097152 
> limit:12582912 backloglen:12634174 rmem_alloc:0 truesize:34658
> 
> >
> > What percentage of drops do you have ?
> netstat -s | grep -i TCPBacklogDrop increases 20-40 per second.
> It's about 1.2% (117724(TCPBacklogDrop)/214502873(InSegs of cat 
> /proc/net/snmp)).
> 
> >
> > Here (at Google), we have less than one backlog drop per billion
> > packets, on host facing the public Internet.
> >
> > If a TCP sender sends a burst of tiny packets because it is misbehaving,
> > you absolutely will drop packets, especially if applications use
> > sendmsg() with very big lengths and big SO_SNDBUF.
> >
> > Trying to not drop these hostile packets as you did is simply opening
> > your host to DOS attacks.
> >
> > Eventually, we should even drop earlier in TCP stack (before taking
> > socket lock).
> >
> >
> How about expand the buffer like:

Please do not send patches before really understanding the issue you
have.

Having a backlog of 12506206 bytes is ridiculous. Dropping packets is
absolutely fine if this ever happens.

Something is really wrong on your host, or the sender simply does not
comply with TCP protocol (not caring of receiver window at all)

Since you added a trace of truesize, please also trace skb->len
Eric Dumazet April 7, 2016, 2:51 p.m. UTC | #2
On Thu, 2016-04-07 at 03:21 -0700, Eric Dumazet wrote:

> Please do not send patches before really understanding the issue you
> have.
> 
> Having a backlog of 12506206 bytes is ridiculous. Dropping packets is
> absolutely fine if this ever happens.
> 
> Something is really wrong on your host, or the sender simply does not
> comply with TCP protocol (not caring of receiver window at all)
> 
> Since you added a trace of truesize, please also trace skb->len
> 

BTW, have you played with /proc/sys/net/ipv4/tcp_adv_win_scale ?
Yang Yingliang April 8, 2016, 11:18 a.m. UTC | #3
On 2016/4/7 22:51, Eric Dumazet wrote:
> On Thu, 2016-04-07 at 03:21 -0700, Eric Dumazet wrote:
>
>> Please do not send patches before really understanding the issue you
>> have.
>>
>> Having a backlog of 12506206 bytes is ridiculous. Dropping packets is
>> absolutely fine if this ever happens.
>>
>> Something is really wrong on your host, or the sender simply does not
>> comply with TCP protocol (not caring of receiver window at all)
>>
>> Since you added a trace of truesize, please also trace skb->len
>>

[2016-04-08 18:33:39][ 9748.726948] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:31992, len:17540
[2016-04-08 18:33:39][ 9748.726964] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:29326, truesize:18662, 
len:10240
[2016-04-08 18:33:39][ 9748.726986] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:39990, len:21920
[2016-04-08 18:33:39][ 9748.727028] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:58652, len:32140
[2016-04-08 18:33:39][ 9748.727068] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:58652, len:32140
[2016-04-08 18:33:39][ 9748.727082] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:21328, truesize:5332, len:2940
[2016-04-08 18:33:39][ 9748.727310] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:53320, len:29220
[2016-04-08 18:33:39][ 9748.727326] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:26660, truesize:7998, len:4400
[2016-04-08 18:33:39][ 9748.727352] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:47988, truesize:58652, 
len:32140
[2016-04-08 18:33:39][ 9748.727389] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:0, truesize:39990, len:21920
[2016-04-08 18:33:39][ 9748.727409] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:12607514 rmem_alloc:58652, truesize:18662, 
len:10240

If I expand buffer 5 times((sndbuf+rcvbuf)*5). There are only 5M data in 
backlog at most.

[2016-04-08 18:33:39][ 9748.777743] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5435954 rmem_alloc:0, truesize:55986, len:30680
[2016-04-08 18:33:39][ 9748.777762] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5457282 rmem_alloc:58652, truesize:21328, 
len:11700
[2016-04-08 18:33:39][ 9748.777804] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5515934 rmem_alloc:55986, truesize:58652, 
len:32140
[2016-04-08 18:33:39][ 9748.777818] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5537262 rmem_alloc:0, truesize:21328, len:11700
[2016-04-08 18:33:39][ 9748.777839] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5574586 rmem_alloc:0, truesize:37324, len:20460
[2016-04-08 18:33:39][ 9748.777854] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5601246 rmem_alloc:58652, truesize:26660, 
len:14620
[2016-04-08 18:33:39][ 9748.777881] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5659898 rmem_alloc:21328, truesize:58652, 
len:32140
[2016-04-08 18:33:39][ 9748.777894] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:5675894 rmem_alloc:37324, truesize:15996, len:8780
[2016-04-08 18:33:39][ 9748.778047] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:58652 rmem_alloc:0, truesize:58652, len:32140
[2016-04-08 18:33:39][ 9748.778075] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:117304 rmem_alloc:0, truesize:58652, len:32140
[2016-04-08 18:33:39][ 9748.778084] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:122636 rmem_alloc:0, truesize:5332, len:2940
[2016-04-08 18:33:39][ 9748.778109] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:175956 rmem_alloc:0, truesize:53320, len:29220
[2016-04-08 18:33:39][ 9748.778156] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:234608 rmem_alloc:0, truesize:58652, len:32140
[2016-04-08 18:33:39][ 9748.778178] TCP: rcvbuf:10485760 sndbuf:2097152 
limit:12582912 backloglen:282596 rmem_alloc:58652, truesize:47988, len:26300
>
> BTW, have you played with /proc/sys/net/ipv4/tcp_adv_win_scale ?
>

I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>
>
>
>
Eric Dumazet April 8, 2016, 2:44 p.m. UTC | #4
On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:

> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.

Try :

echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

And restart your flows.
David Miller April 8, 2016, 4:53 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 08 Apr 2016 07:44:25 -0700

> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
> 
>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
> 
> Try :
> 
> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
> 
> And restart your flows.

I'm honestly beginning to suspect a bug in their driver and how they
handle skb->truesize.

Yang, until you show us the driver you are using and how is handles
receive packets, we are largely in the dark about a major component
of this issue and that is entirely unfair to us.
Eric Dumazet April 8, 2016, 5:04 p.m. UTC | #6
On Fri, 2016-04-08 at 12:53 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 08 Apr 2016 07:44:25 -0700
> 
> > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
> > 
> >> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
> > 
> > Try :
> > 
> > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
> > 
> > And restart your flows.
> 
> I'm honestly beginning to suspect a bug in their driver and how they
> handle skb->truesize.
> 
> Yang, until you show us the driver you are using and how is handles
> receive packets, we are largely in the dark about a major component
> of this issue and that is entirely unfair to us.

Apparently their skb->truesize and skb->len combinations are correct.

I suspect an issue with rcvbuf autouning on a bidirectional tcp traffic.
We mostly focus on unidirectional flows, but they seem to use a mixed
case.

Also, fact that sendmsg() locks the socket for the duration of the call
is problematic : I suspect their issues would mostly disappear by using
smaller chunk sizes (ie 64KB per sendmsg() instead of 256KB).

We also could add resched points in sendmsg() (processing backlog if it
gets too hot), but I fear this would slow down the fast path.
Yang Yingliang April 11, 2016, 11:57 a.m. UTC | #7
On 2016/4/8 22:44, Eric Dumazet wrote:
> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>
>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>
> Try :
>
> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>
> And restart your flows.
>
cat /proc/sys/net/ipv4/tcp_rmem
10240 2097152 10485760

echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

It seems has not effect.
Eric Dumazet April 11, 2016, 12:13 p.m. UTC | #8
On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:
> 
> On 2016/4/8 22:44, Eric Dumazet wrote:
> > On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
> >
> >> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
> >
> > Try :
> >
> > echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
> >
> > And restart your flows.
> >
> cat /proc/sys/net/ipv4/tcp_rmem
> 10240 2097152 10485760

What about leaving the default values ?

$ cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	6291456

> 
> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
> 
> It seems has not effect.
> 

I have no idea what you did on the sender side to allow it to send more
than 1.5 MB then.
Yang Yingliang April 11, 2016, 2:42 p.m. UTC | #9
On 2016/4/9 1:04, Eric Dumazet wrote:
> On Fri, 2016-04-08 at 12:53 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 08 Apr 2016 07:44:25 -0700
>>
>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>>>
>>>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>>>
>>> Try :
>>>
>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>
>>> And restart your flows.
>>
>> I'm honestly beginning to suspect a bug in their driver and how they
>> handle skb->truesize.
>>
>> Yang, until you show us the driver you are using and how is handles
>> receive packets, we are largely in the dark about a major component
>> of this issue and that is entirely unfair to us.
>
> Apparently their skb->truesize and skb->len combinations are correct.
>
> I suspect an issue with rcvbuf autouning on a bidirectional tcp traffic.
> We mostly focus on unidirectional flows, but they seem to use a mixed
> case.
>
> Also, fact that sendmsg() locks the socket for the duration of the call
> is problematic : I suspect their issues would mostly disappear by using
> smaller chunk sizes (ie 64KB per sendmsg() instead of 256KB).
It's less packets dropping with using 64KB chunk.

>
> We also could add resched points in sendmsg() (processing backlog if it
> gets too hot), but I fear this would slow down the fast path.
>
>
>
>
>
Yang Yingliang April 12, 2016, 2:59 a.m. UTC | #10
On 2016/4/11 20:13, Eric Dumazet wrote:
> On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:
>>
>> On 2016/4/8 22:44, Eric Dumazet wrote:
>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>>>
>>>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>>>
>>> Try :
>>>
>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>
>>> And restart your flows.
>>>
>> cat /proc/sys/net/ipv4/tcp_rmem
>> 10240 2097152 10485760
>
> What about leaving the default values ?
I tried, it did not work.

>
> $ cat /proc/sys/net/ipv4/tcp_rmem
> 4096	87380	6291456
>
>>
>> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>
>> It seems has not effect.
>>
>
> I have no idea what you did on the sender side to allow it to send more
> than 1.5 MB then.

We are doing performance test. The sender send 256KB per-block with 128
threads to one socket. And the receiver uses 10Gb NIC to handle the
data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi.

I added some debug messages and found handling backlog packets in 
__release_sock() cost about 11ms at most. This can cause backlog queue
overflow. The sk_data_ready is re-assigned, it may cost time in our
program. I will check it out.
Yang Yingliang April 12, 2016, 12:31 p.m. UTC | #11
On 2016/4/12 10:59, Yang Yingliang wrote:
>
>
> On 2016/4/11 20:13, Eric Dumazet wrote:
>> On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:
>>>
>>> On 2016/4/8 22:44, Eric Dumazet wrote:
>>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>>>>
>>>>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>>>>
>>>> Try :
>>>>
>>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>>
>>>> And restart your flows.
>>>>
>>> cat /proc/sys/net/ipv4/tcp_rmem
>>> 10240 2097152 10485760
>>
>> What about leaving the default values ?
> I tried, it did not work.
>
>>
>> $ cat /proc/sys/net/ipv4/tcp_rmem
>> 4096    87380    6291456
>>
>>>
>>> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>
>>> It seems has not effect.
>>>
>>
>> I have no idea what you did on the sender side to allow it to send more
>> than 1.5 MB then.
>
> We are doing performance test. The sender send 256KB per-block with 128
> threads to one socket. And the receiver uses 10Gb NIC to handle the
> data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi.
>
> I added some debug messages and found handling backlog packets in
> __release_sock() cost about 11ms at most. This can cause backlog queue
> overflow. The sk_data_ready is re-assigned, it may cost time in our
> program. I will check it out.
>
I traced the cost cycles of handling backlog packets in
__release_sock().
16.97 ms to handling about 12MB backlog packets, of which 13.66ms to do
sk_data_ready.
The speed of handling packets in TCP is 5.65Gb/s which is smaller than
the NIC's bandwidth. So the packets will be dropped.

If the cost of sk_data_read cannot be reduced, do we have other choice
exclude dropping packets ?
Eric Dumazet April 13, 2016, 2:42 a.m. UTC | #12
On Tue, 2016-04-12 at 20:31 +0800, Yang Yingliang wrote:

> I traced the cost cycles of handling backlog packets in
> __release_sock().
> 16.97 ms to handling about 12MB backlog packets, of which 13.66ms to do
> sk_data_ready.
> The speed of handling packets in TCP is 5.65Gb/s which is smaller than
> the NIC's bandwidth. So the packets will be dropped.
> 
> If the cost of sk_data_read cannot be reduced, do we have other choice
> exclude dropping packets ?

Normally, TCP stack sends ACK packets with appropriate RWIN.

Sender should not send more packets than allowed in RWIN, even if there
are 128 threads using one TCP socket, it does not matter.

Imagine you do not have a backlog problem (nothing does the sendmsg()
while you receive data), and nothing reads the socket. Then the receiver
should eventually send WIN 0 back to the sender and sender should stop,
before any drop can possibly happen.

I have no problem receiving one TCP flow at 34Gbit, so it must be
something related to the huge windows you seem to use.

One possibility could be to tweak in ACK packets a reduced rwin so that
the sender is not allowed to continue the flood while we are painfully
processing a huge backlog.
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d204f3..da1bc16 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -281,6 +281,7 @@  extern unsigned int sysctl_tcp_notsent_lowat;
  extern int sysctl_tcp_min_tso_segs;
  extern int sysctl_tcp_autocorking;
  extern int sysctl_tcp_invalid_ratelimit;
+extern int sysctl_tcp_backlog_buf_multi;

  extern atomic_long_t tcp_memory_allocated;
  extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index f0e8297..9511410 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -631,6 +631,13 @@  static struct ctl_table ipv4_table[] = {
  		.mode		= 0644,
  		.proc_handler	= proc_dointvec
  	},
+	{
+		.procname	= "tcp_backlog_buf_multi",
+		.data		= &sysctl_tcp_backlog_buf_multi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
  #ifdef CONFIG_NETLABEL
  	{
  		.procname	= "cipso_cache_enable",
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 87463c8..337ad55 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -101,6 +101,8 @@  int sysctl_tcp_thin_dupack __read_mostly;
  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
  int sysctl_tcp_early_retrans __read_mostly = 3;
  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
+int sysctl_tcp_backlog_buf_multi __read_mostly = 1;
+EXPORT_SYMBOL(sysctl_tcp_backlog_buf_multi);

  #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
  #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13b92d5..39272f3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1635,7 +1635,8 @@  process:
  		if (!tcp_prequeue(sk, skb))
  			ret = tcp_v4_do_rcv(sk, skb);
  	} else if (unlikely(sk_add_backlog(sk, skb,
-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+					   (sk->sk_rcvbuf + sk->sk_sndbuf) *
+					   sysctl_tcp_backlog_buf_multi))) {
  		bh_unlock_sock(sk);
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c1147ac..1e8f709 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1433,7 +1433,8 @@  process:
  		if (!tcp_prequeue(sk, skb))
  			ret = tcp_v6_do_rcv(sk, skb);
  	} else if (unlikely(sk_add_backlog(sk, skb,
-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+					   (sk->sk_rcvbuf + sk->sk_sndbuf) *
+					   sysctl_tcp_backlog_buf_multi))) {
  		bh_unlock_sock(sk);
  		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
  		goto discard_and_relse;