diff mbox

[net-next] tcp: reduce memory needs of out of order queue

Message ID 1318576791.2533.99.camel@edumazet-laptop
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 14, 2011, 7:19 a.m. UTC
Many drivers allocates big skb to store a single TCP frame.
(WIFI drivers, or NIC using PAGE_SIZE fragments)

Its now common to get skb->truesize bigger than 4096 to store a ~1500
bytes TCP frame.

TCP sessions with large RTT and packet losses can fill their Out Of
Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
starting a pruning of complete OFO queue, without giving chance to
receive the missing packet(s) and moving skbs from OFO to receive queue.

This patch adds skb_reduce_truesize() helper, and uses it for all skbs
queued into OFO queue.

Spending some time to perform a copy is worth the pain, since it permits
SACK processing to have a chance to complete over the RTT barrier.

This greatly improves user experience, without added cost on fast path.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/tcp_input.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Oct. 14, 2011, 7:42 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 14 Oct 2011 09:19:51 +0200

> Many drivers allocates big skb to store a single TCP frame.
> (WIFI drivers, or NIC using PAGE_SIZE fragments)
> 
> Its now common to get skb->truesize bigger than 4096 to store a ~1500
> bytes TCP frame.
> 
> TCP sessions with large RTT and packet losses can fill their Out Of
> Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
> starting a pruning of complete OFO queue, without giving chance to
> receive the missing packet(s) and moving skbs from OFO to receive queue.
> 
> This patch adds skb_reduce_truesize() helper, and uses it for all skbs
> queued into OFO queue.
> 
> Spending some time to perform a copy is worth the pain, since it permits
> SACK processing to have a chance to complete over the RTT barrier.
> 
> This greatly improves user experience, without added cost on fast path.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

No objection from me, although I wish wireless drivers were able to
size their SKBs more appropriately.  I wonder how many problems that
look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
this truesize issue.

I think such large truesize SKBs will cause problems even in non loss
situations, in that the receive buffer will hit it's limits more
quickly.  I not sure that the receive buffer autotuning is built to
handle this sort of scenerio as a common occurance.

You might want to check if this is the actual root cause of your
problems.  If the receive buffer autotuning doesn't expand the receive
buffer enough to hold two windows worth of these large truesize SKBs,
that's the real reason why we end up pruning.

We have to decide if these kinds of SKBs are acceptable as a normal
situation for MSS sized frames.  And if they are then it's probably
a good idea to adjust the receive buffer autotuning code too.

Although I realize it might be difficult, getting rid of these weird
SKBs in the first place would be ideal.

It would also be a good idea to put the truesize inaccuracies into
perspective when selecting how to fix this.  It's trying to prevent
1 byte packets not accounting for the 256 byte SKB and metadata.
That kind of case with such a high ratio of wastage is important.

On the other hand, using 2048 bytes for a 1500 byte packet and claiming
the truesize is 1500 + sizeof(metadata)... that might be an acceptable
lie to tell :-)  This is especially true if it allows an easy solution
to this wireless problem.

Just some thoughts...  and I wonder if the wireless thing is due to
some hardware limitation or similar.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 14, 2011, 8:05 a.m. UTC | #2
Le vendredi 14 octobre 2011 à 03:42 -0400, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 14 Oct 2011 09:19:51 +0200
> 
> > Many drivers allocates big skb to store a single TCP frame.
> > (WIFI drivers, or NIC using PAGE_SIZE fragments)
> > 
> > Its now common to get skb->truesize bigger than 4096 to store a ~1500
> > bytes TCP frame.
> > 
> > TCP sessions with large RTT and packet losses can fill their Out Of
> > Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
> > starting a pruning of complete OFO queue, without giving chance to
> > receive the missing packet(s) and moving skbs from OFO to receive queue.
> > 
> > This patch adds skb_reduce_truesize() helper, and uses it for all skbs
> > queued into OFO queue.
> > 
> > Spending some time to perform a copy is worth the pain, since it permits
> > SACK processing to have a chance to complete over the RTT barrier.
> > 
> > This greatly improves user experience, without added cost on fast path.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> No objection from me, although I wish wireless drivers were able to
> size their SKBs more appropriately.  I wonder how many problems that
> look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
> this truesize issue.
> 
> I think such large truesize SKBs will cause problems even in non loss
> situations, in that the receive buffer will hit it's limits more
> quickly.  I not sure that the receive buffer autotuning is built to
> handle this sort of scenerio as a common occurance.
> 
> You might want to check if this is the actual root cause of your
> problems.  If the receive buffer autotuning doesn't expand the receive
> buffer enough to hold two windows worth of these large truesize SKBs,
> that's the real reason why we end up pruning.
> 
> We have to decide if these kinds of SKBs are acceptable as a normal
> situation for MSS sized frames.  And if they are then it's probably
> a good idea to adjust the receive buffer autotuning code too.
> 
> Although I realize it might be difficult, getting rid of these weird
> SKBs in the first place would be ideal.
> 
> It would also be a good idea to put the truesize inaccuracies into
> perspective when selecting how to fix this.  It's trying to prevent
> 1 byte packets not accounting for the 256 byte SKB and metadata.
> That kind of case with such a high ratio of wastage is important.
> 
> On the other hand, using 2048 bytes for a 1500 byte packet and claiming
> the truesize is 1500 + sizeof(metadata)... that might be an acceptable
> lie to tell :-)  This is especially true if it allows an easy solution
> to this wireless problem.
> 
> Just some thoughts...  and I wonder if the wireless thing is due to
> some hardware limitation or similar.
> 

This patch specifically addresses the OFO problem, trying to lower
memory usage for machines handling lot of sockets (proxies for example)

For the general case, I believe we have to tune/change
tcp_win_from_space() to take into account general tendancy to get fat
skbs.

sysctl_tcp_adv_win_scale is not fine enough today, and default value (2)
gives too much collapses. It's also a very complex setting, I am pretty
sure nobody knows how to use it.

tcp_win_from_space(int space) -> 75% of space  [ default ]

Only current kernels choices are to set it to one/-1 :

tcp_win_from_space(int space) -> 50% of space

or -2 :

tcp_win_from_space(int space) -> 25% of space



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 14, 2011, 3:50 p.m. UTC | #3
On 10/14/2011 12:42 AM, David Miller wrote:

> No objection from me, although I wish wireless drivers were able to
> size their SKBs more appropriately.  I wonder how many problems that
> look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
> this truesize issue.

I think the buffer bloat folks are looking at latency through transmit 
queues - now perhaps some of their latency is really coming from 
retransmissions thanks to packets being dropped thanks to overfilling 
socket buffers, but I'm pretty sure they are clever enough to look for that.

> I think such large truesize SKBs will cause problems even in non loss
> situations, in that the receive buffer will hit it's limits more
> quickly.  I not sure that the receive buffer autotuning is built to
> handle this sort of scenerio as a common occurance.

I believe that may be the case - at least during something like:

netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D

which on an otherwise quiet test setup will report a non-trivial number 
of retransmissions - either via looking at netstat -s output, or by 
adding local_transport_retrans,remote_transport_retrans to an output 
selector for netperf (eg -o 
throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end)

(I plan on providing more data after a laptop has gone through some 
upgrades)

> You might want to check if this is the actual root cause of your
> problems.  If the receive buffer autotuning doesn't expand the receive
> buffer enough to hold two windows worth of these large truesize SKBs,
> that's the real reason why we end up pruning.
>
> We have to decide if these kinds of SKBs are acceptable as a normal
> situation for MSS sized frames.  And if they are then it's probably
> a good idea to adjust the receive buffer autotuning code too.
>
> Although I realize it might be difficult, getting rid of these weird
> SKBs in the first place would be ideal.

That means a semi-arbitrary alloc/copy in drivers, even when/if the 
wasted space isn't going to be a problem no?  That TCP_RR test above 
would run "just fine" if the burst size was much smaller, but if there 
was an arbitrary allocate/copy it would take a service demand and thus 
transaction rate hit.

> It would also be a good idea to put the truesize inaccuracies into
> perspective when selecting how to fix this.  It's trying to prevent
> 1 byte packets not accounting for the 256 byte SKB and metadata.
> That kind of case with such a high ratio of wastage is important.
>
> On the other hand, using 2048 bytes for a 1500 byte packet and claiming
> the truesize is 1500 + sizeof(metadata)... that might be an acceptable
> lie to tell :-)  This is especially true if it allows an easy solution
> to this wireless problem.

Is the wireless problem strictly a wireless problem?  Many of the 
drivers where Eric has been fixing the truesize accounting have been 
wired devices no?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 14, 2011, 4 p.m. UTC | #4
Le vendredi 14 octobre 2011 à 08:50 -0700, Rick Jones a écrit :

> Is the wireless problem strictly a wireless problem?  Many of the 
> drivers where Eric has been fixing the truesize accounting have been 
> wired devices no?

Yes, but the goal of such fixes it to make bugs happen too with said
wired devices ;)

About WIFI, I get these TCP Collapses on two different machines, one
using drivers/net/wireless/rt2x00 driver

Extract from drivers/net/wireless/rt2x00/rt2x00queue.h

/**
 * DOC: Entry frame size
 * 
 * Ralink PCI devices demand the Frame size to be a multiple of 128 bytes,
 * for USB devices this restriction does not apply, but the value of
 * 2432 makes sense since it is big enough to contain the maximum fragment
 * size according to the ieee802.11 specs. 
 * The aggregation size depends on support from the driver, but should
 * be something around 3840 bytes.
 */
#define DATA_FRAME_SIZE         2432
#define MGMT_FRAME_SIZE         256
#define AGGREGATION_SIZE        3840



You understand why we endup using skb->truesize > 4096 buffers 

I liked doing the copybreak only if needed, I found the OFO case was
 most of the time responsible of the Collapses.

Now we also could do the copybreak for frames queued into regular
 receive_queue, if current wmem_alloc is above 25% of rcvbuf space...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 14, 2011, 5:33 p.m. UTC | #5
Le vendredi 14 octobre 2011 à 10:05 +0200, Eric Dumazet a écrit :

> This patch specifically addresses the OFO problem, trying to lower
> memory usage for machines handling lot of sockets (proxies for example)

Well, thinking a bit more about it is needed, so zap the patch please.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 14, 2011, 10:12 p.m. UTC | #6
> I believe that may be the case - at least during something like:
>
> netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D
>
> which on an otherwise quiet test setup will report a non-trivial number
> of retransmissions - either via looking at netstat -s output, or by
> adding local_transport_retrans,remote_transport_retrans to an output
> selector for netperf (eg -o
> throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end)
>
>
> (I plan on providing more data after a laptop has gone through some
> upgrades)

So, a test as above from a system running 2.6.38-11-generic to a system 
running 3.0.0-12-generic.  On the sender we have:

raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H 
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o 
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end 
; netstat -s > after
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : 
nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport 
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
76752.43,274,0,16384,98304

274 retransmissions at the sender.  The "beforeafter" of that on the sender:

raj@tardy:~/netperf2_trunk$ cat delta.send
Ip:
     766747 total packets received
     12 with invalid addresses
     0 forwarded
     0 incoming packets discarded
     766735 incoming packets delivered
     734689 requests sent out
     0 dropped because of missing route
Icmp:
     0 ICMP messages received
     0 input ICMP message failed.
     ICMP input histogram:
         destination unreachable: 0
         echo requests: 0
         echo replies: 0
     0 ICMP messages sent
     0 ICMP messages failed
     ICMP output histogram:
         destination unreachable: 0
         echo request: 0
         echo replies: 0
IcmpMsg:
         InType0: 0
         InType3: 0
         InType8: 0
         OutType0: 0
         OutType3: 0
         OutType8: 0
Tcp:
     2 active connections openings
     0 passive connection openings
     0 failed connection attempts
     0 connection resets received
     0 connections established
     766727 segments received
     734408 segments send out
     274 segments retransmited
     0 bad segments received.
     0 resets sent
Udp:
     7 packets received
     0 packets to unknown port received.
     0 packet receive errors
     7 packets sent
UdpLite:
TcpExt:
     0 packets pruned from receive queue because of socket buffer overrun
     0 ICMP packets dropped because they were out-of-window
     0 TCP sockets finished time wait in fast timer
     2 delayed acks sent
     0 delayed acks further delayed because of locked socket
     Quick ack mode was activated 0 times
     170856 packets directly queued to recvmsg prequeue.
     1204 bytes directly in process context from backlog
     170678 bytes directly received in process context from prequeue
     592090 packet headers predicted
     170626 packets header predicted and directly queued to user
     1375 acknowledgments not containing data payload received
     174911 predicted acknowledgments
     150 times recovered from packet loss by selective acknowledgements
     0 congestion windows recovered without slow start by DSACK
     0 congestion windows recovered without slow start after partial ack
     299 TCP data loss events
     TCPLostRetransmit: 9
     0 timeouts after reno fast retransmit
     0 timeouts after SACK recovery
     253 fast retransmits
     14 forward retransmits
     6 retransmits in slow start
     0 other TCP timeouts
     1 SACK retransmits failed
     0 times receiver scheduled too late for direct processing
     0 packets collapsed in receive queue due to low socket buffer
     0 DSACKs sent for old packets
     0 DSACKs received
     0 connections reset due to unexpected data
     0 connections reset due to early user close
     0 connections aborted due to timeout
     0 times unabled to send RST due to no memory
     TCPDSACKIgnoredOld: 0
     TCPDSACKIgnoredNoUndo: 0
     TCPSackShifted: 0
     TCPSackMerged: 1031
     TCPSackShiftFallback: 240
     TCPBacklogDrop: 0
     IPReversePathFilter: 0
IpExt:
     InMcastPkts: 0
     OutMcastPkts: 0
     InBcastPkts: 1
     InOctets: -1012182764
     OutOctets: -1436530450
     InMcastOctets: 0
     OutMcastOctets: 0
     InBcastOctets: 147

and then the deltas on the receiver:

raj@raj-8510w:~/netperf2_trunk$ cat delta.recv
Ip:
     734669 total packets received
     0 with invalid addresses
     0 forwarded
     0 incoming packets discarded
     734669 incoming packets delivered
     766696 requests sent out
     0 dropped because of missing route
Icmp:
     0 ICMP messages received
     0 input ICMP message failed.
     ICMP input histogram:
         destination unreachable: 0
     0 ICMP messages sent
     0 ICMP messages failed
     ICMP output histogram:
IcmpMsg:
         InType3: 0
Tcp:
     0 active connections openings
     2 passive connection openings
     0 failed connection attempts
     0 connection resets received
     0 connections established
     734651 segments received
     766695 segments send out
     0 segments retransmited
     0 bad segments received.
     0 resets sent
Udp:
     1 packets received
     0 packets to unknown port received.
     0 packet receive errors
     1 packets sent
UdpLite:
TcpExt:
     28 packets pruned from receive queue because of socket buffer overrun
     0 delayed acks sent
     0 delayed acks further delayed because of locked socket
     19 packets directly queued to recvmsg prequeue.
     0 bytes directly in process context from backlog
     667 bytes directly received in process context from prequeue
     727842 packet headers predicted
     9 packets header predicted and directly queued to user
     161 acknowledgments not containing data payload received
     229704 predicted acknowledgments
     6774 packets collapsed in receive queue due to low socket buffer
     TCPBacklogDrop: 276
IpExt:
     InMcastPkts: 0
     OutMcastPkts: 0
     InBcastPkts: 17
     OutBcastPkts: 0
     InOctets: 38973144
     OutOctets: 40673137
     InMcastOctets: 0
     OutMcastOctets: 0
     InBcastOctets: 1816
     OutBcastOctets: 0

this is an otherwise clean network, no errors reported by ifconfig or 
ethtool -S, and the packet rate was well within the limits of 1 GbE and 
the ProCurve 2724 switch between the two systems.

 From just a very quick look it looks like tcp_v[46]_rcv is called, 
finds that the socket is owned by the user, attempts to add to the 
backlog, but the path called by sk_add_backlog does not seem to make any 
attempts to compress things, so when the quantity of data is << the 
truesize it starts tossing babies out with the bathwater.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 14, 2011, 11:18 p.m. UTC | #7
From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 14 Oct 2011 15:12:04 -0700

> From just a very quick look it looks like tcp_v[46]_rcv is called,
> finds that the socket is owned by the user, attempts to add to the
> backlog, but the path called by sk_add_backlog does not seem to make
> any attempts to compress things, so when the quantity of data is <<
> the truesize it starts tossing babies out with the bathwater.

This is why I don't believe the right fix is to add bandaids all
around the TCP layer.

The wastage has to be avoided at a higher level.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 15, 2011, 6:39 a.m. UTC | #8
Le vendredi 14 octobre 2011 à 15:12 -0700, Rick Jones a écrit :

Thanks Rick


> So, a test as above from a system running 2.6.38-11-generic to a system 
> running 3.0.0-12-generic.  On the sender we have:
> 
> raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H 
> raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o 
> throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end 
> ; netstat -s > after
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
> to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : 
> nodelay : first burst 256
> Throughput,Local Transport Retransmissions,Remote Transport 
> Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
> 76752.43,274,0,16384,98304
> 
> 274 retransmissions at the sender.  The "beforeafter" of that on the sender:
> 
> raj@tardy:~/netperf2_trunk$ cat delta.send

> Tcp:
>      2 active connections openings
>      0 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      0 connections established
>      766727 segments received
>      734408 segments send out

>      274 segments retransmited

	Exactly the count of dropped frames because of receiver sk_rmem_alloc +
backlog.len hitting receiver sk_rcvbuf

static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
{
        unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);

        return qsize + skb->truesize > sk->sk_rcvbuf;
}

static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
        if (sk_rcvqueues_full(sk, skb))
                return -ENOBUFS;

        __sk_add_backlog(sk, skb);
        sk->sk_backlog.len += skb->truesize;
        return 0;
}

In very old kernels, we had no limit on backlog, so we could queue lot
of extra skbs in it and eventually consume all kernel memory (OOM)

refs : commit c377411f249 (net: sk_add_backlog() take rmem_alloc into
account)
	commit 6b03a53a5ab7 (tcp: use limited socket backlog)

	commit 8eae939f14003 (net: add limit for socket backlog )

Now we enforce a limit, better to chose a correct limit / tcpwindow
combination so that normal trafic doesnt trigger drops at receiver

>      0 bad segments received.
>      0 resets sent
> Udp:
>      7 packets received
>      0 packets to unknown port received.
>      0 packet receive errors
>      7 packets sent
> UdpLite:
> TcpExt:
>      0 packets pruned from receive queue because of socket buffer overrun
>      0 ICMP packets dropped because they were out-of-window
>      0 TCP sockets finished time wait in fast timer
>      2 delayed acks sent
>      0 delayed acks further delayed because of locked socket
>      Quick ack mode was activated 0 times
>      170856 packets directly queued to recvmsg prequeue.
>      1204 bytes directly in process context from backlog
>      170678 bytes directly received in process context from prequeue
>      592090 packet headers predicted
>      170626 packets header predicted and directly queued to user
>      1375 acknowledgments not containing data payload received
>      174911 predicted acknowledgments
>      150 times recovered from packet loss by selective acknowledgements
>      0 congestion windows recovered without slow start by DSACK
>      0 congestion windows recovered without slow start after partial ack
>      299 TCP data loss events
>      TCPLostRetransmit: 9
>      0 timeouts after reno fast retransmit
>      0 timeouts after SACK recovery
>      253 fast retransmits
>      14 forward retransmits
>      6 retransmits in slow start
>      0 other TCP timeouts
>      1 SACK retransmits failed
>      0 times receiver scheduled too late for direct processing
>      0 packets collapsed in receive queue due to low socket buffer
>      0 DSACKs sent for old packets
>      0 DSACKs received
>      0 connections reset due to unexpected data
>      0 connections reset due to early user close
>      0 connections aborted due to timeout
>      0 times unabled to send RST due to no memory
>      TCPDSACKIgnoredOld: 0
>      TCPDSACKIgnoredNoUndo: 0
>      TCPSackShifted: 0
>      TCPSackMerged: 1031
>      TCPSackShiftFallback: 240
>      TCPBacklogDrop: 0
>      IPReversePathFilter: 0
> IpExt:
>      InMcastPkts: 0
>      OutMcastPkts: 0
>      InBcastPkts: 1
>      InOctets: -1012182764
>      OutOctets: -1436530450
>      InMcastOctets: 0
>      OutMcastOctets: 0
>      InBcastOctets: 147
> 
> and then the deltas on the receiver:
> 
> raj@raj-8510w:~/netperf2_trunk$ cat delta.recv
> Ip:
>      734669 total packets received
>      0 with invalid addresses
>      0 forwarded
>      0 incoming packets discarded
>      734669 incoming packets delivered
>      766696 requests sent out
>      0 dropped because of missing route
> Icmp:
>      0 ICMP messages received
>      0 input ICMP message failed.
>      ICMP input histogram:
>          destination unreachable: 0
>      0 ICMP messages sent
>      0 ICMP messages failed
>      ICMP output histogram:
> IcmpMsg:
>          InType3: 0
> Tcp:
>      0 active connections openings
>      2 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      0 connections established
>      734651 segments received
>      766695 segments send out
>      0 segments retransmited
>      0 bad segments received.
>      0 resets sent
> Udp:
>      1 packets received
>      0 packets to unknown port received.
>      0 packet receive errors
>      1 packets sent
> UdpLite:
> TcpExt:
>      28 packets pruned from receive queue because of socket buffer overrun
>      0 delayed acks sent
>      0 delayed acks further delayed because of locked socket
>      19 packets directly queued to recvmsg prequeue.
>      0 bytes directly in process context from backlog
>      667 bytes directly received in process context from prequeue
>      727842 packet headers predicted
>      9 packets header predicted and directly queued to user
>      161 acknowledgments not containing data payload received
>      229704 predicted acknowledgments


>      6774 packets collapsed in receive queue due to low socket buffer
>      TCPBacklogDrop: 276

	Yes, these two counters explain all.

	1) "6774 packets collapsed in receive queue due to low socket buffer"

We spend a _lot_ of cpu time in "collapsing" process : Taking several
skb and build a compound one (using one PAGE and trying to fill all the
available bytes in it with contigous parts).

Doing this work is of course last desperate attempt before the much
painfull :

	2) TCPBacklogDrop: 276

	We plain drop incoming messages because too much kernel memory is used
by the socket.

> IpExt:
>      InMcastPkts: 0
>      OutMcastPkts: 0
>      InBcastPkts: 17
>      OutBcastPkts: 0
>      InOctets: 38973144
>      OutOctets: 40673137
>      InMcastOctets: 0
>      OutMcastOctets: 0
>      InBcastOctets: 1816
>      OutBcastOctets: 0
> 
> this is an otherwise clean network, no errors reported by ifconfig or 
> ethtool -S, and the packet rate was well within the limits of 1 GbE and 
> the ProCurve 2724 switch between the two systems.
> 
>  From just a very quick look it looks like tcp_v[46]_rcv is called, 
> finds that the socket is owned by the user, attempts to add to the 
> backlog, but the path called by sk_add_backlog does not seem to make any 
> attempts to compress things, so when the quantity of data is << the 
> truesize it starts tossing babies out with the bathwater.
> 

Rick, could you redo the test, using following bit on receiver :

echo 1 >/proc/sys/net/ipv4/tcp_adv_win_scale

If you still have collapses/retransmits, you then could try :

echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 15, 2011, 6:54 a.m. UTC | #9
Le vendredi 14 octobre 2011 à 19:18 -0400, David Miller a écrit :
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 14 Oct 2011 15:12:04 -0700
> 
> > From just a very quick look it looks like tcp_v[46]_rcv is called,
> > finds that the socket is owned by the user, attempts to add to the
> > backlog, but the path called by sk_add_backlog does not seem to make
> > any attempts to compress things, so when the quantity of data is <<
> > the truesize it starts tossing babies out with the bathwater.
> 
> This is why I don't believe the right fix is to add bandaids all
> around the TCP layer.
> 
> The wastage has to be avoided at a higher level.

We cant do that at higher level without smart hardware (like NIU) or
adding a copy.

Its a tradeoff between space and speed.

Most drivers have to allocate a large skb1 and post it to hardware to
receive a frame (Unknown length, only max length is known)

Some drivers have a copybreak feature, doing a copy of small incoming
frames into a smaller skb2 (skb2->truesize < skb1->truesize)

This strategy do save memory for small frames, not for 1500 bytes
frames.

I think the problem is in TCP layer (and maybe in other protocols) :

1) Either tune rcvbuf to allow more memory to be used, for a particular
tcp window,

   Or lower TCP window to allow less packets in flight for a given
rcvbuf.

2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket
with many packets in OFO queue. But fixing 1) would make these collapses
never happen in the first place. People wanting high TCP bandwidth
[ with say more than 500 in-flight packets per session ] can certainly
afford having enough memory.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 17, 2011, 12:53 a.m. UTC | #10
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 15 Oct 2011 08:54:42 +0200

> I think the problem is in TCP layer (and maybe in other protocols) :
> 
> 1) Either tune rcvbuf to allow more memory to be used, for a particular
> tcp window,
> 
>    Or lower TCP window to allow less packets in flight for a given
> rcvbuf.
> 
> 2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket
> with many packets in OFO queue. But fixing 1) would make these collapses
> never happen in the first place. People wanting high TCP bandwidth
> [ with say more than 500 in-flight packets per session ] can certainly
> afford having enough memory.

So perhaps the best solution is to divorce truesize from such driver
and device details?  If there is one calculation, then TCP need only
be concerned with one case.

Look at how confusing and useless tcp_adv_win_scale ends up being for
this problem.

Therefore I'll make the mostly-serious propsal that truesize be
something like "initial_real_total_data + sizeof(metadata)"

So if a device receives a 512 byte packet, it's:

	512 + sizeof(metadata)

It still provides the necessary protection that truesize is meant to
provide, yet sanitizes all of the receive and send buffer overhead
handling.

TCP should be absoultely, and completely, impervious to details like
how buffering needs to be done for some random wireless card.  Just
the mere fact that using a larger buffer in a driver ruins TCP
performance indicates a serious design failure.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 17, 2011, 7:02 a.m. UTC | #11
Le dimanche 16 octobre 2011 à 20:53 -0400, David Miller a écrit :

> So perhaps the best solution is to divorce truesize from such driver
> and device details?  If there is one calculation, then TCP need only
> be concerned with one case.
> 
> Look at how confusing and useless tcp_adv_win_scale ends up being for
> this problem.
> 
> Therefore I'll make the mostly-serious propsal that truesize be
> something like "initial_real_total_data + sizeof(metadata)"
> 
> So if a device receives a 512 byte packet, it's:
> 
> 	512 + sizeof(metadata)
> 

That would probably OOM in stress situation, with thousand of sockets.

> It still provides the necessary protection that truesize is meant to
> provide, yet sanitizes all of the receive and send buffer overhead
> handling.
> 
> TCP should be absoultely, and completely, impervious to details like
> how buffering needs to be done for some random wireless card.  Just
> the mere fact that using a larger buffer in a driver ruins TCP
> performance indicates a serious design failure.
> 

I dont think its a design failure. Its the same problem when computing
the TCP window given the rcvspace (memory we allow to be consumed for
the socket) based on the MSS : If the sender uses 1-bytes frames only,
then receiver hit the memory limit and performance drops.

Right now our tcp-window tuning really assumes too much : perfect MSS
skb using _exactly_ MSS + sizeof(metadata), while we already know that
real slab cost is higher : 

  __roundup_pow_of_two(MSS + sizeof(struct skb_shared_info)) +
  SKB_DATA_ALIGN(sizeof(struct sk_buff))

and now with paged frag devices :

  PAGE_SIZE + SKB_DATA_ALIGN(sizeof(struct sk_buff))

We assume sender behaves correctly and drivers dont use 64KB pages to
store a single 72-bytes frame

I would say the first thing TCP stack must respect is the memory limits
that the admin set for it. Thats what skb->truesize is for.

# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	4127616

In this case, we allow up to 4Mbytes or receiver memory per session.
Not 20 or 30 Mbytes...

We must translate this to a TCP window, suitable for current hardware.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 17, 2011, 4:47 p.m. UTC | #12
>
> Rick, could you redo the test, using following bit on receiver :
>
> echo 1>/proc/sys/net/ipv4/tcp_adv_win_scale


raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H 
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o 
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end 
; netstat -s > afterMIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 
(0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net 
(16.89.245.115) port 0 AF_INET : nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport 
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
78527.68,289,0,16384,98304

Deltas on the receiver:

TcpExt:
     27 packets pruned from receive queue because of socket buffer overrun
     0 TCP sockets finished time wait in fast timer
     0 delayed acks sent
     0 delayed acks further delayed because of locked socket
     Quick ack mode was activated 0 times
     19 packets directly queued to recvmsg prequeue.
     0 bytes directly in process context from backlog
     670 bytes directly received in process context from prequeue
     739983 packet headers predicted
     14 packets header predicted and directly queued to user
     127 acknowledgments not containing data payload received
     235774 predicted acknowledgments
     0 other TCP timeouts
     6553 packets collapsed in receive queue due to low socket buffer
     0 DSACKs sent for old packets
     TCPBacklogDrop: 294

So, moving on to:

> If you still have collapses/retransmits, you then could try :
>
> echo -2>/proc/sys/net/ipv4/tcp_adv_win_scale

raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H 
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o 
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end 
; netstat -s > after
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : 
nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport 
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
95981.83,0,0,121200,156600

No retransmissions in that one.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c1653fe..1d10edb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4426,6 +4426,25 @@  static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
 	return 0;
 }
 
+/*
+ * Caller want to reduce memory needs before queueing skb
+ * The (expensive) copy should not be be done in fast path.
+ */
+static struct sk_buff *skb_reduce_truesize(struct sk_buff *skb)
+{
+	if (skb->truesize > 2 * SKB_TRUESIZE(skb->len)) {
+		struct sk_buff *nskb;
+
+		nskb = skb_copy_expand(skb, skb_headroom(skb), 0,
+				       GFP_ATOMIC | __GFP_NOWARN);
+		if (nskb) {
+			__kfree_skb(skb);
+			skb = nskb;
+		}
+	}
+	return skb;
+}
+
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcphdr *th = tcp_hdr(skb);
@@ -4553,6 +4572,11 @@  drop:
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
 		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
 
+	/* Since this skb might stay on ofo a long time, try to reduce
+	 * its truesize (if its too big) to avoid future pruning.
+	 * Many drivers allocate large buffers even to hold tiny frames.
+	 */
+	skb = skb_reduce_truesize(skb);
 	skb_set_owner_r(skb, sk);
 
 	if (!skb_peek(&tp->out_of_order_queue)) {