diff mbox

[net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"

Message ID 1397504717-19566-1-git-send-email-dborkman@redhat.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann April 14, 2014, 7:45 p.m. UTC
This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler <pbutler@sonusnet.com>
Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Peter Butler <pbutler@sonusnet.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
---
 As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
 needs a rework by Matija for net-next again, so that this fix can
 go back to -stable and restore performance for 3.11-3.15 kernels.

 include/net/sctp/structs.h | 14 +++++++-
 net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
 net/sctp/sm_statefuns.c    |  2 +-
 net/sctp/socket.c          |  6 ++++
 net/sctp/ulpevent.c        |  8 ++---
 5 files changed, 87 insertions(+), 25 deletions(-)

Comments

Vladislav Yasevich April 14, 2014, 7:57 p.m. UTC | #1
On 04/14/2014 03:45 PM, Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
> 
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 17:56:21 GMT
> Connecting to host 192.168.241.3, port 5201
>       Cookie: Lab200slot2.1397238981.812898.548918
> [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
> [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
> [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
> [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
> [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
> [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
> [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
> [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
> [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
> [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
> [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
> [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
> [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
> (etc)
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 19:08:41 GMT
> Connecting to host 2001:db8:0:f101::1, port 5201
>       Cookie: Lab200slot2.1397243321.714295.2b3f7c
> [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
> [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
> [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
> [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
> [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
> [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
> [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
> [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
> [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
> [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
> [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
> [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
> (etc)
> 
> After patch:
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
> Time: Mon, 14 Apr 2014 16:40:48 GMT
> Connecting to host 192.168.240.3, port 5201
>       Cookie: Lab200slot2.1397493648.413274.65e131
> [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
> [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
> [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
> [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
> 
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

The base approach is sound.  The idea is to calculate rwnd based
on receiver buffer available.  The algorithm chosen however, is
gives a much higher preference to small data and penalizes large
data transfers.  We need to figure our something else here..

-vlad

> ---
>  As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
>  needs a rework by Matija for net-next again, so that this fix can
>  go back to -stable and restore performance for 3.11-3.15 kernels.
> 
>  include/net/sctp/structs.h | 14 +++++++-
>  net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
>  net/sctp/sm_statefuns.c    |  2 +-
>  net/sctp/socket.c          |  6 ++++
>  net/sctp/ulpevent.c        |  8 ++---
>  5 files changed, 87 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6ee76c8..d992ca3 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1653,6 +1653,17 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> +	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> +	 * to slop over a maximum of the association's frag_point.
> +	 */
> +	__u32 rwnd_over;
> +
> +	/* Keeps treack of rwnd pressure.  This happens when we have
> +	 * a window, but not recevie buffer (i.e small packets).  This one
> +	 * is releases slowly (1 PMTU at a time ).
> +	 */
> +	__u32 rwnd_press;
> +
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_update(struct sctp_association *, bool);
> +void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 4f6d6f9..39579c3 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
>  	return false;
>  }
>  
> -/* Update asoc's rwnd for the approximated state in the buffer,
> - * and check whether SACK needs to be sent.
> - */
> -void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> +/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> +void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
>  {
> -	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +	if (asoc->rwnd_over) {
> +		if (asoc->rwnd_over >= len) {
> +			asoc->rwnd_over -= len;
> +		} else {
> +			asoc->rwnd += (len - asoc->rwnd_over);
> +			asoc->rwnd_over = 0;
> +		}
> +	} else {
> +		asoc->rwnd += len;
> +	}
>  
> -	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> -		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> -	else
> -		asoc->rwnd = 0;
> +	/* If we had window pressure, start recovering it
> +	 * once our rwnd had reached the accumulated pressure
> +	 * threshold.  The idea is to recover slowly, but up
> +	 * to the initial advertised window.
> +	 */
> +	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> +		int change = min(asoc->pathmtu, asoc->rwnd_press);
> +		asoc->rwnd += change;
> +		asoc->rwnd_press -= change;
> +	}
>  
> -	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> -		 __func__, asoc, asoc->rwnd, rx_count,
> -		 asoc->base.sk->sk_rcvbuf);
> +	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->a_rwnd);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (update_peer && sctp_peer_needs_update(asoc)) {
> +	if (sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
>  	}
>  }
>  
> +/* Decrease asoc's rwnd by len. */
> +void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> +{
> +	int rx_count;
> +	int over = 0;
> +
> +	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> +		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> +			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> +			 asoc->rwnd, asoc->rwnd_over);
> +
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +
> +	/* If we've reached or overflowed our receive buffer, announce
> +	 * a 0 rwnd if rwnd would still be positive.  Store the
> +	 * the potential pressure overflow so that the window can be restored
> +	 * back to original value.
> +	 */
> +	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> +		over = 1;
> +
> +	if (asoc->rwnd >= len) {
> +		asoc->rwnd -= len;
> +		if (over) {
> +			asoc->rwnd_press += asoc->rwnd;
> +			asoc->rwnd = 0;
> +		}
> +	} else {
> +		asoc->rwnd_over = len - asoc->rwnd;
> +		asoc->rwnd = 0;
> +	}
> +
> +	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->rwnd_press);
> +}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 01e0024..ae9fbeb 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e13519e..ff20e2d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
>  		sctp_skb_pull(skb, copied);
>  		skb_queue_head(&sk->sk_receive_queue, skb);
>  
> +		/* When only partial message is copied to the user, increase
> +		 * rwnd by that amount. If all the data in the skb is read,
> +		 * rwnd is updated when the event is freed.
> +		 */
> +		if (!sctp_ulpevent_is_notification(event))
> +			sctp_assoc_rwnd_increase(event->asoc, copied);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> index 8d198ae..85c6465 100644
> --- a/net/sctp/ulpevent.c
> +++ b/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_update(asoc, false);
> +	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  {
>  	struct sk_buff *skb, *frag;
>  	unsigned int	len;
> -	struct sctp_association *asoc;
>  
>  	/* Current stack structures assume that the rcv buffer is
>  	 * per socket.   For UDP style sockets this is not true as
> @@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  	}
>  
>  done:
> -	asoc = event->asoc;
> -	sctp_association_hold(asoc);
> +	sctp_assoc_rwnd_increase(event->asoc, len);
>  	sctp_ulpevent_release_owner(event);
> -	sctp_assoc_rwnd_update(asoc, true);
> -	sctp_association_put(asoc);
>  }
>  
>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 14, 2014, 8:48 p.m. UTC | #2
From: Daniel Borkmann <dborkman@redhat.com>
Date: Mon, 14 Apr 2014 21:45:17 +0200

> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
 ...
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied and queued up for -stable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Sverdlin April 15, 2014, 6:43 a.m. UTC | #3
Hello Daniel,

On 14/04/14 21:45, ext Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?
Daniel Borkmann April 15, 2014, 7:08 a.m. UTC | #4
Hi Matija,

[cc'ing Peter]

On 04/15/2014 08:43 AM, Alexander Sverdlin wrote:
> On 14/04/14 21:45, ext Daniel Borkmann wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>
> Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

If you would like to get to know the exact specifics from the bug report,
resp. numbers from the commit message, I refer you to Peter's setup, i.e.
he used ixgbe NICs as these are one of the few with SCTP checksum
offloading available.

Thanks,

Daniel

  [1] http://www.spinics.net/lists/linux-sctp/msg03290.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Sverdlin April 15, 2014, 8:46 a.m. UTC | #5
Hi!

On 14/04/14 22:48, ext David Miller wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>>
>> Current state:
>  ...
>> With the reverted patch applied, the SCTP/IPv4 performance is back
>> to normal on latest upstream for IPv4 and IPv6 and has same throughput
>> as 3.4.2 test kernel, steady and interval reports are smooth again.
>>
>> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
>> Reported-by: Peter Butler <pbutler@sonusnet.com>
>> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
>> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
>> Tested-by: Peter Butler <pbutler@sonusnet.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> 
> Applied and queued up for -stable.

Should not this be fixed actually in SCTP congestion control part?
RWND calculation is actually not responsible for congestion control.
And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular 
multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).

We are not arguing against another version of the patch, but:
- you are choosing speed instead of stability here
- you are masking the problem reverting the code, which is not responsible for the problem observed
Daniel Borkmann April 15, 2014, 8:57 a.m. UTC | #6
On 04/15/2014 10:46 AM, Alexander Sverdlin wrote:
...
> Should not this be fixed actually in SCTP congestion control part?
> RWND calculation is actually not responsible for congestion control.
> And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular
> multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).
>
> We are not arguing against another version of the patch, but:
> - you are choosing speed instead of stability here
> - you are masking the problem reverting the code, which is not responsible for the problem observed

So on 10Gbit Ethernet it is reasonable to regress from 2Gbit down to 50Mbit???
Did you actually measure that? I'm not arguing against the original approach to
the fix, but you need to rework that for net-next just differently as Vlad
already stated, that's all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Butler, Peter April 15, 2014, 2:27 p.m. UTC | #7
I'm not sure if the following processing power is actually required to reproduce the problem, but this is my setup:

- 2 single-board computers (SBCs), each a Xeon C5528 @ 2.13GHz (4-core with HT = 8 cores each)
- 10Gb Intel 82599EB NICs (IXGBE); each SBC has 2 of these but only 1 was used for this SCTP testing (i.e. NOT multi-homed)
- 10Gb backplane, approx. 0.2 ms RTT (however I also tested this with RTT = 10 ms, 20 ms, 50 ms and it is reproducible in all scenarios)
- send buffer size = 1200000, receive buffer size = 3000000




-----Original Message-----
From: linux-sctp-owner@vger.kernel.org [mailto:linux-sctp-owner@vger.kernel.org] On Behalf Of Alexander Sverdlin

Sent: April-15-14 2:44 AM
To: ext Daniel Borkmann; davem@davemloft.net
Cc: netdev@vger.kernel.org; linux-sctp@vger.kernel.org; Matija Glavinic Pecotic; Vlad Yasevich
Subject: Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"

Hello Daniel,

On 14/04/14 21:45, ext Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd 

> management to reflect real state of the receiver's buffer") as it 

> introduced a serious performance regression on SCTP over IPv4 and 

> IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.


Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

--
Best regards,
Alexander Sverdlin.
--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matija Glavinic Pecotic April 16, 2014, 6:57 a.m. UTC | #8
Hello Vlad,

On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> The base approach is sound.  The idea is to calculate rwnd based
> on receiver buffer available.  The algorithm chosen however, is
> gives a much higher preference to small data and penalizes large
> data transfers.  We need to figure our something else here..

I don't follow you here. Could you please explain what do you see as penalty?

Thanks,

Matija
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
宋冬生 April 16, 2014, 8:39 a.m. UTC | #9
From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
the penalty is 99 %.

http://www.spinics.net/lists/linux-sctp/msg03308.html


On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
<matija.glavinic-pecotic.ext@nsn.com> wrote:
>
> Hello Vlad,
>
> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> > The base approach is sound.  The idea is to calculate rwnd based
> > on receiver buffer available.  The algorithm chosen however, is
> > gives a much higher preference to small data and penalizes large
> > data transfers.  We need to figure our something else here..
>
> I don't follow you here. Could you please explain what do you see as penalty?
>
> Thanks,
>
> Matija
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Sverdlin April 16, 2014, 9:02 a.m. UTC | #10
Hi Dongsheng!

On 16/04/14 10:39, ext Dongsheng Song wrote:
>>From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
> the penalty is 99 %.

The question was, do you see this as a problem of the new rwnd algorithm?
If yes, how exactly? The algorithm actually has no preference to any amount of data.
It was fine-tuned before to serve as congestion control algorithm, but this should
be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
TCP would be possible...

> http://www.spinics.net/lists/linux-sctp/msg03308.html
> 
> 
> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>
>> Hello Vlad,
>>
>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>> The base approach is sound.  The idea is to calculate rwnd based
>>> on receiver buffer available.  The algorithm chosen however, is
>>> gives a much higher preference to small data and penalizes large
>>> data transfers.  We need to figure our something else here..
>>
>> I don't follow you here. Could you please explain what do you see as penalty?
>>
>> Thanks,
>>
>> Matija
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>
Matija Glavinic Pecotic April 16, 2014, 11:55 a.m. UTC | #11
Hello,

On 16.04.2014 11:02, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly? The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
> TCP would be possible...

Its also worth to note that sctp specifies rfc2581 for congestion 
control. TCP obsoleted that one in favor of 5681.

@Vlad, after Alexanders comment, it seems to be that you were referring 
to performance penalty. At first, I understood you refer to some penalty 
in rwnd calculation against buffer/rwnd value/something else. Thats why 
I asked that.

What also might be is that we are hitting SWS. I remember us observing 
some scenarios in which SWS is broken, new rwnd might have triggered it 
fully.

In any case, after some thought in the meantime, I'm pretty much sure 
that we need to improve congestion control and that new rwnd calculation 
is correct approach.

>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Yasevich April 16, 2014, 1:32 p.m. UTC | #12
On 04/16/2014 07:55 AM, Matija Glavinic Pecotic wrote:
> Hello,
> 
> On 16.04.2014 11:02, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly? The algorithm actually has no preference to any
>> amount of data.
>> It was fine-tuned before to serve as congestion control algorithm, but
>> this should
>> be located elsewhere. Perhaps, indeed, a re-use of congestion control
>> modules from
>> TCP would be possible...
> 
> Its also worth to note that sctp specifies rfc2581 for congestion
> control. TCP obsoleted that one in favor of 5681.
> 
> @Vlad, after Alexanders comment, it seems to be that you were referring
> to performance penalty. At first, I understood you refer to some penalty
> in rwnd calculation against buffer/rwnd value/something else. Thats why
> I asked that.
> 
> What also might be is that we are hitting SWS. I remember us observing
> some scenarios in which SWS is broken, new rwnd might have triggered it
> fully.
> 
> In any case, after some thought in the meantime, I'm pretty much sure
> that we need to improve congestion control and that new rwnd calculation
> is correct approach.

I am not sure where congestion control is broken.  It might be nice to
add a periodic SCTP_STATUS call to netperf/iperf to see what the state
of the congestion window and peer receive window is.

Alternatively, an quick stap script to examine these values could also
be useful.

-vlad

> 
>>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>>
>>>
>>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>>
>>>> Hello Vlad,
>>>>
>>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>>> on receiver buffer available.  The algorithm chosen however, is
>>>>> gives a much higher preference to small data and penalizes large
>>>>> data transfers.  We need to figure our something else here..
>>>>
>>>> I don't follow you here. Could you please explain what do you see as
>>>> penalty?
>>>>
>>>> Thanks,
>>>>
>>>> Matija
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-sctp" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Yasevich April 16, 2014, 6:36 p.m. UTC | #13
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 6ee76c8..d992ca3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1653,6 +1653,17 @@  struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;
 
+	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
+	 * to slop over a maximum of the association's frag_point.
+	 */
+	__u32 rwnd_over;
+
+	/* Keeps treack of rwnd pressure.  This happens when we have
+	 * a window, but not recevie buffer (i.e small packets).  This one
+	 * is releases slowly (1 PMTU at a time ).
+	 */
+	__u32 rwnd_press;
+
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1881,7 +1892,8 @@  void sctp_assoc_update(struct sctp_association *old,
 __u32 sctp_association_get_next_tsn(struct sctp_association *);
 
 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_update(struct sctp_association *, bool);
+void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 4f6d6f9..39579c3 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1395,35 +1395,44 @@  static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
 	return false;
 }
 
-/* Update asoc's rwnd for the approximated state in the buffer,
- * and check whether SACK needs to be sent.
- */
-void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
+/* Increase asoc's rwnd by len and send any window update SACK if needed. */
+void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
 {
-	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;
 
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+	if (asoc->rwnd_over) {
+		if (asoc->rwnd_over >= len) {
+			asoc->rwnd_over -= len;
+		} else {
+			asoc->rwnd += (len - asoc->rwnd_over);
+			asoc->rwnd_over = 0;
+		}
+	} else {
+		asoc->rwnd += len;
+	}
 
-	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
-		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
-	else
-		asoc->rwnd = 0;
+	/* If we had window pressure, start recovering it
+	 * once our rwnd had reached the accumulated pressure
+	 * threshold.  The idea is to recover slowly, but up
+	 * to the initial advertised window.
+	 */
+	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
+		int change = min(asoc->pathmtu, asoc->rwnd_press);
+		asoc->rwnd += change;
+		asoc->rwnd_press -= change;
+	}
 
-	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
-		 __func__, asoc, asoc->rwnd, rx_count,
-		 asoc->base.sk->sk_rcvbuf);
+	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->a_rwnd);
 
 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (update_peer && sctp_peer_needs_update(asoc)) {
+	if (sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;
 
 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1445,6 +1454,45 @@  void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
 	}
 }
 
+/* Decrease asoc's rwnd by len. */
+void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
+{
+	int rx_count;
+	int over = 0;
+
+	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
+		pr_debug("%s: association:%p has asoc->rwnd:%u, "
+			 "asoc->rwnd_over:%u!\n", __func__, asoc,
+			 asoc->rwnd, asoc->rwnd_over);
+
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+
+	/* If we've reached or overflowed our receive buffer, announce
+	 * a 0 rwnd if rwnd would still be positive.  Store the
+	 * the potential pressure overflow so that the window can be restored
+	 * back to original value.
+	 */
+	if (rx_count >= asoc->base.sk->sk_rcvbuf)
+		over = 1;
+
+	if (asoc->rwnd >= len) {
+		asoc->rwnd -= len;
+		if (over) {
+			asoc->rwnd_press += asoc->rwnd;
+			asoc->rwnd = 0;
+		}
+	} else {
+		asoc->rwnd_over = len - asoc->rwnd;
+		asoc->rwnd = 0;
+	}
+
+	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->rwnd_press);
+}
 
 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 01e0024..ae9fbeb 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -6178,7 +6178,7 @@  static int sctp_eat_data(const struct sctp_association *asoc,
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {
 
 		/* If this is the next TSN, consider reneging to make
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e13519e..ff20e2d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2115,6 +2115,12 @@  static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
 		sctp_skb_pull(skb, copied);
 		skb_queue_head(&sk->sk_receive_queue, skb);
 
+		/* When only partial message is copied to the user, increase
+		 * rwnd by that amount. If all the data in the skb is read,
+		 * rwnd is updated when the event is freed.
+		 */
+		if (!sctp_ulpevent_is_notification(event))
+			sctp_assoc_rwnd_increase(event->asoc, copied);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..85c6465 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@  static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_update(asoc, false);
+	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
 
 	if (!skb->data_len)
 		return;
@@ -1011,7 +1011,6 @@  static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;
 
 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,8 @@  static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 	}
 
 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
+	sctp_assoc_rwnd_increase(event->asoc, len);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }
 
 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)