diff mbox

[net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"

Message ID 534ED0FD.4040709@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Vladislav Yasevich April 16, 2014, 6:50 p.m. UTC
On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly?

The algorithm isn't wrong, but the implementation appears to have
a bug with window update SACKs.  The problem is that
sk->sk_rmem_alloc is updated by the skb destructor when
skb is freed.  This happens after we call sctp_assoc_rwnd_update()
which tries to send the update SACK.  As a result, in default
config with per-socket accounting, the test
    if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
uses the wrong values for rx_count and results in advertisement
of decreased rwnd instead of what is really available.

Can you try this patch without the revert applied.

Thanks
-vlad

 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,7 @@ static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 	}

 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }

 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
@@ -1071,12 +1066,21 @@ done:
  */
 void sctp_ulpevent_free(struct sctp_ulpevent *event)
 {
+	struct sctp_association *assoc = event->asoc;
+
 	if (sctp_ulpevent_is_notification(event))
 		sctp_ulpevent_release_owner(event);
 	else
 		sctp_ulpevent_release_data(event);

 	kfree_skb(sctp_event2skb(event));
+	/* The socket is locked and the assocaiton can't go anywhere
+	 * since we are walking the uplqueue.  No need to hold
+	 * another ref on the association.  Now that the skb has been
+	 * freed and accounted for everywhere, see if we need to send
+	 * a window update SACK.
+	 */
+	sctp_assoc_rwnd_update(asoc, true);
 }

 /* Purge the skb lists holding ulpevents. */


> The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but
this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control
modules from
> TCP would be possible...
>
>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as
penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Daniel Borkmann April 16, 2014, 7:05 p.m. UTC | #1
On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly?

[ Default config ./test_timetolive from lksctp-test suite triggered
   that as well actually it appears, i.e. showing that the app never
   woke up from the 3 sec timeout. ]

> The algorithm isn't wrong, but the implementation appears to have
> a bug with window update SACKs.  The problem is that
> sk->sk_rmem_alloc is updated by the skb destructor when
> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
> which tries to send the update SACK.  As a result, in default
> config with per-socket accounting, the test
>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> uses the wrong values for rx_count and results in advertisement
> of decreased rwnd instead of what is really available.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matija Glavinic Pecotic April 16, 2014, 7:24 p.m. UTC | #2
On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>> Hi Dongsheng!
>>>
>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>> the penalty is 99 %.
>>>
>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>> If yes, how exactly?
> 
> [ Default config ./test_timetolive from lksctp-test suite triggered
>   that as well actually it appears, i.e. showing that the app never
>   woke up from the 3 sec timeout. ]

We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.

http://www.spinics.net/lists/linux-sctp/msg03185.html

>> The algorithm isn't wrong, but the implementation appears to have
>> a bug with window update SACKs.  The problem is that
>> sk->sk_rmem_alloc is updated by the skb destructor when
>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>> which tries to send the update SACK.  As a result, in default
>> config with per-socket accounting, the test
>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> uses the wrong values for rx_count and results in advertisement
>> of decreased rwnd instead of what is really available.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Yasevich April 16, 2014, 7:47 p.m. UTC | #3
On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>> Hi Dongsheng!
>>>>
>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>> the penalty is 99 %.
>>>>
>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>> If yes, how exactly?
>>
>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>   that as well actually it appears, i.e. showing that the app never
>>   woke up from the 3 sec timeout. ]
> 
> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
> 
> http://www.spinics.net/lists/linux-sctp/msg03185.html

The problem with the test is that it tries to completely fill the
receive window by using a single SCTP message.  This all goes well
and the test expects a 0-rwnd to be advertised.

The test then consumes said message.  At this point, the test expects
the window to be opened and subsequent messages be sent or timed-out.
This doesn't happen, because the window update is not sent.  So
the sender thinks that the window is closed which it technically is
since we never actually update asoc->rwnd.  But the receive buffer
is empty since we drained the data.
We have a stuck association.

Hard to do when traffic is always flowing one way or the other, but
in a test, it's easy.

-vlad

> 
>>> The algorithm isn't wrong, but the implementation appears to have
>>> a bug with window update SACKs.  The problem is that
>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>> which tries to send the update SACK.  As a result, in default
>>> config with per-socket accounting, the test
>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>> uses the wrong values for rx_count and results in advertisement
>>> of decreased rwnd instead of what is really available.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matija Glavinic Pecotic April 21, 2014, 7:12 p.m. UTC | #4
On 04/16/2014 09:47 PM, ext Vlad Yasevich wrote:
> On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
>> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>>> Hi Dongsheng!
>>>>>
>>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>>> the penalty is 99 %.
>>>>>
>>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>>> If yes, how exactly?
>>>
>>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>>   that as well actually it appears, i.e. showing that the app never
>>>   woke up from the 3 sec timeout. ]
>>
>> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
>>
>> http://www.spinics.net/lists/linux-sctp/msg03185.html
> 
> The problem with the test is that it tries to completely fill the
> receive window by using a single SCTP message.  This all goes well
> and the test expects a 0-rwnd to be advertised.
> 
> The test then consumes said message.  At this point, the test expects
> the window to be opened and subsequent messages be sent or timed-out.
> This doesn't happen, because the window update is not sent.  So
> the sender thinks that the window is closed which it technically is
> since we never actually update asoc->rwnd.  But the receive buffer
> is empty since we drained the data.
> We have a stuck association.
> 
> Hard to do when traffic is always flowing one way or the other, but
> in a test, it's easy.

I'm not sure we hit exactly this scenario in this test case.

The problem with this TC is that it relied on the fact that once SO_RCVBUF is set on the socket, and later changed, a_rwnd will stay at the initial value (the same what was discussed when this TC was fixed few months ago -> http://www.spinics.net/lists/linux-sctp/msg03185.html).

What happened once rwnd became "honest" is that TC did not use small value to create fill message (fillmsg = malloc(gstatus.sstat_rwnd+RWND_SLOP);) ->  (SMALL_RCVBUF+RWND_SLOP), but due to new behavior, it used later advertised, or what is referred in TC as the original value. This value is even bigger then REALLY_BIG value in TC, and in my case, it is 164k:

> Sending the message of size 163837...

With these parameters, we will deplete receiver in just two sctp packets on lo (~65k of data plus really big overhead due to MAXSEG set to 100). We can confirm this by looking at the assocs state at the time of TC hanging:

glavinic@slon:~$ cat /proc/net/sctp/assocs 
 ASSOC     SOCK   STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema wmemq sndbuf rcvbuf
f2c0d800 f599c780 0   7   3  8332   13   753877        0    1000 27635 1024   1025  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0    15604  1428953  1153600   163840   163840
f2c09800 f599cb40 0   10  3  9101   14        0   327916    1000 27636 1025   1024  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0        0        1        0   163840   327680
glavinic@slon:~$ 

What happens is that TC hangs on the send message. Since TC hanged there, we dont get to the point that we start reading and we have locked ourselves forever.

On the other side, I see a possible pitfall with this late rwnd update, especially for the large messages accompanied with large MTUs and when we come close to closing the rwnd, so I'm looking forward for the retest.

Regards,

Matija
 
> -vlad
> 
>>
>>>> The algorithm isn't wrong, but the implementation appears to have
>>>> a bug with window update SACKs.  The problem is that
>>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>>> which tries to send the update SACK.  As a result, in default
>>>> config with per-socket accounting, the test
>>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>>> uses the wrong values for rx_count and results in advertisement
>>>> of decreased rwnd instead of what is really available.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..cc2d440 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -1011,7 +1011,6 @@  static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;

 	/* Current stack structures assume that the rcv buffer is