diff mbox

[net-next,2/2] sctp: add support for MSG_MORE

Message ID d6f2835b0bd655a563896e86648cbd9b4e01374e.1487440272.git.lucien.xin@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Xin Long Feb. 18, 2017, 5:52 p.m. UTC
This patch is to add support for MSG_MORE on sctp.

It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
creating datamsg according to the send flag. sctp_packet_can_append_data
then uses it to decide if the chunks of this msg will be sent at once or
delay it.

Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
in assoc. As sctp enqueues the chunks first, then dequeue them one by
one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
affect other chunks' bundling.

Since last patch, sctp flush out queue once assoc state falls into
SHUTDOWN_PENDING, the close block problem mentioned in [1] has been
solved as well.

[1] https://patchwork.ozlabs.org/patch/372404/

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/structs.h | 1 +
 net/sctp/output.c          | 9 +++------
 net/sctp/socket.c          | 1 +
 3 files changed, 5 insertions(+), 6 deletions(-)

Comments

David Laight Feb. 21, 2017, 2:27 p.m. UTC | #1
From: Xin Long 
> Sent: 18 February 2017 17:53
> This patch is to add support for MSG_MORE on sctp.
> 
> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
> creating datamsg according to the send flag. sctp_packet_can_append_data
> then uses it to decide if the chunks of this msg will be sent at once or
> delay it.
> 
> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
> in assoc. As sctp enqueues the chunks first, then dequeue them one by
> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
> affect other chunks' bundling.

I thought about that and decided that the MSG_MORE flag on the last data
chunk was the only one that mattered.
Indeed looking at any others is broken.

Consider what happens if you have two small chunks queued, the first
with MSG_MORE set, the second with it clear.

I think that sctp_outq_flush() will look at the first chunk and decide it
doesn't need to do anything because sctp_packet_transmit_chunk()
returns SCTP_XMIT_DELAY.
The data chunk with MSG_MORE clear won't even be looked at.
So the data will never be sent.

I wouldn't worry about having messages queued that have MSG_MORE clean
when the final message has it set.
While it might be 'nice' to send the data (would have to be tx credit)
waiting for the next data chunk shouldn't be a problem.

I'm not sure I even want to test the current patch!

	David
Xin Long Feb. 23, 2017, 3:45 a.m. UTC | #2
On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Xin Long
>> Sent: 18 February 2017 17:53
>> This patch is to add support for MSG_MORE on sctp.
>>
>> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
>> creating datamsg according to the send flag. sctp_packet_can_append_data
>> then uses it to decide if the chunks of this msg will be sent at once or
>> delay it.
>>
>> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
>> in assoc. As sctp enqueues the chunks first, then dequeue them one by
>> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
>> affect other chunks' bundling.
>
> I thought about that and decided that the MSG_MORE flag on the last data
> chunk was the only one that mattered.
> Indeed looking at any others is broken.
>
> Consider what happens if you have two small chunks queued, the first
> with MSG_MORE set, the second with it clear.
>
> I think that sctp_outq_flush() will look at the first chunk and decide it
> doesn't need to do anything because sctp_packet_transmit_chunk()
> returns SCTP_XMIT_DELAY.
> The data chunk with MSG_MORE clear won't even be looked at.
> So the data will never be sent.
It's not that bad as you thought, in sctp_packet_can_append_data():
when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
would be still sent out.

What MSG_MORE flag actually does is ignore inflight == 0 and
sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
it has to respect the original logic (like !chunk->msg->can_delay
|| !sctp_packet_empty(packet) || ...)

To delay the chunks with MSG_MORE set even when inflight is 0
it especially important here for users.

>
> I wouldn't worry about having messages queued that have MSG_MORE clean
> when the final message has it set.
Yeah, It's an old optimization for bundling. MSG_MORE should NOT
break that.

> While it might be 'nice' to send the data (would have to be tx credit)
> waiting for the next data chunk shouldn't be a problem.
sorry, you mean it shouldn't send the data if it's waiting for the
next data whenever ?

>
> I'm not sure I even want to test the current patch!
>
>         David
>
David Laight Feb. 23, 2017, 4:04 p.m. UTC | #3
From: Xin Long

> Sent: 23 February 2017 03:46

> On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:

> > From: Xin Long

> >> Sent: 18 February 2017 17:53

> >> This patch is to add support for MSG_MORE on sctp.

> >>

> >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after

> >> creating datamsg according to the send flag. sctp_packet_can_append_data

> >> then uses it to decide if the chunks of this msg will be sent at once or

> >> delay it.

> >>

> >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of

> >> in assoc. As sctp enqueues the chunks first, then dequeue them one by

> >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may

> >> affect other chunks' bundling.

> >

> > I thought about that and decided that the MSG_MORE flag on the last data

> > chunk was the only one that mattered.

> > Indeed looking at any others is broken.

> >

> > Consider what happens if you have two small chunks queued, the first

> > with MSG_MORE set, the second with it clear.

> >

> > I think that sctp_outq_flush() will look at the first chunk and decide it

> > doesn't need to do anything because sctp_packet_transmit_chunk()

> > returns SCTP_XMIT_DELAY.

> > The data chunk with MSG_MORE clear won't even be looked at.

> > So the data will never be sent.


> It's not that bad as you thought, in sctp_packet_can_append_data():

> when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks

> would be still sent out.


One of us isn't understanding the other :-)

IIRC sctp_packet_can_append_data() is called for the first queued
data chunk in order to decide whether to generate a message that
consists only of data chunks.
If it returns SCTP_XMIT_OK then a message is built collecting the
rest of the queued data chunks (until the window fills).

So if I send a message with MSG_MORE set (on an idle connection)
SCTP_XMIT_DELAY is returned and a message isn't sent.

I now send a second small message, this time with MSG_MORE clear.
The message is queued, then the code looks to see if it can send anything.

sctp_packet_can_append_data() is called for the first queued chunk.
Since it has force_delay set SCTP_XMIT_DELAY is returned and no
message is built.
The second message isn't even looked at.

> What MSG_MORE flag actually does is ignore inflight == 0 and

> sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still

> it has to respect the original logic (like !chunk->msg->can_delay

> || !sctp_packet_empty(packet) || ...)

> 

> To delay the chunks with MSG_MORE set even when inflight is 0

> it especially important here for users.


I'm not too worried about that.
Sending the first message was a cheap way to ensure something got
sent if the application lied and didn't send a subsequent message.

The change has hit Linus's tree, I'll should be able to test that
and confirm what I think is going on.

	David
Marcelo Ricardo Leitner Feb. 23, 2017, 5:40 p.m. UTC | #4
On Thu, Feb 23, 2017 at 04:04:10PM +0000, David Laight wrote:
> From: Xin Long
> > Sent: 23 February 2017 03:46
> > On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:
> > > From: Xin Long
> > >> Sent: 18 February 2017 17:53
> > >> This patch is to add support for MSG_MORE on sctp.
> > >>
> > >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
> > >> creating datamsg according to the send flag. sctp_packet_can_append_data
> > >> then uses it to decide if the chunks of this msg will be sent at once or
> > >> delay it.
> > >>
> > >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
> > >> in assoc. As sctp enqueues the chunks first, then dequeue them one by
> > >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
> > >> affect other chunks' bundling.
> > >
> > > I thought about that and decided that the MSG_MORE flag on the last data
> > > chunk was the only one that mattered.
> > > Indeed looking at any others is broken.
> > >
> > > Consider what happens if you have two small chunks queued, the first
> > > with MSG_MORE set, the second with it clear.
> > >
> > > I think that sctp_outq_flush() will look at the first chunk and decide it
> > > doesn't need to do anything because sctp_packet_transmit_chunk()
> > > returns SCTP_XMIT_DELAY.
> > > The data chunk with MSG_MORE clear won't even be looked at.
> > > So the data will never be sent.
> 
> > It's not that bad as you thought, in sctp_packet_can_append_data():
> > when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
> > would be still sent out.
> 
> One of us isn't understanding the other :-)
> 
> IIRC sctp_packet_can_append_data() is called for the first queued
> data chunk in order to decide whether to generate a message that

Perhaps here lies the source of the confusion?
sctp_packet_can_append_data() is called for all queued data chunks, and
not just the first one.

sctp_outq_flush
  (retransmissions here, omitted for simplicity)
  /* Finally, transmit new packets.  */
  while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
    sctp_packet_transmit_chunk
      sctp_packet_append_chunk
        sctp_packet_can_append_data
        __sctp_packet_append_chunk

So chunks are checked one by one.

> consists only of data chunks.

That's not really its purpose. It's to check if it can append a data
chunk to the packet being prepared, while respecting asoc state, cwnd,
etc.

HTH!

  Marcelo

> If it returns SCTP_XMIT_OK then a message is built collecting the
> rest of the queued data chunks (until the window fills).
> 
> So if I send a message with MSG_MORE set (on an idle connection)
> SCTP_XMIT_DELAY is returned and a message isn't sent.
> 
> I now send a second small message, this time with MSG_MORE clear.
> The message is queued, then the code looks to see if it can send anything.
> 
> sctp_packet_can_append_data() is called for the first queued chunk.
> Since it has force_delay set SCTP_XMIT_DELAY is returned and no
> message is built.
> The second message isn't even looked at.
> 
> > What MSG_MORE flag actually does is ignore inflight == 0 and
> > sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
> > it has to respect the original logic (like !chunk->msg->can_delay
> > || !sctp_packet_empty(packet) || ...)
> > 
> > To delay the chunks with MSG_MORE set even when inflight is 0
> > it especially important here for users.
> 
> I'm not too worried about that.
> Sending the first message was a cheap way to ensure something got
> sent if the application lied and didn't send a subsequent message.
> 
> The change has hit Linus's tree, I'll should be able to test that
> and confirm what I think is going on.
> 
> 	David
>
Xin Long Feb. 23, 2017, 6:16 p.m. UTC | #5
On Fri, Feb 24, 2017 at 1:40 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Thu, Feb 23, 2017 at 04:04:10PM +0000, David Laight wrote:
>> From: Xin Long
>> > Sent: 23 February 2017 03:46
>> > On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:
>> > > From: Xin Long
>> > >> Sent: 18 February 2017 17:53
>> > >> This patch is to add support for MSG_MORE on sctp.
>> > >>
>> > >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
>> > >> creating datamsg according to the send flag. sctp_packet_can_append_data
>> > >> then uses it to decide if the chunks of this msg will be sent at once or
>> > >> delay it.
>> > >>
>> > >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
>> > >> in assoc. As sctp enqueues the chunks first, then dequeue them one by
>> > >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
>> > >> affect other chunks' bundling.
>> > >
>> > > I thought about that and decided that the MSG_MORE flag on the last data
>> > > chunk was the only one that mattered.
>> > > Indeed looking at any others is broken.
>> > >
>> > > Consider what happens if you have two small chunks queued, the first
>> > > with MSG_MORE set, the second with it clear.
>> > >
>> > > I think that sctp_outq_flush() will look at the first chunk and decide it
>> > > doesn't need to do anything because sctp_packet_transmit_chunk()
>> > > returns SCTP_XMIT_DELAY.
>> > > The data chunk with MSG_MORE clear won't even be looked at.
>> > > So the data will never be sent.
>>
>> > It's not that bad as you thought, in sctp_packet_can_append_data():
>> > when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
>> > would be still sent out.
>>
>> One of us isn't understanding the other :-)
>>
>> IIRC sctp_packet_can_append_data() is called for the first queued
>> data chunk in order to decide whether to generate a message that
>
> Perhaps here lies the source of the confusion?
> sctp_packet_can_append_data() is called for all queued data chunks, and
> not just the first one.
>
> sctp_outq_flush
>   (retransmissions here, omitted for simplicity)
>   /* Finally, transmit new packets.  */
>   while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
>     sctp_packet_transmit_chunk
>       sctp_packet_append_chunk
>         sctp_packet_can_append_data
>         __sctp_packet_append_chunk
>
> So chunks are checked one by one.
I think I got David's point.
like, the queue is:

chunk3[null]-->chunk2 [msg_more]-->chunk1 [msg_more]

it dequeue from chunk1, once it returns SCTP_XMIT_DELAY
chunk2, chunk3 will has no chance to dequeue, as it will
goto: sctpflush_out in sctp_outq_flush(), But we are expecting
to send all chunks.

>
>> consists only of data chunks.
>
> That's not really its purpose. It's to check if it can append a data
> chunk to the packet being prepared, while respecting asoc state, cwnd,
> etc.
>
> HTH!
>
>   Marcelo
>
>> If it returns SCTP_XMIT_OK then a message is built collecting the
>> rest of the queued data chunks (until the window fills).
>>
>> So if I send a message with MSG_MORE set (on an idle connection)
>> SCTP_XMIT_DELAY is returned and a message isn't sent.
>>
>> I now send a second small message, this time with MSG_MORE clear.
>> The message is queued, then the code looks to see if it can send anything.
>>
>> sctp_packet_can_append_data() is called for the first queued chunk.
>> Since it has force_delay set SCTP_XMIT_DELAY is returned and no
>> message is built.
>> The second message isn't even looked at.
>>
>> > What MSG_MORE flag actually does is ignore inflight == 0 and
>> > sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
>> > it has to respect the original logic (like !chunk->msg->can_delay
>> > || !sctp_packet_empty(packet) || ...)
>> >
>> > To delay the chunks with MSG_MORE set even when inflight is 0
>> > it especially important here for users.
>>
>> I'm not too worried about that.
>> Sending the first message was a cheap way to ensure something got
>> sent if the application lied and didn't send a subsequent message.
>>
>> The change has hit Linus's tree, I'll should be able to test that
>> and confirm what I think is going on.
>>
>>       David
>>
Marcelo Ricardo Leitner Feb. 23, 2017, 6:39 p.m. UTC | #6
On Fri, Feb 24, 2017 at 02:16:02AM +0800, Xin Long wrote:
> On Fri, Feb 24, 2017 at 1:40 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > On Thu, Feb 23, 2017 at 04:04:10PM +0000, David Laight wrote:
> >> From: Xin Long
> >> > Sent: 23 February 2017 03:46
> >> > On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:
> >> > > From: Xin Long
> >> > >> Sent: 18 February 2017 17:53
> >> > >> This patch is to add support for MSG_MORE on sctp.
> >> > >>
> >> > >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
> >> > >> creating datamsg according to the send flag. sctp_packet_can_append_data
> >> > >> then uses it to decide if the chunks of this msg will be sent at once or
> >> > >> delay it.
> >> > >>
> >> > >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
> >> > >> in assoc. As sctp enqueues the chunks first, then dequeue them one by
> >> > >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
> >> > >> affect other chunks' bundling.
> >> > >
> >> > > I thought about that and decided that the MSG_MORE flag on the last data
> >> > > chunk was the only one that mattered.
> >> > > Indeed looking at any others is broken.
> >> > >
> >> > > Consider what happens if you have two small chunks queued, the first
> >> > > with MSG_MORE set, the second with it clear.
> >> > >
> >> > > I think that sctp_outq_flush() will look at the first chunk and decide it
> >> > > doesn't need to do anything because sctp_packet_transmit_chunk()
> >> > > returns SCTP_XMIT_DELAY.
> >> > > The data chunk with MSG_MORE clear won't even be looked at.
> >> > > So the data will never be sent.
> >>
> >> > It's not that bad as you thought, in sctp_packet_can_append_data():
> >> > when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
> >> > would be still sent out.
> >>
> >> One of us isn't understanding the other :-)
> >>
> >> IIRC sctp_packet_can_append_data() is called for the first queued
> >> data chunk in order to decide whether to generate a message that
> >
> > Perhaps here lies the source of the confusion?
> > sctp_packet_can_append_data() is called for all queued data chunks, and
> > not just the first one.
> >
> > sctp_outq_flush
> >   (retransmissions here, omitted for simplicity)
> >   /* Finally, transmit new packets.  */
> >   while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
> >     sctp_packet_transmit_chunk
> >       sctp_packet_append_chunk
> >         sctp_packet_can_append_data
> >         __sctp_packet_append_chunk
> >
> > So chunks are checked one by one.
> I think I got David's point.
> like, the queue is:
> 
> chunk3[null]-->chunk2 [msg_more]-->chunk1 [msg_more]
> 
> it dequeue from chunk1, once it returns SCTP_XMIT_DELAY
> chunk2, chunk3 will has no chance to dequeue, as it will
> goto: sctpflush_out in sctp_outq_flush(), But we are expecting
> to send all chunks.

Ahh yes, exactly.

> 
> >
> >> consists only of data chunks.
> >
> > That's not really its purpose. It's to check if it can append a data
> > chunk to the packet being prepared, while respecting asoc state, cwnd,
> > etc.
> >
> > HTH!
> >
> >   Marcelo
> >
> >> If it returns SCTP_XMIT_OK then a message is built collecting the
> >> rest of the queued data chunks (until the window fills).
> >>
> >> So if I send a message with MSG_MORE set (on an idle connection)
> >> SCTP_XMIT_DELAY is returned and a message isn't sent.
> >>
> >> I now send a second small message, this time with MSG_MORE clear.
> >> The message is queued, then the code looks to see if it can send anything.
> >>
> >> sctp_packet_can_append_data() is called for the first queued chunk.
> >> Since it has force_delay set SCTP_XMIT_DELAY is returned and no
> >> message is built.
> >> The second message isn't even looked at.
> >>
> >> > What MSG_MORE flag actually does is ignore inflight == 0 and
> >> > sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
> >> > it has to respect the original logic (like !chunk->msg->can_delay
> >> > || !sctp_packet_empty(packet) || ...)
> >> >
> >> > To delay the chunks with MSG_MORE set even when inflight is 0
> >> > it especially important here for users.
> >>
> >> I'm not too worried about that.
> >> Sending the first message was a cheap way to ensure something got
> >> sent if the application lied and didn't send a subsequent message.
> >>
> >> The change has hit Linus's tree, I'll should be able to test that
> >> and confirm what I think is going on.
> >>
> >>       David
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Xin Long Feb. 24, 2017, 6:43 a.m. UTC | #7
On Fri, Feb 24, 2017 at 12:04 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Xin Long
>> Sent: 23 February 2017 03:46
>> On Tue, Feb 21, 2017 at 10:27 PM, David Laight <David.Laight@aculab.com> wrote:
>> > From: Xin Long
>> >> Sent: 18 February 2017 17:53
>> >> This patch is to add support for MSG_MORE on sctp.
>> >>
>> >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
>> >> creating datamsg according to the send flag. sctp_packet_can_append_data
>> >> then uses it to decide if the chunks of this msg will be sent at once or
>> >> delay it.
>> >>
>> >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
>> >> in assoc. As sctp enqueues the chunks first, then dequeue them one by
>> >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
>> >> affect other chunks' bundling.
>> >
>> > I thought about that and decided that the MSG_MORE flag on the last data
>> > chunk was the only one that mattered.
>> > Indeed looking at any others is broken.
>> >
>> > Consider what happens if you have two small chunks queued, the first
>> > with MSG_MORE set, the second with it clear.
>> >
>> > I think that sctp_outq_flush() will look at the first chunk and decide it
>> > doesn't need to do anything because sctp_packet_transmit_chunk()
>> > returns SCTP_XMIT_DELAY.
>> > The data chunk with MSG_MORE clear won't even be looked at.
>> > So the data will never be sent.
>
>> It's not that bad as you thought, in sctp_packet_can_append_data():
>> when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
>> would be still sent out.
>
> One of us isn't understanding the other :-)
>
> IIRC sctp_packet_can_append_data() is called for the first queued
> data chunk in order to decide whether to generate a message that
> consists only of data chunks.
> If it returns SCTP_XMIT_OK then a message is built collecting the
> rest of the queued data chunks (until the window fills).
>
> So if I send a message with MSG_MORE set (on an idle connection)
> SCTP_XMIT_DELAY is returned and a message isn't sent.
>
> I now send a second small message, this time with MSG_MORE clear.
> The message is queued, then the code looks to see if it can send anything.
>
> sctp_packet_can_append_data() is called for the first queued chunk.
> Since it has force_delay set SCTP_XMIT_DELAY is returned and no
> message is built.
> The second message isn't even looked at.
You're right. I can see the problem now.

What I expected is it should work like:

1, send 3 small chunks with MSG_MORE set, the queue is:
  chk3 [set] -> chk2 [set] -> chk1 [set]
2. send 1 more chunk with MSG_MORE clear, the queue is:
  chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
3. then if user send more small chunks with MSG_MORE set,
the queue is like:
  chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2
[clear] -> chk1 [clear]
so that the new small chunks' flag will not affect the other chunks bundling.

is it ok to you to work like that ?
if yes, I propose the fix for this issue is:

@@ -303,6 +303,17 @@ void sctp_outq_tail(struct sctp_outq *q, struct
sctp_chunk *chunk, gfp_t gfp)
                         sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
                         "illegal chunk");

+               if (q->has_delay && !chunk->msg->force_delay) {
+                       struct sctp_chunk *chk;
+
+                       list_for_each_entry(chk, &q->out_chunk_list, list) {
+                               if (!chk->msg->force_delay)
+                                       break;
+                               chk->msg->force_delay = 0;
+                       }
+               }
+               q->has_delay = chunk->msg->force_delay;
+

Thanks.

>
>> What MSG_MORE flag actually does is ignore inflight == 0 and
>> sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
>> it has to respect the original logic (like !chunk->msg->can_delay
>> || !sctp_packet_empty(packet) || ...)
>>
>> To delay the chunks with MSG_MORE set even when inflight is 0
>> it especially important here for users.
>
> I'm not too worried about that.
> Sending the first message was a cheap way to ensure something got
> sent if the application lied and didn't send a subsequent message.
>
> The change has hit Linus's tree, I'll should be able to test that
> and confirm what I think is going on.
>
>         David
>
David Laight Feb. 24, 2017, 10:14 a.m. UTC | #8
From: Xin Long

> Sent: 24 February 2017 06:44

...
> > IIRC sctp_packet_can_append_data() is called for the first queued

> > data chunk in order to decide whether to generate a message that

> > consists only of data chunks.

> > If it returns SCTP_XMIT_OK then a message is built collecting the

> > rest of the queued data chunks (until the window fills).

> >

> > So if I send a message with MSG_MORE set (on an idle connection)

> > SCTP_XMIT_DELAY is returned and a message isn't sent.

> >

> > I now send a second small message, this time with MSG_MORE clear.

> > The message is queued, then the code looks to see if it can send anything.

> >

> > sctp_packet_can_append_data() is called for the first queued chunk.

> > Since it has force_delay set SCTP_XMIT_DELAY is returned and no

> > message is built.

> > The second message isn't even looked at.

> You're right. I can see the problem now.

> 

> What I expected is it should work like:

> 

> 1, send 3 small chunks with MSG_MORE set, the queue is:

>   chk3 [set] -> chk2 [set] -> chk1 [set]


Strange way to write a queue! chk1 points to chk2 :-)

> 2. send 1 more chunk with MSG_MORE clear, the queue is:

>   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]


I don't think processing the entire queue is a good idea.
Both from execution time and the effects on the data cache.
The SCTP code is horrid enough as it is.

> 3. then if user send more small chunks with MSG_MORE set,

> the queue is like:

>   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]

> so that the new small chunks' flag will not affect the other chunks bundling.


That isn't really necessary.
The user can't expect to have absolute control over which chunks get bundled
together.
If the above chunks still aren't big enough to fill a frame the code might
as well wait for the next chunk instead of building a packet that contains
chk1 through to chkB.

Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
queued chunks will be sent.

So immediately after your (3) the application is expected to send a chunk
with MSG_MORE clear - at that point all the queued chunks can be sent in
a single packet.

So just save the last MSG_MORE on the association as I did.

	David
Xin Long Feb. 25, 2017, 8:41 a.m. UTC | #9
On Fri, Feb 24, 2017 at 6:14 PM, David Laight <David.Laight@aculab.com> wrote:
>
> From: Xin Long
> > Sent: 24 February 2017 06:44
> ...
> > > IIRC sctp_packet_can_append_data() is called for the first queued
> > > data chunk in order to decide whether to generate a message that
> > > consists only of data chunks.
> > > If it returns SCTP_XMIT_OK then a message is built collecting the
> > > rest of the queued data chunks (until the window fills).
> > >
> > > So if I send a message with MSG_MORE set (on an idle connection)
> > > SCTP_XMIT_DELAY is returned and a message isn't sent.
> > >
> > > I now send a second small message, this time with MSG_MORE clear.
> > > The message is queued, then the code looks to see if it can send anything.
> > >
> > > sctp_packet_can_append_data() is called for the first queued chunk.
> > > Since it has force_delay set SCTP_XMIT_DELAY is returned and no
> > > message is built.
> > > The second message isn't even looked at.
> > You're right. I can see the problem now.
> >
> > What I expected is it should work like:
> >
> > 1, send 3 small chunks with MSG_MORE set, the queue is:
> >   chk3 [set] -> chk2 [set] -> chk1 [set]
>
> Strange way to write a queue! chk1 points to chk2 :-)
haha, just  a model.

>
> > 2. send 1 more chunk with MSG_MORE clear, the queue is:
> >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
>
> I don't think processing the entire queue is a good idea.
> Both from execution time and the effects on the data cache.
> The SCTP code is horrid enough as it is.
you check the codes in last email, it's not processing the entire queue.

1). only when queue has delay chunk inside by checking queue->has_delay
    and current chunk has msg_more flag.

2). will break on the first chunk with clear in the queue.

but yes, in 2), extra work has to be done than before, but not much.

>
> > 3. then if user send more small chunks with MSG_MORE set,
> > the queue is like:
> >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > so that the new small chunks' flag will not affect the other chunks bundling.
>
> That isn't really necessary.
> The user can't expect to have absolute control over which chunks get bundled
> together.
> If the above chunks still aren't big enough to fill a frame the code might
> as well wait for the next chunk instead of building a packet that contains
> chk1 through to chkB.
>
> Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> queued chunks will be sent.
>
> So immediately after your (3) the application is expected to send a chunk
> with MSG_MORE clear - at that point all the queued chunks can be sent in
> a single packet.
understand this.

what I'm worried about is if the msg_more is saved in assoc:
     chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
then when you send a small chkA with MSG_MORE,
the queue will be like:
     chkA [set] -> chk4[set] -> chk3 [set] -> chk2 [set] -> chk1 [set]
because msg_more is saved in assoc, every chunk can look at it.
chk1 - chk4 are big enough to be packed into a packet, they were
not sent last time because a lot of chunks are in the retransmit
queue.

But now even if retransmit queue is null, chk1-chk4 are still blocked.

can you accept that chkA may block the old chunks ?

>
> So just save the last MSG_MORE on the association as I did.
I will think about it, thanks.

>
>         David
Xin Long Feb. 27, 2017, 4:49 a.m. UTC | #10
On Sat, Feb 25, 2017 at 4:41 PM, Xin Long <lucien.xin@gmail.com> wrote:
> On Fri, Feb 24, 2017 at 6:14 PM, David Laight <David.Laight@aculab.com> wrote:
>>
>> From: Xin Long
>> > Sent: 24 February 2017 06:44
>> ...
>> > > IIRC sctp_packet_can_append_data() is called for the first queued
>> > > data chunk in order to decide whether to generate a message that
>> > > consists only of data chunks.
>> > > If it returns SCTP_XMIT_OK then a message is built collecting the
>> > > rest of the queued data chunks (until the window fills).
>> > >
>> > > So if I send a message with MSG_MORE set (on an idle connection)
>> > > SCTP_XMIT_DELAY is returned and a message isn't sent.
>> > >
>> > > I now send a second small message, this time with MSG_MORE clear.
>> > > The message is queued, then the code looks to see if it can send anything.
>> > >
>> > > sctp_packet_can_append_data() is called for the first queued chunk.
>> > > Since it has force_delay set SCTP_XMIT_DELAY is returned and no
>> > > message is built.
>> > > The second message isn't even looked at.
>> > You're right. I can see the problem now.
>> >
>> > What I expected is it should work like:
>> >
>> > 1, send 3 small chunks with MSG_MORE set, the queue is:
>> >   chk3 [set] -> chk2 [set] -> chk1 [set]
>>
>> Strange way to write a queue! chk1 points to chk2 :-)
> haha, just  a model.
>
>>
>> > 2. send 1 more chunk with MSG_MORE clear, the queue is:
>> >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
>>
>> I don't think processing the entire queue is a good idea.
>> Both from execution time and the effects on the data cache.
>> The SCTP code is horrid enough as it is.
> you check the codes in last email, it's not processing the entire queue.
>
> 1). only when queue has delay chunk inside by checking queue->has_delay
>     and current chunk has msg_more flag.
>
> 2). will break on the first chunk with clear in the queue.
>
> but yes, in 2), extra work has to be done than before, but not much.
>
>>
>> > 3. then if user send more small chunks with MSG_MORE set,
>> > the queue is like:
>> >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
>> > so that the new small chunks' flag will not affect the other chunks bundling.
>>
>> That isn't really necessary.
>> The user can't expect to have absolute control over which chunks get bundled
>> together.
>> If the above chunks still aren't big enough to fill a frame the code might
>> as well wait for the next chunk instead of building a packet that contains
>> chk1 through to chkB.
>>
>> Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
>> As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
>> queued chunks will be sent.
>>
>> So immediately after your (3) the application is expected to send a chunk
>> with MSG_MORE clear - at that point all the queued chunks can be sent in
>> a single packet.
> understand this.
>
> what I'm worried about is if the msg_more is saved in assoc:
>      chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> then when you send a small chkA with MSG_MORE,
> the queue will be like:
>      chkA [set] -> chk4[set] -> chk3 [set] -> chk2 [set] -> chk1 [set]
> because msg_more is saved in assoc, every chunk can look at it.
> chk1 - chk4 are big enough to be packed into a packet, they were
> not sent last time because a lot of chunks are in the retransmit
> queue.
>
> But now even if retransmit queue is null, chk1-chk4 are still blocked.
>
> can you accept that chkA may block the old chunks ?
even also block the retransmit chunks.

>
>>
>> So just save the last MSG_MORE on the association as I did.
> I will think about it, thanks.
>
>>
>>         David
David Laight Feb. 27, 2017, 10:48 a.m. UTC | #11
From: Xin Long

> Sent: 27 February 2017 04:49

...
> > what I'm worried about is if the msg_more is saved in assoc:

> >      chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]

> > then when you send a small chkA with MSG_MORE,

> > the queue will be like:

> >      chkA [set] -> chk4[set] -> chk3 [set] -> chk2 [set] -> chk1 [set]

> > because msg_more is saved in assoc, every chunk can look at it.

> > chk1 - chk4 are big enough to be packed into a packet, they were

> > not sent last time because a lot of chunks are in the retransmit

> > queue.


Or just waiting for transmit window space.

> > But now even if retransmit queue is null, chk1-chk4 are still blocked.

> >

> > can you accept that chkA may block the old chunks ?

> even also block the retransmit chunks.


I don't see a problem, the application is about to send another chunk.
It is likely that the whole lot will go in one packet.

If the connection is flow controlled off (even if due to packet loss)
the minor delay waiting for the application won't make any real difference.
Indeed reducing the number of ethernet frames may help increase
overall throughput.

Does a request for chunks to be retransmitted force a response packet be built?
If so it will pick up all the queued data chunks regardless of any saved
MSG_MORE bits.

	David
Marcelo Ricardo Leitner March 21, 2017, 10:04 p.m. UTC | #12
Hi,

On Fri, Feb 24, 2017 at 10:14:09AM +0000, David Laight wrote:
> From: Xin Long
> > Sent: 24 February 2017 06:44
> ...
> > > IIRC sctp_packet_can_append_data() is called for the first queued
> > > data chunk in order to decide whether to generate a message that
> > > consists only of data chunks.
> > > If it returns SCTP_XMIT_OK then a message is built collecting the
> > > rest of the queued data chunks (until the window fills).
> > >
> > > So if I send a message with MSG_MORE set (on an idle connection)
> > > SCTP_XMIT_DELAY is returned and a message isn't sent.
> > >
> > > I now send a second small message, this time with MSG_MORE clear.
> > > The message is queued, then the code looks to see if it can send anything.
> > >
> > > sctp_packet_can_append_data() is called for the first queued chunk.
> > > Since it has force_delay set SCTP_XMIT_DELAY is returned and no
> > > message is built.
> > > The second message isn't even looked at.
> > You're right. I can see the problem now.
> > 
> > What I expected is it should work like:
> > 
> > 1, send 3 small chunks with MSG_MORE set, the queue is:
> >   chk3 [set] -> chk2 [set] -> chk1 [set]
> 
> Strange way to write a queue! chk1 points to chk2 :-)

Strictly speaking, it's actually both together, as it's a double-linked
list. :-)

> 
> > 2. send 1 more chunk with MSG_MORE clear, the queue is:
> >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> 
> I don't think processing the entire queue is a good idea.
> Both from execution time and the effects on the data cache.

It won't be processing the entire queue if not needed, and it will only
process it on the last sendmsg() call. As the list is double-linked, it
can walk backwards as necessary and stop just at the right point.  So
this doesn't imply on any quadratic or exponential factor here, but
linear and only if/when finishing the MSG_MORE block.

If the application is not using MSG_MORE, impact is zero.

> The SCTP code is horrid enough as it is.
> 
> > 3. then if user send more small chunks with MSG_MORE set,
> > the queue is like:
> >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > so that the new small chunks' flag will not affect the other chunks bundling.
> 
> That isn't really necessary.
> The user can't expect to have absolute control over which chunks get bundled
> together.

So...?
I mean, I'm okay with that but that doesn't explain why we can't do as
Xin proposed on previous email here.

> If the above chunks still aren't big enough to fill a frame the code might
> as well wait for the next chunk instead of building a packet that contains
> chk1 through to chkB.

Our expectations are the same and that's what the proposed solution also
achieves, no?

> 
> Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> queued chunks will be sent.

With the fix proposed by Xin, this would be more like: ... all of the
_non-held_ chunks will be sent.
After all, application asked to hold them, for whatever reason it had.

> 
> So immediately after your (3) the application is expected to send a chunk
> with MSG_MORE clear - at that point all the queued chunks can be sent in
> a single packet.

Yes. Isn't that the idea?

> 
> So just save the last MSG_MORE on the association as I did.

I don't see the reason to change that. Your reply seem to reject the
idea but I cannot get the reason why. The solution proposed is more
complex, yes, allows more control, yes, but those aren't real issues
here.

  Marcelo
David Laight March 22, 2017, 2:07 p.m. UTC | #13
From: Marcelo Ricardo Leitner [mailto:marcelo.leitner@gmail.com]
> Sent: 21 March 2017 22:04
> Hi,
...
> > > 2. send 1 more chunk with MSG_MORE clear, the queue is:
> > >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> >
> > I don't think processing the entire queue is a good idea.
> > Both from execution time and the effects on the data cache.
> 
> It won't be processing the entire queue if not needed, and it will only
> process it on the last sendmsg() call. As the list is double-linked, it
> can walk backwards as necessary and stop just at the right point.  So
> this doesn't imply on any quadratic or exponential factor here, but
> linear and only if/when finishing the MSG_MORE block.
> 
> If the application is not using MSG_MORE, impact is zero.
> 
> > The SCTP code is horrid enough as it is.
> >
> > > 3. then if user send more small chunks with MSG_MORE set,
> > > the queue is like:
> > >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > > so that the new small chunks' flag will not affect the other chunks bundling.
> >
> > That isn't really necessary.
> > The user can't expect to have absolute control over which chunks get bundled
> > together.
> 
> So...?
> I mean, I'm okay with that but that doesn't explain why we can't do as
> Xin proposed on previous email here.
> 
> > If the above chunks still aren't big enough to fill a frame the code might
> > as well wait for the next chunk instead of building a packet that contains
> > chk1 through to chkB.
> 
> Our expectations are the same and that's what the proposed solution also
> achieves, no?

Not really.

> > Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> > As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> > queued chunks will be sent.
> 
> With the fix proposed by Xin, this would be more like: ... all of the
> _non-held_ chunks will be sent.
> After all, application asked to hold them, for whatever reason it had.

You are mis-understanding what I think MSG_MORE is for.
It isn't the application saying 'don't send this packet', but rather
'there is no point sending ANY data because I've more data to send'.
There is also the inference that the application will immediately
send the next piece of data.

So it isn't a property of the queued chunk, it is an indication that
the application is going to send more data immediately.

> > So immediately after your (3) the application is expected to send a chunk
> > with MSG_MORE clear - at that point all the queued chunks can be sent in
> > a single packet.
> 
> Yes. Isn't that the idea?
> 
> >
> > So just save the last MSG_MORE on the association as I did.
> 
> I don't see the reason to change that. Your reply seem to reject the
> idea but I cannot get the reason why. The solution proposed is more
> complex, yes, allows more control, yes, but those aren't real issues
> here.

I think you are trying to provide control of chunking that is neither
necessary nor desirable and may give a false indication of what it
might be sensible for the application to have control over.

Regardless of the MSG_MORE flags associated with any specific send()
request there will always be protocol effects (like retransmissions
or flow control 'on') that will generate different 'chunking'.

	David
Marcelo Ricardo Leitner March 22, 2017, 5:33 p.m. UTC | #14
On Wed, Mar 22, 2017 at 02:07:37PM +0000, David Laight wrote:
> From: Marcelo Ricardo Leitner [mailto:marcelo.leitner@gmail.com]
> > Sent: 21 March 2017 22:04
> > Hi,
> ...
> > > > 2. send 1 more chunk with MSG_MORE clear, the queue is:
> > > >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > >
> > > I don't think processing the entire queue is a good idea.
> > > Both from execution time and the effects on the data cache.
> > 
> > It won't be processing the entire queue if not needed, and it will only
> > process it on the last sendmsg() call. As the list is double-linked, it
> > can walk backwards as necessary and stop just at the right point.  So
> > this doesn't imply on any quadratic or exponential factor here, but
> > linear and only if/when finishing the MSG_MORE block.
> > 
> > If the application is not using MSG_MORE, impact is zero.
> > 
> > > The SCTP code is horrid enough as it is.
> > >
> > > > 3. then if user send more small chunks with MSG_MORE set,
> > > > the queue is like:
> > > >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > > > so that the new small chunks' flag will not affect the other chunks bundling.
> > >
> > > That isn't really necessary.
> > > The user can't expect to have absolute control over which chunks get bundled
> > > together.
> > 
> > So...?
> > I mean, I'm okay with that but that doesn't explain why we can't do as
> > Xin proposed on previous email here.
> > 
> > > If the above chunks still aren't big enough to fill a frame the code might
> > > as well wait for the next chunk instead of building a packet that contains
> > > chk1 through to chkB.
> > 
> > Our expectations are the same and that's what the proposed solution also
> > achieves, no?
> 
> Not really.
> 
> > > Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> > > As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> > > queued chunks will be sent.
> > 
> > With the fix proposed by Xin, this would be more like: ... all of the
> > _non-held_ chunks will be sent.
> > After all, application asked to hold them, for whatever reason it had.
> 
> You are mis-understanding what I think MSG_MORE is for.
> It isn't the application saying 'don't send this packet', but rather
> 'there is no point sending ANY data because I've more data to send'.

I see the difference now, thanks.

> There is also the inference that the application will immediately
> send the next piece of data.
> 
> So it isn't a property of the queued chunk, it is an indication that
> the application is going to send more data immediately.

It would basically avoid sending the chunk immediately when inflight==0
and the rest could stay as is then.
As the application is supposed to send more data immediately and clear
the flag, we can send packets as they become full, and also let normal
Nagle processing be. TCP also has a ceiling on this waiting, 200ms
according to man 7 tcp/TCP_CORK.

> 
> > > So immediately after your (3) the application is expected to send a chunk
> > > with MSG_MORE clear - at that point all the queued chunks can be sent in
> > > a single packet.
> > 
> > Yes. Isn't that the idea?
> > 
> > >
> > > So just save the last MSG_MORE on the association as I did.
> > 
> > I don't see the reason to change that. Your reply seem to reject the
> > idea but I cannot get the reason why. The solution proposed is more
> > complex, yes, allows more control, yes, but those aren't real issues
> > here.
> 
> I think you are trying to provide control of chunking that is neither
> necessary nor desirable and may give a false indication of what it
> might be sensible for the application to have control over.

Agreed.

> 
> Regardless of the MSG_MORE flags associated with any specific send()
> request there will always be protocol effects (like retransmissions
> or flow control 'on') that will generate different 'chunking'.

Yes, those are the ones that may lead to some confusion on how it
actually works, and mangling them is not really desired for the
sideeffects that it might have.

Sooner or later we could have bug reports like "hey this chunk shouldn't
have been packed with that." if we stick with the initial proposition,
while with David's view, we are only promising to not send packets with
a single chunk and as long as the application send more data fast enough.

David, are we on the same page now? ;-)

Xin, what do you think?

  Marcelo
Xin Long March 23, 2017, 4:35 a.m. UTC | #15
On Thu, Mar 23, 2017 at 1:33 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Mar 22, 2017 at 02:07:37PM +0000, David Laight wrote:
>> From: Marcelo Ricardo Leitner [mailto:marcelo.leitner@gmail.com]
>> > Sent: 21 March 2017 22:04
>> > Hi,
>> ...
>> > > > 2. send 1 more chunk with MSG_MORE clear, the queue is:
>> > > >   chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
>> > >
>> > > I don't think processing the entire queue is a good idea.
>> > > Both from execution time and the effects on the data cache.
>> >
>> > It won't be processing the entire queue if not needed, and it will only
>> > process it on the last sendmsg() call. As the list is double-linked, it
>> > can walk backwards as necessary and stop just at the right point.  So
>> > this doesn't imply on any quadratic or exponential factor here, but
>> > linear and only if/when finishing the MSG_MORE block.
>> >
>> > If the application is not using MSG_MORE, impact is zero.
>> >
>> > > The SCTP code is horrid enough as it is.
>> > >
>> > > > 3. then if user send more small chunks with MSG_MORE set,
>> > > > the queue is like:
>> > > >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
>> > > > so that the new small chunks' flag will not affect the other chunks bundling.
>> > >
>> > > That isn't really necessary.
>> > > The user can't expect to have absolute control over which chunks get bundled
>> > > together.
>> >
>> > So...?
>> > I mean, I'm okay with that but that doesn't explain why we can't do as
>> > Xin proposed on previous email here.
>> >
>> > > If the above chunks still aren't big enough to fill a frame the code might
>> > > as well wait for the next chunk instead of building a packet that contains
>> > > chk1 through to chkB.
>> >
>> > Our expectations are the same and that's what the proposed solution also
>> > achieves, no?
>>
>> Not really.
>>
>> > > Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
>> > > As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
>> > > queued chunks will be sent.
>> >
>> > With the fix proposed by Xin, this would be more like: ... all of the
>> > _non-held_ chunks will be sent.
>> > After all, application asked to hold them, for whatever reason it had.
>>
>> You are mis-understanding what I think MSG_MORE is for.
>> It isn't the application saying 'don't send this packet', but rather
>> 'there is no point sending ANY data because I've more data to send'.
>
> I see the difference now, thanks.
>
>> There is also the inference that the application will immediately
>> send the next piece of data.
>>
>> So it isn't a property of the queued chunk, it is an indication that
>> the application is going to send more data immediately.
>
> It would basically avoid sending the chunk immediately when inflight==0
> and the rest could stay as is then.
> As the application is supposed to send more data immediately and clear
> the flag, we can send packets as they become full, and also let normal
> Nagle processing be. TCP also has a ceiling on this waiting, 200ms
> according to man 7 tcp/TCP_CORK.
>
>>
>> > > So immediately after your (3) the application is expected to send a chunk
>> > > with MSG_MORE clear - at that point all the queued chunks can be sent in
>> > > a single packet.
>> >
>> > Yes. Isn't that the idea?
>> >
>> > >
>> > > So just save the last MSG_MORE on the association as I did.
>> >
>> > I don't see the reason to change that. Your reply seem to reject the
>> > idea but I cannot get the reason why. The solution proposed is more
>> > complex, yes, allows more control, yes, but those aren't real issues
>> > here.
>>
>> I think you are trying to provide control of chunking that is neither
>> necessary nor desirable and may give a false indication of what it
>> might be sensible for the application to have control over.
>
> Agreed.
>
>>
>> Regardless of the MSG_MORE flags associated with any specific send()
>> request there will always be protocol effects (like retransmissions
>> or flow control 'on') that will generate different 'chunking'.
>
> Yes, those are the ones that may lead to some confusion on how it
> actually works, and mangling them is not really desired for the
> sideeffects that it might have.
>
> Sooner or later we could have bug reports like "hey this chunk shouldn't
> have been packed with that." if we stick with the initial proposition,
> while with David's view, we are only promising to not send packets with
> a single chunk and as long as the application send more data fast enough.
>
> David, are we on the same page now? ;-)
>
> Xin, what do you think?
If we insist that MSG_MORE means not to send  ANY data, I compromise.
does ANY include retransmission DATA? should MSG_MORE block
retransmission ?

>
>   Marcelo
>
Marcelo Ricardo Leitner March 23, 2017, 4:42 p.m. UTC | #16
On Thu, Mar 23, 2017 at 12:35:46PM +0800, Xin Long wrote:
> On Thu, Mar 23, 2017 at 1:33 AM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > On Wed, Mar 22, 2017 at 02:07:37PM +0000, David Laight wrote:
> >> Regardless of the MSG_MORE flags associated with any specific send()
> >> request there will always be protocol effects (like retransmissions
> >> or flow control 'on') that will generate different 'chunking'.
> >
> > Yes, those are the ones that may lead to some confusion on how it
> > actually works, and mangling them is not really desired for the
> > sideeffects that it might have.
> >
> > Sooner or later we could have bug reports like "hey this chunk shouldn't
> > have been packed with that." if we stick with the initial proposition,
> > while with David's view, we are only promising to not send packets with
> > a single chunk and as long as the application send more data fast enough.
> >
> > David, are we on the same page now? ;-)
> >
> > Xin, what do you think?
> If we insist that MSG_MORE means not to send  ANY data, I compromise.
> does ANY include retransmission DATA? should MSG_MORE block
> retransmission ?

That's not really what he meant by that, I think. That "ANY" in there is
a way to refer to the entire buf and not that msg sendmsg is sending.
Later I explained what I got from his explanation, which should be more
like:
"If MSG_MORE was used, and there are no packets in flight, do not send a
packet right away because the application is going to send more data."
Would have to handle the (Not-)Nagle situation too:
"If not using Nagle and using MSG_MORE, try to not generate a packet
right away." (because this may send packets with a single chunk even if
in_flight != 0)
In both cases, if the flush is generated by other triggers, it's okay.

Because if there are chunks already queued, they will be sent as soon as
in_flight reaches 0 or some other break is lifted (flow control).
Holding the chunk that was queued with MSG_MORE and sending a partial
packet in this case because of MSG_MORE is not good, it's possibly not
saving anything.

  Marcelo
Xin Long March 24, 2017, 4:09 p.m. UTC | #17
On Fri, Mar 24, 2017 at 12:42 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 12:35:46PM +0800, Xin Long wrote:
>> On Thu, Mar 23, 2017 at 1:33 AM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>> > On Wed, Mar 22, 2017 at 02:07:37PM +0000, David Laight wrote:
>> >> Regardless of the MSG_MORE flags associated with any specific send()
>> >> request there will always be protocol effects (like retransmissions
>> >> or flow control 'on') that will generate different 'chunking'.
>> >
>> > Yes, those are the ones that may lead to some confusion on how it
>> > actually works, and mangling them is not really desired for the
>> > sideeffects that it might have.
>> >
>> > Sooner or later we could have bug reports like "hey this chunk shouldn't
>> > have been packed with that." if we stick with the initial proposition,
>> > while with David's view, we are only promising to not send packets with
>> > a single chunk and as long as the application send more data fast enough.
>> >
>> > David, are we on the same page now? ;-)
>> >
>> > Xin, what do you think?
>> If we insist that MSG_MORE means not to send  ANY data, I compromise.
>> does ANY include retransmission DATA? should MSG_MORE block
>> retransmission ?
>
> That's not really what he meant by that, I think. That "ANY" in there is
> a way to refer to the entire buf and not that msg sendmsg is sending.
> Later I explained what I got from his explanation, which should be more
> like:
> "If MSG_MORE was used, and there are no packets in flight, do not send a
> packet right away because the application is going to send more data."
> Would have to handle the (Not-)Nagle situation too:
> "If not using Nagle and using MSG_MORE, try to not generate a packet
> right away." (because this may send packets with a single chunk even if
> in_flight != 0)
> In both cases, if the flush is generated by other triggers, it's okay.
>
> Because if there are chunks already queued, they will be sent as soon as
> in_flight reaches 0 or some other break is lifted (flow control).
> Holding the chunk that was queued with MSG_MORE and sending a partial
> packet in this case because of MSG_MORE is not good, it's possibly not
> saving anything.

Makes sence, thanks for making this clear, will post a new fix.
David Laight March 24, 2017, 5:38 p.m. UTC | #18
From: Marcelo Ricardo Leitner
> Sent: 23 March 2017 16:42
...
> > > David, are we on the same page now? ;-)

Probably...

> > > Xin, what do you think?
> > If we insist that MSG_MORE means not to send  ANY data, I compromise.
> > does ANY include retransmission DATA? should MSG_MORE block
> > retransmission ?
> 
> That's not really what he meant by that, I think. That "ANY" in there is
> a way to refer to the entire buf and not that msg sendmsg is sending.
> Later I explained what I got from his explanation, which should be more
> like:
> "If MSG_MORE was used, and there are no packets in flight, do not send a
> packet right away because the application is going to send more data."
> Would have to handle the (Not-)Nagle situation too:
> "If not using Nagle and using MSG_MORE, try to not generate a packet
> right away." (because this may send packets with a single chunk even if
> in_flight != 0)
> In both cases, if the flush is generated by other triggers, it's okay.
> 
> Because if there are chunks already queued, they will be sent as soon as
> in_flight reaches 0 or some other break is lifted (flow control).
> Holding the chunk that was queued with MSG_MORE and sending a partial
> packet in this case because of MSG_MORE is not good, it's possibly not
> saving anything.

It is also worth remembering that this is all about whether the first
chunk in an ethernet frame is a data chunk.
If a frame is being sent for some other reason (eg ack or heartbeat)
then it will collect queued data chunks.

I've seen hearbeats collect data chunks, I've not checked that this
doesn't happen for heartbeats that are probing IP addresses.

	David
David Laight March 28, 2017, 10:29 a.m. UTC | #19
From: Marcelo Ricardo Leitner
> Sent: 21 March 2017 22:04
...
> > > 3. then if user send more small chunks with MSG_MORE set,
> > > the queue is like:
> > >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > > so that the new small chunks' flag will not affect the other chunks bundling.
> >
> > That isn't really necessary.
> > The user can't expect to have absolute control over which chunks get bundled
> > together.
> 
> So...?
> I mean, I'm okay with that but that doesn't explain why we can't do as
> Xin proposed on previous email here.
> 
> > If the above chunks still aren't big enough to fill a frame the code might
> > as well wait for the next chunk instead of building a packet that contains
> > chk1 through to chkB.
> 
> Our expectations are the same and that's what the proposed solution also
> achieves, no?
> 
> >
> > Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> > As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> > queued chunks will be sent.
> 
> With the fix proposed by Xin, this would be more like: ... all of the
> _non-held_ chunks will be sent.
> After all, application asked to hold them, for whatever reason it had.
> 
> >
> > So immediately after your (3) the application is expected to send a chunk
> > with MSG_MORE clear - at that point all the queued chunks can be sent in
> > a single packet.
> 
> Yes. Isn't that the idea?
> 
> >
> > So just save the last MSG_MORE on the association as I did.
> 
> I don't see the reason to change that. Your reply seem to reject the
> idea but I cannot get the reason why. The solution proposed is more
> complex, yes, allows more control, yes, but those aren't real issues
> here.

I think I've realised why we were disagreeing....

For TCP MSG_MORE was implemented as an alternative to 'corking' for
allocations that used multiple send() calls to send a single application
level message (eg those that send a length and data separately).
The real solution is to use sendv() (or writev() if sendv() doesn't exist).

SCTP is different, every send() generates a separate data chunk.
The problem in SCTP is that if an application has more than one data
chunk to send it needs to stop the kernel sending anything until
it has sent all the data chunks - otherwise they all go out in
separate ethernet packets.

A TCP application can avoid this by doing a single send() for all the
data. This doesn't work for SCTP because it would generate a single
data chunk.

	David
Marcelo Ricardo Leitner March 28, 2017, 6:12 p.m. UTC | #20
On Tue, Mar 28, 2017 at 10:29:27AM +0000, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 21 March 2017 22:04
> ...
> > > > 3. then if user send more small chunks with MSG_MORE set,
> > > > the queue is like:
> > > >   chkB[set] -> chkA[set] -> chk4[clear] -> chk3 [clear] -> chk2 [clear] -> chk1 [clear]
> > > > so that the new small chunks' flag will not affect the other chunks bundling.
> > >
> > > That isn't really necessary.
> > > The user can't expect to have absolute control over which chunks get bundled
> > > together.
> > 
> > So...?
> > I mean, I'm okay with that but that doesn't explain why we can't do as
> > Xin proposed on previous email here.
> > 
> > > If the above chunks still aren't big enough to fill a frame the code might
> > > as well wait for the next chunk instead of building a packet that contains
> > > chk1 through to chkB.
> > 
> > Our expectations are the same and that's what the proposed solution also
> > achieves, no?
> > 
> > >
> > > Remember you'll only get a queued chunk with MSG_MORE clear if data can't be sent.
> > > As soon as data can be sent, if the first chunk has MSG_MORE clear all of the
> > > queued chunks will be sent.
> > 
> > With the fix proposed by Xin, this would be more like: ... all of the
> > _non-held_ chunks will be sent.
> > After all, application asked to hold them, for whatever reason it had.
> > 
> > >
> > > So immediately after your (3) the application is expected to send a chunk
> > > with MSG_MORE clear - at that point all the queued chunks can be sent in
> > > a single packet.
> > 
> > Yes. Isn't that the idea?
> > 
> > >
> > > So just save the last MSG_MORE on the association as I did.
> > 
> > I don't see the reason to change that. Your reply seem to reject the
> > idea but I cannot get the reason why. The solution proposed is more
> > complex, yes, allows more control, yes, but those aren't real issues
> > here.
> 
> I think I've realised why we were disagreeing....
> 
> For TCP MSG_MORE was implemented as an alternative to 'corking' for
> allocations that used multiple send() calls to send a single application
> level message (eg those that send a length and data separately).
> The real solution is to use sendv() (or writev() if sendv() doesn't exist).
> 
> SCTP is different, every send() generates a separate data chunk.
> The problem in SCTP is that if an application has more than one data
> chunk to send it needs to stop the kernel sending anything until
> it has sent all the data chunks - otherwise they all go out in
> separate ethernet packets.
> 
> A TCP application can avoid this by doing a single send() for all the
> data. This doesn't work for SCTP because it would generate a single
> data chunk.

Yep, it was based on TCP one since day 0. There is also the way MSG_MORE
works for UDP, in which it appends the subsequent write to the same
datagram. But then, for SCTP, we should use EOR flag, as Michael Tuexen
had indicated here on the list a few months ago:
"""
What about adding support for the explicit EOR mode as specified in
https://tools.ietf.org/html/rfc6458#section-8.1.26
"""
Thus why it was very based on TCP implementation, so maybe in the future
we could have both options available for SCTP.

  Marcelo
diff mbox

Patch

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 387c802..a244db5 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -497,6 +497,7 @@  struct sctp_datamsg {
 	/* Did the messenge fail to send? */
 	int send_error;
 	u8 send_failed:1,
+	   force_delay:1,
 	   can_delay;	    /* should this message be Nagle delayed */
 };
 
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 814eac0..85406d5 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -704,18 +704,15 @@  static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 	 * unacknowledged.
 	 */
 
-	if (sctp_sk(asoc->base.sk)->nodelay)
-		/* Nagle disabled */
+	if ((sctp_sk(asoc->base.sk)->nodelay || inflight == 0) &&
+	    !chunk->msg->force_delay)
+		/* Nothing unacked */
 		return SCTP_XMIT_OK;
 
 	if (!sctp_packet_empty(packet))
 		/* Append to packet */
 		return SCTP_XMIT_OK;
 
-	if (inflight == 0)
-		/* Nothing unacked */
-		return SCTP_XMIT_OK;
-
 	if (!sctp_state(asoc, ESTABLISHED))
 		return SCTP_XMIT_OK;
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 75f35ce..b532148 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1964,6 +1964,7 @@  static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
 		err = PTR_ERR(datamsg);
 		goto out_free;
 	}
+	datamsg->force_delay = !!(msg->msg_flags & MSG_MORE);
 
 	/* Now send the (possibly) fragmented message. */
 	list_for_each_entry(chunk, &datamsg->chunks, frag_list) {