diff mbox

SPLICE_F_NONBLOCK semantics...

Message ID 20091001.151102.09812927.davem@davemloft.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller Oct. 1, 2009, 10:11 p.m. UTC
Linus, I plan on putting the fix below into my tree.

It depends upon our interpretation of how you intended the
SPLICE_F_NONBLOCK flag to work when you added it way back
when.

Could you take a quick look and make sure our interpretation matches
your intent?  This behavior has been bugging people for a while and I
want to close this out, one way or another.

Thanks!

[PATCH] net: splice() from tcp to pipe should take into account O_NONBLOCK

tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.

Many projects cannot use splice(tcp -> pipe) because of this flaw.
 
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html

Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
(splice: add SPLICE_F_NONBLOCK flag )

  It doesn't make the splice itself necessarily nonblocking (because the
  actual file descriptors that are spliced from/to may block unless they
  have the O_NONBLOCK flag set), but it makes the splice pipe operations
  nonblocking.


Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only


This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.


Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

First version of this patch was submitted by Octavian Purdila

Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Linus Torvalds Oct. 1, 2009, 10:21 p.m. UTC | #1
On Thu, 1 Oct 2009, David Miller wrote:
> 
> It depends upon our interpretation of how you intended the
> SPLICE_F_NONBLOCK flag to work when you added it way back
> when.
> 
> Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
> (splice: add SPLICE_F_NONBLOCK flag )
> 
>   It doesn't make the splice itself necessarily nonblocking (because the
>   actual file descriptors that are spliced from/to may block unless they
>   have the O_NONBLOCK flag set), but it makes the splice pipe operations
>   nonblocking.
> 
> Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only

Ack. The original intent was for the flag to affect the buffering, not the 
end points.

> splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 
> 
> to block on data coming from socket (if file is in blocking mode),
> and not block on pipe output (to avoid deadlock)

Yes. Sounds correct. Although the more I think about it, the more I 
suspect that the whole NONBLOCK thing should probably have been two bits, 
and simply been about "nonblocking input" vs "nonblocking output" (so that 
you could control both sides on a call-by-call basis).

But I think that your patch is fundamentally correct with the semantics 
as-is.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 1, 2009, 10:24 p.m. UTC | #2
On Thu, 1 Oct 2009, Linus Torvalds wrote:
> 
> Ack. The original intent was for the flag to affect the buffering, not the 
> end points.

Side note, in case it wasn't clear: I added Jens to the cc, because in the 
end my "original intent" probably doesn't matter all that much. I may have 
set up the basic ideas and so on, but I never wrote any apps that use it, 
and Jens has done most of the actual fixes since.

So give "my original intent" the weight (or rather, lack there-of) that it 
deserves. 

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 1, 2009, 10:27 p.m. UTC | #3
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 1 Oct 2009 15:21:44 -0700 (PDT)

> On Thu, 1 Oct 2009, David Miller wrote:
>> 
>> It depends upon our interpretation of how you intended the
>> SPLICE_F_NONBLOCK flag to work when you added it way back
>> when.
>> 
>> Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
>> (splice: add SPLICE_F_NONBLOCK flag )
>> 
>>   It doesn't make the splice itself necessarily nonblocking (because the
>>   actual file descriptors that are spliced from/to may block unless they
>>   have the O_NONBLOCK flag set), but it makes the splice pipe operations
>>   nonblocking.
>> 
>> Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only
> 
> Ack. The original intent was for the flag to affect the buffering, not the 
> end points.

Great, thanks for reviewing.

> Although the more I think about it, the more I suspect that the
> whole NONBLOCK thing should probably have been two bits, and simply
> been about "nonblocking input" vs "nonblocking output" (so that you
> could control both sides on a call-by-call basis).

I think we could still extend things in this way if we wanted to.
So if you specify the explicit input and/or output nonblock flag,
it takes precedence over the SPLICE_F_NONBLOCK thing.

Anyways, just an idea.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 2, 2009, 7:47 a.m. UTC | #4
On Thu, Oct 01 2009, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu, 1 Oct 2009 15:21:44 -0700 (PDT)
> 
> > On Thu, 1 Oct 2009, David Miller wrote:
> >> 
> >> It depends upon our interpretation of how you intended the
> >> SPLICE_F_NONBLOCK flag to work when you added it way back
> >> when.
> >> 
> >> Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
> >> (splice: add SPLICE_F_NONBLOCK flag )
> >> 
> >>   It doesn't make the splice itself necessarily nonblocking (because the
> >>   actual file descriptors that are spliced from/to may block unless they
> >>   have the O_NONBLOCK flag set), but it makes the splice pipe operations
> >>   nonblocking.
> >> 
> >> Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only
> > 
> > Ack. The original intent was for the flag to affect the buffering, not the 
> > end points.
> 
> Great, thanks for reviewing.
> 
> > Although the more I think about it, the more I suspect that the
> > whole NONBLOCK thing should probably have been two bits, and simply
> > been about "nonblocking input" vs "nonblocking output" (so that you
> > could control both sides on a call-by-call basis).
> 
> I think we could still extend things in this way if we wanted to.
> So if you specify the explicit input and/or output nonblock flag,
> it takes precedence over the SPLICE_F_NONBLOCK thing.

Yes I agree, thank god for having a 'flags' parameter for the syscalls
:-). I'll make a note to add and test bidirectional nonblock hints.

The net patch looks fine and correct to me, feel free to add my acked-by
if you want.
David Miller Oct. 2, 2009, 4:45 p.m. UTC | #5
From: Jens Axboe <jens.axboe@oracle.com>
Date: Fri, 2 Oct 2009 09:47:54 +0200

> The net patch looks fine and correct to me, feel free to add my acked-by
> if you want.

Thanks Jens.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21387eb..8cdfab6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -580,7 +580,7 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)