diff mbox

Splice on blocking TCP sockets again..

Message ID 4AC2E481.5060509@gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 30, 2009, 4:54 a.m. UTC
Jason Gunthorpe a écrit :
> Eric,
> 
> I saw your patch from January regarding splicing on blocking sockets,
> and I wondered what ever happened to it?
> 
> http://lkml.org/lkml/2009/1/13/507
> 
> It doesn't look like it has been applied.. I see the patch thread died
> at davem's comments?
> 
> I have run into exactly the same problem as Samba, where I'd like the
> TCP socket to be blocking, and the pipe to be non blocking ...
> 
> As it stands, 
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
> causes a random endless block and
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> will return 0 immediately if the TCP buffer is empty.
> 
> FWIW, it looks like samba has a splice code now, but doesn't enable it
> due to this issue?
> 
> http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
> 
> Thanks,
> Jason

Hi Jason, thanks for this reminding

Hmm, most probably I did not replied correctly do David objection which was :

Date	Wed, 14 Jan 2009 20:58:39 -0800 (PST)
Subject	Re: maximum buffer size for splice(2) tcp->pipe?
From	David Miller <>

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 14 Jan 2009 00:38:32 +0100
> [PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
> 
> Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
> source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
> for selecting a non blocking socket.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

This needs at least some more thought.

It seems, for one thing, that this change will interfere with the
intentions of the code in splice_dirt_to_actor which goes:

	/*
	 * Don't block on output, we have to drain the direct pipe.
	 */
	sd->flags &= ~SPLICE_F_NONBLOCK;

------------------------------------------------------------------------------

But splice_dist_to_actor() handles a REG/BLK file as input and a pipe as output,
so I believe my patch wont change splice_dist_to_actor() behavior.

My patch title was wrong :

net: splice() from tcp to socket should take into account O_NONBLOCK


So maybe David was mistaken by the title :)


[PATCH] net: splice() from tcp to pipe should take into account O_NONBLOCK

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.
 
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD

One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Gunthorpe Sept. 30, 2009, 5:40 a.m. UTC | #1
> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Thanks Eric, this seems reasonable from my userspace perspective.

I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
un-unixy to have a syscall completely ignore the NONBLOCK flag of the
fd it is called on. Ie setting NONBLOCK on the pipe itself does
nothing when using splice..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 30, 2009, 5:51 a.m. UTC | #2
Jason Gunthorpe a écrit :
>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

arg, this was tcp_splice_read() of course

> 
> Thanks Eric, this seems reasonable from my userspace perspective.
> 
> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
> fd it is called on. Ie setting NONBLOCK on the pipe itself does
> nothing when using splice..
> 

Hmm, good question, I dont have the answer but I'll digg one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 30, 2009, 6 a.m. UTC | #3
Eric Dumazet a écrit :
> Jason Gunthorpe a écrit :
>>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.
> 
> arg, this was tcp_splice_read() of course
> 
>> Thanks Eric, this seems reasonable from my userspace perspective.
>>
>> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
>> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
>> fd it is called on. Ie setting NONBLOCK on the pipe itself does
>> nothing when using splice..
>>
> 
> Hmm, good question, I dont have the answer but I'll digg one.
> 

commit	29e350944fdc2dfca102500790d8ad6d6ff4f69d
splice: add SPLICE_F_NONBLOCK flag

It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>


See Linus intention was pretty clear : O_NONBLOCK should be taken into account
by 'actual file that are spliced from/to', regardless of SPLICE_F_NONBLOCK flag

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Oct. 1, 2009, 10:17 p.m. UTC | #4
On Wed, Sep 30, 2009 at 08:00:04AM +0200, Eric Dumazet wrote:

> >> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
> >> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
> >> fd it is called on. Ie setting NONBLOCK on the pipe itself does
> >> nothing when using splice..
> > 
> > Hmm, good question, I dont have the answer but I'll digg one.
> > 
> 
> commit	29e350944fdc2dfca102500790d8ad6d6ff4f69d
> splice: add SPLICE_F_NONBLOCK flag
> 
> It doesn't make the splice itself necessarily nonblocking (because the
> actual file descriptors that are spliced from/to may block unless they
> have the O_NONBLOCK flag set), but it makes the splice pipe operations
> nonblocking.
> 
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> See Linus intention was pretty clear : O_NONBLOCK should be taken
> into account by 'actual file that are spliced from/to', regardless
> of SPLICE_F_NONBLOCK flag

Yes, that seems reasonable.

What confuses me is that if O_NONBLOCK is set on the _pipe_ and
SPICE_F_NONBLOCK is not set on the splice call the splice still blocks
- that is unlike other unix apis, eg MSG_DONTWAIT

It seems to me that SPICE_F_NONBLOCK should be or'd with O_NONBLOCK on
the pipe?

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21387eb..8cdfab6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -580,7 +580,7 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)