diff mbox

tcp: splice as many packets as possible at once

Message ID 20090108173028.GA22531@1wt.eu
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Willy Tarreau Jan. 8, 2009, 5:30 p.m. UTC
Jens,

here's the other patch I was talking about, for better behaviour of
non-blocking splice(). Ben Mansell also confirms similar improvements
in his tests, where non-blocking splice() initially showed half of
read()/write() performance. Ben, would you mind adding a Tested-By
line ?

Also, please note that this is unrelated to the corruption bug I reported
and does not fix it.

Regards,
Willy

From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 8 Jan 2009 17:10:13 +0100
Subject: [PATCH] tcp: splice as many packets as possible at once

Currently, in non-blocking mode, tcp_splice_read() returns after
splicing one segment regardless of the len argument. This results
in low performance and very high overhead due to syscall rate when
splicing from interfaces which do not support LRO.

The fix simply consists in not breaking out of the loop after the
first read. That way, we can read up to the size requested by the
caller and still return when there is no data left.

Performance has significantly improved with this fix, with the
number of calls to splice() divided by about 20, and CPU usage
dropped from 100% to 75%.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Comments

Jens Axboe Jan. 8, 2009, 7:44 p.m. UTC | #1
On Thu, Jan 08 2009, Willy Tarreau wrote:
> Jens,
> 
> here's the other patch I was talking about, for better behaviour of
> non-blocking splice(). Ben Mansell also confirms similar improvements
> in his tests, where non-blocking splice() initially showed half of
> read()/write() performance. Ben, would you mind adding a Tested-By
> line ?

Looks very good to me, I don't see any valid reason to have that !timeo
break. So feel free to add my acked-by to that patch and send it to
Dave.

> 
> Also, please note that this is unrelated to the corruption bug I reported
> and does not fix it.
> 
> Regards,
> Willy
> 
> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 8 Jan 2009 17:10:13 +0100
> Subject: [PATCH] tcp: splice as many packets as possible at once
> 
> Currently, in non-blocking mode, tcp_splice_read() returns after
> splicing one segment regardless of the len argument. This results
> in low performance and very high overhead due to syscall rate when
> splicing from interfaces which do not support LRO.
> 
> The fix simply consists in not breaking out of the loop after the
> first read. That way, we can read up to the size requested by the
> caller and still return when there is no data left.
> 
> Performance has significantly improved with this fix, with the
> number of calls to splice() divided by about 20, and CPU usage
> dropped from 100% to 75%.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..80261b4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>  		lock_sock(sk);
>  
>  		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
> -		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
> +		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
>  		    signal_pending(current))
>  			break;
>  	}
> -- 
> 1.6.0.3
>
Ben Mansell Jan. 8, 2009, 9:50 p.m. UTC | #2
> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 8 Jan 2009 17:10:13 +0100
> Subject: [PATCH] tcp: splice as many packets as possible at once
> 
> Currently, in non-blocking mode, tcp_splice_read() returns after
> splicing one segment regardless of the len argument. This results
> in low performance and very high overhead due to syscall rate when
> splicing from interfaces which do not support LRO.
> 
> The fix simply consists in not breaking out of the loop after the
> first read. That way, we can read up to the size requested by the
> caller and still return when there is no data left.
> 
> Performance has significantly improved with this fix, with the
> number of calls to splice() divided by about 20, and CPU usage
> dropped from 100% to 75%.
> 

I get similar results with my testing here. Benchmarking an application 
with this patch shows that more than one packet is being splice()d in at 
once, as a result I see a doubling in throughput.

Tested-by: Ben Mansell <ben@zeus.com>

> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..80261b4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>  		lock_sock(sk);
>  
>  		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
> -		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
> +		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
>  		    signal_pending(current))
>  			break;
>  	}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 8, 2009, 9:55 p.m. UTC | #3
From: Ben Mansell <ben@zeus.com>
Date: Thu, 08 Jan 2009 21:50:44 +0000

> > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > Subject: [PATCH] tcp: splice as many packets as possible at once
> > Currently, in non-blocking mode, tcp_splice_read() returns after
> > splicing one segment regardless of the len argument. This results
> > in low performance and very high overhead due to syscall rate when
> > splicing from interfaces which do not support LRO.
> > The fix simply consists in not breaking out of the loop after the
> > first read. That way, we can read up to the size requested by the
> > caller and still return when there is no data left.
> > Performance has significantly improved with this fix, with the
> > number of calls to splice() divided by about 20, and CPU usage
> > dropped from 100% to 75%.
> > 
> 
> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
> 
> Tested-by: Ben Mansell <ben@zeus.com>

I'm not applying this until someone explains to me why
we should remove this test from the splice receive but
keep it in the tcp_recvmsg() code where it has been
essentially forever.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 8, 2009, 10:03 p.m. UTC | #4
On Thu, Jan 08, 2009 at 08:44:24PM +0100, Jens Axboe wrote:
> On Thu, Jan 08 2009, Willy Tarreau wrote:
> > Jens,
> > 
> > here's the other patch I was talking about, for better behaviour of
> > non-blocking splice(). Ben Mansell also confirms similar improvements
> > in his tests, where non-blocking splice() initially showed half of
> > read()/write() performance. Ben, would you mind adding a Tested-By
> > line ?
> 
> Looks very good to me, I don't see any valid reason to have that !timeo
> break. So feel free to add my acked-by to that patch and send it to
> Dave.

Done, thanks.

Also, I tried to follow the code path from the userspace splice() call and
the kernel code. While splicing from socket to a pipe seems obvious till
tcp_splice_read(), I have a hard time finding the correct way from the pipe
to the socket.

It *seems* to me that we proceed like this, I'd like someone to confirm :

sys_splice()
  do_splice()
    do_splice_from()
      generic_splice_sendpage()      [ I assumed this from socket_file_ops ]
        splice_from_pipe(,,pipe_to_sendpage)
          __splice_from_pipe(,,pipe_to_sendpage)
            pipe_to_sendpage()
              tcp_sendpage()         [ I assumed this from inet_stream_ops ]
                do_tcp_sendpages()   (if NETIF_F_SG) or sock_no_sendpage()

I hope that I guessed it right because it does not appear obvious to me. It
will help me try to understand better the corruption problem.

Now for the read part, am I wrong tcp_splice_read() will return -EAGAIN
if the pipe is full ? It seems to me that this will be brought up by
splice_to_pipe(), via __tcp_splice_read, tcp_read_sock, tcp_splice_data_recv
and skb_splice_bits.

If so, this can cause serious trouble because if there is data available in
a socket, poll() will return POLLIN. Then we call splice(sock, pipe) but the
pipe is full, so we get -EAGAIN, which in turn makes the application think
that we need to poll again, and since no data has moved, we'll immediately
call splice() again. In my early experiments, I'm pretty sure I already came
across such a situation. Shouldn't we return -EBUSY or -ENOSPC in this case ?
I can work on patches if there is a consensus around those issues, and this
will also help me understand better the splice internals. I just don't want
to work in the wrong direction for nothing.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 8, 2009, 10:20 p.m. UTC | #5
On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> From: Ben Mansell <ben@zeus.com>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
> 
> > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > > Subject: [PATCH] tcp: splice as many packets as possible at once
> > > Currently, in non-blocking mode, tcp_splice_read() returns after
> > > splicing one segment regardless of the len argument. This results
> > > in low performance and very high overhead due to syscall rate when
> > > splicing from interfaces which do not support LRO.
> > > The fix simply consists in not breaking out of the loop after the
> > > first read. That way, we can read up to the size requested by the
> > > caller and still return when there is no data left.
> > > Performance has significantly improved with this fix, with the
> > > number of calls to splice() divided by about 20, and CPU usage
> > > dropped from 100% to 75%.
> > > 
> > 
> > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
> > 
> > Tested-by: Ben Mansell <ben@zeus.com>
> 
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

In my opinion, the code structure is different between both functions. In
tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
all data which have been processed since the entry in the function. If we
removed the test here, we could not break out of the loop once we have
copied something. In tcp_splice_read(), the test is still present in the
(!ret) code path, where ret is the last number of bytes processed, so the
test is still performed regardless of what has been previously transferred.

So in summary, in tcp_splice_read without this test, we get back to the
top of the loop, and if __tcp_splice_read() returns 0, then we break out
of the loop.

I don't know if my explanation is clear or not, it's easier to follow the
loops in front of the code :-/

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 9, 2009, 6:47 a.m. UTC | #6
David Miller a écrit :
> From: Ben Mansell <ben@zeus.com>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
> 
>>> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
>>> From: Willy Tarreau <w@1wt.eu>
>>> Date: Thu, 8 Jan 2009 17:10:13 +0100
>>> Subject: [PATCH] tcp: splice as many packets as possible at once
>>> Currently, in non-blocking mode, tcp_splice_read() returns after
>>> splicing one segment regardless of the len argument. This results
>>> in low performance and very high overhead due to syscall rate when
>>> splicing from interfaces which do not support LRO.
>>> The fix simply consists in not breaking out of the loop after the
>>> first read. That way, we can read up to the size requested by the
>>> caller and still return when there is no data left.
>>> Performance has significantly improved with this fix, with the
>>> number of calls to splice() divided by about 20, and CPU usage
>>> dropped from 100% to 75%.
>>>
>> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
>>
>> Tested-by: Ben Mansell <ben@zeus.com>
> 
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

I found this patch usefull in my testings, but had a feeling something
was not complete. If the goal is to reduce number of splice() calls,
we also should reduce number of wakeups. If splice() is used in non
blocking mode, nothing we can do here of course, since the application
will use a poll()/select()/epoll() event before calling splice(). A
good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

I tested this on current tree and it is not working : we still have
one wakeup for each frame (ethernet link is a 100 Mb/s one)

bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5)                            = 0
accept(6, 0, NULL)                      = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1024
splice(0x3, 0, 0x5, 0, 0x400, 0x5)      = 1024
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460


About tcp_recvmsg(), we might also remove the "!timeo" test as well,
more testings are needed. But remind that if an application provides
a large buffer to tcp_recvmsg() call, removing the test will reduce
the number of syscalls but might use more DCACHE. It could reduce
performance on old cpus. With splice() call, we expect to not
copy memory and trash DCACHE, and pipe buffers being limited to 16,
we cope with a limited working set. 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 9, 2009, 7:04 a.m. UTC | #7
On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
> 
> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
> 
> I tested this on current tree and it is not working : we still have
> one wakeup for each frame (ethernet link is a 100 Mb/s one)

Well, it simply means that data are not coming in fast enough compared to
the tiny amount of work you have to perform to forward them, there's nothing
wrong with that. It is important in my opinion not to wait for *enough* data
to come in, otherwise it might become impossible to forward small chunks.
I mean, if there are only 300 bytes left to forward, we must not wait
indefinitely for more data to come, we must forward those 300 bytes.

In your case below, it simply means that the performance improvement brought
by splice will be really minor because you'll just avoid 2 memory copies,
which are ridiculously cheap at 100 Mbps. If you would change your program
to use recv/send, you would observe the exact same pattern, because as soon
as poll() wakes you up, you still only have one frame in the system buffers.
On a small machine I have here (geode 500 MHz), I easily have multiple
frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
on something like 10-100 sockets, and by the time I process the first ones,
the later have time to queue up more data.

> bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> listen(6, 5)                            = 0
> accept(6, 0, NULL)                      = 7
> setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1024
> splice(0x3, 0, 0x5, 0, 0x400, 0x5)      = 1024
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460

How much CPU is used for that ? Probably something like 1-2% ? If you could
retry with 100-1000 concurrent sessions, you would observe more traffic on
each socket, which is where the gain of splice becomes really important.

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed.

No right now we can't (we must move it somewhere else at least). Because
once at least one byte has been received (copied != 0), no other check
will break out of the loop (or at least I have not found it).

> But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set. 

That's an interesting point indeed !

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 9, 2009, 7:28 a.m. UTC | #8
Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
>>> I'm not applying this until someone explains to me why
>>> we should remove this test from the splice receive but
>>> keep it in the tcp_recvmsg() code where it has been
>>> essentially forever.
>> I found this patch usefull in my testings, but had a feeling something
>> was not complete. If the goal is to reduce number of splice() calls,
>> we also should reduce number of wakeups. If splice() is used in non
>> blocking mode, nothing we can do here of course, since the application
>> will use a poll()/select()/epoll() event before calling splice(). A
>> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
>>
>> I tested this on current tree and it is not working : we still have
>> one wakeup for each frame (ethernet link is a 100 Mb/s one)
> 
> Well, it simply means that data are not coming in fast enough compared to
> the tiny amount of work you have to perform to forward them, there's nothing
> wrong with that. It is important in my opinion not to wait for *enough* data
> to come in, otherwise it might become impossible to forward small chunks.
> I mean, if there are only 300 bytes left to forward, we must not wait
> indefinitely for more data to come, we must forward those 300 bytes.
> 
> In your case below, it simply means that the performance improvement brought
> by splice will be really minor because you'll just avoid 2 memory copies,
> which are ridiculously cheap at 100 Mbps. If you would change your program
> to use recv/send, you would observe the exact same pattern, because as soon
> as poll() wakes you up, you still only have one frame in the system buffers.
> On a small machine I have here (geode 500 MHz), I easily have multiple
> frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
> on something like 10-100 sockets, and by the time I process the first ones,
> the later have time to queue up more data.

My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)

But if it doesnt work on a single flow, it wont work on many :)

I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
could be available for other tasks.

If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
would be good if kernel was smart enough and could reduce number of wakeups.

(Next blocking point is the fixed limit of 16 pages per pipe, but thats another story)

> 
>> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
>> more testings are needed.
> 
> No right now we can't (we must move it somewhere else at least). Because
> once at least one byte has been received (copied != 0), no other check
> will break out of the loop (or at least I have not found it).
> 

Of course we cant remove the test totally, but change the logic so that several skb
might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

Lets focus on functional changes, not on implementation details :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 9, 2009, 7:42 a.m. UTC | #9
On Fri, Jan 09, 2009 at 08:28:09AM +0100, Eric Dumazet wrote:
> My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)
> 
> But if it doesnt work on a single flow, it wont work on many :)

Yes it will, precisely because during the time you spend processing flow #1,
you're still receiving data for flow #2. I really invite you to try. That's
what I've been observing for years of userland coding of proxies.

> I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
> in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
> could be available for other tasks.

I also observe 23000 bytes on average on gigabit, which is very good
(only about 5000 calls per second). And the CPU usage is lower than
with recv/send, and I'd like to be able to run some profiling because
I observed very different performance patterns depending on the network
cards used. Generally, almost all the time is spent in softirqs.

It's easy to make poll wait a little bit more : call it later and do
your work before calling it. Also, epoll_wait() lets you ask it to
return just a few amount of FDs. This really improves data gathering.
I generally observe best performance between 30-200 FDs per call, even
with 10000 concurrent connections. During the time I process the first
200 FDs, data is accumulating in the other's buffers.

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
> would be good if kernel was smart enough and could reduce number of wakeups.

Yes, I agree about that. But my comment was about not making this
behaviour mandatory for splice(). Letting the application choose is
the way to go, of course.

> (Next blocking point is the fixed limit of 16 pages per pipe, but
> thats another story)

Yes but that's not always easy to guess how many data you can feed
into the pipe. It seems that depending on how the segments are
gathered, you can store between 16 segments and 64 kB. I have
observed some cases in blocking mode where I could not push more
than a few kbytes with a small MSS, indicating to me that all those
segments were each on a distinct page. I don't know precisely how
that's handled internally.

> >> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> >> more testings are needed.
> > 
> > No right now we can't (we must move it somewhere else at least). Because
> > once at least one byte has been received (copied != 0), no other check
> > will break out of the loop (or at least I have not found it).
> > 
> 
> Of course we cant remove the test totally, but change the logic so that several skb
> might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

OK, I initially understood that you suggested we could simply remove
it like I did for splice.

> Lets focus on functional changes, not on implementation details :)

Agreed :-)

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 9, 2009, 3:42 p.m. UTC | #10
Eric Dumazet a écrit :
> David Miller a écrit :
>> I'm not applying this until someone explains to me why
>> we should remove this test from the splice receive but
>> keep it in the tcp_recvmsg() code where it has been
>> essentially forever.

Reading again tcp_recvmsg(), I found it already is able to eat several skb
even in nonblocking mode.

setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0
ioctl(5, FIONBIO, [1])                  = 0
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536


David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is
no issue with it. We really want to break from this loop if !timeo .

Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

I wonder if the right fix should be done in tcp_read_sock() : this is the
one who should eat several skbs IMHO, if we want optimal ACK generation.

We break out of its loop at line 1246

if (!desc->count) /* this test is always true */
	break;

(__tcp_splice_read() set count to 0, right before calling tcp_read_sock())

So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 9, 2009, 6:54 p.m. UTC | #11
Hi Eric,

On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
(...)
> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
> 
> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
> 
> We break out of its loop at line 1246
> 
> if (!desc->count) /* this test is always true */
> 	break;
> 
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
> 
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

That's a very interesting discovery that you made here. I have made
mesurements with this line commented out just to get an idea. The
hardest part was to find a CPU-bound machine. Finally I slowed my
laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
call that 300). That way, I cannot saturate the PCI-based tg3 and
I can observe the effects of various changes on the data rate.

- original tcp_splice_read(), with "!timeo"    : 24.1 MB/s
- modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
- original with line #1246 commented out       : 34.5 MB/s (+43%)

So you're right, avoiding calling tcp_read_sock() all the time
gives a nice performance boost.

Also, I found that tcp_splice_read() behaves like this when breaking
out of the loop :

    lock_sock();
    while () {
         ...
         __tcp_splice_read();
         ...
         release_sock();
         lock_sock();
         if (break condition)
              break;
    }
    release_sock();

Which means that when breaking out of the loop on (!timeo)
with ret > 0, we do release_sock/lock_sock/release_sock.

So I tried a minor modification, consisting in moving the
test before release_sock(), and leaving !timeo there with
line #1246 commented out. That's a noticeable winner, as
the data rate went up to 35.7 MB/s (+48%).

Also, in your second mail, you're saying that your change
might return more data than requested by the user. I can't
find why, could you please explain to me, as I'm still quite
ignorant in this area ?

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 9, 2009, 8:51 p.m. UTC | #12
Willy Tarreau a écrit :
> Hi Eric,
> 
> On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
> (...)
>> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
>> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
>>
>> I wonder if the right fix should be done in tcp_read_sock() : this is the
>> one who should eat several skbs IMHO, if we want optimal ACK generation.
>>
>> We break out of its loop at line 1246
>>
>> if (!desc->count) /* this test is always true */
>> 	break;
>>
>> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>>
>> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.
> 
> That's a very interesting discovery that you made here. I have made
> mesurements with this line commented out just to get an idea. The
> hardest part was to find a CPU-bound machine. Finally I slowed my
> laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
> call that 300). That way, I cannot saturate the PCI-based tg3 and
> I can observe the effects of various changes on the data rate.
> 
> - original tcp_splice_read(), with "!timeo"    : 24.1 MB/s
> - modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
> - original with line #1246 commented out       : 34.5 MB/s (+43%)
> 
> So you're right, avoiding calling tcp_read_sock() all the time
> gives a nice performance boost.
> 
> Also, I found that tcp_splice_read() behaves like this when breaking
> out of the loop :
> 
>     lock_sock();
>     while () {
>          ...
>          __tcp_splice_read();
>          ...
>          release_sock();
>          lock_sock();
>          if (break condition)
>               break;
>     }
>     release_sock();
> 
> Which means that when breaking out of the loop on (!timeo)
> with ret > 0, we do release_sock/lock_sock/release_sock.
> 
> So I tried a minor modification, consisting in moving the
> test before release_sock(), and leaving !timeo there with
> line #1246 commented out. That's a noticeable winner, as
> the data rate went up to 35.7 MB/s (+48%).
> 
> Also, in your second mail, you're saying that your change
> might return more data than requested by the user. I can't
> find why, could you please explain to me, as I'm still quite
> ignorant in this area ?

Well, I just tested various user programs and indeed got this
strange result :

Here I call splice() with len=1000 (0x3e8), and you can see
it gives a result of 1460 at the second call.

I suspect a bug in splice code, that my patch just exposed.

pipe([3, 4])                            = 0
open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0644) = 5
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5)                            = 0
accept(6, 0, NULL)                      = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [1000], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1000                OK
splice(0x3, 0, 0x5, 0, 0x3e8, 0x5)      = 1000
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1460                Oh well... 1460 > 1000 !!!
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 9, 2009, 9:24 p.m. UTC | #13
On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
(...)
> > Also, in your second mail, you're saying that your change
> > might return more data than requested by the user. I can't
> > find why, could you please explain to me, as I'm still quite
> > ignorant in this area ?
> 
> Well, I just tested various user programs and indeed got this
> strange result :
> 
> Here I call splice() with len=1000 (0x3e8), and you can see
> it gives a result of 1460 at the second call.

huh, not nice indeed!

While looking at the code to see how this could be possible, I
came across this minor thing (unrelated IMHO) :

	if (__skb_splice_bits(skb, &offset, &tlen, &spd))
		goto done;
>>>>>>	else if (!tlen)    <<<<<<
		goto done;

	/*
	 * now see if we have a frag_list to map
	 */
	if (skb_shinfo(skb)->frag_list) {
		struct sk_buff *list = skb_shinfo(skb)->frag_list;

		for (; list && tlen; list = list->next) {
			if (__skb_splice_bits(list, &offset, &tlen, &spd))
				break;
		}
	}

    done:

Above on the enlighted line, we'd better remove the else and leave a plain
"if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
tlen, we still enter the if and evaluate the for condition for nothing. But
let's leave that for later.

> I suspect a bug in splice code, that my patch just exposed.

I've checked in skb_splice_bits() and below and can't see how we can move
more than the requested len.

However, with your change, I don't clearly see how we break out of
the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
and read remaining data ? I suspect that we should at least exit when
((struct tcp_splice_state *)desc->arg.data)->len = 0.

At least that's something easy to add just before or after !desc->count
for a test.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 13, 2009, 11:08 p.m. UTC | #14
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 8 Jan 2009 23:20:39 +0100

> On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
> 
> In my opinion, the code structure is different between both functions. In
> tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
> all data which have been processed since the entry in the function. If we
> removed the test here, we could not break out of the loop once we have
> copied something. In tcp_splice_read(), the test is still present in the
> (!ret) code path, where ret is the last number of bytes processed, so the
> test is still performed regardless of what has been previously transferred.
> 
> So in summary, in tcp_splice_read without this test, we get back to the
> top of the loop, and if __tcp_splice_read() returns 0, then we break out
> of the loop.

Ok I see what you're saying, the !timeo check is only
necessary when the receive queue has been exhausted.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 13, 2009, 11:26 p.m. UTC | #15
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 07:47:16 +0100

> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

Spice read does not handle SO_RCVLOWAT like tcp_recvmsg() does.

We should probably add a:

	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

and check 'target' against 'spliced' in the main loop of
tcp_splice_read().

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed. But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set. 

I sometimes have a suspicion we can remove this test too, but it's
not really that clear.

If an application is doing non-blocking reads and they care about
latency, they shouldn't be providing huge buffers.  This much I
agree with, but...

If you look at where this check is placed in the recvmsg() case, it is
done after we have verified that there is no socket backlog.

		if (copied >= target && !sk->sk_backlog.tail)
			break;

		if (copied) {
			if (sk->sk_err ||
			    sk->sk_state == TCP_CLOSE ||
			    (sk->sk_shutdown & RCV_SHUTDOWN) ||
			    !timeo ||
			    signal_pending(current))
				break;
		} else {

So either:

1) We haven't met the target.  And note that target is one unless
   the user makes an explicit receive low-water setting.

2) Or there is no backlog.

When we get to the 'if (copied)' check.

You can view this "!timeo" check as meaning "non-blocking".  When
we get to it, we are guarenteed that we haven't met the target
and we have no backlog.  So it is absolutely appropriate to break
out of recvmsg() processing here if non-blocking.

There is a lot of logic and feature handling in tcp_splice_read() and
that's why the semantics of "!timeo" cases are not being handled
properly here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 13, 2009, 11:27 p.m. UTC | #16
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 08:28:09 +0100

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
> [32768], 4), it would be good if kernel was smart enough and could
> reduce number of wakeups.

Right, and as I pointed out in previous replies the problem is
that splice() receive in TCP doesn't check the low water mark
at all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 13, 2009, 11:31 p.m. UTC | #17
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 16:42:44 +0100

> David, if you referred to code at line 1374 of net/ipv4/tcp.c, I
> believe there is no issue with it. We really want to break from this
> loop if !timeo .

Correct, I agree, and I gave some detailed analysis of this in
another response :-)

> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

"Like", sure, but not the same since splice() lacks the low-water
and backlog checks.

> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
> 
> We break out of its loop at line 1246
> 
> if (!desc->count) /* this test is always true */
> 	break;
> 
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
> 
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

Yes, that's very odd.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 13, 2009, 11:35 p.m. UTC | #18
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 09 Jan 2009 08:28:09 +0100
> 
>> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
>> [32768], 4), it would be good if kernel was smart enough and could
>> reduce number of wakeups.
> 
> Right, and as I pointed out in previous replies the problem is
> that splice() receive in TCP doesn't check the low water mark
> at all.
> 
> 

Yes I understand, but if splice() is running, wakeup occured, and
no need to check if the wakeup was good or not... just proceed and
consume some skb, since we already were awaken.

Then, an application might setup a high SO_RCVLOWAT, but want to splice()
only few bytes, so RCVLOWAT is only a hint for the thing that perform the
wakeup, not for the consumer.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 35bcddf..80261b4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -615,7 +615,7 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 		lock_sock(sk);
 
 		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
-		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current))
 			break;
 	}