Patchwork phonet: Check input from user before allocating

login
register
mail settings
Submitter Sasha Levin
Date April 2, 2012, 8:31 p.m.
Message ID <1333398660-11552-1-git-send-email-levinsasha928@gmail.com>
Download mbox | patch
Permalink /patch/150199/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Rémi Denis-Courmont - April 2, 2012, 7 p.m.
Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> tx which means that the user can specify any size he wishes, and the kernel
> will attempt to allocate that size.

> 
> In the good case, it'll lead to the following warning, but it may also
> cause the kernel to kick in the OOM and kill a random task on the server.
> 
> [ 8921.744094] WARNING: at mm/page_alloc.c:2255
> __alloc_pages_slowpath+0x65/0x730() [ 8921.749770] Pid: 5081, comm:
> trinity Tainted: G        W    3.4.0-rc1-next-20120402-sasha #46 [
> 8921.756672] Call Trace:
> [ 8921.758185]  [<ffffffff810b2ba7>] warn_slowpath_common+0x87/0xb0
> [ 8921.762868]  [<ffffffff810b2be5>] warn_slowpath_null+0x15/0x20
> [ 8921.765399]  [<ffffffff8117eae5>] __alloc_pages_slowpath+0x65/0x730
> [ 8921.769226]  [<ffffffff81179c8a>] ? zone_watermark_ok+0x1a/0x20
> [ 8921.771686]  [<ffffffff8117d045>] ? get_page_from_freelist+0x625/0x660
> [ 8921.773919]  [<ffffffff8117f3a8>] __alloc_pages_nodemask+0x1f8/0x240
> [ 8921.776248]  [<ffffffff811c03e0>] kmalloc_large_node+0x70/0xc0
> [ 8921.778294]  [<ffffffff811c4bd4>] __kmalloc_node_track_caller+0x34/0x1c0
> [ 8921.780847]  [<ffffffff821b0e3c>] ? sock_alloc_send_pskb+0xbc/0x260
> [ 8921.783179]  [<ffffffff821b3c65>] __alloc_skb+0x75/0x170
> [ 8921.784971]  [<ffffffff821b0e3c>] sock_alloc_send_pskb+0xbc/0x260
> [ 8921.787111]  [<ffffffff821b002e>] ? release_sock+0x7e/0x90
> [ 8921.788973]  [<ffffffff821b0ff0>] sock_alloc_send_skb+0x10/0x20
> [ 8921.791052]  [<ffffffff824cfc20>] pep_sendmsg+0x60/0x380
> [ 8921.792931]  [<ffffffff824cb4a6>] ? pn_socket_bind+0x156/0x180
> [ 8921.794917]  [<ffffffff824cb50f>] ? pn_socket_autobind+0x3f/0x90
> [ 8921.797053]  [<ffffffff824cb63f>] pn_socket_sendmsg+0x4f/0x70
> [ 8921.798992]  [<ffffffff821ab8e7>] sock_aio_write+0x187/0x1b0
> [ 8921.801395]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
> [ 8921.803501]  [<ffffffff8111842c>] ? __lock_acquire+0x42c/0x4b0
> [ 8921.805505]  [<ffffffff821ab760>] ? __sock_recv_ts_and_drops+0x140/0x140
> [ 8921.807860]  [<ffffffff811e07cc>] do_sync_readv_writev+0xbc/0x110
> [ 8921.809986]  [<ffffffff811958e7>] ? might_fault+0x97/0xa0
> [ 8921.811998]  [<ffffffff817bd99e>] ? security_file_permission+0x1e/0x90
> [ 8921.814595]  [<ffffffff811e17e2>] do_readv_writev+0xe2/0x1e0
> [ 8921.816702]  [<ffffffff810b8dac>] ? do_setitimer+0x1ac/0x200
> [ 8921.818819]  [<ffffffff810e2ec1>] ? get_parent_ip+0x11/0x50
> [ 8921.820863]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
> [ 8921.823318]  [<ffffffff811e1926>] vfs_writev+0x46/0x60
> [ 8921.825219]  [<ffffffff811e1a3f>] sys_writev+0x4f/0xb0
> [ 8921.827127]  [<ffffffff82658039>] system_call_fastpath+0x16/0x1b
> [ 8921.829384] ---[ end trace dffe390f30db9eb7 ]---
> 
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> ---
>  net/phonet/pep.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/net/phonet/pep.c b/net/phonet/pep.c
> index 9f60008..caee99e 100644
> --- a/net/phonet/pep.c
> +++ b/net/phonet/pep.c
> @@ -1130,6 +1130,9 @@ static int pep_sendmsg(struct kiocb *iocb, struct
> sock *sk, int flags = msg->msg_flags;
>  	int err, done;
> 
> +	if (len > USHRT_MAX)
> +		return -E2BIG;

I think EMSGSIZE is specified in that case.

> +
>  	if ((msg->msg_flags & ~(MSG_DONTWAIT|MSG_EOR|MSG_NOSIGNAL|
>  				MSG_CMSG_COMPAT)) ||
>  			!(msg->msg_flags & MSG_EOR))
Rémi Denis-Courmont - April 2, 2012, 7:01 p.m.
Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> tx which means that the user can specify any size he wishes, and the kernel
> will attempt to allocate that size.

Does this really solve the problem?  I guess 128kb is still possible with 
USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
easily once the memory gets sufficiently fragmented.

How does UDP deal with this?
Sasha Levin - April 2, 2012, 8:31 p.m.
A phonet packet is limited to USHRT_MAX bytes, this is never checked during
tx which means that the user can specify any size he wishes, and the kernel
will attempt to allocate that size.

In the good case, it'll lead to the following warning, but it may also cause
the kernel to kick in the OOM and kill a random task on the server.

[ 8921.744094] WARNING: at mm/page_alloc.c:2255 __alloc_pages_slowpath+0x65/0x730()
[ 8921.749770] Pid: 5081, comm: trinity Tainted: G        W    3.4.0-rc1-next-20120402-sasha #46
[ 8921.756672] Call Trace:
[ 8921.758185]  [<ffffffff810b2ba7>] warn_slowpath_common+0x87/0xb0
[ 8921.762868]  [<ffffffff810b2be5>] warn_slowpath_null+0x15/0x20
[ 8921.765399]  [<ffffffff8117eae5>] __alloc_pages_slowpath+0x65/0x730
[ 8921.769226]  [<ffffffff81179c8a>] ? zone_watermark_ok+0x1a/0x20
[ 8921.771686]  [<ffffffff8117d045>] ? get_page_from_freelist+0x625/0x660
[ 8921.773919]  [<ffffffff8117f3a8>] __alloc_pages_nodemask+0x1f8/0x240
[ 8921.776248]  [<ffffffff811c03e0>] kmalloc_large_node+0x70/0xc0
[ 8921.778294]  [<ffffffff811c4bd4>] __kmalloc_node_track_caller+0x34/0x1c0
[ 8921.780847]  [<ffffffff821b0e3c>] ? sock_alloc_send_pskb+0xbc/0x260
[ 8921.783179]  [<ffffffff821b3c65>] __alloc_skb+0x75/0x170
[ 8921.784971]  [<ffffffff821b0e3c>] sock_alloc_send_pskb+0xbc/0x260
[ 8921.787111]  [<ffffffff821b002e>] ? release_sock+0x7e/0x90
[ 8921.788973]  [<ffffffff821b0ff0>] sock_alloc_send_skb+0x10/0x20
[ 8921.791052]  [<ffffffff824cfc20>] pep_sendmsg+0x60/0x380
[ 8921.792931]  [<ffffffff824cb4a6>] ? pn_socket_bind+0x156/0x180
[ 8921.794917]  [<ffffffff824cb50f>] ? pn_socket_autobind+0x3f/0x90
[ 8921.797053]  [<ffffffff824cb63f>] pn_socket_sendmsg+0x4f/0x70
[ 8921.798992]  [<ffffffff821ab8e7>] sock_aio_write+0x187/0x1b0
[ 8921.801395]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
[ 8921.803501]  [<ffffffff8111842c>] ? __lock_acquire+0x42c/0x4b0
[ 8921.805505]  [<ffffffff821ab760>] ? __sock_recv_ts_and_drops+0x140/0x140
[ 8921.807860]  [<ffffffff811e07cc>] do_sync_readv_writev+0xbc/0x110
[ 8921.809986]  [<ffffffff811958e7>] ? might_fault+0x97/0xa0
[ 8921.811998]  [<ffffffff817bd99e>] ? security_file_permission+0x1e/0x90
[ 8921.814595]  [<ffffffff811e17e2>] do_readv_writev+0xe2/0x1e0
[ 8921.816702]  [<ffffffff810b8dac>] ? do_setitimer+0x1ac/0x200
[ 8921.818819]  [<ffffffff810e2ec1>] ? get_parent_ip+0x11/0x50
[ 8921.820863]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
[ 8921.823318]  [<ffffffff811e1926>] vfs_writev+0x46/0x60
[ 8921.825219]  [<ffffffff811e1a3f>] sys_writev+0x4f/0xb0
[ 8921.827127]  [<ffffffff82658039>] system_call_fastpath+0x16/0x1b
[ 8921.829384] ---[ end trace dffe390f30db9eb7 ]---

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 net/phonet/pep.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)
David Miller - April 2, 2012, 9:38 p.m.
From: "Rémi Denis-Courmont" <remi@remlab.net>
Date: Mon, 2 Apr 2012 22:00:40 +0300

> Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
>> +	if (len > USHRT_MAX)
>> +		return -E2BIG;
> 
> I think EMSGSIZE is specified in that case.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 2, 2012, 9:40 p.m.
From: "Rémi Denis-Courmont" <remi@remlab.net>
Date: Mon, 2 Apr 2012 22:01:40 +0300

> Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
>> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
>> tx which means that the user can specify any size he wishes, and the kernel
>> will attempt to allocate that size.
> 
> Does this really solve the problem?  I guess 128kb is still possible with 
> USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
> easily once the memory gets sufficiently fragmented.
> 
> How does UDP deal with this?

UDP generates a fragment list of MTU sized SKBs.

Phonet could avoid the large allocations by building page based
SKBs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 3, 2012, 1:53 a.m.
On Mon, 2012-04-02 at 17:40 -0400, David Miller wrote:
> From: "Rémi Denis-Courmont" <remi@remlab.net>
> Date: Mon, 2 Apr 2012 22:01:40 +0300
> 
> > Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> >> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> >> tx which means that the user can specify any size he wishes, and the kernel
> >> will attempt to allocate that size.
> > 
> > Does this really solve the problem?  I guess 128kb is still possible with 
> > USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
> > easily once the memory gets sufficiently fragmented.
> > 
> > How does UDP deal with this?
> 
> UDP generates a fragment list of MTU sized SKBs.
> 
> Phonet could avoid the large allocations by building page based
> SKBs.

Not that AF_UNIX does nothing in this respect, it can use order-XX pages
for large datagrams.

(I beleve I sent a patch some time ago to address this point)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 3, 2012, 1:59 a.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 03:53:17 +0200

> Not that AF_UNIX does nothing in this respect, it can use order-XX pages
> for large datagrams.
> 
> (I beleve I sent a patch some time ago to address this point)

Yes, on the datagram side it's a problem.

For stream AF_UNIX sockets the allocation is capped at SKB_MAX_ALLOC
which evaluates to an order 2 page.

Overall, AF_UNIX ought to be easy to deal with since all of the
routines that copy data between userspace and SKBs can handle
segmented SKBs and thus most of the work is converting over to
sock_alloc_send_pskb() and setting data_len how we set the normal
length of sock_alloc_skb_skb() currently.

Anyways, feel free to resubmit your patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 3, 2012, 2:15 a.m.
On Mon, 2012-04-02 at 21:59 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 03:53:17 +0200
> 
> > Not that AF_UNIX does nothing in this respect, it can use order-XX pages
> > for large datagrams.
> > 
> > (I beleve I sent a patch some time ago to address this point)
> 
> Yes, on the datagram side it's a problem.
> 
> For stream AF_UNIX sockets the allocation is capped at SKB_MAX_ALLOC
> which evaluates to an order 2 page.
> 
> Overall, AF_UNIX ought to be easy to deal with since all of the
> routines that copy data between userspace and SKBs can handle
> segmented SKBs and thus most of the work is converting over to
> sock_alloc_send_pskb() and setting data_len how we set the normal
> length of sock_alloc_skb_skb() currently.
> 
> Anyways, feel free to resubmit your patch.

This was indeed a basic patch, but it probably can lower raw performance
on some apps, (if memory frag is not an issue) so I need to bench it. 


Any idea of a representative benchmark in dgram af_unix ?

http://patchwork.ozlabs.org/patch/114103/

I'll respin it with proper performance resuts.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 3, 2012, 2:23 a.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 04:15:04 +0200

> Any idea of a representative benchmark in dgram af_unix ?

Maybe you could try to make bw_unix from lmbench use
SOCK_DGRAM? :-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 3, 2012, 2:29 a.m.
On Mon, 2012-04-02 at 22:23 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 04:15:04 +0200
> 
> > Any idea of a representative benchmark in dgram af_unix ?
> 
> Maybe you could try to make bw_unix from lmbench use
> SOCK_DGRAM? :-)
> 

Yes, I also converted hackbench to SOCK_DGRAM and change its DATASIZE
from 100 to 10000 bytes per message.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 3, 2012, 2:29 a.m.
On 04/02/2012 07:23 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 04:15:04 +0200
>
>> Any idea of a representative benchmark in dgram af_unix ?
>
> Maybe you could try to make bw_unix from lmbench use
> SOCK_DGRAM? :-)

I don't know it isn't entirely bitrotted, but there are streaming and 
datagram AF_UNIX tests in netperf - they require conditional inclusion 
via ./configure --enable-unixdomain:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 3, 2012, 2:34 a.m.
On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
> On 04/02/2012 07:23 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Tue, 03 Apr 2012 04:15:04 +0200
> >
> >> Any idea of a representative benchmark in dgram af_unix ?
> >
> > Maybe you could try to make bw_unix from lmbench use
> > SOCK_DGRAM? :-)
> 
> I don't know it isn't entirely bitrotted, but there are streaming and 
> datagram AF_UNIX tests in netperf - they require conditional inclusion 
> via ./configure --enable-unixdomain:
> 
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
> 

Ah yes of course, I'll try that.

BTW, do you have plans to support vmsplice()/splice() as a way to
provide 0-copy to TCP_STREAM ?

I ask this because I am currently working on fixing splice(pipe ->
socket) performance problem and need a standard benchmark to publish
performance results.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 3, 2012, 2:39 a.m.
On 04/02/2012 07:34 PM, Eric Dumazet wrote:
> On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
>> I don't know it isn't entirely bitrotted, but there are streaming and
>> datagram AF_UNIX tests in netperf - they require conditional inclusion
>> via ./configure --enable-unixdomain:
>>
>> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
>>
>
> Ah yes of course, I'll try that.
>
> BTW, do you have plans to support vmsplice()/splice() as a way to
> provide 0-copy to TCP_STREAM ?

I'd probably plead "too platform specific" but I've already got some 
platform-specific stuff in there like TCP_INFO so I probably cannot hide 
behind that.  Particularly if I ever do make "native" WSA calls for 
Windows...

Until now I'd not thought of vmsplice/splice though - only sendfile for 
zero copy, which should be there for TCP in the form of the TCP_SENDFILE 
test (unmigrated, so none of the omni output selection is available)

> I ask this because I am currently working on fixing splice(pipe ->
> socket) performance problem and need a standard benchmark to publish
> performance results.

I'll start to ponder, without promises.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - April 3, 2012, 3:14 a.m.
On Mon, 2012-04-02 at 19:39 -0700, Rick Jones wrote:
> On 04/02/2012 07:34 PM, Eric Dumazet wrote:
> > On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
> >> I don't know it isn't entirely bitrotted, but there are streaming and
> >> datagram AF_UNIX tests in netperf - they require conditional inclusion
> >> via ./configure --enable-unixdomain:
> >>
> >> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
> >>
> >
> > Ah yes of course, I'll try that.
> >

It seems netperf has some problems zith AF_UNIX dgram :

socket(PF_FILE, SOCK_DGRAM, 0)          = 4
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [0], 4) = 0
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [2048], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [2288], [4]) = 0
sendto(3, "\0\0\0+\377\377\377\377\0\0\0\0\0\0\10\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0"..., 256, 0, NULL, 0) = 256
select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 979635})
recvfrom(3, "\0\0\0,\0\0\0\0\0\0\10\360\0\0\10\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0"..., 256, 0, NULL, NULL) = 256
connect(4, {sa_family=AF_FILE, path="/tmp/netpe62HGNM"}, 110) = 0
rt_sigaction(SIGALRM, {0x403171, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f691f026af0}, NULL, 8) = 0
alarm(10)                               = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 2048, 0, NULL, 0) = -1 EMSGSIZE (Message too long)
dup(2)                                  = 5
fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(5, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f691fa18000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(5, "dg_send: data send error: Messag"..., 43dg_send: data send error: Message too long
) = 43
close(5)                                = 0
munmap(0x7f691fa18000, 4096)            = 0
exit_group(1)                           = ?

I guess the SO_SNDBUF/SO_RCVBUF limits are a bit too low ?

Tried this on an old kernel as well.

Linux edumazet-glaptop 2.6.38-13-generic #57~lucid1-Ubuntu SMP Tue Mar 6 20:05:46 UTC 2012 x86_64 GNU/Linux


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rémi Denis-Courmont - April 3, 2012, 6:36 a.m.
On Mon, 02 Apr 2012 17:40:06 -0400 (EDT), David Miller
<davem@davemloft.net> wrote:
> UDP generates a fragment list of MTU sized SKBs.
> 
> Phonet could avoid the large allocations by building page based
> SKBs.

Oh right. And Phonet devices don't support scatter/gather, so that I guess
that would merely delay the problem.

Also sendmsg() code would need to be reectored to look up the output
device and then the MTU before allocating the socket buffer. This will only
work if the default MTU is reduced first :/
David Miller - April 3, 2012, 6:38 a.m.
From: Rémi Denis-Courmont <remi@remlab.net>
Date: Tue, 03 Apr 2012 08:36:20 +0200

> Oh right. And Phonet devices don't support scatter/gather, so that I
> guess that would merely delay the problem.

And this lack of SG support in the drivers never materialized largely
because the phonet stack never generated such things.

So yes indeed both sides of this equation will need to be addressed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - April 3, 2012, 6:18 p.m.
On 04/02/2012 08:14 PM, Eric Dumazet wrote:
> It seems netperf has some problems zith AF_UNIX dgram :
>
> socket(PF_FILE, SOCK_DGRAM, 0)          = 4
> setsockopt(4, SOL_SOCKET, SO_SNDBUF, [0], 4) = 0
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [2048], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [2288], [4]) = 0
> sendto(3, "\0\0\0+\377\377\377\377\0\0\0\0\0\0\10\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0"..., 256, 0, NULL, 0) = 256
> select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 979635})
> recvfrom(3, "\0\0\0,\0\0\0\0\0\0\10\360\0\0\10\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0"..., 256, 0, NULL, NULL) = 256
> connect(4, {sa_family=AF_FILE, path="/tmp/netpe62HGNM"}, 110) = 0
> rt_sigaction(SIGALRM, {0x403171, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f691f026af0}, NULL, 8) = 0
> alarm(10)                               = 0
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 2048, 0, NULL, 0) = -1 EMSGSIZE (Message too long)
> dup(2)                                  = 5
> fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
> fstat(5, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f691fa18000
> lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> write(5, "dg_send: data send error: Messag"..., 43dg_send: data send error: Message too long
> ) = 43
> close(5)                                = 0
> munmap(0x7f691fa18000, 4096)            = 0
> exit_group(1)                           = ?
>
> I guess the SO_SNDBUF/SO_RCVBUF limits are a bit too low ?

The AF_UNIX support in netperf is from a time when the code was very 
simplistic - if no socket buffer size specified on the command line, 
take the system default.  If no send size specified, use the SO_SNDBUF 
size, though some limits were applied.

Of course, along the way, there was a change in the value for socket 
buffer size that meant "take the default" - it used to be 0 but then 
became -1.  Something to do with when Windows would decide to do copy 
avoidance - tell Windows the socket buffer size is 0 and it will do copy 
avoidance. The code in src/nettest_unix.c was still initializing to 0 
rather than -1.

I've fixed that in the top-of-trunk:

raj@tardy:~/netperf2_trunk$ src/netperf -t DG_STREAM
DG UNIDIRECTIONAL SEND TEST
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

126976   65507    10.00     1195647      0     811.07
  2288            10.00     1195647            811.07

BTW, the local CPU utilization seems sane, but "remote" (which is 
redundant anyway) seems to be fubar somehow.  More bitrot I suspect.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 9f60008..caee99e 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -1130,6 +1130,9 @@  static int pep_sendmsg(struct kiocb *iocb, struct sock *sk,
 	int flags = msg->msg_flags;
 	int err, done;
 
+	if (len > USHRT_MAX)
+		return -E2BIG;
+
 	if ((msg->msg_flags & ~(MSG_DONTWAIT|MSG_EOR|MSG_NOSIGNAL|
 				MSG_CMSG_COMPAT)) ||
 			!(msg->msg_flags & MSG_EOR))