diff mbox

[RFC,1/2] net: Introduce recvmmsg socket syscall

Message ID 20090520230652.GB5956@ghostprotocols.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Arnaldo Carvalho de Melo May 20, 2009, 11:06 p.m. UTC
Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.

Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/alpha/kernel/systbls.S      |    1 +
 arch/arm/kernel/calls.S          |    1 +
 arch/ia64/kernel/entry.S         |    1 +
 arch/mips/kernel/scall32-o32.S   |    2 +
 arch/mips/kernel/scall64-64.S    |    1 +
 arch/mips/kernel/scall64-n32.S   |    1 +
 arch/mips/kernel/scall64-o32.S   |    1 +
 arch/sh/kernel/syscalls_64.S     |    1 +
 arch/sparc/kernel/systbls_64.S   |    2 +-
 arch/x86/include/asm/unistd_64.h |    2 +
 include/linux/socket.h           |    6 +++
 include/linux/syscalls.h         |    3 ++
 include/net/compat.h             |    7 ++++
 kernel/sys_ni.c                  |    2 +
 net/compat.c                     |    7 ++++
 net/socket.c                     |   72 ++++++++++++++++++++++++++++++-------
 16 files changed, 95 insertions(+), 15 deletions(-)

Comments

Neil Horman May 21, 2009, 12:46 a.m. UTC | #1
On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote:
> Meaning receive multiple messages, reducing the number of syscalls and
> net stack entry/exit operations.
> 
> Next patches will introduce mechanisms where protocols that want to
> optimize this operation will provide an unlocked_recvmsg operation.
> 
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Its a neat idea, I like the possibility on saving lots of syscalls for
busy sockets, but I imagine the addition of a new syscall gives people pause.  I
wonder if simply augmenting the existing recvmsg syscall with a message flag to
indicate that multiple messages can be received on that call.

What I would propose looks something like:

1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND.  Setting
this on the call lets the protocol we can store multiple messages

2) if this flag is set the msg_control pointer should contain a cmsghdr with a
new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data
contains a pointer to the next msghdr pointer.

3) The kernel can iteratively fill out buffers passed in through the chain,
setting the MSG_COMPOUND flag on each msghdr that contains valid data.  The
first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that
the kernel put valid data in.  This way the buffer chain pointer is kept
unchanged, and userspace can follow it to free the data if need be.

Thoughts?

Neil

> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnaldo Carvalho de Melo May 21, 2009, 2:05 a.m. UTC | #2
Em Wed, May 20, 2009 at 08:46:34PM -0400, Neil Horman escreveu:
> On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote:
> > Meaning receive multiple messages, reducing the number of syscalls and
> > net stack entry/exit operations.
> > 
> > Next patches will introduce mechanisms where protocols that want to
> > optimize this operation will provide an unlocked_recvmsg operation.
> > 
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Its a neat idea, I like the possibility on saving lots of syscalls for
> busy sockets, but I imagine the addition of a new syscall gives people pause.  I
> wonder if simply augmenting the existing recvmsg syscall with a message flag to
> indicate that multiple messages can be received on that call.
> 
> What I would propose looks something like:
> 
> 1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND.  Setting
> this on the call lets the protocol we can store multiple messages
> 
> 2) if this flag is set the msg_control pointer should contain a cmsghdr with a
> new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data
> contains a pointer to the next msghdr pointer.
> 
> 3) The kernel can iteratively fill out buffers passed in through the chain,
> setting the MSG_COMPOUND flag on each msghdr that contains valid data.  The
> first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that
> the kernel put valid data in.  This way the buffer chain pointer is kept
> unchanged, and userspace can follow it to free the data if need be.
> 
> Thoughts?

I didn't went into such detail when discussing this with Dave on IRC,
but I thought about something like using a setsockopt to tell the kernel
that the socket was in multiple message mode, lemme look at the
discussion to be faithful to it...

[18:22] <acme> I see, but the bastardization I was thinking was about just
putting a datagram per iovec instead of taking a datagram and go on
spilling it over the iovec entries, if some sockopt was set, as a first
try ;-)
[18:23] <davem> Oh I see
[18:23] <davem> that would work too

But I think that the interface I proposed, that was Dave's general idea,
should be ok as well for sendmmsg, to send multiple messages to
different destinations using markings like one msg_iovlen to signal that
the previous msg_iov/msg_iovlen should be used for a different
destination.

The reasoning behing the proposed interface was to mostly keep the
existing way of passing iovecs to the kernel, but this time around
passing multiple iovecs instead of just one.

Existing code would just have to make the iovecs, msg_name, etc be
arrays instead of rethinking how to talk to the kernel completely.

So... lets hear more opinions :-)

Ah, I went to a local pub to relax and left three machines non-stop
pounding a "chrt -f 1 ./rcvmmsg 5001 64" patched server and it hold up
for hours:

nr_datagrams received: 24
    4352 bytes received from mica.ghostprotocols.net in 17 datagrams
    1536 bytes received from doppio.ghostprotocols.net in 6 datagrams
    256 bytes received from filo.ghostprotocols.net in 1 datagrams
nr_datagrams received: 18
    256 bytes received from filo.ghostprotocols.net in 1 datagrams
    3072 bytes received from doppio.ghostprotocols.net in 12 datagrams
    256 bytes received from mica.ghostprotocols.net in 1 datagrams
    256 bytes received from doppio.ghostprotocols.net in 1 datagrams
    256 bytes received from mica.ghostprotocols.net in 1 datagrams
    256 bytes received from doppio.ghostprotocols.net in 1 datagrams
    256 bytes received from mica.ghostprotocols.net in 1 datagrams
nr_datagrams received: 26
    5120 bytes received from mica.ghostprotocols.net in 20 datagrams
    256 bytes received from filo.ghostprotocols.net in 1 datagrams
    1280 bytes received from doppio.ghostprotocols.net in 5 datagrams
nr_datagrams received: 18
    256 bytes received from filo.ghostprotocols.net in 1 datagrams
    1792 bytes received from doppio.ghostprotocols.net in 7 datagrams
    256 bytes received from filo.ghostprotocols.net in 1 datagrams
    1792 bytes received from doppio.ghostprotocols.net in 7 datagrams
    256 bytes received from mica.ghostprotocols.net in 1 datagrams
    256 bytes received from do^C    256 bytes received from filo.ghostprotocols.net in 1 datagrams

:-)

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 21, 2009, 2:26 a.m. UTC | #3
On Wed, May 20, 2009 at 11:05:41PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, May 20, 2009 at 08:46:34PM -0400, Neil Horman escreveu:
> > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Meaning receive multiple messages, reducing the number of syscalls and
> > > net stack entry/exit operations.
> > > 
> > > Next patches will introduce mechanisms where protocols that want to
> > > optimize this operation will provide an unlocked_recvmsg operation.
> > > 
> > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Its a neat idea, I like the possibility on saving lots of syscalls for
> > busy sockets, but I imagine the addition of a new syscall gives people pause.  I
> > wonder if simply augmenting the existing recvmsg syscall with a message flag to
> > indicate that multiple messages can be received on that call.
> > 
> > What I would propose looks something like:
> > 
> > 1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND.  Setting
> > this on the call lets the protocol we can store multiple messages
> > 
> > 2) if this flag is set the msg_control pointer should contain a cmsghdr with a
> > new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data
> > contains a pointer to the next msghdr pointer.
> > 
> > 3) The kernel can iteratively fill out buffers passed in through the chain,
> > setting the MSG_COMPOUND flag on each msghdr that contains valid data.  The
> > first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that
> > the kernel put valid data in.  This way the buffer chain pointer is kept
> > unchanged, and userspace can follow it to free the data if need be.
> > 
> > Thoughts?
> 
> I didn't went into such detail when discussing this with Dave on IRC,
> but I thought about something like using a setsockopt to tell the kernel
> that the socket was in multiple message mode, lemme look at the
> discussion to be faithful to it...
> 
> [18:22] <acme> I see, but the bastardization I was thinking was about just
> putting a datagram per iovec instead of taking a datagram and go on
> spilling it over the iovec entries, if some sockopt was set, as a first
> try ;-)
> [18:23] <davem> Oh I see
> [18:23] <davem> that would work too
> 
> But I think that the interface I proposed, that was Dave's general idea,
> should be ok as well for sendmmsg, to send multiple messages to
> different destinations using markings like one msg_iovlen to signal that
> the previous msg_iov/msg_iovlen should be used for a different
> destination.
> 
> The reasoning behing the proposed interface was to mostly keep the
> existing way of passing iovecs to the kernel, but this time around
> passing multiple iovecs instead of just one.
> 
> Existing code would just have to make the iovecs, msg_name, etc be
> arrays instead of rethinking how to talk to the kernel completely.
> 
> So... lets hear more opinions :-)
> 
I agree, your way of doing this definately lets you layer on top of the existing
vetted implementation, which is nice, I just thought that avoiding the creation
of another syscall might be worth a little extra work in the kernel.  Instead of
arrays of msghdrs, We'd be looking at chains like this:
msghdr->(struct msghdr *)msg_control[i].data->msghdr->etc

Not too hard to parse, I dont think.  But I'll defer to brighter minds than
mine.  If the creation of another syscall isn't too difficult a barrier to
overcome (assuming this is going to occur for sendmsg, and various other i/o ops
as well), then your way here is probably the way to go.
Neil

> Ah, I went to a local pub to relax and left three machines non-stop
> pounding a "chrt -f 1 ./rcvmmsg 5001 64" patched server and it hold up
> for hours:
> 
> nr_datagrams received: 24
>     4352 bytes received from mica.ghostprotocols.net in 17 datagrams
>     1536 bytes received from doppio.ghostprotocols.net in 6 datagrams
>     256 bytes received from filo.ghostprotocols.net in 1 datagrams
> nr_datagrams received: 18
>     256 bytes received from filo.ghostprotocols.net in 1 datagrams
>     3072 bytes received from doppio.ghostprotocols.net in 12 datagrams
>     256 bytes received from mica.ghostprotocols.net in 1 datagrams
>     256 bytes received from doppio.ghostprotocols.net in 1 datagrams
>     256 bytes received from mica.ghostprotocols.net in 1 datagrams
>     256 bytes received from doppio.ghostprotocols.net in 1 datagrams
>     256 bytes received from mica.ghostprotocols.net in 1 datagrams
> nr_datagrams received: 26
>     5120 bytes received from mica.ghostprotocols.net in 20 datagrams
>     256 bytes received from filo.ghostprotocols.net in 1 datagrams
>     1280 bytes received from doppio.ghostprotocols.net in 5 datagrams
> nr_datagrams received: 18
>     256 bytes received from filo.ghostprotocols.net in 1 datagrams
>     1792 bytes received from doppio.ghostprotocols.net in 7 datagrams
>     256 bytes received from filo.ghostprotocols.net in 1 datagrams
>     1792 bytes received from doppio.ghostprotocols.net in 7 datagrams
>     256 bytes received from mica.ghostprotocols.net in 1 datagrams
>     256 bytes received from do^C    256 bytes received from filo.ghostprotocols.net in 1 datagrams
> 
> :-)
> 
> - Arnaldo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 21, 2009, 3:50 a.m. UTC | #4
From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 20 May 2009 22:26:21 -0400

> I agree, your way of doing this definately lets you layer on top of
> the existing vetted implementation, which is nice, I just thought
> that avoiding the creation of another syscall might be worth a
> little extra work in the kernel.  Instead of arrays of msghdrs, We'd
> be looking at chains like this: msghdr->(struct msghdr
> *)msg_control[i].data->msghdr->etc
> 
> Not too hard to parse, I dont think.  But I'll defer to brighter
> minds than mine.  If the creation of another syscall isn't too
> difficult a barrier to overcome (assuming this is going to occur for
> sendmsg, and various other i/o ops as well), then your way here is
> probably the way to go.

Unfortunately you can't use msg flags for this.

We accept any message flag we don't understand without signalling
any errors.

So there is no way to determine if the kernel supports the flag
or not.  Whereas with a socket option, we'll always get an error
on older kernels for unsupported options.

I think the system call is the cleanest, because it's not only a
semantic change but also a data type change.  I also think the
socket option scheme is too cumbersome.  I think it would be
common for an application to want to use both modes of sending,
especially if that application uses lots of existing library
mode to compose some messages.  And extra setsockopt() around
every call down into that library?  Yikes, good luck getting
that right all the time.  It's way too error prone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 21, 2009, 10:40 a.m. UTC | #5
On Wed, May 20, 2009 at 08:50:44PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Wed, 20 May 2009 22:26:21 -0400
> 
> > I agree, your way of doing this definately lets you layer on top of
> > the existing vetted implementation, which is nice, I just thought
> > that avoiding the creation of another syscall might be worth a
> > little extra work in the kernel.  Instead of arrays of msghdrs, We'd
> > be looking at chains like this: msghdr->(struct msghdr
> > *)msg_control[i].data->msghdr->etc
> > 
> > Not too hard to parse, I dont think.  But I'll defer to brighter
> > minds than mine.  If the creation of another syscall isn't too
> > difficult a barrier to overcome (assuming this is going to occur for
> > sendmsg, and various other i/o ops as well), then your way here is
> > probably the way to go.
> 
> Unfortunately you can't use msg flags for this.
> 
> We accept any message flag we don't understand without signalling
> any errors.
> 

I assume that silently ignoring flags we don't understand is a specified part
of the sockets api?  I was looking at the posix definition of recvmsg and I
don't see where its required that we ignore unknown flags (although I can see
why it would be usefull to do so)

> So there is no way to determine if the kernel supports the flag
> or not.  Whereas with a socket option, we'll always get an error
> on older kernels for unsupported options.
> 
> I think the system call is the cleanest, because it's not only a
> semantic change but also a data type change.  I also think the
> socket option scheme is too cumbersome.  I think it would be
> common for an application to want to use both modes of sending,
> especially if that application uses lots of existing library
> mode to compose some messages.  And extra setsockopt() around
> every call down into that library?  Yikes, good luck getting
> that right all the time.  It's way too error prone.

The new system call is definately cleaner, as it lets you specify the array
semantics in a way that is much more efficient and less cumbersome for both user
and kernel space.  I just thought I'd propose an alternate method that avoids
the need for a new syscall.  It sounds like you and Acme have already hashed
through this though, so I'll sit down :)

Thanks guys!
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore May 21, 2009, 2:16 p.m. UTC | #6
On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote:
> Meaning receive multiple messages, reducing the number of syscalls and
> net stack entry/exit operations.

NOTE: adding the LSM list to the CC line

If this approach is accepted I wonder if it would also make sense to move the 
security_socket_recvmsg() hook out of __sock_recvmsg and into the callers.  I 
personally can't see a reason why we would need to call into the LSM for each 
message in the case of the new recvmmsg() syscall.  The downside is that there 
is now some code duplication (although we are only talking duplicating ~three 
lines of code) but the upside is that we wont end up calling into the LSM for 
each of the messages when recvmmsg() is called which seems to fit well with 
the performance oriented nature of the new syscall.

> Next patches will introduce mechanisms where protocols that want to
> optimize this operation will provide an unlocked_recvmsg operation.
>
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  arch/alpha/kernel/systbls.S      |    1 +
>  arch/arm/kernel/calls.S          |    1 +
>  arch/ia64/kernel/entry.S         |    1 +
>  arch/mips/kernel/scall32-o32.S   |    2 +
>  arch/mips/kernel/scall64-64.S    |    1 +
>  arch/mips/kernel/scall64-n32.S   |    1 +
>  arch/mips/kernel/scall64-o32.S   |    1 +
>  arch/sh/kernel/syscalls_64.S     |    1 +
>  arch/sparc/kernel/systbls_64.S   |    2 +-
>  arch/x86/include/asm/unistd_64.h |    2 +
>  include/linux/socket.h           |    6 +++
>  include/linux/syscalls.h         |    3 ++
>  include/net/compat.h             |    7 ++++
>  kernel/sys_ni.c                  |    2 +
>  net/compat.c                     |    7 ++++
>  net/socket.c                     |   72
> ++++++++++++++++++++++++++++++------- 16 files changed, 95 insertions(+),
> 15 deletions(-)
>
> diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
> index 95c9aef..cda6b8b 100644
> --- a/arch/alpha/kernel/systbls.S
> +++ b/arch/alpha/kernel/systbls.S
> @@ -497,6 +497,7 @@ sys_call_table:
>  	.quad sys_signalfd
>  	.quad sys_ni_syscall
>  	.quad sys_eventfd
> +	.quad sys_recvmmsg
>
>  	.size sys_call_table, . - sys_call_table
>  	.type sys_call_table, @object
> diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
> index 1680e9e..80fa7fc 100644
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -372,6 +372,7 @@
>  /* 360 */	CALL(sys_inotify_init1)
>  		CALL(sys_preadv)
>  		CALL(sys_pwritev)
> +		CALL(sys_recvmmsg)
>  #ifndef syscalls_counted
>  .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
>  #define syscalls_counted
> diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
> index 7bebac0..aef1e61 100644
> --- a/arch/ia64/kernel/entry.S
> +++ b/arch/ia64/kernel/entry.S
> @@ -1805,6 +1805,7 @@ sys_call_table:
>  	data8 sys_inotify_init1
>  	data8 sys_preadv
>  	data8 sys_pwritev			// 1320
> +	data8 sys_recvmmsg
>
>  	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase
> NR_syscalls #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
> diff --git a/arch/mips/kernel/scall32-o32.S
> b/arch/mips/kernel/scall32-o32.S index 0b31b9b..283383c 100644
> --- a/arch/mips/kernel/scall32-o32.S
> +++ b/arch/mips/kernel/scall32-o32.S
> @@ -652,6 +652,8 @@ einval:	li	v0, -ENOSYS
>  	sys	sys_inotify_init1	1
>  	sys	sys_preadv		6	/* 4330 */
>  	sys	sys_pwritev		6
> +	sys	sys_pwritev		6
> +	sys	sys_recvmmsg		4
>  	.endm
>
>  	/* We pre-compute the number of _instruction_ bytes needed to
> diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
> index c647fd6..109550f 100644
> --- a/arch/mips/kernel/scall64-64.S
> +++ b/arch/mips/kernel/scall64-64.S
> @@ -489,4 +489,5 @@ sys_call_table:
>  	PTR	sys_inotify_init1
>  	PTR	sys_preadv
>  	PTR	sys_pwritev			/* 5390 */
> +	PTR	sys_recvmmsg
>  	.size	sys_call_table,.-sys_call_table
> diff --git a/arch/mips/kernel/scall64-n32.S
> b/arch/mips/kernel/scall64-n32.S index c2c16ef..d45f22c 100644
> --- a/arch/mips/kernel/scall64-n32.S
> +++ b/arch/mips/kernel/scall64-n32.S
> @@ -415,4 +415,5 @@ EXPORT(sysn32_call_table)
>  	PTR	sys_inotify_init1
>  	PTR	sys_preadv
>  	PTR	sys_pwritev
> +	PTR	compat_sys_recvmmsg
>  	.size	sysn32_call_table,.-sysn32_call_table
> diff --git a/arch/mips/kernel/scall64-o32.S
> b/arch/mips/kernel/scall64-o32.S index 002fac2..de93e46 100644
> --- a/arch/mips/kernel/scall64-o32.S
> +++ b/arch/mips/kernel/scall64-o32.S
> @@ -535,4 +535,5 @@ sys_call_table:
>  	PTR	sys_inotify_init1
>  	PTR	compat_sys_preadv		/* 4330 */
>  	PTR	compat_sys_pwritev
> +	PTR	compat_sys_recvmmsg
>  	.size	sys_call_table,.-sys_call_table
> diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
> index a083609..4d180ac 100644
> --- a/arch/sh/kernel/syscalls_64.S
> +++ b/arch/sh/kernel/syscalls_64.S
> @@ -389,3 +389,4 @@ sys_call_table:
>  	.long sys_inotify_init1		/* 360 */
>  	.long sys_preadv
>  	.long sys_pwritev
> +	.long sys_recvmmsg
> diff --git a/arch/sparc/kernel/systbls_64.S
> b/arch/sparc/kernel/systbls_64.S index 82b5bf8..5807ed5 100644
> --- a/arch/sparc/kernel/systbls_64.S
> +++ b/arch/sparc/kernel/systbls_64.S
> @@ -156,4 +156,4 @@ sys_call_table:
>  	.word sys_set_mempolicy, sys_kexec_load, sys_move_pages, sys_getcpu,
> sys_epoll_pwait /*310*/	.word sys_utimensat, sys_signalfd,
> sys_timerfd_create, sys_eventfd, sys_fallocate .word sys_timerfd_settime,
> sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1
> -/*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4,
> sys_preadv, sys_pwritev +/*320*/	.word sys_dup3, sys_pipe2,
> sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev, sys_recvmmsg diff
> --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
> index 900e161..713a32a 100644
> --- a/arch/x86/include/asm/unistd_64.h
> +++ b/arch/x86/include/asm/unistd_64.h
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
>  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
>  #define __NR_perf_counter_open			298
>  __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
> +#define __NR_recvmmsg				299
> +__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
>
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 421afb4..50c6c44 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -65,6 +65,12 @@ struct msghdr {
>  	unsigned	msg_flags;
>  };
>
> +/* For recvmmsg/sendmmsg */
> +struct mmsghdr {
> +	struct msghdr	msg_hdr;
> +	unsigned	msg_len;
> +};
> +
>  /*
>   *	POSIX 1003.1g - ancillary data object information
>   *	Ancillary data consits of a sequence of pairs of
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 677d159..dbf94dd 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -25,6 +25,7 @@ struct linux_dirent64;
>  struct list_head;
>  struct msgbuf;
>  struct msghdr;
> +struct mmsghdr;
>  struct msqid_ds;
>  struct new_utsname;
>  struct nfsctl_arg;
> @@ -555,6 +556,8 @@ asmlinkage long sys_recv(int, void __user *, size_t,
> unsigned); asmlinkage long sys_recvfrom(int, void __user *, size_t,
> unsigned, struct sockaddr __user *, int __user *);
>  asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned
> flags); +asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
> +			     unsigned int vlen, unsigned flags);
>  asmlinkage long sys_socket(int, int, int);
>  asmlinkage long sys_socketpair(int, int, int, int __user *);
>  asmlinkage long sys_socketcall(int call, unsigned long __user *args);
> diff --git a/include/net/compat.h b/include/net/compat.h
> index 5bbf8bf..35acabb 100644
> --- a/include/net/compat.h
> +++ b/include/net/compat.h
> @@ -18,6 +18,11 @@ struct compat_msghdr {
>  	compat_uint_t	msg_flags;
>  };
>
> +struct compat_mmsghdr {
> +	struct compat_msghdr msg_hdr;
> +	compat_uint_t	     msg_len;
> +};
> +
>  struct compat_cmsghdr {
>  	compat_size_t	cmsg_len;
>  	compat_int_t	cmsg_level;
> @@ -35,6 +40,8 @@ extern int get_compat_msghdr(struct msghdr *, struct
> compat_msghdr __user *); extern int verify_compat_iovec(struct msghdr *,
> struct iovec *, struct sockaddr *, int); extern asmlinkage long
> compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned); extern
> asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user
> *,unsigned); +extern asmlinkage long compat_sys_recvmmsg(int, struct
> compat_mmsghdr __user *, +					   unsigned, unsigned);
>  extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *,
> int __user *); extern int put_cmsg_compat(struct msghdr*, int, int, int,
> void *);
>
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 68320f6..f581fb0 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -48,7 +48,9 @@ cond_syscall(sys_shutdown);
>  cond_syscall(sys_sendmsg);
>  cond_syscall(compat_sys_sendmsg);
>  cond_syscall(sys_recvmsg);
> +cond_syscall(sys_recvmmsg);
>  cond_syscall(compat_sys_recvmsg);
> +cond_syscall(compat_sys_recvmmsg);
>  cond_syscall(sys_socketcall);
>  cond_syscall(sys_futex);
>  cond_syscall(compat_sys_futex);
> diff --git a/net/compat.c b/net/compat.c
> index 8d73905..1ec778f 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -743,6 +743,13 @@ asmlinkage long compat_sys_recvmsg(int fd, struct
> compat_msghdr __user *msg, uns return sys_recvmsg(fd, (struct msghdr __user
> *)msg, flags | MSG_CMSG_COMPAT); }
>
> +asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user
> *mmsg, +				    unsigned vlen, unsigned int flags)
> +{
> +	return sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
> +			    flags | MSG_CMSG_COMPAT);
> +}
> +
>  asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
>  {
>  	int ret;
> diff --git a/net/socket.c b/net/socket.c
> index 791d71a..f0249cb 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1965,22 +1965,16 @@ out:
>  	return err;
>  }
>
> -/*
> - *	BSD recvmsg interface
> - */
> -
> -SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
> -		unsigned int, flags)
> +static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
> +			 unsigned flags)
>  {
>  	struct compat_msghdr __user *msg_compat =
>  	    (struct compat_msghdr __user *)msg;
> -	struct socket *sock;
>  	struct iovec iovstack[UIO_FASTIOV];
>  	struct iovec *iov = iovstack;
>  	struct msghdr msg_sys;
>  	unsigned long cmsg_ptr;
>  	int err, iov_size, total_len, len;
> -	int fput_needed;
>
>  	/* kernel mode address */
>  	struct sockaddr_storage addr;
> @@ -1996,13 +1990,9 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr
> __user *, msg, else if (copy_from_user(&msg_sys, msg, sizeof(struct
> msghdr)))
>  		return -EFAULT;
>
> -	sock = sockfd_lookup_light(fd, &err, &fput_needed);
> -	if (!sock)
> -		goto out;
> -
>  	err = -EMSGSIZE;
>  	if (msg_sys.msg_iovlen > UIO_MAXIOV)
> -		goto out_put;
> +		goto out;
>
>  	/* Check whether to allocate the iovec area */
>  	err = -ENOMEM;
> @@ -2010,7 +2000,7 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr
> __user *, msg, if (msg_sys.msg_iovlen > UIO_FASTIOV) {
>  		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
>  		if (!iov)
> -			goto out_put;
> +			goto out;
>  	}
>
>  	/*
> @@ -2066,9 +2056,63 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr
> __user *, msg, out_freeiov:
>  	if (iov != iovstack)
>  		sock_kfree_s(sock->sk, iov, iov_size);
> +out:
> +	return err;
> +}
> +
> +/*
> + *	BSD recvmsg interface
> + */
> +
> +SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
> +		unsigned int, flags)
> +{
> +	int fput_needed, err;
> +	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
> +
> +	if (!sock)
> +		goto out;
> +
> +	err = __sys_recvmsg(sock, msg, flags);
> +
> +	fput_light(sock->file, fput_needed);
> +out:
> +	return err;
> +}
> +
> +/*
> + *	Linux recvmmsg interface
> + */
> +
> +SYSCALL_DEFINE4(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
> +		unsigned int, vlen, unsigned int, flags)
> +{
> +	int fput_needed, err, datagrams = 0;
> +	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
> +	struct mmsghdr __user *entry = mmsg;
> +
> +	if (!sock)
> +		goto out;
> +
> +	while (datagrams < vlen) {
> +		err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags);
> +		if (err < 0)
> +			goto out_put;
> +		err = __put_user(err, &entry->msg_len);
> +		if (err)
> +			goto out_put;
> +		++entry;
> +		++datagrams;
> +	}
>  out_put:
>  	fput_light(sock->file, fput_needed);
>  out:
> +	/*
> +	 * We may return less entries than requested (vlen) if the
> +	 * sock is non block and there aren't enough datagrams.
> +	 */
> +	if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
> +		return datagrams;
>  	return err;
>  }
Arnaldo Carvalho de Melo May 21, 2009, 2:47 p.m. UTC | #7
Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu:
> On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote:
> > Meaning receive multiple messages, reducing the number of syscalls and
> > net stack entry/exit operations.
> 
> NOTE: adding the LSM list to the CC line

thanks!
 
> If this approach is accepted I wonder if it would also make sense to move the 
> security_socket_recvmsg() hook out of __sock_recvmsg and into the callers.  I 
> personally can't see a reason why we would need to call into the LSM for each 
> message in the case of the new recvmmsg() syscall.  The downside is that there 
> is now some code duplication (although we are only talking duplicating ~three 
> lines of code) but the upside is that we wont end up calling into the LSM for 
> each of the messages when recvmmsg() is called which seems to fit well with 
> the performance oriented nature of the new syscall.

Agreed that we must do this earlier to avoind vlen calls to
security_socket_recvmsg, but there are many callers of sock_recvmsg...

Also shouldn't recvmmsg have a different LSM hook? It doesn't look right
at first sight to reuse security_socket_recvmsg, as we now are passing
many msghdrs and sockaddrs, etc.

If security_socket_recvmsg receives the msg and inspects it, I think
fully inspecting the mmsg and vlen can be something LSM policies can be
interested in inspecting too, no?

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore May 21, 2009, 3:03 p.m. UTC | #8
On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote:
> Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu:
> > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote:
> > > Meaning receive multiple messages, reducing the number of syscalls and
> > > net stack entry/exit operations.
> >
> > NOTE: adding the LSM list to the CC line
>
> thanks!
>
> > If this approach is accepted I wonder if it would also make sense to move
> > the security_socket_recvmsg() hook out of __sock_recvmsg and into the
> > callers.  I personally can't see a reason why we would need to call into
> > the LSM for each message in the case of the new recvmmsg() syscall.  The
> > downside is that there is now some code duplication (although we are only
> > talking duplicating ~three lines of code) but the upside is that we wont
> > end up calling into the LSM for each of the messages when recvmmsg() is
> > called which seems to fit well with the performance oriented nature of
> > the new syscall.
>
> Agreed that we must do this earlier to avoind vlen calls to
> security_socket_recvmsg, but there are many callers of sock_recvmsg...

Yeah, like I said there is a downside to this approach, the question is how 
much do we care about performance and what are we willing to give up for it?  
I don't know the answer but I thought the question needed to be asked.

> Also shouldn't recvmmsg have a different LSM hook? It doesn't look right
> at first sight to reuse security_socket_recvmsg, as we now are passing
> many msghdrs and sockaddrs, etc.

Well, right now the only LSM of the three in the mainline kernel that makes 
use of the recvmsg hook is SELinux and in SELinux the recvmsg hook really only 
checks to see if the process can read from the socket - there is no access 
check against the message itself.  In general, all of the per-packet/message 
access controls happen below the socket layer in SELinux so I see no reason 
why we would need to execute the recvmsg hook multiple times for each 
recvmmsg() syscall.

If I'm wrong I'm sure the LSM brain trust will quickly step in ...

> If security_socket_recvmsg receives the msg and inspects it, I think
> fully inspecting the mmsg and vlen can be something LSM policies can be
> interested in inspecting too, no?

Maybe, but not with what we currently have in-tree.  Perhaps this is a 
sign/opportunity that we can trim the arguments to security_socket_recvmsg() 
too?
Arnaldo Carvalho de Melo May 21, 2009, 3:11 p.m. UTC | #9
Em Thu, May 21, 2009 at 11:03:26AM -0400, Paul Moore escreveu:
> On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote:
> > Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu:
> > > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote:
> > > > Meaning receive multiple messages, reducing the number of syscalls and
> > > > net stack entry/exit operations.
> > >
> > > NOTE: adding the LSM list to the CC line
> >
> > thanks!
> >
> > > If this approach is accepted I wonder if it would also make sense to move
> > > the security_socket_recvmsg() hook out of __sock_recvmsg and into the
> > > callers.  I personally can't see a reason why we would need to call into
> > > the LSM for each message in the case of the new recvmmsg() syscall.  The
> > > downside is that there is now some code duplication (although we are only
> > > talking duplicating ~three lines of code) but the upside is that we wont
> > > end up calling into the LSM for each of the messages when recvmmsg() is
> > > called which seems to fit well with the performance oriented nature of
> > > the new syscall.
> >
> > Agreed that we must do this earlier to avoind vlen calls to
> > security_socket_recvmsg, but there are many callers of sock_recvmsg...
> 
> Yeah, like I said there is a downside to this approach, the question is how 
> much do we care about performance and what are we willing to give up for it?  
> I don't know the answer but I thought the question needed to be asked.

Well, if we only check if the process can read from the socket, I also
see no reasons for a new security_socket_recvmmsg nor for checking it
multiple times in recvmmsg, since what changes (the msg) is of no
interest to LSM.
 
> > Also shouldn't recvmmsg have a different LSM hook? It doesn't look right
> > at first sight to reuse security_socket_recvmsg, as we now are passing
> > many msghdrs and sockaddrs, etc.
> 
> Well, right now the only LSM of the three in the mainline kernel that makes 
> use of the recvmsg hook is SELinux and in SELinux the recvmsg hook really only 
> checks to see if the process can read from the socket - there is no access 
> check against the message itself.  In general, all of the per-packet/message 
> access controls happen below the socket layer in SELinux so I see no reason 
> why we would need to execute the recvmsg hook multiple times for each 
> recvmmsg() syscall.

Agreed
 
> If I'm wrong I'm sure the LSM brain trust will quickly step in ...
> 
> > If security_socket_recvmsg receives the msg and inspects it, I think
> > fully inspecting the mmsg and vlen can be something LSM policies can be
> > interested in inspecting too, no?
> 
> Maybe, but not with what we currently have in-tree.  Perhaps this is a 
> sign/opportunity that we can trim the arguments to security_socket_recvmsg() 
> too?

Perhaps, but up to LSM folks to tell if this was really a case where
passing the msg was something that ended up being overkill.

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore May 21, 2009, 3:24 p.m. UTC | #10
On Thursday 21 May 2009 11:11:09 am Arnaldo Carvalho de Melo wrote:
> Em Thu, May 21, 2009 at 11:03:26AM -0400, Paul Moore escreveu:
> > On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote:
> > > Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu:
> > > > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote:
> > > > > Meaning receive multiple messages, reducing the number of syscalls
> > > > > and net stack entry/exit operations.
> > > >
> > > > NOTE: adding the LSM list to the CC line
> > >
> > > thanks!
> > >
> > > > If this approach is accepted I wonder if it would also make sense to
> > > > move the security_socket_recvmsg() hook out of __sock_recvmsg and
> > > > into the callers.  I personally can't see a reason why we would need
> > > > to call into the LSM for each message in the case of the new
> > > > recvmmsg() syscall.  The downside is that there is now some code
> > > > duplication (although we are only talking duplicating ~three lines of
> > > > code) but the upside is that we wont end up calling into the LSM for
> > > > each of the messages when recvmmsg() is called which seems to fit
> > > > well with the performance oriented nature of the new syscall.
> > >
> > > Agreed that we must do this earlier to avoind vlen calls to
> > > security_socket_recvmsg, but there are many callers of sock_recvmsg...
> >
> > Yeah, like I said there is a downside to this approach, the question is
> > how much do we care about performance and what are we willing to give up
> > for it? I don't know the answer but I thought the question needed to be
> > asked.
>
> Well, if we only check if the process can read from the socket, I also
> see no reasons for a new security_socket_recvmmsg nor for checking it
> multiple times in recvmmsg, since what changes (the msg) is of no
> interest to LSM.

Exactly.

> > If I'm wrong I'm sure the LSM brain trust will quickly step in ...
> >
> > > If security_socket_recvmsg receives the msg and inspects it, I think
> > > fully inspecting the mmsg and vlen can be something LSM policies can be
> > > interested in inspecting too, no?
> >
> > Maybe, but not with what we currently have in-tree.  Perhaps this is a
> > sign/opportunity that we can trim the arguments to
> > security_socket_recvmsg() too?
>
> Perhaps, but up to LSM folks to tell if this was really a case where
> passing the msg was something that ended up being overkill.

Yep, and since they are a rather ornery bunch I'm sure they will speak up.  
However, I would say to go ahead and trim the unused args and if we need them 
in the future we can always add them back in; none of this is really visible 
outside the kernel so changes are relatively easy.

I'd hate to see us miss an opportunity to make things better over concerns 
about what might happen at some unknown point in the future.
Evgeniy Polyakov May 21, 2009, 4:10 p.m. UTC | #11
Hi Arnaldo.

On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote:
> Meaning receive multiple messages, reducing the number of syscalls and
> net stack entry/exit operations.
> 
> Next patches will introduce mechanisms where protocols that want to
> optimize this operation will provide an unlocked_recvmsg operation.
 
What's the difference from the single msg with multiple iovecs?

Can we implement receiving from multiple sockets using this or similar interface?
Arnaldo Carvalho de Melo May 21, 2009, 4:27 p.m. UTC | #12
Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu:
> Hi Arnaldo.
> 
> On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote:
> > Meaning receive multiple messages, reducing the number of syscalls and
> > net stack entry/exit operations.
> > 
> > Next patches will introduce mechanisms where protocols that want to
> > optimize this operation will provide an unlocked_recvmsg operation.
>  
> What's the difference from the single msg with multiple iovecs?

recvmsg consumes just one skb, a datagram, truncating if it has more
bytes than asked and giving less bytes than asked for if the skb being
consumer is smaller than requested.

WRT iovec, it gets this skb/datagram and goes on filling iovec entry by
entry, till it exhausts the skb.

The usecase here is: UDP socket has multiple skbs in its receive queue,
so application will make several syscalls to get those skbs while we
could return multiple datagrams in just one syscall + fd lookup + LSM
validation + lock_sock + release_sock.

We could use some sort of setsockopt to instead put a datagram per iovec
entry, but that would be cumbersome, using recvmsg + recvmmsg on the
same code should be possible without require setsockopts to switch
modes.

The proposed API just adds an "array mode" for recvmsg, keeping the
existing semantics as much as possible, to ease converting applications
to it, libraries would just do some internal caching and its users
wouldn't notice the change.
 
> Can we implement receiving from multiple sockets using this or similar interface?

Not really, we don't pass something like a poll_fd array, the operation
as proposed is per socket.

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
steve@chygwyn.com May 21, 2009, 4:33 p.m. UTC | #13
Hi,

On Thu, 2009-05-21 at 13:27 -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu:
> > Hi Arnaldo.
> > 
> > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote:
> > > Meaning receive multiple messages, reducing the number of syscalls and
> > > net stack entry/exit operations.
> > > 
> > > Next patches will introduce mechanisms where protocols that want to
> > > optimize this operation will provide an unlocked_recvmsg operation.
> >  
> > What's the difference from the single msg with multiple iovecs?
> 
> recvmsg consumes just one skb, a datagram, truncating if it has more
> bytes than asked and giving less bytes than asked for if the skb being
> consumer is smaller than requested.
> 
> WRT iovec, it gets this skb/datagram and goes on filling iovec entry by
> entry, till it exhausts the skb.
> 
> The usecase here is: UDP socket has multiple skbs in its receive queue,
> so application will make several syscalls to get those skbs while we
> could return multiple datagrams in just one syscall + fd lookup + LSM
> validation + lock_sock + release_sock.
> 
Its not just UDP/SOCK_DGRAM either. SOCK_SEQPACKET has to return after
each message because it is not allowed to (nor can it, given the api)
return more than one message in a single call. So with small messages
that can add up to a lot of calls. This way the MSG_EOR flags can be
preserved in the correct places,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caitlin Bestler May 21, 2009, 4:38 p.m. UTC | #14
> +
> +       while (datagrams < vlen) {
> +               err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags);
> +               if (err < 0)
> +                       goto out_put;
> +               err = __put_user(err, &entry->msg_len);
> +               if (err)
> +                       goto out_put;
> +               ++entry;
> +               ++datagrams;
> +       }
>  out_put:
>        fput_light(sock->file, fput_needed);
>  out:
> +       /*
> +        * We may return less entries than requested (vlen) if the
> +        * sock is non block and there aren't enough datagrams.
> +        */
> +       if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
> +               return datagrams;
>        return err;
>  }
>

There is an assumption here that unless MSG_DONTWAIT is set, or there
is an error,
that the caller will be willing to wait indefinitely for N messages to
show up -- and that
it is never worth waking up the caller earlier with less than N messages.

I think an application would more typically want to wait at most m
msecs after the first
message is received to see if any other messages can be delivered at
the same time.
A busy server could simply use DONTWAIT in a polling loop every cycle,
but it would be
nice to be able to wait indefinitely for *any* of your clients to send
you a message.

Further, with some sockets there are some messages that are more equal
than others.
Although valid messages, with no errors, they should be delivered to
user-mode immediately.
The example that leaps to my mind immediately are SCTP Events,
particularly with one-to-many
sockets. You could be waiting for N messages, knowing that a specific
peer has been asked to
send N messages. The 2nd message could be an SCTP event announcing
that the specific
association has been torn down (and hence the remaining messages will
not be arriving).
Waiting for a different association to send enough messages to
complete the request will
not provide very prompt service.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnaldo Carvalho de Melo May 21, 2009, 4:45 p.m. UTC | #15
Em Thu, May 21, 2009 at 05:33:35PM +0100, Steven Whitehouse escreveu:
> Hi,
> 
> On Thu, 2009-05-21 at 13:27 -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu:
> > > Hi Arnaldo.
> > > 
> > > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote:
> > > > Meaning receive multiple messages, reducing the number of syscalls and
> > > > net stack entry/exit operations.
> > > > 
> > > > Next patches will introduce mechanisms where protocols that want to
> > > > optimize this operation will provide an unlocked_recvmsg operation.
> > >  
> > > What's the difference from the single msg with multiple iovecs?
> > 
> > recvmsg consumes just one skb, a datagram, truncating if it has more
> > bytes than asked and giving less bytes than asked for if the skb being
> > consumer is smaller than requested.
> > 
> > WRT iovec, it gets this skb/datagram and goes on filling iovec entry by
> > entry, till it exhausts the skb.
> > 
> > The usecase here is: UDP socket has multiple skbs in its receive queue,
> > so application will make several syscalls to get those skbs while we
> > could return multiple datagrams in just one syscall + fd lookup + LSM
> > validation + lock_sock + release_sock.
> > 
> Its not just UDP/SOCK_DGRAM either. SOCK_SEQPACKET has to return after
> each message because it is not allowed to (nor can it, given the api)
> return more than one message in a single call. So with small messages
> that can add up to a lot of calls. This way the MSG_EOR flags can be
> preserved in the correct places,

Yeah, SOCK_SEQPACKET are just not optimized in the current patch, it
could, just providing a sk_prot->unlocked_recvmsg call, etc.

And to use sock_common_recvmsg in its socket->ops, which not all
protocol families do, but was my plan long ago to make all use so that
we could shorten the path, etc :-)

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnaldo Carvalho de Melo May 21, 2009, 4:55 p.m. UTC | #16
Em Thu, May 21, 2009 at 09:38:47AM -0700, Caitlin Bestler escreveu:
> > +
> > +       while (datagrams < vlen) {
> > +               err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags);
> > +               if (err < 0)
> > +                       goto out_put;
> > +               err = __put_user(err, &entry->msg_len);
> > +               if (err)
> > +                       goto out_put;
> > +               ++entry;
> > +               ++datagrams;
> > +       }
> >  out_put:
> >        fput_light(sock->file, fput_needed);
> >  out:
> > +       /*
> > +        * We may return less entries than requested (vlen) if the
> > +        * sock is non block and there aren't enough datagrams.
> > +        */
> > +       if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
> > +               return datagrams;
> >        return err;
> >  }
> >
 
> There is an assumption here that unless MSG_DONTWAIT is set, or there
> is an error, that the caller will be willing to wait indefinitely for
> N messages to show up -- and that it is never worth waking up the
> caller earlier with less than N messages.
 
> I think an application would more typically want to wait at most m
> msecs after the first message is received to see if any other messages
> can be delivered at the same time.

> A busy server could simply use DONTWAIT in a polling loop every cycle,
> but it would be nice to be able to wait indefinitely for *any* of your
> clients to send you a message.
 
> Further, with some sockets there are some messages that are more equal
> than others.

> Although valid messages, with no errors, they should be delivered to
> user-mode immediately.

> The example that leaps to my mind immediately are SCTP Events,
> particularly with one-to-many sockets. You could be waiting for N
> messages, knowing that a specific peer has been asked to send N
> messages. The 2nd message could be an SCTP event announcing that the
> specific association has been torn down (and hence the remaining
> messages will not be arriving).  Waiting for a different association
> to send enough messages to complete the request will not provide very
> prompt service.

Good points, that will require some sort of change to sk->sk_rcvtimeo
handling so that the timeout for each datagram are deduced from the
configured timeout for the socket, so that the existing code doesn't
have to be changed.

I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and
restore at exit, and would somehow go on subtracting the time
sock_recvmsg() took from it so that the following call finds a reduced
sk->sk_rcvtimeo iif it was configured in the first place and the socket
is in blocking mode.

How does that sound?

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caitlin Bestler May 21, 2009, 5:26 p.m. UTC | #17
On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo
<acme@ghostprotocols.net> wrote:

>
> I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and
> restore at exit, and would somehow go on subtracting the time
> sock_recvmsg() took from it so that the following call finds a reduced
> sk->sk_rcvtimeo iif it was configured in the first place and the socket
> is in blocking mode.
>
> How does that sound?
>

I suspect that an additional timeout value will be needed ultimately.

Essentially there is the existing timeout (how long will I wait before I
want to be told that there have been no messages) and an additional
"delivery pending" timeout (how long can the delivery of a message
be delayed to attempt to coalesce it with other messages).

There is also one sticky compliance issue with SCTP, delivery of an
UNORDERED message MUST NOT be delayed waiting for other traffic.
There is no guarantee that the local client will be scheduled immediately,
just a prohibition on delaying delivery for the purpose of bundling.

That might mean a specific transport would have to support multiple
conditional timeouts: maximum delay after an SCTP Event, after an
UNORDERED message, after a message with PPID X, etc. Or for
TCP after a PUSH FLAG, or after certain flags have been set, etc.

Those could probably be reduced to delivering immediately after
any message that the transport flags as "urgent", and *all* error
completions are urgent (which is what the code first shown does).

The specific transport could then use setsockopt to control what
messages qualified as "urgent" in a transport specific manner.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnaldo Carvalho de Melo May 21, 2009, 5:51 p.m. UTC | #18
Em Thu, May 21, 2009 at 10:26:58AM -0700, Caitlin Bestler escreveu:
> On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo
> <acme@ghostprotocols.net> wrote:
> 
> >
> > I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and
> > restore at exit, and would somehow go on subtracting the time
> > sock_recvmsg() took from it so that the following call finds a reduced
> > sk->sk_rcvtimeo iif it was configured in the first place and the socket
> > is in blocking mode.
> >
> > How does that sound?
> >
> 
> I suspect that an additional timeout value will be needed ultimately.

So you mean we need a timeout to wait for a datagram, that remains being
sk->sk_rcvtimeo (SO_RCVTIMEO), and one that is passed as a parameter to
recvmmsg?
 
> Essentially there is the existing timeout (how long will I wait before I
> want to be told that there have been no messages) and an additional
> "delivery pending" timeout (how long can the delivery of a message
> be delayed to attempt to coalesce it with other messages).

I gather this is an yes for the question I asked above :)
 
> There is also one sticky compliance issue with SCTP, delivery of an
> UNORDERED message MUST NOT be delayed waiting for other traffic.
> There is no guarantee that the local client will be scheduled immediately,
> just a prohibition on delaying delivery for the purpose of bundling.
> 
> That might mean a specific transport would have to support multiple
> conditional timeouts: maximum delay after an SCTP Event, after an
> UNORDERED message, after a message with PPID X, etc. Or for
> TCP after a PUSH FLAG, or after certain flags have been set, etc.
> 
> Those could probably be reduced to delivering immediately after
> any message that the transport flags as "urgent", and *all* error
> completions are urgent (which is what the code first shown does).

checking if MSG_OOB is set in the last datagram returned by the
unlocked_recvmsg call, that would make recvmmsg return imediately even
if in blocking mode and not having filled the array, ok.
 
> The specific transport could then use setsockopt to control what
> messages qualified as "urgent" in a transport specific manner.

OK

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rémi Denis-Courmont May 22, 2009, 7:22 a.m. UTC | #19
On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote:
> +       /*
> +        * We may return less entries than requested (vlen) if the
> +        * sock is non block and there aren't enough datagrams.
> +        */
> +       if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
> +               return datagrams;
>         return err;
>  }

Could there be a situation whereby we receive one or more datagrams, then get 
an error? How does userland get the datagrams then?
steve@chygwyn.com May 22, 2009, 8:31 a.m. UTC | #20
Hi,

On Fri, May 22, 2009 at 10:22:48AM +0300, Rémi Denis-Courmont wrote:
> On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote:
> > +       /*
> > +        * We may return less entries than requested (vlen) if the
> > +        * sock is non block and there aren't enough datagrams.
> > +        */
> > +       if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
> > +               return datagrams;
> >         return err;
> >  }
> 
> Could there be a situation whereby we receive one or more datagrams, then get 
> an error? How does userland get the datagrams then?
>
Normally you'd expect the call to return what it has read without an
error, and then the socket error would be picked up on the next call.
Need to check if sk->sk_error is being reset in the common code I
guess,

Steve.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
steve@chygwyn.com May 22, 2009, 8:32 a.m. UTC | #21
Hi,

On Thu, May 21, 2009 at 02:51:22PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, May 21, 2009 at 10:26:58AM -0700, Caitlin Bestler escreveu:
> > On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo
> > <acme@ghostprotocols.net> wrote:
> > 
> > >
> > > I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and
> > > restore at exit, and would somehow go on subtracting the time
> > > sock_recvmsg() took from it so that the following call finds a reduced
> > > sk->sk_rcvtimeo iif it was configured in the first place and the socket
> > > is in blocking mode.
> > >
> > > How does that sound?
> > >
> > 
> > I suspect that an additional timeout value will be needed ultimately.
> 
> So you mean we need a timeout to wait for a datagram, that remains being
> sk->sk_rcvtimeo (SO_RCVTIMEO), and one that is passed as a parameter to
> recvmmsg?
>
The other thing which might also need looking at is the low water mark,
should that apply to each individual message, or to all the received messages
in combination? I think the latter is more useful.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caitlin Bestler May 22, 2009, 4:39 p.m. UTC | #22
2009/5/22 Rémi Denis-Courmont <remi.denis-courmont@nokia.com>:
> On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote:
>> +       /*
>> +        * We may return less entries than requested (vlen) if the
>> +        * sock is non block and there aren't enough datagrams.
>> +        */
>> +       if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
>> +               return datagrams;
>>         return err;
>>  }
>
> Could there be a situation whereby we receive one or more datagrams, then get
> an error? How does userland get the datagrams then?
>

This is exactly how a server using SOCK_DGRAM or SOCK_SEQPACKET would
want to work -- give me up to N messages received over time X (waiting at most
time Y after the first message is received).

The server application then:
+ wakes up with at most Y delay on delivery of any single message.
+ but wakes up only once during that period (unless N or more messages
are received)
+ wakes up at least every X to deal with idle processing.

It could simply wake up every y msecs, then use recvmmsg to receive
everything that is there,
but that would mean a lot of idle wake-ups when the number of clients is low.

These are all less relevant for SOCK_STREAM since statistical
averaging of incoming
messages is rarely feasible for a single connection.

As commented earlier, however, it would make sense for each transport
to be able to
declare a given message to be "urgent" (by whatever means appropriate
to that transport)
and thereby trigger an earlier delivery. RDMA interfaces use a
"solicited event" bit for
these purposes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 22, 2009, 8:06 p.m. UTC | #23
On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote:
> Meaning receive multiple messages, reducing the number of syscalls and
> net stack entry/exit operations.
> 
> Next patches will introduce mechanisms where protocols that want to
> optimize this operation will provide an unlocked_recvmsg operation.
> 


Not to throw more questions into the mix again, but didn't Ingo write a
batching syscall a while back, which let you issue several syscalls in one trap
to kernel space?  I understand that your approach has some efficiency gains over
that, but did that ever get accepted upstream?  Is the overlap there sufficient
to make this approach redundant?  Or are the gains in performance here
sufficient to warrant this new call?

Regards
Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Grover June 4, 2009, 1:44 a.m. UTC | #24
On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> Not to throw more questions into the mix again, but didn't Ingo write a
> batching syscall a while back, which let you issue several syscalls in one trap
> to kernel space?  I understand that your approach has some efficiency gains over
> that, but did that ever get accepted upstream?  Is the overlap there sufficient
> to make this approach redundant?  Or are the gains in performance here
> sufficient to warrant this new call?

I couldn't find this via Google or any posts to LKML by Ingo on this?

I'm very interested in multiple send/recvmsg support. So, it's
definitely going to happen? :-)

Regards -- Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnaldo Carvalho de Melo June 4, 2009, 1:46 a.m. UTC | #25
Em Wed, Jun 03, 2009 at 06:44:22PM -0700, Andrew Grover escreveu:
> On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > Not to throw more questions into the mix again, but didn't Ingo write a
> > batching syscall a while back, which let you issue several syscalls in one trap
> > to kernel space?  I understand that your approach has some efficiency gains over
> > that, but did that ever get accepted upstream?  Is the overlap there sufficient
> > to make this approach redundant?  Or are the gains in performance here
> > sufficient to warrant this new call?
> 
> I couldn't find this via Google or any posts to LKML by Ingo on this?
> 
> I'm very interested in multiple send/recvmsg support. So, it's
> definitely going to happen? :-)

I got sidetracked with some other activities, and was waiting for more
comments and reports from some testers, but will be back to implement a
v2 patch soon.

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman June 4, 2009, 10:47 a.m. UTC | #26
On Wed, Jun 03, 2009 at 06:44:22PM -0700, Andrew Grover wrote:
> On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > Not to throw more questions into the mix again, but didn't Ingo write a
> > batching syscall a while back, which let you issue several syscalls in one trap
> > to kernel space?  I understand that your approach has some efficiency gains over
> > that, but did that ever get accepted upstream?  Is the overlap there sufficient
> > to make this approach redundant?  Or are the gains in performance here
> > sufficient to warrant this new call?
> 
> I couldn't find this via Google or any posts to LKML by Ingo on this?
> 
> I'm very interested in multiple send/recvmsg support. So, it's
> definitely going to happen? :-)
> 
> Regards -- Andy
> 

Its entirely possible it was just something he talked about on IRC, I just
brought it up because of the overlap.
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
index 95c9aef..cda6b8b 100644
--- a/arch/alpha/kernel/systbls.S
+++ b/arch/alpha/kernel/systbls.S
@@ -497,6 +497,7 @@  sys_call_table:
 	.quad sys_signalfd
 	.quad sys_ni_syscall
 	.quad sys_eventfd
+	.quad sys_recvmmsg
 
 	.size sys_call_table, . - sys_call_table
 	.type sys_call_table, @object
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 1680e9e..80fa7fc 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -372,6 +372,7 @@ 
 /* 360 */	CALL(sys_inotify_init1)
 		CALL(sys_preadv)
 		CALL(sys_pwritev)
+		CALL(sys_recvmmsg)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index 7bebac0..aef1e61 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1805,6 +1805,7 @@  sys_call_table:
 	data8 sys_inotify_init1
 	data8 sys_preadv
 	data8 sys_pwritev			// 1320
+	data8 sys_recvmmsg
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
 #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 0b31b9b..283383c 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -652,6 +652,8 @@  einval:	li	v0, -ENOSYS
 	sys	sys_inotify_init1	1
 	sys	sys_preadv		6	/* 4330 */
 	sys	sys_pwritev		6
+	sys	sys_pwritev		6
+	sys	sys_recvmmsg		4
 	.endm
 
 	/* We pre-compute the number of _instruction_ bytes needed to
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index c647fd6..109550f 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -489,4 +489,5 @@  sys_call_table:
 	PTR	sys_inotify_init1
 	PTR	sys_preadv
 	PTR	sys_pwritev			/* 5390 */
+	PTR	sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index c2c16ef..d45f22c 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -415,4 +415,5 @@  EXPORT(sysn32_call_table)
 	PTR	sys_inotify_init1
 	PTR	sys_preadv
 	PTR	sys_pwritev
+	PTR	compat_sys_recvmmsg
 	.size	sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 002fac2..de93e46 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -535,4 +535,5 @@  sys_call_table:
 	PTR	sys_inotify_init1
 	PTR	compat_sys_preadv		/* 4330 */
 	PTR	compat_sys_pwritev
+	PTR	compat_sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
index a083609..4d180ac 100644
--- a/arch/sh/kernel/syscalls_64.S
+++ b/arch/sh/kernel/syscalls_64.S
@@ -389,3 +389,4 @@  sys_call_table:
 	.long sys_inotify_init1		/* 360 */
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_recvmmsg
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 82b5bf8..5807ed5 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -156,4 +156,4 @@  sys_call_table:
 	.word sys_set_mempolicy, sys_kexec_load, sys_move_pages, sys_getcpu, sys_epoll_pwait
 /*310*/	.word sys_utimensat, sys_signalfd, sys_timerfd_create, sys_eventfd, sys_fallocate
 	.word sys_timerfd_settime, sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1
-/*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev
+/*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev, sys_recvmmsg
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 900e161..713a32a 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@  __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_counter_open			298
 __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
+#define __NR_recvmmsg				299
+__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 421afb4..50c6c44 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,12 @@  struct msghdr {
 	unsigned	msg_flags;
 };
 
+/* For recvmmsg/sendmmsg */
+struct mmsghdr {
+	struct msghdr	msg_hdr;
+	unsigned	msg_len;
+};
+
 /*
  *	POSIX 1003.1g - ancillary data object information
  *	Ancillary data consits of a sequence of pairs of
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 677d159..dbf94dd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -25,6 +25,7 @@  struct linux_dirent64;
 struct list_head;
 struct msgbuf;
 struct msghdr;
+struct mmsghdr;
 struct msqid_ds;
 struct new_utsname;
 struct nfsctl_arg;
@@ -555,6 +556,8 @@  asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
 asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned,
 				struct sockaddr __user *, int __user *);
 asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags);
+asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
+			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_socket(int, int, int);
 asmlinkage long sys_socketpair(int, int, int, int __user *);
 asmlinkage long sys_socketcall(int call, unsigned long __user *args);
diff --git a/include/net/compat.h b/include/net/compat.h
index 5bbf8bf..35acabb 100644
--- a/include/net/compat.h
+++ b/include/net/compat.h
@@ -18,6 +18,11 @@  struct compat_msghdr {
 	compat_uint_t	msg_flags;
 };
 
+struct compat_mmsghdr {
+	struct compat_msghdr msg_hdr;
+	compat_uint_t	     msg_len;
+};
+
 struct compat_cmsghdr {
 	compat_size_t	cmsg_len;
 	compat_int_t	cmsg_level;
@@ -35,6 +40,8 @@  extern int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *);
 extern int verify_compat_iovec(struct msghdr *, struct iovec *, struct sockaddr *, int);
 extern asmlinkage long compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned);
 extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned);
+extern asmlinkage long compat_sys_recvmmsg(int, struct compat_mmsghdr __user *,
+					   unsigned, unsigned);
 extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *);
 extern int put_cmsg_compat(struct msghdr*, int, int, int, void *);
 
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..f581fb0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,7 +48,9 @@  cond_syscall(sys_shutdown);
 cond_syscall(sys_sendmsg);
 cond_syscall(compat_sys_sendmsg);
 cond_syscall(sys_recvmsg);
+cond_syscall(sys_recvmmsg);
 cond_syscall(compat_sys_recvmsg);
+cond_syscall(compat_sys_recvmmsg);
 cond_syscall(sys_socketcall);
 cond_syscall(sys_futex);
 cond_syscall(compat_sys_futex);
diff --git a/net/compat.c b/net/compat.c
index 8d73905..1ec778f 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -743,6 +743,13 @@  asmlinkage long compat_sys_recvmsg(int fd, struct compat_msghdr __user *msg, uns
 	return sys_recvmsg(fd, (struct msghdr __user *)msg, flags | MSG_CMSG_COMPAT);
 }
 
+asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
+				    unsigned vlen, unsigned int flags)
+{
+	return sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
+			    flags | MSG_CMSG_COMPAT);
+}
+
 asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
 {
 	int ret;
diff --git a/net/socket.c b/net/socket.c
index 791d71a..f0249cb 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1965,22 +1965,16 @@  out:
 	return err;
 }
 
-/*
- *	BSD recvmsg interface
- */
-
-SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
-		unsigned int, flags)
+static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
+			 unsigned flags)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
-	struct socket *sock;
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct msghdr msg_sys;
 	unsigned long cmsg_ptr;
 	int err, iov_size, total_len, len;
-	int fput_needed;
 
 	/* kernel mode address */
 	struct sockaddr_storage addr;
@@ -1996,13 +1990,9 @@  SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	else if (copy_from_user(&msg_sys, msg, sizeof(struct msghdr)))
 		return -EFAULT;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
-	if (!sock)
-		goto out;
-
 	err = -EMSGSIZE;
 	if (msg_sys.msg_iovlen > UIO_MAXIOV)
-		goto out_put;
+		goto out;
 
 	/* Check whether to allocate the iovec area */
 	err = -ENOMEM;
@@ -2010,7 +2000,7 @@  SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
 		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
 		if (!iov)
-			goto out_put;
+			goto out;
 	}
 
 	/*
@@ -2066,9 +2056,63 @@  SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 out_freeiov:
 	if (iov != iovstack)
 		sock_kfree_s(sock->sk, iov, iov_size);
+out:
+	return err;
+}
+
+/*
+ *	BSD recvmsg interface
+ */
+
+SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
+		unsigned int, flags)
+{
+	int fput_needed, err;
+	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
+
+	if (!sock)
+		goto out;
+
+	err = __sys_recvmsg(sock, msg, flags);
+
+	fput_light(sock->file, fput_needed);
+out:
+	return err;
+}
+
+/*
+ *	Linux recvmmsg interface
+ */
+
+SYSCALL_DEFINE4(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
+		unsigned int, vlen, unsigned int, flags)
+{
+	int fput_needed, err, datagrams = 0;
+	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	struct mmsghdr __user *entry = mmsg;
+
+	if (!sock)
+		goto out;
+
+	while (datagrams < vlen) {
+		err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags);
+		if (err < 0)
+			goto out_put;
+		err = __put_user(err, &entry->msg_len);
+		if (err)
+			goto out_put;
+		++entry;
+		++datagrams;
+	}
 out_put:
 	fput_light(sock->file, fput_needed);
 out:
+	/*
+	 * We may return less entries than requested (vlen) if the
+	 * sock is non block and there aren't enough datagrams.
+	 */
+	if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT)))
+		return datagrams;
 	return err;
 }