Message ID | 20090520230652.GB5956@ghostprotocols.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote: > Meaning receive multiple messages, reducing the number of syscalls and > net stack entry/exit operations. > > Next patches will introduce mechanisms where protocols that want to > optimize this operation will provide an unlocked_recvmsg operation. > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Its a neat idea, I like the possibility on saving lots of syscalls for busy sockets, but I imagine the addition of a new syscall gives people pause. I wonder if simply augmenting the existing recvmsg syscall with a message flag to indicate that multiple messages can be received on that call. What I would propose looks something like: 1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND. Setting this on the call lets the protocol we can store multiple messages 2) if this flag is set the msg_control pointer should contain a cmsghdr with a new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data contains a pointer to the next msghdr pointer. 3) The kernel can iteratively fill out buffers passed in through the chain, setting the MSG_COMPOUND flag on each msghdr that contains valid data. The first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that the kernel put valid data in. This way the buffer chain pointer is kept unchanged, and userspace can follow it to free the data if need be. Thoughts? Neil > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Wed, May 20, 2009 at 08:46:34PM -0400, Neil Horman escreveu: > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote: > > Meaning receive multiple messages, reducing the number of syscalls and > > net stack entry/exit operations. > > > > Next patches will introduce mechanisms where protocols that want to > > optimize this operation will provide an unlocked_recvmsg operation. > > > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> > Its a neat idea, I like the possibility on saving lots of syscalls for > busy sockets, but I imagine the addition of a new syscall gives people pause. I > wonder if simply augmenting the existing recvmsg syscall with a message flag to > indicate that multiple messages can be received on that call. > > What I would propose looks something like: > > 1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND. Setting > this on the call lets the protocol we can store multiple messages > > 2) if this flag is set the msg_control pointer should contain a cmsghdr with a > new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data > contains a pointer to the next msghdr pointer. > > 3) The kernel can iteratively fill out buffers passed in through the chain, > setting the MSG_COMPOUND flag on each msghdr that contains valid data. The > first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that > the kernel put valid data in. This way the buffer chain pointer is kept > unchanged, and userspace can follow it to free the data if need be. > > Thoughts? I didn't went into such detail when discussing this with Dave on IRC, but I thought about something like using a setsockopt to tell the kernel that the socket was in multiple message mode, lemme look at the discussion to be faithful to it... [18:22] <acme> I see, but the bastardization I was thinking was about just putting a datagram per iovec instead of taking a datagram and go on spilling it over the iovec entries, if some sockopt was set, as a first try ;-) [18:23] <davem> Oh I see [18:23] <davem> that would work too But I think that the interface I proposed, that was Dave's general idea, should be ok as well for sendmmsg, to send multiple messages to different destinations using markings like one msg_iovlen to signal that the previous msg_iov/msg_iovlen should be used for a different destination. The reasoning behing the proposed interface was to mostly keep the existing way of passing iovecs to the kernel, but this time around passing multiple iovecs instead of just one. Existing code would just have to make the iovecs, msg_name, etc be arrays instead of rethinking how to talk to the kernel completely. So... lets hear more opinions :-) Ah, I went to a local pub to relax and left three machines non-stop pounding a "chrt -f 1 ./rcvmmsg 5001 64" patched server and it hold up for hours: nr_datagrams received: 24 4352 bytes received from mica.ghostprotocols.net in 17 datagrams 1536 bytes received from doppio.ghostprotocols.net in 6 datagrams 256 bytes received from filo.ghostprotocols.net in 1 datagrams nr_datagrams received: 18 256 bytes received from filo.ghostprotocols.net in 1 datagrams 3072 bytes received from doppio.ghostprotocols.net in 12 datagrams 256 bytes received from mica.ghostprotocols.net in 1 datagrams 256 bytes received from doppio.ghostprotocols.net in 1 datagrams 256 bytes received from mica.ghostprotocols.net in 1 datagrams 256 bytes received from doppio.ghostprotocols.net in 1 datagrams 256 bytes received from mica.ghostprotocols.net in 1 datagrams nr_datagrams received: 26 5120 bytes received from mica.ghostprotocols.net in 20 datagrams 256 bytes received from filo.ghostprotocols.net in 1 datagrams 1280 bytes received from doppio.ghostprotocols.net in 5 datagrams nr_datagrams received: 18 256 bytes received from filo.ghostprotocols.net in 1 datagrams 1792 bytes received from doppio.ghostprotocols.net in 7 datagrams 256 bytes received from filo.ghostprotocols.net in 1 datagrams 1792 bytes received from doppio.ghostprotocols.net in 7 datagrams 256 bytes received from mica.ghostprotocols.net in 1 datagrams 256 bytes received from do^C 256 bytes received from filo.ghostprotocols.net in 1 datagrams :-) - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 11:05:41PM -0300, Arnaldo Carvalho de Melo wrote: > Em Wed, May 20, 2009 at 08:46:34PM -0400, Neil Horman escreveu: > > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote: > > > Meaning receive multiple messages, reducing the number of syscalls and > > > net stack entry/exit operations. > > > > > > Next patches will introduce mechanisms where protocols that want to > > > optimize this operation will provide an unlocked_recvmsg operation. > > > > > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> > > Its a neat idea, I like the possibility on saving lots of syscalls for > > busy sockets, but I imagine the addition of a new syscall gives people pause. I > > wonder if simply augmenting the existing recvmsg syscall with a message flag to > > indicate that multiple messages can be received on that call. > > > > What I would propose looks something like: > > > > 1) define a new flag in the msghdr pointer for msg_flags, MSG_COMPOUND. Setting > > this on the call lets the protocol we can store multiple messages > > > > 2) if this flag is set the msg_control pointer should contain a cmsghdr with a > > new type MSG_COMPOUND_NEXT, in which the size is sizeof(void *) and the data > > contains a pointer to the next msghdr pointer. > > > > 3) The kernel can iteratively fill out buffers passed in through the chain, > > setting the MSG_COMPOUND flag on each msghdr that contains valid data. The > > first msghdr to not have the MSG_COMPOUND flag set denotes the last buffer that > > the kernel put valid data in. This way the buffer chain pointer is kept > > unchanged, and userspace can follow it to free the data if need be. > > > > Thoughts? > > I didn't went into such detail when discussing this with Dave on IRC, > but I thought about something like using a setsockopt to tell the kernel > that the socket was in multiple message mode, lemme look at the > discussion to be faithful to it... > > [18:22] <acme> I see, but the bastardization I was thinking was about just > putting a datagram per iovec instead of taking a datagram and go on > spilling it over the iovec entries, if some sockopt was set, as a first > try ;-) > [18:23] <davem> Oh I see > [18:23] <davem> that would work too > > But I think that the interface I proposed, that was Dave's general idea, > should be ok as well for sendmmsg, to send multiple messages to > different destinations using markings like one msg_iovlen to signal that > the previous msg_iov/msg_iovlen should be used for a different > destination. > > The reasoning behing the proposed interface was to mostly keep the > existing way of passing iovecs to the kernel, but this time around > passing multiple iovecs instead of just one. > > Existing code would just have to make the iovecs, msg_name, etc be > arrays instead of rethinking how to talk to the kernel completely. > > So... lets hear more opinions :-) > I agree, your way of doing this definately lets you layer on top of the existing vetted implementation, which is nice, I just thought that avoiding the creation of another syscall might be worth a little extra work in the kernel. Instead of arrays of msghdrs, We'd be looking at chains like this: msghdr->(struct msghdr *)msg_control[i].data->msghdr->etc Not too hard to parse, I dont think. But I'll defer to brighter minds than mine. If the creation of another syscall isn't too difficult a barrier to overcome (assuming this is going to occur for sendmsg, and various other i/o ops as well), then your way here is probably the way to go. Neil > Ah, I went to a local pub to relax and left three machines non-stop > pounding a "chrt -f 1 ./rcvmmsg 5001 64" patched server and it hold up > for hours: > > nr_datagrams received: 24 > 4352 bytes received from mica.ghostprotocols.net in 17 datagrams > 1536 bytes received from doppio.ghostprotocols.net in 6 datagrams > 256 bytes received from filo.ghostprotocols.net in 1 datagrams > nr_datagrams received: 18 > 256 bytes received from filo.ghostprotocols.net in 1 datagrams > 3072 bytes received from doppio.ghostprotocols.net in 12 datagrams > 256 bytes received from mica.ghostprotocols.net in 1 datagrams > 256 bytes received from doppio.ghostprotocols.net in 1 datagrams > 256 bytes received from mica.ghostprotocols.net in 1 datagrams > 256 bytes received from doppio.ghostprotocols.net in 1 datagrams > 256 bytes received from mica.ghostprotocols.net in 1 datagrams > nr_datagrams received: 26 > 5120 bytes received from mica.ghostprotocols.net in 20 datagrams > 256 bytes received from filo.ghostprotocols.net in 1 datagrams > 1280 bytes received from doppio.ghostprotocols.net in 5 datagrams > nr_datagrams received: 18 > 256 bytes received from filo.ghostprotocols.net in 1 datagrams > 1792 bytes received from doppio.ghostprotocols.net in 7 datagrams > 256 bytes received from filo.ghostprotocols.net in 1 datagrams > 1792 bytes received from doppio.ghostprotocols.net in 7 datagrams > 256 bytes received from mica.ghostprotocols.net in 1 datagrams > 256 bytes received from do^C 256 bytes received from filo.ghostprotocols.net in 1 datagrams > > :-) > > - Arnaldo > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman <nhorman@tuxdriver.com> Date: Wed, 20 May 2009 22:26:21 -0400 > I agree, your way of doing this definately lets you layer on top of > the existing vetted implementation, which is nice, I just thought > that avoiding the creation of another syscall might be worth a > little extra work in the kernel. Instead of arrays of msghdrs, We'd > be looking at chains like this: msghdr->(struct msghdr > *)msg_control[i].data->msghdr->etc > > Not too hard to parse, I dont think. But I'll defer to brighter > minds than mine. If the creation of another syscall isn't too > difficult a barrier to overcome (assuming this is going to occur for > sendmsg, and various other i/o ops as well), then your way here is > probably the way to go. Unfortunately you can't use msg flags for this. We accept any message flag we don't understand without signalling any errors. So there is no way to determine if the kernel supports the flag or not. Whereas with a socket option, we'll always get an error on older kernels for unsupported options. I think the system call is the cleanest, because it's not only a semantic change but also a data type change. I also think the socket option scheme is too cumbersome. I think it would be common for an application to want to use both modes of sending, especially if that application uses lots of existing library mode to compose some messages. And extra setsockopt() around every call down into that library? Yikes, good luck getting that right all the time. It's way too error prone. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 08:50:44PM -0700, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Wed, 20 May 2009 22:26:21 -0400 > > > I agree, your way of doing this definately lets you layer on top of > > the existing vetted implementation, which is nice, I just thought > > that avoiding the creation of another syscall might be worth a > > little extra work in the kernel. Instead of arrays of msghdrs, We'd > > be looking at chains like this: msghdr->(struct msghdr > > *)msg_control[i].data->msghdr->etc > > > > Not too hard to parse, I dont think. But I'll defer to brighter > > minds than mine. If the creation of another syscall isn't too > > difficult a barrier to overcome (assuming this is going to occur for > > sendmsg, and various other i/o ops as well), then your way here is > > probably the way to go. > > Unfortunately you can't use msg flags for this. > > We accept any message flag we don't understand without signalling > any errors. > I assume that silently ignoring flags we don't understand is a specified part of the sockets api? I was looking at the posix definition of recvmsg and I don't see where its required that we ignore unknown flags (although I can see why it would be usefull to do so) > So there is no way to determine if the kernel supports the flag > or not. Whereas with a socket option, we'll always get an error > on older kernels for unsupported options. > > I think the system call is the cleanest, because it's not only a > semantic change but also a data type change. I also think the > socket option scheme is too cumbersome. I think it would be > common for an application to want to use both modes of sending, > especially if that application uses lots of existing library > mode to compose some messages. And extra setsockopt() around > every call down into that library? Yikes, good luck getting > that right all the time. It's way too error prone. The new system call is definately cleaner, as it lets you specify the array semantics in a way that is much more efficient and less cumbersome for both user and kernel space. I just thought I'd propose an alternate method that avoids the need for a new syscall. It sounds like you and Acme have already hashed through this though, so I'll sit down :) Thanks guys! Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote: > Meaning receive multiple messages, reducing the number of syscalls and > net stack entry/exit operations. NOTE: adding the LSM list to the CC line If this approach is accepted I wonder if it would also make sense to move the security_socket_recvmsg() hook out of __sock_recvmsg and into the callers. I personally can't see a reason why we would need to call into the LSM for each message in the case of the new recvmmsg() syscall. The downside is that there is now some code duplication (although we are only talking duplicating ~three lines of code) but the upside is that we wont end up calling into the LSM for each of the messages when recvmmsg() is called which seems to fit well with the performance oriented nature of the new syscall. > Next patches will introduce mechanisms where protocols that want to > optimize this operation will provide an unlocked_recvmsg operation. > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> > --- > arch/alpha/kernel/systbls.S | 1 + > arch/arm/kernel/calls.S | 1 + > arch/ia64/kernel/entry.S | 1 + > arch/mips/kernel/scall32-o32.S | 2 + > arch/mips/kernel/scall64-64.S | 1 + > arch/mips/kernel/scall64-n32.S | 1 + > arch/mips/kernel/scall64-o32.S | 1 + > arch/sh/kernel/syscalls_64.S | 1 + > arch/sparc/kernel/systbls_64.S | 2 +- > arch/x86/include/asm/unistd_64.h | 2 + > include/linux/socket.h | 6 +++ > include/linux/syscalls.h | 3 ++ > include/net/compat.h | 7 ++++ > kernel/sys_ni.c | 2 + > net/compat.c | 7 ++++ > net/socket.c | 72 > ++++++++++++++++++++++++++++++------- 16 files changed, 95 insertions(+), > 15 deletions(-) > > diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S > index 95c9aef..cda6b8b 100644 > --- a/arch/alpha/kernel/systbls.S > +++ b/arch/alpha/kernel/systbls.S > @@ -497,6 +497,7 @@ sys_call_table: > .quad sys_signalfd > .quad sys_ni_syscall > .quad sys_eventfd > + .quad sys_recvmmsg > > .size sys_call_table, . - sys_call_table > .type sys_call_table, @object > diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S > index 1680e9e..80fa7fc 100644 > --- a/arch/arm/kernel/calls.S > +++ b/arch/arm/kernel/calls.S > @@ -372,6 +372,7 @@ > /* 360 */ CALL(sys_inotify_init1) > CALL(sys_preadv) > CALL(sys_pwritev) > + CALL(sys_recvmmsg) > #ifndef syscalls_counted > .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls > #define syscalls_counted > diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S > index 7bebac0..aef1e61 100644 > --- a/arch/ia64/kernel/entry.S > +++ b/arch/ia64/kernel/entry.S > @@ -1805,6 +1805,7 @@ sys_call_table: > data8 sys_inotify_init1 > data8 sys_preadv > data8 sys_pwritev // 1320 > + data8 sys_recvmmsg > > .org sys_call_table + 8*NR_syscalls // guard against failures to increase > NR_syscalls #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */ > diff --git a/arch/mips/kernel/scall32-o32.S > b/arch/mips/kernel/scall32-o32.S index 0b31b9b..283383c 100644 > --- a/arch/mips/kernel/scall32-o32.S > +++ b/arch/mips/kernel/scall32-o32.S > @@ -652,6 +652,8 @@ einval: li v0, -ENOSYS > sys sys_inotify_init1 1 > sys sys_preadv 6 /* 4330 */ > sys sys_pwritev 6 > + sys sys_pwritev 6 > + sys sys_recvmmsg 4 > .endm > > /* We pre-compute the number of _instruction_ bytes needed to > diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S > index c647fd6..109550f 100644 > --- a/arch/mips/kernel/scall64-64.S > +++ b/arch/mips/kernel/scall64-64.S > @@ -489,4 +489,5 @@ sys_call_table: > PTR sys_inotify_init1 > PTR sys_preadv > PTR sys_pwritev /* 5390 */ > + PTR sys_recvmmsg > .size sys_call_table,.-sys_call_table > diff --git a/arch/mips/kernel/scall64-n32.S > b/arch/mips/kernel/scall64-n32.S index c2c16ef..d45f22c 100644 > --- a/arch/mips/kernel/scall64-n32.S > +++ b/arch/mips/kernel/scall64-n32.S > @@ -415,4 +415,5 @@ EXPORT(sysn32_call_table) > PTR sys_inotify_init1 > PTR sys_preadv > PTR sys_pwritev > + PTR compat_sys_recvmmsg > .size sysn32_call_table,.-sysn32_call_table > diff --git a/arch/mips/kernel/scall64-o32.S > b/arch/mips/kernel/scall64-o32.S index 002fac2..de93e46 100644 > --- a/arch/mips/kernel/scall64-o32.S > +++ b/arch/mips/kernel/scall64-o32.S > @@ -535,4 +535,5 @@ sys_call_table: > PTR sys_inotify_init1 > PTR compat_sys_preadv /* 4330 */ > PTR compat_sys_pwritev > + PTR compat_sys_recvmmsg > .size sys_call_table,.-sys_call_table > diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S > index a083609..4d180ac 100644 > --- a/arch/sh/kernel/syscalls_64.S > +++ b/arch/sh/kernel/syscalls_64.S > @@ -389,3 +389,4 @@ sys_call_table: > .long sys_inotify_init1 /* 360 */ > .long sys_preadv > .long sys_pwritev > + .long sys_recvmmsg > diff --git a/arch/sparc/kernel/systbls_64.S > b/arch/sparc/kernel/systbls_64.S index 82b5bf8..5807ed5 100644 > --- a/arch/sparc/kernel/systbls_64.S > +++ b/arch/sparc/kernel/systbls_64.S > @@ -156,4 +156,4 @@ sys_call_table: > .word sys_set_mempolicy, sys_kexec_load, sys_move_pages, sys_getcpu, > sys_epoll_pwait /*310*/ .word sys_utimensat, sys_signalfd, > sys_timerfd_create, sys_eventfd, sys_fallocate .word sys_timerfd_settime, > sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1 > -/*320*/ .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, > sys_preadv, sys_pwritev +/*320*/ .word sys_dup3, sys_pipe2, > sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev, sys_recvmmsg diff > --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h > index 900e161..713a32a 100644 > --- a/arch/x86/include/asm/unistd_64.h > +++ b/arch/x86/include/asm/unistd_64.h > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev) > __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) > #define __NR_perf_counter_open 298 > __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open) > +#define __NR_recvmmsg 299 > +__SYSCALL(__NR_recvmmsg, sys_recvmmsg) > > #ifndef __NO_STUBS > #define __ARCH_WANT_OLD_READDIR > diff --git a/include/linux/socket.h b/include/linux/socket.h > index 421afb4..50c6c44 100644 > --- a/include/linux/socket.h > +++ b/include/linux/socket.h > @@ -65,6 +65,12 @@ struct msghdr { > unsigned msg_flags; > }; > > +/* For recvmmsg/sendmmsg */ > +struct mmsghdr { > + struct msghdr msg_hdr; > + unsigned msg_len; > +}; > + > /* > * POSIX 1003.1g - ancillary data object information > * Ancillary data consits of a sequence of pairs of > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 677d159..dbf94dd 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -25,6 +25,7 @@ struct linux_dirent64; > struct list_head; > struct msgbuf; > struct msghdr; > +struct mmsghdr; > struct msqid_ds; > struct new_utsname; > struct nfsctl_arg; > @@ -555,6 +556,8 @@ asmlinkage long sys_recv(int, void __user *, size_t, > unsigned); asmlinkage long sys_recvfrom(int, void __user *, size_t, > unsigned, struct sockaddr __user *, int __user *); > asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned > flags); +asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg, > + unsigned int vlen, unsigned flags); > asmlinkage long sys_socket(int, int, int); > asmlinkage long sys_socketpair(int, int, int, int __user *); > asmlinkage long sys_socketcall(int call, unsigned long __user *args); > diff --git a/include/net/compat.h b/include/net/compat.h > index 5bbf8bf..35acabb 100644 > --- a/include/net/compat.h > +++ b/include/net/compat.h > @@ -18,6 +18,11 @@ struct compat_msghdr { > compat_uint_t msg_flags; > }; > > +struct compat_mmsghdr { > + struct compat_msghdr msg_hdr; > + compat_uint_t msg_len; > +}; > + > struct compat_cmsghdr { > compat_size_t cmsg_len; > compat_int_t cmsg_level; > @@ -35,6 +40,8 @@ extern int get_compat_msghdr(struct msghdr *, struct > compat_msghdr __user *); extern int verify_compat_iovec(struct msghdr *, > struct iovec *, struct sockaddr *, int); extern asmlinkage long > compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned); extern > asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user > *,unsigned); +extern asmlinkage long compat_sys_recvmmsg(int, struct > compat_mmsghdr __user *, + unsigned, unsigned); > extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, > int __user *); extern int put_cmsg_compat(struct msghdr*, int, int, int, > void *); > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 68320f6..f581fb0 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -48,7 +48,9 @@ cond_syscall(sys_shutdown); > cond_syscall(sys_sendmsg); > cond_syscall(compat_sys_sendmsg); > cond_syscall(sys_recvmsg); > +cond_syscall(sys_recvmmsg); > cond_syscall(compat_sys_recvmsg); > +cond_syscall(compat_sys_recvmmsg); > cond_syscall(sys_socketcall); > cond_syscall(sys_futex); > cond_syscall(compat_sys_futex); > diff --git a/net/compat.c b/net/compat.c > index 8d73905..1ec778f 100644 > --- a/net/compat.c > +++ b/net/compat.c > @@ -743,6 +743,13 @@ asmlinkage long compat_sys_recvmsg(int fd, struct > compat_msghdr __user *msg, uns return sys_recvmsg(fd, (struct msghdr __user > *)msg, flags | MSG_CMSG_COMPAT); } > > +asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user > *mmsg, + unsigned vlen, unsigned int flags) > +{ > + return sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, > + flags | MSG_CMSG_COMPAT); > +} > + > asmlinkage long compat_sys_socketcall(int call, u32 __user *args) > { > int ret; > diff --git a/net/socket.c b/net/socket.c > index 791d71a..f0249cb 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -1965,22 +1965,16 @@ out: > return err; > } > > -/* > - * BSD recvmsg interface > - */ > - > -SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, > - unsigned int, flags) > +static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg, > + unsigned flags) > { > struct compat_msghdr __user *msg_compat = > (struct compat_msghdr __user *)msg; > - struct socket *sock; > struct iovec iovstack[UIO_FASTIOV]; > struct iovec *iov = iovstack; > struct msghdr msg_sys; > unsigned long cmsg_ptr; > int err, iov_size, total_len, len; > - int fput_needed; > > /* kernel mode address */ > struct sockaddr_storage addr; > @@ -1996,13 +1990,9 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr > __user *, msg, else if (copy_from_user(&msg_sys, msg, sizeof(struct > msghdr))) > return -EFAULT; > > - sock = sockfd_lookup_light(fd, &err, &fput_needed); > - if (!sock) > - goto out; > - > err = -EMSGSIZE; > if (msg_sys.msg_iovlen > UIO_MAXIOV) > - goto out_put; > + goto out; > > /* Check whether to allocate the iovec area */ > err = -ENOMEM; > @@ -2010,7 +2000,7 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr > __user *, msg, if (msg_sys.msg_iovlen > UIO_FASTIOV) { > iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL); > if (!iov) > - goto out_put; > + goto out; > } > > /* > @@ -2066,9 +2056,63 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr > __user *, msg, out_freeiov: > if (iov != iovstack) > sock_kfree_s(sock->sk, iov, iov_size); > +out: > + return err; > +} > + > +/* > + * BSD recvmsg interface > + */ > + > +SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, > + unsigned int, flags) > +{ > + int fput_needed, err; > + struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed); > + > + if (!sock) > + goto out; > + > + err = __sys_recvmsg(sock, msg, flags); > + > + fput_light(sock->file, fput_needed); > +out: > + return err; > +} > + > +/* > + * Linux recvmmsg interface > + */ > + > +SYSCALL_DEFINE4(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, > + unsigned int, vlen, unsigned int, flags) > +{ > + int fput_needed, err, datagrams = 0; > + struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed); > + struct mmsghdr __user *entry = mmsg; > + > + if (!sock) > + goto out; > + > + while (datagrams < vlen) { > + err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags); > + if (err < 0) > + goto out_put; > + err = __put_user(err, &entry->msg_len); > + if (err) > + goto out_put; > + ++entry; > + ++datagrams; > + } > out_put: > fput_light(sock->file, fput_needed); > out: > + /* > + * We may return less entries than requested (vlen) if the > + * sock is non block and there aren't enough datagrams. > + */ > + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) > + return datagrams; > return err; > }
Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu: > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote: > > Meaning receive multiple messages, reducing the number of syscalls and > > net stack entry/exit operations. > > NOTE: adding the LSM list to the CC line thanks! > If this approach is accepted I wonder if it would also make sense to move the > security_socket_recvmsg() hook out of __sock_recvmsg and into the callers. I > personally can't see a reason why we would need to call into the LSM for each > message in the case of the new recvmmsg() syscall. The downside is that there > is now some code duplication (although we are only talking duplicating ~three > lines of code) but the upside is that we wont end up calling into the LSM for > each of the messages when recvmmsg() is called which seems to fit well with > the performance oriented nature of the new syscall. Agreed that we must do this earlier to avoind vlen calls to security_socket_recvmsg, but there are many callers of sock_recvmsg... Also shouldn't recvmmsg have a different LSM hook? It doesn't look right at first sight to reuse security_socket_recvmsg, as we now are passing many msghdrs and sockaddrs, etc. If security_socket_recvmsg receives the msg and inspects it, I think fully inspecting the mmsg and vlen can be something LSM policies can be interested in inspecting too, no? - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote: > Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu: > > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote: > > > Meaning receive multiple messages, reducing the number of syscalls and > > > net stack entry/exit operations. > > > > NOTE: adding the LSM list to the CC line > > thanks! > > > If this approach is accepted I wonder if it would also make sense to move > > the security_socket_recvmsg() hook out of __sock_recvmsg and into the > > callers. I personally can't see a reason why we would need to call into > > the LSM for each message in the case of the new recvmmsg() syscall. The > > downside is that there is now some code duplication (although we are only > > talking duplicating ~three lines of code) but the upside is that we wont > > end up calling into the LSM for each of the messages when recvmmsg() is > > called which seems to fit well with the performance oriented nature of > > the new syscall. > > Agreed that we must do this earlier to avoind vlen calls to > security_socket_recvmsg, but there are many callers of sock_recvmsg... Yeah, like I said there is a downside to this approach, the question is how much do we care about performance and what are we willing to give up for it? I don't know the answer but I thought the question needed to be asked. > Also shouldn't recvmmsg have a different LSM hook? It doesn't look right > at first sight to reuse security_socket_recvmsg, as we now are passing > many msghdrs and sockaddrs, etc. Well, right now the only LSM of the three in the mainline kernel that makes use of the recvmsg hook is SELinux and in SELinux the recvmsg hook really only checks to see if the process can read from the socket - there is no access check against the message itself. In general, all of the per-packet/message access controls happen below the socket layer in SELinux so I see no reason why we would need to execute the recvmsg hook multiple times for each recvmmsg() syscall. If I'm wrong I'm sure the LSM brain trust will quickly step in ... > If security_socket_recvmsg receives the msg and inspects it, I think > fully inspecting the mmsg and vlen can be something LSM policies can be > interested in inspecting too, no? Maybe, but not with what we currently have in-tree. Perhaps this is a sign/opportunity that we can trim the arguments to security_socket_recvmsg() too?
Em Thu, May 21, 2009 at 11:03:26AM -0400, Paul Moore escreveu: > On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote: > > Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu: > > > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote: > > > > Meaning receive multiple messages, reducing the number of syscalls and > > > > net stack entry/exit operations. > > > > > > NOTE: adding the LSM list to the CC line > > > > thanks! > > > > > If this approach is accepted I wonder if it would also make sense to move > > > the security_socket_recvmsg() hook out of __sock_recvmsg and into the > > > callers. I personally can't see a reason why we would need to call into > > > the LSM for each message in the case of the new recvmmsg() syscall. The > > > downside is that there is now some code duplication (although we are only > > > talking duplicating ~three lines of code) but the upside is that we wont > > > end up calling into the LSM for each of the messages when recvmmsg() is > > > called which seems to fit well with the performance oriented nature of > > > the new syscall. > > > > Agreed that we must do this earlier to avoind vlen calls to > > security_socket_recvmsg, but there are many callers of sock_recvmsg... > > Yeah, like I said there is a downside to this approach, the question is how > much do we care about performance and what are we willing to give up for it? > I don't know the answer but I thought the question needed to be asked. Well, if we only check if the process can read from the socket, I also see no reasons for a new security_socket_recvmmsg nor for checking it multiple times in recvmmsg, since what changes (the msg) is of no interest to LSM. > > Also shouldn't recvmmsg have a different LSM hook? It doesn't look right > > at first sight to reuse security_socket_recvmsg, as we now are passing > > many msghdrs and sockaddrs, etc. > > Well, right now the only LSM of the three in the mainline kernel that makes > use of the recvmsg hook is SELinux and in SELinux the recvmsg hook really only > checks to see if the process can read from the socket - there is no access > check against the message itself. In general, all of the per-packet/message > access controls happen below the socket layer in SELinux so I see no reason > why we would need to execute the recvmsg hook multiple times for each > recvmmsg() syscall. Agreed > If I'm wrong I'm sure the LSM brain trust will quickly step in ... > > > If security_socket_recvmsg receives the msg and inspects it, I think > > fully inspecting the mmsg and vlen can be something LSM policies can be > > interested in inspecting too, no? > > Maybe, but not with what we currently have in-tree. Perhaps this is a > sign/opportunity that we can trim the arguments to security_socket_recvmsg() > too? Perhaps, but up to LSM folks to tell if this was really a case where passing the msg was something that ended up being overkill. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 21 May 2009 11:11:09 am Arnaldo Carvalho de Melo wrote: > Em Thu, May 21, 2009 at 11:03:26AM -0400, Paul Moore escreveu: > > On Thursday 21 May 2009 10:47:39 am Arnaldo Carvalho de Melo wrote: > > > Em Thu, May 21, 2009 at 10:16:17AM -0400, Paul Moore escreveu: > > > > On Wednesday 20 May 2009 07:06:52 pm Arnaldo Carvalho de Melo wrote: > > > > > Meaning receive multiple messages, reducing the number of syscalls > > > > > and net stack entry/exit operations. > > > > > > > > NOTE: adding the LSM list to the CC line > > > > > > thanks! > > > > > > > If this approach is accepted I wonder if it would also make sense to > > > > move the security_socket_recvmsg() hook out of __sock_recvmsg and > > > > into the callers. I personally can't see a reason why we would need > > > > to call into the LSM for each message in the case of the new > > > > recvmmsg() syscall. The downside is that there is now some code > > > > duplication (although we are only talking duplicating ~three lines of > > > > code) but the upside is that we wont end up calling into the LSM for > > > > each of the messages when recvmmsg() is called which seems to fit > > > > well with the performance oriented nature of the new syscall. > > > > > > Agreed that we must do this earlier to avoind vlen calls to > > > security_socket_recvmsg, but there are many callers of sock_recvmsg... > > > > Yeah, like I said there is a downside to this approach, the question is > > how much do we care about performance and what are we willing to give up > > for it? I don't know the answer but I thought the question needed to be > > asked. > > Well, if we only check if the process can read from the socket, I also > see no reasons for a new security_socket_recvmmsg nor for checking it > multiple times in recvmmsg, since what changes (the msg) is of no > interest to LSM. Exactly. > > If I'm wrong I'm sure the LSM brain trust will quickly step in ... > > > > > If security_socket_recvmsg receives the msg and inspects it, I think > > > fully inspecting the mmsg and vlen can be something LSM policies can be > > > interested in inspecting too, no? > > > > Maybe, but not with what we currently have in-tree. Perhaps this is a > > sign/opportunity that we can trim the arguments to > > security_socket_recvmsg() too? > > Perhaps, but up to LSM folks to tell if this was really a case where > passing the msg was something that ended up being overkill. Yep, and since they are a rather ornery bunch I'm sure they will speak up. However, I would say to go ahead and trim the unused args and if we need them in the future we can always add them back in; none of this is really visible outside the kernel so changes are relatively easy. I'd hate to see us miss an opportunity to make things better over concerns about what might happen at some unknown point in the future.
Hi Arnaldo. On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote: > Meaning receive multiple messages, reducing the number of syscalls and > net stack entry/exit operations. > > Next patches will introduce mechanisms where protocols that want to > optimize this operation will provide an unlocked_recvmsg operation. What's the difference from the single msg with multiple iovecs? Can we implement receiving from multiple sockets using this or similar interface?
Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu: > Hi Arnaldo. > > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote: > > Meaning receive multiple messages, reducing the number of syscalls and > > net stack entry/exit operations. > > > > Next patches will introduce mechanisms where protocols that want to > > optimize this operation will provide an unlocked_recvmsg operation. > > What's the difference from the single msg with multiple iovecs? recvmsg consumes just one skb, a datagram, truncating if it has more bytes than asked and giving less bytes than asked for if the skb being consumer is smaller than requested. WRT iovec, it gets this skb/datagram and goes on filling iovec entry by entry, till it exhausts the skb. The usecase here is: UDP socket has multiple skbs in its receive queue, so application will make several syscalls to get those skbs while we could return multiple datagrams in just one syscall + fd lookup + LSM validation + lock_sock + release_sock. We could use some sort of setsockopt to instead put a datagram per iovec entry, but that would be cumbersome, using recvmsg + recvmmsg on the same code should be possible without require setsockopts to switch modes. The proposed API just adds an "array mode" for recvmsg, keeping the existing semantics as much as possible, to ease converting applications to it, libraries would just do some internal caching and its users wouldn't notice the change. > Can we implement receiving from multiple sockets using this or similar interface? Not really, we don't pass something like a poll_fd array, the operation as proposed is per socket. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thu, 2009-05-21 at 13:27 -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu: > > Hi Arnaldo. > > > > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote: > > > Meaning receive multiple messages, reducing the number of syscalls and > > > net stack entry/exit operations. > > > > > > Next patches will introduce mechanisms where protocols that want to > > > optimize this operation will provide an unlocked_recvmsg operation. > > > > What's the difference from the single msg with multiple iovecs? > > recvmsg consumes just one skb, a datagram, truncating if it has more > bytes than asked and giving less bytes than asked for if the skb being > consumer is smaller than requested. > > WRT iovec, it gets this skb/datagram and goes on filling iovec entry by > entry, till it exhausts the skb. > > The usecase here is: UDP socket has multiple skbs in its receive queue, > so application will make several syscalls to get those skbs while we > could return multiple datagrams in just one syscall + fd lookup + LSM > validation + lock_sock + release_sock. > Its not just UDP/SOCK_DGRAM either. SOCK_SEQPACKET has to return after each message because it is not allowed to (nor can it, given the api) return more than one message in a single call. So with small messages that can add up to a lot of calls. This way the MSG_EOR flags can be preserved in the correct places, Steve. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> + > + while (datagrams < vlen) { > + err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags); > + if (err < 0) > + goto out_put; > + err = __put_user(err, &entry->msg_len); > + if (err) > + goto out_put; > + ++entry; > + ++datagrams; > + } > out_put: > fput_light(sock->file, fput_needed); > out: > + /* > + * We may return less entries than requested (vlen) if the > + * sock is non block and there aren't enough datagrams. > + */ > + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) > + return datagrams; > return err; > } > There is an assumption here that unless MSG_DONTWAIT is set, or there is an error, that the caller will be willing to wait indefinitely for N messages to show up -- and that it is never worth waking up the caller earlier with less than N messages. I think an application would more typically want to wait at most m msecs after the first message is received to see if any other messages can be delivered at the same time. A busy server could simply use DONTWAIT in a polling loop every cycle, but it would be nice to be able to wait indefinitely for *any* of your clients to send you a message. Further, with some sockets there are some messages that are more equal than others. Although valid messages, with no errors, they should be delivered to user-mode immediately. The example that leaps to my mind immediately are SCTP Events, particularly with one-to-many sockets. You could be waiting for N messages, knowing that a specific peer has been asked to send N messages. The 2nd message could be an SCTP event announcing that the specific association has been torn down (and hence the remaining messages will not be arriving). Waiting for a different association to send enough messages to complete the request will not provide very prompt service. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Thu, May 21, 2009 at 05:33:35PM +0100, Steven Whitehouse escreveu: > Hi, > > On Thu, 2009-05-21 at 13:27 -0300, Arnaldo Carvalho de Melo wrote: > > Em Thu, May 21, 2009 at 08:10:00PM +0400, Evgeniy Polyakov escreveu: > > > Hi Arnaldo. > > > > > > On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo (acme@redhat.com) wrote: > > > > Meaning receive multiple messages, reducing the number of syscalls and > > > > net stack entry/exit operations. > > > > > > > > Next patches will introduce mechanisms where protocols that want to > > > > optimize this operation will provide an unlocked_recvmsg operation. > > > > > > What's the difference from the single msg with multiple iovecs? > > > > recvmsg consumes just one skb, a datagram, truncating if it has more > > bytes than asked and giving less bytes than asked for if the skb being > > consumer is smaller than requested. > > > > WRT iovec, it gets this skb/datagram and goes on filling iovec entry by > > entry, till it exhausts the skb. > > > > The usecase here is: UDP socket has multiple skbs in its receive queue, > > so application will make several syscalls to get those skbs while we > > could return multiple datagrams in just one syscall + fd lookup + LSM > > validation + lock_sock + release_sock. > > > Its not just UDP/SOCK_DGRAM either. SOCK_SEQPACKET has to return after > each message because it is not allowed to (nor can it, given the api) > return more than one message in a single call. So with small messages > that can add up to a lot of calls. This way the MSG_EOR flags can be > preserved in the correct places, Yeah, SOCK_SEQPACKET are just not optimized in the current patch, it could, just providing a sk_prot->unlocked_recvmsg call, etc. And to use sock_common_recvmsg in its socket->ops, which not all protocol families do, but was my plan long ago to make all use so that we could shorten the path, etc :-) - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Thu, May 21, 2009 at 09:38:47AM -0700, Caitlin Bestler escreveu: > > + > > + while (datagrams < vlen) { > > + err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags); > > + if (err < 0) > > + goto out_put; > > + err = __put_user(err, &entry->msg_len); > > + if (err) > > + goto out_put; > > + ++entry; > > + ++datagrams; > > + } > > out_put: > > fput_light(sock->file, fput_needed); > > out: > > + /* > > + * We may return less entries than requested (vlen) if the > > + * sock is non block and there aren't enough datagrams. > > + */ > > + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) > > + return datagrams; > > return err; > > } > > > There is an assumption here that unless MSG_DONTWAIT is set, or there > is an error, that the caller will be willing to wait indefinitely for > N messages to show up -- and that it is never worth waking up the > caller earlier with less than N messages. > I think an application would more typically want to wait at most m > msecs after the first message is received to see if any other messages > can be delivered at the same time. > A busy server could simply use DONTWAIT in a polling loop every cycle, > but it would be nice to be able to wait indefinitely for *any* of your > clients to send you a message. > Further, with some sockets there are some messages that are more equal > than others. > Although valid messages, with no errors, they should be delivered to > user-mode immediately. > The example that leaps to my mind immediately are SCTP Events, > particularly with one-to-many sockets. You could be waiting for N > messages, knowing that a specific peer has been asked to send N > messages. The 2nd message could be an SCTP event announcing that the > specific association has been torn down (and hence the remaining > messages will not be arriving). Waiting for a different association > to send enough messages to complete the request will not provide very > prompt service. Good points, that will require some sort of change to sk->sk_rcvtimeo handling so that the timeout for each datagram are deduced from the configured timeout for the socket, so that the existing code doesn't have to be changed. I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and restore at exit, and would somehow go on subtracting the time sock_recvmsg() took from it so that the following call finds a reduced sk->sk_rcvtimeo iif it was configured in the first place and the socket is in blocking mode. How does that sound? - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo <acme@ghostprotocols.net> wrote: > > I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and > restore at exit, and would somehow go on subtracting the time > sock_recvmsg() took from it so that the following call finds a reduced > sk->sk_rcvtimeo iif it was configured in the first place and the socket > is in blocking mode. > > How does that sound? > I suspect that an additional timeout value will be needed ultimately. Essentially there is the existing timeout (how long will I wait before I want to be told that there have been no messages) and an additional "delivery pending" timeout (how long can the delivery of a message be delayed to attempt to coalesce it with other messages). There is also one sticky compliance issue with SCTP, delivery of an UNORDERED message MUST NOT be delayed waiting for other traffic. There is no guarantee that the local client will be scheduled immediately, just a prohibition on delaying delivery for the purpose of bundling. That might mean a specific transport would have to support multiple conditional timeouts: maximum delay after an SCTP Event, after an UNORDERED message, after a message with PPID X, etc. Or for TCP after a PUSH FLAG, or after certain flags have been set, etc. Those could probably be reduced to delivering immediately after any message that the transport flags as "urgent", and *all* error completions are urgent (which is what the code first shown does). The specific transport could then use setsockopt to control what messages qualified as "urgent" in a transport specific manner. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Thu, May 21, 2009 at 10:26:58AM -0700, Caitlin Bestler escreveu: > On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo > <acme@ghostprotocols.net> wrote: > > > > > I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and > > restore at exit, and would somehow go on subtracting the time > > sock_recvmsg() took from it so that the following call finds a reduced > > sk->sk_rcvtimeo iif it was configured in the first place and the socket > > is in blocking mode. > > > > How does that sound? > > > > I suspect that an additional timeout value will be needed ultimately. So you mean we need a timeout to wait for a datagram, that remains being sk->sk_rcvtimeo (SO_RCVTIMEO), and one that is passed as a parameter to recvmmsg? > Essentially there is the existing timeout (how long will I wait before I > want to be told that there have been no messages) and an additional > "delivery pending" timeout (how long can the delivery of a message > be delayed to attempt to coalesce it with other messages). I gather this is an yes for the question I asked above :) > There is also one sticky compliance issue with SCTP, delivery of an > UNORDERED message MUST NOT be delayed waiting for other traffic. > There is no guarantee that the local client will be scheduled immediately, > just a prohibition on delaying delivery for the purpose of bundling. > > That might mean a specific transport would have to support multiple > conditional timeouts: maximum delay after an SCTP Event, after an > UNORDERED message, after a message with PPID X, etc. Or for > TCP after a PUSH FLAG, or after certain flags have been set, etc. > > Those could probably be reduced to delivering immediately after > any message that the transport flags as "urgent", and *all* error > completions are urgent (which is what the code first shown does). checking if MSG_OOB is set in the last datagram returned by the unlocked_recvmsg call, that would make recvmmsg return imediately even if in blocking mode and not having filled the array, ok. > The specific transport could then use setsockopt to control what > messages qualified as "urgent" in a transport specific manner. OK - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote: > + /* > + * We may return less entries than requested (vlen) if the > + * sock is non block and there aren't enough datagrams. > + */ > + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) > + return datagrams; > return err; > } Could there be a situation whereby we receive one or more datagrams, then get an error? How does userland get the datagrams then?
Hi, On Fri, May 22, 2009 at 10:22:48AM +0300, Rémi Denis-Courmont wrote: > On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote: > > + /* > > + * We may return less entries than requested (vlen) if the > > + * sock is non block and there aren't enough datagrams. > > + */ > > + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) > > + return datagrams; > > return err; > > } > > Could there be a situation whereby we receive one or more datagrams, then get > an error? How does userland get the datagrams then? > Normally you'd expect the call to return what it has read without an error, and then the socket error would be picked up on the next call. Need to check if sk->sk_error is being reset in the common code I guess, Steve. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thu, May 21, 2009 at 02:51:22PM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, May 21, 2009 at 10:26:58AM -0700, Caitlin Bestler escreveu: > > On Thu, May 21, 2009 at 9:55 AM, Arnaldo Carvalho de Melo > > <acme@ghostprotocols.net> wrote: > > > > > > > > I.e. recvmmsg would save the value of sk->sk_rcvtimeo at entry and > > > restore at exit, and would somehow go on subtracting the time > > > sock_recvmsg() took from it so that the following call finds a reduced > > > sk->sk_rcvtimeo iif it was configured in the first place and the socket > > > is in blocking mode. > > > > > > How does that sound? > > > > > > > I suspect that an additional timeout value will be needed ultimately. > > So you mean we need a timeout to wait for a datagram, that remains being > sk->sk_rcvtimeo (SO_RCVTIMEO), and one that is passed as a parameter to > recvmmsg? > The other thing which might also need looking at is the low water mark, should that apply to each individual message, or to all the received messages in combination? I think the latter is more useful. Steve. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2009/5/22 Rémi Denis-Courmont <remi.denis-courmont@nokia.com>: > On Thursday 21 May 2009 02:06:52 ext Arnaldo Carvalho de Melo wrote: >> + /* >> + * We may return less entries than requested (vlen) if the >> + * sock is non block and there aren't enough datagrams. >> + */ >> + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) >> + return datagrams; >> return err; >> } > > Could there be a situation whereby we receive one or more datagrams, then get > an error? How does userland get the datagrams then? > This is exactly how a server using SOCK_DGRAM or SOCK_SEQPACKET would want to work -- give me up to N messages received over time X (waiting at most time Y after the first message is received). The server application then: + wakes up with at most Y delay on delivery of any single message. + but wakes up only once during that period (unless N or more messages are received) + wakes up at least every X to deal with idle processing. It could simply wake up every y msecs, then use recvmmsg to receive everything that is there, but that would mean a lot of idle wake-ups when the number of clients is low. These are all less relevant for SOCK_STREAM since statistical averaging of incoming messages is rarely feasible for a single connection. As commented earlier, however, it would make sense for each transport to be able to declare a given message to be "urgent" (by whatever means appropriate to that transport) and thereby trigger an earlier delivery. RDMA interfaces use a "solicited event" bit for these purposes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 08:06:52PM -0300, Arnaldo Carvalho de Melo wrote: > Meaning receive multiple messages, reducing the number of syscalls and > net stack entry/exit operations. > > Next patches will introduce mechanisms where protocols that want to > optimize this operation will provide an unlocked_recvmsg operation. > Not to throw more questions into the mix again, but didn't Ingo write a batching syscall a while back, which let you issue several syscalls in one trap to kernel space? I understand that your approach has some efficiency gains over that, but did that ever get accepted upstream? Is the overlap there sufficient to make this approach redundant? Or are the gains in performance here sufficient to warrant this new call? Regards Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > Not to throw more questions into the mix again, but didn't Ingo write a > batching syscall a while back, which let you issue several syscalls in one trap > to kernel space? I understand that your approach has some efficiency gains over > that, but did that ever get accepted upstream? Is the overlap there sufficient > to make this approach redundant? Or are the gains in performance here > sufficient to warrant this new call? I couldn't find this via Google or any posts to LKML by Ingo on this? I'm very interested in multiple send/recvmsg support. So, it's definitely going to happen? :-) Regards -- Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Wed, Jun 03, 2009 at 06:44:22PM -0700, Andrew Grover escreveu: > On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > > Not to throw more questions into the mix again, but didn't Ingo write a > > batching syscall a while back, which let you issue several syscalls in one trap > > to kernel space? I understand that your approach has some efficiency gains over > > that, but did that ever get accepted upstream? Is the overlap there sufficient > > to make this approach redundant? Or are the gains in performance here > > sufficient to warrant this new call? > > I couldn't find this via Google or any posts to LKML by Ingo on this? > > I'm very interested in multiple send/recvmsg support. So, it's > definitely going to happen? :-) I got sidetracked with some other activities, and was waiting for more comments and reports from some testers, but will be back to implement a v2 patch soon. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 03, 2009 at 06:44:22PM -0700, Andrew Grover wrote: > On Fri, May 22, 2009 at 1:06 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > > Not to throw more questions into the mix again, but didn't Ingo write a > > batching syscall a while back, which let you issue several syscalls in one trap > > to kernel space? I understand that your approach has some efficiency gains over > > that, but did that ever get accepted upstream? Is the overlap there sufficient > > to make this approach redundant? Or are the gains in performance here > > sufficient to warrant this new call? > > I couldn't find this via Google or any posts to LKML by Ingo on this? > > I'm very interested in multiple send/recvmsg support. So, it's > definitely going to happen? :-) > > Regards -- Andy > Its entirely possible it was just something he talked about on IRC, I just brought it up because of the overlap. Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S index 95c9aef..cda6b8b 100644 --- a/arch/alpha/kernel/systbls.S +++ b/arch/alpha/kernel/systbls.S @@ -497,6 +497,7 @@ sys_call_table: .quad sys_signalfd .quad sys_ni_syscall .quad sys_eventfd + .quad sys_recvmmsg .size sys_call_table, . - sys_call_table .type sys_call_table, @object diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S index 1680e9e..80fa7fc 100644 --- a/arch/arm/kernel/calls.S +++ b/arch/arm/kernel/calls.S @@ -372,6 +372,7 @@ /* 360 */ CALL(sys_inotify_init1) CALL(sys_preadv) CALL(sys_pwritev) + CALL(sys_recvmmsg) #ifndef syscalls_counted .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls #define syscalls_counted diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S index 7bebac0..aef1e61 100644 --- a/arch/ia64/kernel/entry.S +++ b/arch/ia64/kernel/entry.S @@ -1805,6 +1805,7 @@ sys_call_table: data8 sys_inotify_init1 data8 sys_preadv data8 sys_pwritev // 1320 + data8 sys_recvmmsg .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */ diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S index 0b31b9b..283383c 100644 --- a/arch/mips/kernel/scall32-o32.S +++ b/arch/mips/kernel/scall32-o32.S @@ -652,6 +652,8 @@ einval: li v0, -ENOSYS sys sys_inotify_init1 1 sys sys_preadv 6 /* 4330 */ sys sys_pwritev 6 + sys sys_pwritev 6 + sys sys_recvmmsg 4 .endm /* We pre-compute the number of _instruction_ bytes needed to diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S index c647fd6..109550f 100644 --- a/arch/mips/kernel/scall64-64.S +++ b/arch/mips/kernel/scall64-64.S @@ -489,4 +489,5 @@ sys_call_table: PTR sys_inotify_init1 PTR sys_preadv PTR sys_pwritev /* 5390 */ + PTR sys_recvmmsg .size sys_call_table,.-sys_call_table diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S index c2c16ef..d45f22c 100644 --- a/arch/mips/kernel/scall64-n32.S +++ b/arch/mips/kernel/scall64-n32.S @@ -415,4 +415,5 @@ EXPORT(sysn32_call_table) PTR sys_inotify_init1 PTR sys_preadv PTR sys_pwritev + PTR compat_sys_recvmmsg .size sysn32_call_table,.-sysn32_call_table diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S index 002fac2..de93e46 100644 --- a/arch/mips/kernel/scall64-o32.S +++ b/arch/mips/kernel/scall64-o32.S @@ -535,4 +535,5 @@ sys_call_table: PTR sys_inotify_init1 PTR compat_sys_preadv /* 4330 */ PTR compat_sys_pwritev + PTR compat_sys_recvmmsg .size sys_call_table,.-sys_call_table diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S index a083609..4d180ac 100644 --- a/arch/sh/kernel/syscalls_64.S +++ b/arch/sh/kernel/syscalls_64.S @@ -389,3 +389,4 @@ sys_call_table: .long sys_inotify_init1 /* 360 */ .long sys_preadv .long sys_pwritev + .long sys_recvmmsg diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S index 82b5bf8..5807ed5 100644 --- a/arch/sparc/kernel/systbls_64.S +++ b/arch/sparc/kernel/systbls_64.S @@ -156,4 +156,4 @@ sys_call_table: .word sys_set_mempolicy, sys_kexec_load, sys_move_pages, sys_getcpu, sys_epoll_pwait /*310*/ .word sys_utimensat, sys_signalfd, sys_timerfd_create, sys_eventfd, sys_fallocate .word sys_timerfd_settime, sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1 -/*320*/ .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev +/*320*/ .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv, sys_pwritev, sys_recvmmsg diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index 900e161..713a32a 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev) __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) #define __NR_perf_counter_open 298 __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open) +#define __NR_recvmmsg 299 +__SYSCALL(__NR_recvmmsg, sys_recvmmsg) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/include/linux/socket.h b/include/linux/socket.h index 421afb4..50c6c44 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -65,6 +65,12 @@ struct msghdr { unsigned msg_flags; }; +/* For recvmmsg/sendmmsg */ +struct mmsghdr { + struct msghdr msg_hdr; + unsigned msg_len; +}; + /* * POSIX 1003.1g - ancillary data object information * Ancillary data consits of a sequence of pairs of diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 677d159..dbf94dd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -25,6 +25,7 @@ struct linux_dirent64; struct list_head; struct msgbuf; struct msghdr; +struct mmsghdr; struct msqid_ds; struct new_utsname; struct nfsctl_arg; @@ -555,6 +556,8 @@ asmlinkage long sys_recv(int, void __user *, size_t, unsigned); asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned, struct sockaddr __user *, int __user *); asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags); +asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg, + unsigned int vlen, unsigned flags); asmlinkage long sys_socket(int, int, int); asmlinkage long sys_socketpair(int, int, int, int __user *); asmlinkage long sys_socketcall(int call, unsigned long __user *args); diff --git a/include/net/compat.h b/include/net/compat.h index 5bbf8bf..35acabb 100644 --- a/include/net/compat.h +++ b/include/net/compat.h @@ -18,6 +18,11 @@ struct compat_msghdr { compat_uint_t msg_flags; }; +struct compat_mmsghdr { + struct compat_msghdr msg_hdr; + compat_uint_t msg_len; +}; + struct compat_cmsghdr { compat_size_t cmsg_len; compat_int_t cmsg_level; @@ -35,6 +40,8 @@ extern int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *); extern int verify_compat_iovec(struct msghdr *, struct iovec *, struct sockaddr *, int); extern asmlinkage long compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned); extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned); +extern asmlinkage long compat_sys_recvmmsg(int, struct compat_mmsghdr __user *, + unsigned, unsigned); extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *); extern int put_cmsg_compat(struct msghdr*, int, int, int, void *); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 68320f6..f581fb0 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -48,7 +48,9 @@ cond_syscall(sys_shutdown); cond_syscall(sys_sendmsg); cond_syscall(compat_sys_sendmsg); cond_syscall(sys_recvmsg); +cond_syscall(sys_recvmmsg); cond_syscall(compat_sys_recvmsg); +cond_syscall(compat_sys_recvmmsg); cond_syscall(sys_socketcall); cond_syscall(sys_futex); cond_syscall(compat_sys_futex); diff --git a/net/compat.c b/net/compat.c index 8d73905..1ec778f 100644 --- a/net/compat.c +++ b/net/compat.c @@ -743,6 +743,13 @@ asmlinkage long compat_sys_recvmsg(int fd, struct compat_msghdr __user *msg, uns return sys_recvmsg(fd, (struct msghdr __user *)msg, flags | MSG_CMSG_COMPAT); } +asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg, + unsigned vlen, unsigned int flags) +{ + return sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, + flags | MSG_CMSG_COMPAT); +} + asmlinkage long compat_sys_socketcall(int call, u32 __user *args) { int ret; diff --git a/net/socket.c b/net/socket.c index 791d71a..f0249cb 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1965,22 +1965,16 @@ out: return err; } -/* - * BSD recvmsg interface - */ - -SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, - unsigned int, flags) +static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg, + unsigned flags) { struct compat_msghdr __user *msg_compat = (struct compat_msghdr __user *)msg; - struct socket *sock; struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct msghdr msg_sys; unsigned long cmsg_ptr; int err, iov_size, total_len, len; - int fput_needed; /* kernel mode address */ struct sockaddr_storage addr; @@ -1996,13 +1990,9 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, else if (copy_from_user(&msg_sys, msg, sizeof(struct msghdr))) return -EFAULT; - sock = sockfd_lookup_light(fd, &err, &fput_needed); - if (!sock) - goto out; - err = -EMSGSIZE; if (msg_sys.msg_iovlen > UIO_MAXIOV) - goto out_put; + goto out; /* Check whether to allocate the iovec area */ err = -ENOMEM; @@ -2010,7 +2000,7 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, if (msg_sys.msg_iovlen > UIO_FASTIOV) { iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL); if (!iov) - goto out_put; + goto out; } /* @@ -2066,9 +2056,63 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, out_freeiov: if (iov != iovstack) sock_kfree_s(sock->sk, iov, iov_size); +out: + return err; +} + +/* + * BSD recvmsg interface + */ + +SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg, + unsigned int, flags) +{ + int fput_needed, err; + struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed); + + if (!sock) + goto out; + + err = __sys_recvmsg(sock, msg, flags); + + fput_light(sock->file, fput_needed); +out: + return err; +} + +/* + * Linux recvmmsg interface + */ + +SYSCALL_DEFINE4(recvmmsg, int, fd, struct mmsghdr __user *, mmsg, + unsigned int, vlen, unsigned int, flags) +{ + int fput_needed, err, datagrams = 0; + struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed); + struct mmsghdr __user *entry = mmsg; + + if (!sock) + goto out; + + while (datagrams < vlen) { + err = __sys_recvmsg(sock, (struct msghdr __user *)entry, flags); + if (err < 0) + goto out_put; + err = __put_user(err, &entry->msg_len); + if (err) + goto out_put; + ++entry; + ++datagrams; + } out_put: fput_light(sock->file, fput_needed); out: + /* + * We may return less entries than requested (vlen) if the + * sock is non block and there aren't enough datagrams. + */ + if (err == 0 || (err == -EAGAIN && (flags & MSG_DONTWAIT))) + return datagrams; return err; }
Meaning receive multiple messages, reducing the number of syscalls and net stack entry/exit operations. Next patches will introduce mechanisms where protocols that want to optimize this operation will provide an unlocked_recvmsg operation. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> --- arch/alpha/kernel/systbls.S | 1 + arch/arm/kernel/calls.S | 1 + arch/ia64/kernel/entry.S | 1 + arch/mips/kernel/scall32-o32.S | 2 + arch/mips/kernel/scall64-64.S | 1 + arch/mips/kernel/scall64-n32.S | 1 + arch/mips/kernel/scall64-o32.S | 1 + arch/sh/kernel/syscalls_64.S | 1 + arch/sparc/kernel/systbls_64.S | 2 +- arch/x86/include/asm/unistd_64.h | 2 + include/linux/socket.h | 6 +++ include/linux/syscalls.h | 3 ++ include/net/compat.h | 7 ++++ kernel/sys_ni.c | 2 + net/compat.c | 7 ++++ net/socket.c | 72 ++++++++++++++++++++++++++++++------- 16 files changed, 95 insertions(+), 15 deletions(-)