Message ID | m1bp7oq1u8.fsf@fess.ebiederm.org |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 09/23/2010 12:51 PM, Eric W. Biederman wrote: > > Add a system call for creating sockets in a specified network namespace. What for? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-09-23 at 12:56 +0400, Pavel Emelyanov wrote: > On 09/23/2010 12:51 PM, Eric W. Biederman wrote: > > > > Add a system call for creating sockets in a specified network namespace. > > What for? I can see many uses if my understanding is correct.. ex, from mother namespace: fdx = open socket at namespace blah from mother namespace, read/write/poll fdx (eg add route with netlink socket) cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/2010 03:19 PM, jamal wrote: > On Thu, 2010-09-23 at 12:56 +0400, Pavel Emelyanov wrote: >> On 09/23/2010 12:51 PM, Eric W. Biederman wrote: >>> >>> Add a system call for creating sockets in a specified network namespace. >> >> What for? > > I can see many uses if my understanding is correct.. > ex, from mother namespace: > fdx = open socket at namespace blah > from mother namespace, read/write/poll fdx > (eg add route with netlink socket) This particular usecase is unneeded once you have the "enter" ability. > cheers, > jamal > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:
> This particular usecase is unneeded once you have the "enter" ability.
Is that cheaper from a syscall count/cost?
i.e do I have to enter every time i want to write/read this fd?
How does poll/select work in that enter scenario?
cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/2010 03:40 PM, jamal wrote: > On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote: > >> This particular usecase is unneeded once you have the "enter" ability. > > Is that cheaper from a syscall count/cost? Why does it matter? You told, that the usage scenario was to add routes to container. If I do 2 syscalls instead of 1, is it THAT worse? > i.e do I have to enter every time i want to write/read this fd? No - you enter once, create a socket and do whatever you need withing the enterned namespace. > How does poll/select work in that enter scenario? Just like it used to before the enter. > cheers, > jamal > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote: > Why does it matter? You told, that the usage scenario was to > add routes to container. If I do 2 syscalls instead of 1, is > it THAT worse? > Anything to do with socket IO that requires namespace awareness applies for usage; it could be tcp/udp/etc socket. If it doesnt make any difference performance wise using one scheme vs other to write/read heavy messages then i dont see an issue and socketat is redundant. If i was to pick blindly - I would say whatever approach with less syscalls is better even if just a "slow" path one time thing. I could create a scenario which would make it bad to have more syscalls. But theres also the simplicity aspect in doing: fdx = socketat namespace foo use fdx for read/write/poll into foo without any wrapper code. Vs enter foo fdx = socket .. read/write fdx leave foo. > Just like it used to before the enter. > So if i enter foo, get a fdx, leave foo i can use it in ns0 as if it was in ns0? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/2010 04:11 PM, jamal wrote: > On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote: > >> Why does it matter? You told, that the usage scenario was to >> add routes to container. If I do 2 syscalls instead of 1, is >> it THAT worse? >> > > Anything to do with socket IO that requires namespace awareness > applies for usage; it could be tcp/udp/etc socket. If it doesnt > make any difference performance wise using one scheme vs other > to write/read heavy messages then i dont see an issue and socketat > is redundant. That's what my point is about - unless we know why would we need it we don't need it. Eric, please clarify, what is the need in creating a socket in foreign net namespace? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 23, 2010 at 04:34:37PM +0400, Pavel Emelyanov wrote: > On 09/23/2010 04:11 PM, jamal wrote: > > On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote: > > > >> Why does it matter? You told, that the usage scenario was to > >> add routes to container. If I do 2 syscalls instead of 1, is > >> it THAT worse? > >> > > > > Anything to do with socket IO that requires namespace awareness > > applies for usage; it could be tcp/udp/etc socket. If it doesnt > > make any difference performance wise using one scheme vs other > > to write/read heavy messages then i dont see an issue and socketat > > is redundant. > > That's what my point is about - unless we know why would we need it > we don't need it. > > Eric, please clarify, what is the need in creating a socket in foreign > net namespace? Hmm. If you somewhere get the fd to a socket from another namespace, it definitely does work (I'm currently implementing my "socketat" with fd passing through AF_UNIX sockets, so i know it works), so the setns(other...) fd = socket(...) setns(orig...) sequence would certainly work. However, there might be other things happening inbetween like a signal (imagine AIO particularly). While signals are user-controllable (and therefore to be managed/excluded by the user), we need to think if there are other problems with doing this as sequence? If there are no other problematic conditions with this, socketat should probably be moved to a user library. -David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov <xemul@parallels.com> writes: > On 09/23/2010 04:11 PM, jamal wrote: >> On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote: >> >>> Why does it matter? You told, that the usage scenario was to >>> add routes to container. If I do 2 syscalls instead of 1, is >>> it THAT worse? >>> >> >> Anything to do with socket IO that requires namespace awareness >> applies for usage; it could be tcp/udp/etc socket. If it doesnt >> make any difference performance wise using one scheme vs other >> to write/read heavy messages then i dont see an issue and socketat >> is redundant. > > That's what my point is about - unless we know why would we need it > we don't need it. > > Eric, please clarify, what is the need in creating a socket in foreign > net namespace? Strictly speaking with setns() you can implement this functionality with setns(). aka int socketat(int nsfd, int domain, int type, int protocol) { int sk; setns(0, nsfd); sk = socket(domain, type, protocol); setns(0, default_nsfd); return sk; } The major difference is that socketat in userspace suffers from races, with signals etc. The use case are applications are the handful of networking applications that find that it makes sense to listen to sockets from multiple network namespaces at once. Say a home machine that has a vpn into your office network and the vpn into the office network runs in a different network namespace so you don't have to worry about address conflicts between the two networks, the chance of accidentally bridging between them, and so you can use different dns resolvers for the different networks. In that scenario it would be nice if I could run some services on both networks. Starting two+ copies of the daemons just so the can have live in all of the networks is ok, but in the fullness of time I expect that there will be daemons that want to optimize things and have sockets in all of the network namespaces you are connected to. In a multiple network namespace aware application when it goes to open a socket it will want to specify which network namespace the socket is in. If it is a general listener it will probably listening to events in /proc/mounts waiting for extra namespaces to be mounted under a standard location say: /var/run/netns/<netnsname>/ns. Once the application receives the event for a new network namespace showing up it can will want to create a new socket listening for connections in the new network namespace. In that scenario none of those network namespaces are foreign, but one network namespace will be the default and the rest will be non-default network namespaces. To support a multiple network namespace aware daemon I need to implement sockeat() somewhere. So I figured I would see if anyone minded a trivial in kernel race free implementation. To me it is a wart in the API and I am busily removing warts in the API. I don't know of any scenarios with other namespaces where there would be applications that would be native in multiple namespaces. So I haven't haven't done any work in that direction. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/2010 01:53 PM, Pavel Emelyanov wrote: > On 09/23/2010 03:40 PM, jamal wrote: > >> On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote: >> >> >>> This particular usecase is unneeded once you have the "enter" ability. >>> >> Is that cheaper from a syscall count/cost? >> > Why does it matter? You told, that the usage scenario was to > add routes to container. If I do 2 syscalls instead of 1, is > it THAT worse? > > >> i.e do I have to enter every time i want to write/read this fd? >> > No - you enter once, create a socket and do whatever you need > withing the enterned namespace. > Just to clarify this point. You enter the namespace, create the socket and go back to the initial namespace (or create a new one). Further operations can be made against this fd because it is the network namespace stored in the sock struct which is used, not the current process network namespace which is used at the socket creation only. We can actually already do that by unsharing and then create a socket. This socket will pin the namespace and can be used as a control socket for the namespace (assuming the socket domain will be ok for all the operations). Jamal, I don't know what kind of application you want to use but if I assume you want to create a process controlling 1024 netns, let's try to identificate what happen with setns and with socketat : With setns: * open /proc/self/ns/net (1) * unshare the netns * open /proc/self/ns/net (2) * setns (1) * create a virtual network device * move the virtual device to (2) (using the set netns by fd) * unshare the netns ... With socketat: * open a socket (1) * unshare the netns * open a netlink with socketat(1) => (2) * create a virtual device using (2) (at this point it is init_net_ns) * move the virtual device to the current netns (using the set netns by pid) * open a socket (3) * unshare the netns ... We have the same number of file descriptors kept opened. Except, with setns we can bind mount the directory somewhere, that will pin the namespace and then we can close the /proc/self/ns/net file descriptors and reopen them later. If your application has to do a lot of specific network processing, during its life cycle, in different namespaces, the socketat syscall will be better because it will reduce the number of syscalls but at the cost of keeping the file descriptors opened (potentially a big number). Otherwise, setns should fit your needs. >> How does poll/select work in that enter scenario? >> > Just like it used to before the enter. > > >> cheers, >> jamal >> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Daniel, Thanks for clarifying this .. On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote: > Just to clarify this point. You enter the namespace, create the socket > and go back to the initial namespace (or create a new one). Further > operations can be made against this fd because it is the network > namespace stored in the sock struct which is used, not the current > process network namespace which is used at the socket creation only. > > We can actually already do that by unsharing and then create a > socket. > This socket will pin the namespace and can be used as a control socket > for the namespace (assuming the socket domain will be ok for all the > operations). > > Jamal, I don't know what kind of application you want to use but if I > assume you want to create a process controlling 1024 netns, At the moment i am looking at 8K on a Nehalem with lots of RAM. They will mostly be created at startup but some could be created afterwards. Each will have its own netdevs etc. also created at startup (and some other config that may happen later). Because startup time may accumulate, it is clearly important to me to pick whatever scheme that reduces the number of calls... > let's try to identificate what happen with setns and with socketat : > > With setns: > > * open /proc/self/ns/net (1) > * unshare the netns > * open /proc/self/ns/net (2) > * setns (1) > * create a virtual network device > * move the virtual device to (2) (using the set netns by fd) > * unshare the netns > ... > > With socketat: > > * open a socket (1) > * unshare the netns > * open a netlink with socketat(1) => (2) > * create a virtual device using (2) (at this point it is > init_net_ns) > * move the virtual device to the current netns (using the set > netns > by pid) > * open a socket (3) > * unshare the netns > ... > > We have the same number of file descriptors kept opened. Except, with > setns we can bind mount the directory somewhere, that will pin the > namespace and then we can close the /proc/self/ns/net file descriptors > and reopen them later. > Ok, so a wrapper such as: create_socket_on(namespaceid) will have generally less system calls with socketat() > If your application has to do a lot of specific network processing, > during its life cycle, in different namespaces, the socketat syscall > will be better because it will reduce the number of syscalls but at > the cost of keeping the file descriptors opened (potentially a big > number). Otherwise, setns should fit your needs. Makes sense. One thing still confuses me... The app control point is in namespace0. I still want to be able to "boot" namespaces first and maybe a few seconds later do a socketat()... and create devices, tcp sockets etc. I suspect create_ns(namespace-name) would involve: * open /proc/self/ns/net (namespace-name) * unshare the netns Is this correct? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/03/2010 03:44 PM, jamal wrote: > Hi Daniel, > > Thanks for clarifying this .. > > On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote: > >> Just to clarify this point. You enter the namespace, create the socket >> and go back to the initial namespace (or create a new one). Further >> operations can be made against this fd because it is the network >> namespace stored in the sock struct which is used, not the current >> process network namespace which is used at the socket creation only. >> >> We can actually already do that by unsharing and then create a >> socket. >> This socket will pin the namespace and can be used as a control socket >> for the namespace (assuming the socket domain will be ok for all the >> operations). >> >> Jamal, I don't know what kind of application you want to use but if I >> assume you want to create a process controlling 1024 netns, >> > At the moment i am looking at 8K on a Nehalem with lots of RAM. They > will mostly be created at startup but some could be created afterwards. > Each will have its own netdevs etc. also created at startup (and some > other config that may happen later). > Because startup time may accumulate, it is clearly important to me > to pick whatever scheme that reduces the number of calls... > 8K ! whow ! :) >> let's try to identificate what happen with setns and with socketat : >> >> With setns: >> >> * open /proc/self/ns/net (1) >> * unshare the netns >> * open /proc/self/ns/net (2) >> * setns (1) >> * create a virtual network device >> * move the virtual device to (2) (using the set netns by fd) >> * unshare the netns >> ... >> >> With socketat: >> >> * open a socket (1) >> * unshare the netns >> * open a netlink with socketat(1) => (2) >> * create a virtual device using (2) (at this point it is >> init_net_ns) >> * move the virtual device to the current netns (using the set >> netns >> by pid) >> * open a socket (3) >> * unshare the netns >> ... >> >> We have the same number of file descriptors kept opened. Except, with >> setns we can bind mount the directory somewhere, that will pin the >> namespace and then we can close the /proc/self/ns/net file descriptors >> and reopen them later. >> >> > Ok, so a wrapper such as: create_socket_on(namespaceid) > will have generally less system calls with socketat() > Yes, I think so. >> If your application has to do a lot of specific network processing, >> during its life cycle, in different namespaces, the socketat syscall >> will be better because it will reduce the number of syscalls but at >> the cost of keeping the file descriptors opened (potentially a big >> number). Otherwise, setns should fit your needs. >> > Makes sense. > > One thing still confuses me... > The app control point is in namespace0. I still want to be able to > "boot" namespaces first and maybe a few seconds later do a socketat()... > and create devices, tcp sockets etc. I suspect create_ns(namespace-name) > would involve: > * open /proc/self/ns/net (namespace-name) > * unshare the netns > Is this correct? > Maybe I misunderstanding but you are trying to save some syscalls, you should use socketat only and keep app control namespace0 socket for it. The process will be in the last netns you unshared (maybe you can use here one setns syscall to return back to the namespace0). (1) socketat : * pros : 1 syscall to create a socket * cons : a file descriptor per namespace, namespace is only manageable via a socket (2) setns : * pros : namespace is fully manageable with a generic code * cons : 2 syscall (or 3 if we want to return to the initial netns) to create a socket(setns + socket [ + setns ]), a file descriptor per namespace (3) setns + bind mount : * pros : no file descriptor need to be kept opened * cons : startup longer, (unshare + mount --bind), 4 syscalls to create a socket in the namespace (open, setns, socket, close), (may be 5 syscalls if we want to return to the initial netns). Depending of the scheme you choose the startup will be for: (1) socketat : * open /proc/self/ns/net (one time to 'save' and pin the initial netns) and then int create_ns(void) { unshare(CLONE_NEWNET); return socket(...) } and, for (i = 0; i < 8192; i++) mynsfd[i] = create_ns(); (2) setns : * open /proc/self/ns/net (one time to 'save' and pin the initial netns) and then int create_ns(void) { unshare(CLONE_NEWNET); return open("/proc/self/ns/net"); } and, for (i = 0; i < 8192; i++) mynsfd[i] = create_ns(); (3) setns + mount : * open /proc/self/ns/net (one time to 'save' and pin the initial netns) and then int create_ns(const char *nspath) { unshare(CLONE_NEWNET); creat(nspath); mount("/proc/self/ns/net", nspath, MS_BIND); } for (i = 0; i < 8192; i++) create_ns(mynspath[i]); Hope that helps. -- Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal <hadi@cyberus.ca> writes: > One thing still confuses me... > The app control point is in namespace0. I still want to be able to > "boot" namespaces first and maybe a few seconds later do a socketat()... > and create devices, tcp sockets etc. I suspect create_ns(namespace-name) > would involve: > * open /proc/self/ns/net (namespace-name) > * unshare the netns > Is this correct? Almost. create should be: * verify namespace-name is not already in use * mkdir -p /var/run/netns/<namespace-name> * unshare the netns * mount --bind /proc/self/ns/net /var/run/netns/<namespace-name> Are you talking about an replacing something that used to use the linux vrf patches that are floating around? Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric et al, Did these patches make it in? I was looking at two Davem net trees and i dont see them. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric, Ping? If you are too busy to push these in maybe have someone clueful like Daniel help out submitting? I think it should probably be reasonable to leave out the sockeat patch initially if it is deemed controversial.. cheers, jamal On Fri, 2010-10-15 at 08:30 -0400, jamal wrote: > Eric et al, > > Did these patches make it in? I was looking at > two Davem net trees and i dont see them. > > cheers, > jamal > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jamal <hadi@cyberus.ca> writes: > Eric, > > Ping? > If you are too busy to push these in maybe have > someone clueful like Daniel help out submitting? I think it > should probably be reasonable to leave out the sockeat > patch initially if it is deemed controversial.. This merge cycle I am too busy, and my patches did not make it into linux-next before the merge window. Everything except socketat at seems non-controversial. socketat makes sense to post-pone a little bit until we start converting applications, and there is a little real world experience about what is needed. I anticipate some time freeing up in the next couple of weeks so I should be ready for the next merge window. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/socket.c b/net/socket.c index 2270b94..1116f3c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1269,7 +1269,7 @@ int sock_create_kern(int family, int type, int protocol, struct socket **res) } EXPORT_SYMBOL(sock_create_kern); -SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) +static int do_socket(struct net *net, int family, int type, int protocol) { int retval; struct socket *sock; @@ -1289,7 +1289,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK)) flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK; - retval = sock_create(family, type, protocol, &sock); + retval = __sock_create(net, family, type, protocol, &sock, 0); if (retval < 0) goto out; @@ -1306,6 +1306,28 @@ out_release: return retval; } +SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) +{ + return do_socket(current->nsproxy->net_ns, family, type, protocol); +} + +SYSCALL_DEFINE4(socketat, int, fd, int, family, int, type, int, protocol) +{ + struct net *net; + int retval; + + if (fd == -1) { + net = get_net(current->nsproxy->net_ns); + } else { + net = get_net_ns_by_fd(fd); + if (IS_ERR(net)) + return PTR_ERR(net); + } + retval = do_socket(net, family, type, protocol); + put_net(net); + return retval; +} + /* * Create a pair of connected sockets. */
Add a system call for creating sockets in a specified network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- net/socket.c | 26 ++++++++++++++++++++++++-- 1 files changed, 24 insertions(+), 2 deletions(-)