Message ID | 20160311182949.GG3383@oracle.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 11 Mar 2016 13:29:49 -0500 Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > > -#define RDS_TCP_DEFAULT_BUFSIZE (128 * 1024) > +static int sndbuf_size = 16384; > +module_param(sndbuf_size, int, 0444); > +MODULE_PARM_DESC(sndbuf_size, "SO_SNDBUF size of kernel tcp socket"); > + > +static int rcvbuf_size = 87380; > +module_param(rcvbuf_size, int, 0444); > +MODULE_PARM_DESC(rcvbuf_size, "SO_RCVBUF size of kernel tcp socket"); Module parameters are a problem for distributions and should only be used as a last resort.
On (03/11/16 11:09), Stephen Hemminger wrote: > > Module parameters are a problem for distributions and should only be used > as a last resort. I was not aware of that- out of curiosity, what is the associated problem? What would be the alternative recommendation in this case? --Sowmini
From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Fri, 11 Mar 2016 13:29:49 -0500 > Some payload sizes/patterns stand to gain performance benefits by > tuning the size of the TCP socket buffers, so this commit adds > module parameters to customize those values when desired. > > Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> No module parameters, please.
On (03/11/16 11:09), Stephen Hemminger wrote: > Module parameters are a problem for distributions and should only be used > as a last resort. I dont know the history of what the distibution problem is, but I did try to use sysctl as an alternative for this. I'm starting to believe that this is one case where module params, with all their problems, are the least evil option. Here's what I find if I use sysctl: - being able to tune the sndbuf and rcvbuf actually gives me a noticeable 2X perf improvement over the default for DB/Cluster request/response transactions, where the classic req size is 8K bytes, response is 256 bytes, and there are a large number of such concurrent transactions queued up on the kernel tcp socket. (The defaults work well for larger packet sizes, but as I noted in the commit, packet sizes vary in actual deployment). Assuming I use sysctl: - by the time the admin gets to execute the sysctl, the kernel listen socket for RDS_TCP_PORT would already have been created, and an arbitrary number of accept/connect (kernel) endpoints may have been created. According to tcp(7), you should be setting the buf sizes before connect/listen. So using sysctl means you have to reset a variable number of existing cluster connections. All this can be done, but adds a large mass of confusing code to reset kernel sockets and also get the cluster HA/failover right. - at first I thought sysctl was attractive because it was netns aware, but found that it is only superficially so. The ->proc_handler does not pass in the struct net *, and setting up the ctl_table's ->data to a per-net var is not simple thing to do. Thus, even though register_net_sysctl() takes a net * pointer, my handler has to do extra ugly things to get to per-net vars. I dont know if there is a better alternative than sysctl/module_param for achieving what I'm trying to do, which is to set up the params for the kernel sockets before creating them. If yes, some hints/rtfms would be helpful. --Sowmini
On Fri, Mar 11, 2016 at 6:43 PM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > On (03/11/16 11:09), Stephen Hemminger wrote: >> Module parameters are a problem for distributions and should only be used >> as a last resort. > > I dont know the history of what the distibution problem is, but I > did try to use sysctl as an alternative for this. I'm starting to > believe that this is one case where module params, with all their > problems, are the least evil option. Here's what I find if I use sysctl: > > - being able to tune the sndbuf and rcvbuf actually gives me a noticeable > 2X perf improvement over the default for DB/Cluster request/response > transactions, where the classic req size is 8K bytes, response is 256 > bytes, and there are a large number of such concurrent transactions > queued up on the kernel tcp socket. (The defaults work well for > larger packet sizes, but as I noted in the commit, packet sizes vary > in actual deployment). > > Assuming I use sysctl: > > - by the time the admin gets to execute the sysctl, the kernel listen > socket for RDS_TCP_PORT would already have been created, and an > arbitrary number of accept/connect (kernel) endpoints may have been > created. According to tcp(7), you should be setting the buf sizes before > connect/listen. So using sysctl means you have to reset a variable > number of existing cluster connections. All this can be done, but > adds a large mass of confusing code to reset kernel sockets and > also get the cluster HA/failover right. > > - at first I thought sysctl was attractive because it was netns aware, > but found that it is only superficially so. The ->proc_handler does > not pass in the struct net *, and setting up the ctl_table's ->data > to a per-net var is not simple thing to do. Thus, even though > register_net_sysctl() takes a net * pointer, my handler has to do > extra ugly things to get to per-net vars. > > I dont know if there is a better alternative than sysctl/module_param > for achieving what I'm trying to do, which is to set up the params > for the kernel sockets before creating them. If yes, some > hints/rtfms would be helpful. > You could consider and alternate model where connection management (connect, listen, etc.) is performed in a userspace daemon and the attached to the kernel (pretty much like KCM does). This moves complexity out of kernel, and gives you the flexibility to configure arbitrarily configure TCP sockets (like the 4 setsockopts in rds_tcp_keepalive could be configurable). The other cool thing this would allow is switching an existing TCP connection into RDS mode (like is done with several protocols on an http port). Tom > --Sowmini >
On (03/11/16 19:21), Tom Herbert wrote: > You could consider and alternate model where connection management > (connect, listen, etc.) is performed in a userspace daemon and the > attached to the kernel (pretty much like KCM does). This moves You have to remember that, like it or not, RDS has some IB legacy compat requirements that cannot be shed easily. The above suggestion suddenly requiring an extra daemon per node in a cluster involving 100's of nodes with very complex DB setup procedures and rigid HA requirements (nodes in the cluster try to reconnect on connection failure). An extra daemon, with its own startup requirements and additional latency, would be hard to justify for something that can otherwise be done with module_param. As such, the above suggestion is even non-trivial than the "ugly" code I referred to in my previous email. It is actually far simpler to just tell the cluster-setup scripts to just refrain from an ifup of the relevant interfaces till all the config is set up. Besides, the basic problem remains: for an arbitrary kernel module that has parameters that need to be customized before module init, what are the options today? --Sowmini
On Fri, Mar 11, 2016 at 7:44 PM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > On (03/11/16 19:21), Tom Herbert wrote: >> You could consider and alternate model where connection management >> (connect, listen, etc.) is performed in a userspace daemon and the >> attached to the kernel (pretty much like KCM does). This moves > > You have to remember that, like it or not, RDS has some IB legacy > compat requirements that cannot be shed easily. The above suggestion > suddenly requiring an extra daemon per node in a cluster > involving 100's of nodes with very complex DB setup procedures and > rigid HA requirements (nodes in the cluster try to reconnect > on connection failure). An extra daemon, with its own startup requirements > and additional latency, would be hard to justify for something > that can otherwise be done with module_param. > You are describing your deployment of RDS, not kernel implementation. What I see is a very rigid implementation that would make it hard for many us to ever even consider deploying. The abilities of applications to tune TCP connections is well understood, very prevalent, and really fundamental in making TCP based datacenters at large scale. Any way, it was just an idea... ;-) > As such, the above suggestion is even non-trivial than the "ugly" code > I referred to in my previous email. > > It is actually far simpler to just tell the cluster-setup scripts to just > refrain from an ifup of the relevant interfaces till all the config > is set up. > > Besides, the basic problem remains: for an arbitrary kernel module > that has parameters that need to be customized before module init, > what are the options today? > Maybe add one module parameter that indicates the module should just load but not start, configure whatever is needed via netlink, and then send one more netlink command to start operations. > --Sowmini
On (03/11/16 20:07), Tom Herbert wrote: > You are describing your deployment of RDS, not kernel implementation. > What I see is a very rigid implementation that would make it hard for > many us to ever even consider deploying. The abilities of applications > to tune TCP connections is well understood, very prevalent, and really > fundamental in making TCP based datacenters at large scale. sorry, historically, OS DDI/DKI's for kernel modules have always had some way to set up startup parameters at module load time. And clusters have been around for a while. So let's just focus on the technical question around module config here, which involves more than TCP sockets, btw. > Any way, it was just an idea... ;-) Thank you. Moving on, > Maybe add one module parameter that indicates the module should just > load but not start, configure whatever is needed via netlink, and then > send one more netlink command to start operations. Even that needs an extra daemon. Without getting into the vast number of questions that it raises (such as every module with startup params now needs a uspace counterpart? modprobe-r behavior, namespace behavior? Why netlink for every kernel module? etc).. One module parameter is as much a "distribution management" problem as 10 of them, yes? I hope you see that I dont need that module param and daemon baggage- I can just use sysctl to set up all my params, including one bit for module_can_start_now to achieve the same thing. But it is still more than the handful of lines of code in my patch, so it would be nice to understand what is the "distribution" issue. Stepping back, how do we make sysctl fully namespace friendly? btw setting kernel socket keepalive params via sysctl is not a problem to implement at all, if it ever shows up as a requirement for customers. --Sowmini
On Fri, Mar 11, 2016 at 8:39 PM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > On (03/11/16 20:07), Tom Herbert wrote: > >> You are describing your deployment of RDS, not kernel implementation. >> What I see is a very rigid implementation that would make it hard for >> many us to ever even consider deploying. The abilities of applications >> to tune TCP connections is well understood, very prevalent, and really >> fundamental in making TCP based datacenters at large scale. > > sorry, historically, OS DDI/DKI's for kernel modules have always > had some way to set up startup parameters at module load time. And > clusters have been around for a while. > > So let's just focus on the technical question around module config here, > which involves more than TCP sockets, btw. > >> Any way, it was just an idea... ;-) > > Thank you. > > Moving on, > >> Maybe add one module parameter that indicates the module should just >> load but not start, configure whatever is needed via netlink, and then >> send one more netlink command to start operations. > > Even that needs an extra daemon. > > Without getting into the vast number of questions that it raises (such > as every module with startup params now needs a uspace counterpart? > modprobe-r behavior, namespace behavior? Why netlink for every kernel > module? etc).. > Most modules of any significant complexity are managed via netlink, and so all of your questions have likely already been answered. Yes, netlink assumes userspace configuration tools ("ip" configuration many different networking modules). There are simple interfaces to create/delete a module's netlink hooks when module is added/removed. Netlink operates in the context of network namespaces. netlink is preferred since it is far more extensible and generic for configuration than sysctl, module params, etc. Tom > One module parameter is as much a "distribution management" > problem as 10 of them, yes? I hope you see that I dont need that module > param and daemon baggage- I can just use sysctl to set up all my params, > including one bit for module_can_start_now to achieve the same thing. > > But it is still more than the handful of lines of code in my patch, > so it would be nice to understand what is the "distribution" issue. > > Stepping back, how do we make sysctl fully namespace friendly? > > btw setting kernel socket keepalive params via sysctl is not a problem > to implement at all, if it ever shows up as a requirement for customers. > > --Sowmini >
In any case, to wrap up this thread. I managed to set this up with sysctl. End result gives me a tunable per netns (which modparam would not), and thanks to the RDS reconnect infra, can be done dynamically at any time, not just at startup. And it is more compact than a daemon-y solution. I'll send out the patches later this week after some more cleanup and testing. --Sowmini However, it would still be nice to know exactly what distribution issues come out of modparam.
From: Sowmini Varadhan <sowmini.varadhan@oracle.com> Date: Mon, 14 Mar 2016 14:06:42 -0400 > However, it would still be nice to know exactly what distribution > issues come out of modparam. Module parameters mean unstable interfaces, and provide inconsistent ways to do the same exact thing from driver to driver and protocol to protocol. Do not use them, ever.
diff --git a/net/rds/tcp.c b/net/rds/tcp.c index ad60299..b59e7a2 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -52,7 +52,13 @@ static LIST_HEAD(rds_tcp_conn_list); static struct kmem_cache *rds_tcp_conn_slab; -#define RDS_TCP_DEFAULT_BUFSIZE (128 * 1024) +static int sndbuf_size = 16384; +module_param(sndbuf_size, int, 0444); +MODULE_PARM_DESC(sndbuf_size, "SO_SNDBUF size of kernel tcp socket"); + +static int rcvbuf_size = 87380; +module_param(rcvbuf_size, int, 0444); +MODULE_PARM_DESC(rcvbuf_size, "SO_RCVBUF size of kernel tcp socket"); /* doing it this way avoids calling tcp_sk() */ void rds_tcp_nonagle(struct socket *sock) @@ -72,7 +78,15 @@ void rds_tcp_nonagle(struct socket *sock) */ void rds_tcp_tune(struct socket *sock) { + struct sock *sk = sock->sk; + rds_tcp_nonagle(sock); + + lock_sock(sk); + sk->sk_sndbuf = sndbuf_size; + sk->sk_rcvbuf = rcvbuf_size; + sk->sk_userlocks |= SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK; + release_sock(sk); } u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
Some payload sizes/patterns stand to gain performance benefits by tuning the size of the TCP socket buffers, so this commit adds module parameters to customize those values when desired. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> --- net/rds/tcp.c | 16 +++++++++++++++- 1 files changed, 15 insertions(+), 1 deletions(-)