mbox series

[-next,00/22] remove in-kernel syscall invocations (part 2 == netdev)

Message ID 20180316170614.5392-1-linux@dominikbrodowski.net
Headers show
Series remove in-kernel syscall invocations (part 2 == netdev) | expand

Message

Dominik Brodowski March 16, 2018, 5:05 p.m. UTC
Here is another series of patches which reduce the number of syscall
invocations from within the kernel. This series is focused solely on
the net/ part of the kernel and get rids of syscall and compat_syscall
invocations from within the kernel completely. It is also available at

	https://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux.git syscalls-net-next

The rationale of this change is described in patch 1 of part 1[*] as follows:

	The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
	and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
	through kernel entry points, but not from the kernel itself. This
	will allow cleanups and optimizations to the entry paths *and* to
	the parts of the kernel code which currently need to pretend to be
	userspace in order to make use of syscalls.

At present, these patches are based on v4.16-rc5; there is one trivial
conflict against net-next. Dave, I presume that you prefer to take them
through net-next? If you want to, I can re-base them against net-next.
If you prefer otherwise, though, I can route them as part of my whole
syscall series.

Thanks,
	Dominik

[*] The cover letter for v2 is available at
    http://lkml.kernel.org/r/20180315190529.20943-1-linux@dominikbrodowski.net ;
    the whole patchset -- in its current, slightly modified form -- is available at
    at https://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux.git syscalls-next	

Dominik Brodowski (22):
  net: socket: add __sys_recvfrom() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_sendto() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_accept4() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_socket() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_bind() helper; remove in-kernel call to syscall
  net: socket: add __sys_connect() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_listen() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_getsockname() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_getpeername() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_socketpair() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_shutdown() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_setsockopt() helper; remove in-kernel call to
    syscall
  net: socket: add __sys_getsockopt() helper; remove in-kernel call to
    syscall
  net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to
    syscall
  net: socket: move check for forbid_cmsg_compat to __sys_...msg()
  net: socket: replace calls to sys_send() with __sys_sendto()
  net: socket: replace call to sys_recv() with __sys_recvfrom()
  net: socket: add __compat_sys_recvfrom() helper; remove in-kernel call
    to compat syscall
  net: socket: add __compat_sys_setsockopt() helper; remove in-kernel
    call to compat syscall
  net: socket: add __compat_sys_getsockopt() helper; remove in-kernel
    call to compat syscall
  net: socket: add __compat_sys_recvmmsg() helper; remove in-kernel call
    to compat syscall
  net: socket: add __compat_sys_...msg() helpers; remove in-kernel calls
    to compat syscalls

 include/linux/socket.h |  37 +++++++-
 net/compat.c           | 136 +++++++++++++++++++---------
 net/socket.c           | 234 ++++++++++++++++++++++++++++++++++---------------
 3 files changed, 291 insertions(+), 116 deletions(-)

Comments

David Miller March 16, 2018, 6:30 p.m. UTC | #1
From: Dominik Brodowski <linux@dominikbrodowski.net>
Date: Fri, 16 Mar 2018 18:05:52 +0100

> The rationale of this change is described in patch 1 of part 1[*] as follows:
> 
> 	The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
> 	and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
> 	through kernel entry points, but not from the kernel itself. This
> 	will allow cleanups and optimizations to the entry paths *and* to
> 	the parts of the kernel code which currently need to pretend to be
> 	userspace in order to make use of syscalls.
> 
> At present, these patches are based on v4.16-rc5; there is one trivial
> conflict against net-next. Dave, I presume that you prefer to take them
> through net-next? If you want to, I can re-base them against net-next.
> If you prefer otherwise, though, I can route them as part of my whole
> syscall series.

So the transformations themeselves are relatively trivial, so on that
aspect I don't have any problems with these changes.

But overall I have to wonder.

I imagine one of the things you'd like to do is declare that syscall
entries use a different (better) argument passing scheme.  For
example, passing values in registers instead of on the stack.

But in situations where you split out the system call function
completely into one of these "helpers", the compiler is going
to have two choices:

1) Expand the helper into the syscall function inline, thus we end up
   with two copies of the function.

2) Call the helper from the syscall function.  Well, then the compiler
   will need to pop the syscal obtained arguments from the registers
   onto the stack.

So this doesn't seem like such a total win to me.

Maybe you can explain things better to ease my concerns.

About merging, I'm fine with you taking this via your tree.  I do not
see there being any terribly difficult conflicts arising (famous
last words).

Thanks.
Dominik Brodowski March 16, 2018, 7:39 p.m. UTC | #2
On Fri, Mar 16, 2018 at 02:30:21PM -0400, David Miller wrote:
> From: Dominik Brodowski <linux@dominikbrodowski.net>
> Date: Fri, 16 Mar 2018 18:05:52 +0100
> 
> > The rationale of this change is described in patch 1 of part 1[*] as follows:
> > 
> > 	The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
> > 	and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
> > 	through kernel entry points, but not from the kernel itself. This
> > 	will allow cleanups and optimizations to the entry paths *and* to
> > 	the parts of the kernel code which currently need to pretend to be
> > 	userspace in order to make use of syscalls.
> > 
> > At present, these patches are based on v4.16-rc5; there is one trivial
> > conflict against net-next. Dave, I presume that you prefer to take them
> > through net-next? If you want to, I can re-base them against net-next.
> > If you prefer otherwise, though, I can route them as part of my whole
> > syscall series.
> 
> So the transformations themeselves are relatively trivial, so on that
> aspect I don't have any problems with these changes.

Thank you for your fast feedback.

> But overall I have to wonder.
> 
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Well, sort of. Currently, x86-64 decodes all six registers unconditionally:

		regs->ax = sys_call_table[nr](
			regs->di, regs->si, regs->dx,
			regs->r10, regs->r8, regs->r9);

so that in do_syscall_64(), we have to get six parameters from the
stack:

	mov    0x38(%rbx),%rcx
	mov    0x60(%rbx),%rdx
	mov    0x68(%rbx),%rsi
	mov    0x70(%rbx),%rdi
	mov    0x40(%rbx),%r9
	mov    0x48(%rbx),%r8

Instead, the aim is to do

	regs->ax = sys_call_table[nr](regs)

... which results in just a register rename operation:

	mov    %rbp,%rdi

> But in situations where you split out the system call function
> completely into one of these "helpers", the compiler is going
> to have two choices:
> 
> 1) Expand the helper into the syscall function inline, thus we end up
>    with two copies of the function.

That's only sensible for very short stubs, which just call another function
(e.g. __compat_sys_sendmsg()).

> 2) Call the helper from the syscall function.  Well, then the compiler
>    will need to pop the syscal obtained arguments from the registers
>    onto the stack.
> 
> So this doesn't seem like such a total win to me.
> 
> Maybe you can explain things better to ease my concerns.

For example, for sys_recv() and sys_recvfrom(), if all is complete, this
results in:

sys_x86_64_recv:
	callq <__fentry__>
	/* decode struct pt_regs for exactly those parameters
	 * we care about
	 */
	mov    0x38(%rdi),%rcx
	xor    %r9d,%r9d
	xor    %r8d,%r8d
	mov    0x60(%rdi),%rdx
	mov    0x68(%rdi),%rsi
	mov    0x70(%rdi),%rdi

	/* call __sys_recvfrom */
	callq  <__sys_recvfrom>

	/* cleanup and return */
	cltq
	retq

That's only obtaining four entries from the stack, and two register clearing
operations; sys_x86_64_recvfrom is similar (6 movs from stack, one register
rename mov, no xor).

__sys_recvfrom() then does the actual work, starting with pushing some
register contect out of the way and moving registers around, more or less
what SyS_recvfrom() does today.

So the result is nothing spectacular or unusual, but pretty equivalent and
possibly even shorter compared to current codepath.

Thanks,
	Dominik
Linus Torvalds March 16, 2018, 7:42 p.m. UTC | #3
On Fri, Mar 16, 2018 at 11:30 AM, David Miller <davem@davemloft.net> wrote:
>
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Actually, it's almost exactly the reverse.

On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and
have the sys_xyz() function itself just pick out the arguments it
needs from there.

That has a few reasons for it:

 - we can clear all registers at system call entry, which helps defeat
some of the "pass seldom used register with user-controlled value that
survives deep into the callchain" things that people used to leak
information

 - we can streamline the low-level system call code, which needs to
pass around 'struct pt_regs *' anyway, and the system call only picks
up the values it actually needs

 - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros
just do it all with a wrapper inline function

but it fundamentally means that you *cannot* call 'sys_xyz()' from
within the kernel, unless you then do it with something crazy like

    struct pt_regs myregs;
    ... fill in the right registers for this architecture _if_ this
architecture uses ptregs ..
    sys_xyz(&regs);

which I somehow really doubt you want to do in the networking code.

Now, I did do one version that just created two entrypoints for every
single system call - the "kernel version" and the "real" system call
version. That sucks, because you have two choices:

 - either pointlessly generate extra code for the 200+ system calls
that are *not* used by the kernel

 - or let gcc just merge the two, and make code generation suck where
the real system call just loads the registers and jumps to the common
code.

That second option really does suck, because if you let the compiler
just generate the _single_ system call, it will do the "load actual
value from ptregs" much more nicely, and only when it needs it, and
schedules it all into the system call code.

So just making the rule be: "you mustn't call the SYSCALL_DEFINEx()
functions from anything but the system call code" really makes
everything better.

Then you only need to fix up the *handful* of so system calls that
actually have in-kernel callers.

Many of them end up being things that could be improved on further
anyway (ie there's discussion about further cleanup and trying to
avoid using "set_fs()" for arguments etc, because there already exists
helper functions that take the kernel-space versions, and the
sys_xyz() version is actually just going through stupid extra work for
a kernel user).

                    Linus

(*) The "really quite easy" is only true on 64-bit architectures.
32-bit architectures have issues with packing 64-bit values into two
registers, so using macro expansion with just the number of arguments
doesn't work.