Message ID | 20180316170614.5392-1-linux@dominikbrodowski.net |
---|---|
Headers | show |
Series | remove in-kernel syscall invocations (part 2 == netdev) | expand |
From: Dominik Brodowski <linux@dominikbrodowski.net> Date: Fri, 16 Mar 2018 18:05:52 +0100 > The rationale of this change is described in patch 1 of part 1[*] as follows: > > The syscall entry points to the kernel defined by SYSCALL_DEFINEx() > and COMPAT_SYSCALL_DEFINEx() should only be called from userspace > through kernel entry points, but not from the kernel itself. This > will allow cleanups and optimizations to the entry paths *and* to > the parts of the kernel code which currently need to pretend to be > userspace in order to make use of syscalls. > > At present, these patches are based on v4.16-rc5; there is one trivial > conflict against net-next. Dave, I presume that you prefer to take them > through net-next? If you want to, I can re-base them against net-next. > If you prefer otherwise, though, I can route them as part of my whole > syscall series. So the transformations themeselves are relatively trivial, so on that aspect I don't have any problems with these changes. But overall I have to wonder. I imagine one of the things you'd like to do is declare that syscall entries use a different (better) argument passing scheme. For example, passing values in registers instead of on the stack. But in situations where you split out the system call function completely into one of these "helpers", the compiler is going to have two choices: 1) Expand the helper into the syscall function inline, thus we end up with two copies of the function. 2) Call the helper from the syscall function. Well, then the compiler will need to pop the syscal obtained arguments from the registers onto the stack. So this doesn't seem like such a total win to me. Maybe you can explain things better to ease my concerns. About merging, I'm fine with you taking this via your tree. I do not see there being any terribly difficult conflicts arising (famous last words). Thanks.
On Fri, Mar 16, 2018 at 02:30:21PM -0400, David Miller wrote: > From: Dominik Brodowski <linux@dominikbrodowski.net> > Date: Fri, 16 Mar 2018 18:05:52 +0100 > > > The rationale of this change is described in patch 1 of part 1[*] as follows: > > > > The syscall entry points to the kernel defined by SYSCALL_DEFINEx() > > and COMPAT_SYSCALL_DEFINEx() should only be called from userspace > > through kernel entry points, but not from the kernel itself. This > > will allow cleanups and optimizations to the entry paths *and* to > > the parts of the kernel code which currently need to pretend to be > > userspace in order to make use of syscalls. > > > > At present, these patches are based on v4.16-rc5; there is one trivial > > conflict against net-next. Dave, I presume that you prefer to take them > > through net-next? If you want to, I can re-base them against net-next. > > If you prefer otherwise, though, I can route them as part of my whole > > syscall series. > > So the transformations themeselves are relatively trivial, so on that > aspect I don't have any problems with these changes. Thank you for your fast feedback. > But overall I have to wonder. > > I imagine one of the things you'd like to do is declare that syscall > entries use a different (better) argument passing scheme. For > example, passing values in registers instead of on the stack. Well, sort of. Currently, x86-64 decodes all six registers unconditionally: regs->ax = sys_call_table[nr]( regs->di, regs->si, regs->dx, regs->r10, regs->r8, regs->r9); so that in do_syscall_64(), we have to get six parameters from the stack: mov 0x38(%rbx),%rcx mov 0x60(%rbx),%rdx mov 0x68(%rbx),%rsi mov 0x70(%rbx),%rdi mov 0x40(%rbx),%r9 mov 0x48(%rbx),%r8 Instead, the aim is to do regs->ax = sys_call_table[nr](regs) ... which results in just a register rename operation: mov %rbp,%rdi > But in situations where you split out the system call function > completely into one of these "helpers", the compiler is going > to have two choices: > > 1) Expand the helper into the syscall function inline, thus we end up > with two copies of the function. That's only sensible for very short stubs, which just call another function (e.g. __compat_sys_sendmsg()). > 2) Call the helper from the syscall function. Well, then the compiler > will need to pop the syscal obtained arguments from the registers > onto the stack. > > So this doesn't seem like such a total win to me. > > Maybe you can explain things better to ease my concerns. For example, for sys_recv() and sys_recvfrom(), if all is complete, this results in: sys_x86_64_recv: callq <__fentry__> /* decode struct pt_regs for exactly those parameters * we care about */ mov 0x38(%rdi),%rcx xor %r9d,%r9d xor %r8d,%r8d mov 0x60(%rdi),%rdx mov 0x68(%rdi),%rsi mov 0x70(%rdi),%rdi /* call __sys_recvfrom */ callq <__sys_recvfrom> /* cleanup and return */ cltq retq That's only obtaining four entries from the stack, and two register clearing operations; sys_x86_64_recvfrom is similar (6 movs from stack, one register rename mov, no xor). __sys_recvfrom() then does the actual work, starting with pushing some register contect out of the way and moving registers around, more or less what SyS_recvfrom() does today. So the result is nothing spectacular or unusual, but pretty equivalent and possibly even shorter compared to current codepath. Thanks, Dominik
On Fri, Mar 16, 2018 at 11:30 AM, David Miller <davem@davemloft.net> wrote: > > I imagine one of the things you'd like to do is declare that syscall > entries use a different (better) argument passing scheme. For > example, passing values in registers instead of on the stack. Actually, it's almost exactly the reverse. On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and have the sys_xyz() function itself just pick out the arguments it needs from there. That has a few reasons for it: - we can clear all registers at system call entry, which helps defeat some of the "pass seldom used register with user-controlled value that survives deep into the callchain" things that people used to leak information - we can streamline the low-level system call code, which needs to pass around 'struct pt_regs *' anyway, and the system call only picks up the values it actually needs - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros just do it all with a wrapper inline function but it fundamentally means that you *cannot* call 'sys_xyz()' from within the kernel, unless you then do it with something crazy like struct pt_regs myregs; ... fill in the right registers for this architecture _if_ this architecture uses ptregs .. sys_xyz(®s); which I somehow really doubt you want to do in the networking code. Now, I did do one version that just created two entrypoints for every single system call - the "kernel version" and the "real" system call version. That sucks, because you have two choices: - either pointlessly generate extra code for the 200+ system calls that are *not* used by the kernel - or let gcc just merge the two, and make code generation suck where the real system call just loads the registers and jumps to the common code. That second option really does suck, because if you let the compiler just generate the _single_ system call, it will do the "load actual value from ptregs" much more nicely, and only when it needs it, and schedules it all into the system call code. So just making the rule be: "you mustn't call the SYSCALL_DEFINEx() functions from anything but the system call code" really makes everything better. Then you only need to fix up the *handful* of so system calls that actually have in-kernel callers. Many of them end up being things that could be improved on further anyway (ie there's discussion about further cleanup and trying to avoid using "set_fs()" for arguments etc, because there already exists helper functions that take the kernel-space versions, and the sys_xyz() version is actually just going through stupid extra work for a kernel user). Linus (*) The "really quite easy" is only true on 64-bit architectures. 32-bit architectures have issues with packing 64-bit values into two registers, so using macro expansion with just the number of arguments doesn't work.