mbox series

[RFC,0/9] patchable function pointers for pluggable crypto routines

Message ID 20181005081333.15018-1-ard.biesheuvel@linaro.org (mailing list archive)
Headers show
Series patchable function pointers for pluggable crypto routines | expand

Message

Ard Biesheuvel Oct. 5, 2018, 8:13 a.m. UTC
Linux's crypto API is widely regarded as being difficult to use for cases
where support for asynchronous accelerators or runtime dispatch of algorithms
(i.e., passed in as a string) are not needed. This leads to kludgy code and
also to actual security issues [0], although arguably, using AES in the wrong
mode can be done using any kind of crypto abstraction.

At the moment, the Zinc library [1] is being proposed as a solution for that,
and while it does address the usability problems, it does a lot more than
that, and what we end up with is a lot less flexible than what we have now.

Currently, arch specific implementations (based on SIMD or optimized
assembler) live in arch/*/crypto, where each sub-community is in charge
of its own specialized versions of the various primitives (although still
under the purview of the crypto maintainer). Any proposal to change this
model should be judged on its own merit, and not blindly accepted as the
fallout of cleaning up some library code.

Also, Zinc removes the possibility to plug different versions of a routine,
and instead, keeps all versions in the same module. Currently, the kernel's
module support permits user land to take charge of the policies that decide
which version to use in which context (by loading or blacklisting modules).
Instead, Zinc moves to a model where the policy is hardcoded into the module
(e.g., ChaCha20 on ARM uses NEON unless on Cortex-A5 or A7). In the ARM
community, we have always attempted to avoid hardcoding policy like that:
the ARM SoC space is a very heteregeneous one, and putting policy like that
into the code will lead to an endless stream of patches from silicon vendors
that want tweaks for their cores into various parts of the kernel.

Creating monolithic modules for algorithms also makes it very difficult to
support driver based implementations. The kernel's driver model is heavily
based on modules, using alias strings to match drivers against the hardware.
As an example, there ARM ST platforms that support synchronous hardware based
CRC32, and plugging that into the monolithic Zinc model is difficult if not
impossible.

The primary justification for moving away from the module system is that it
depends heavily on function pointers, and in the post-Spectre world, those
are vulnerable and costly. For this reason, this series proposes a patchable
function pointer abstraction that, in its generic incarnation, consists
simply of function pointers and some plumbing to set or reset them (patch #1)

Patch #2 illustrates how architectures could optimize these abstractions by
using code patching techniques to get rid of the indirect calls, and use
fixed jumps instead. This has the side effect of making the function
pointer const, making it more robust as well.

The remainder of the series shows how we can use this abstraction to clean
up the CRC-T10DIF handling in the kernel, which is a pretty depressing
example of how cumbersome it is to expose crypto API algorithms as library
routines.

Note that the various arch specific implementations are kept in their
original place as modules, which can be automatically dispatched by udev
(as before) based on CPU feature bits.

Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org

[0] commit 428490e38b2e ("security/keys: rewrite all of big_key crypto")
[1] https://lore.kernel.org/lkml/20180925145622.29959-3-Jason@zx2c4.com/

Ard Biesheuvel (9):
  kernel: add support for patchable function pointers
  arm64: kernel: add arch support for patchable function pointers
  crypto: crc-t10dif - make crc_t10dif a static inline
  crypto: crc-t10dif - use patchable function pointer for core update
    routine
  crypto: crc-t10dif/arm64 - move PMULL based code into core library
  crypto: crc-t10dif/arm - move PMULL based code into core library
  crypto: crct10dif/generic - switch crypto API driver to core library
  crypto: crc-t10dif/powerpc - move PMULL based code into core library
  crypto: crc-t10dif/x86 - move PMULL based code into core library

 arch/Kconfig                                |   3 +
 arch/arm/crypto/crct10dif-ce-glue.c         |  78 +++-------
 arch/arm64/Kconfig                          |   1 +
 arch/arm64/crypto/crct10dif-ce-glue.c       |  61 ++------
 arch/arm64/include/asm/ffp.h                |  35 +++++
 arch/arm64/kernel/insn.c                    |  22 +++
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  55 +-------
 arch/x86/crypto/crct10dif-pclmul_glue.c     |  98 ++-----------
 crypto/Kconfig                              |   1 +
 crypto/Makefile                             |   2 +-
 crypto/crct10dif_common.c                   |  82 -----------
 crypto/crct10dif_generic.c                  |   4 +-
 include/linux/crc-t10dif.h                  |  24 +++-
 include/linux/ffp.h                         |  43 ++++++
 lib/Kconfig                                 |   2 -
 lib/crc-t10dif.c                            | 149 +++++++-------------
 16 files changed, 235 insertions(+), 425 deletions(-)
 create mode 100644 arch/arm64/include/asm/ffp.h
 delete mode 100644 crypto/crct10dif_common.c
 create mode 100644 include/linux/ffp.h

Comments

Jason A. Donenfeld Oct. 5, 2018, 1:37 p.m. UTC | #1
Hi Ard,

On Fri, Oct 5, 2018 at 10:13 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> At the moment, the Zinc library [1] is being proposed as a solution for that,
> and while it does address the usability problems, it does a lot more than
> that, and what we end up with is a lot less flexible than what we have now.
>
> Currently, arch specific implementations (based on SIMD or optimized
> assembler) live in arch/*/crypto, where each sub-community is in charge
> of its own specialized versions of the various primitives (although still
> under the purview of the crypto maintainer). Any proposal to change this
> model should be judged on its own merit, and not blindly accepted as the
> fallout of cleaning up some library code.
>
> Also, Zinc removes the possibility to plug different versions of a routine,
> and instead, keeps all versions in the same module. Currently, the kernel's
> module support permits user land to take charge of the policies that decide
> which version to use in which context (by loading or blacklisting modules).

I think this explanation misunderstands many of the design goals of Zinc and
also points to why going your direction instead is a bad idea that will cause
even more problems down the road.

Zinc does several important things:

- Introduces direct C function calls throughout, as a central way of being
  implemented and as a central way of being used by consumers of the
  API. This has various obvious benefits for the consumers of the API,
  but it also has big benefits for the developers of the library as
  well. Namely, it keeps the relationship between different parts
  extremely clear and direct. It's an explicit choice for simplicity.
  And by being the simpler and more direct solution, it also gives gcc
  an important opportunity to optimize and inline.

- Reorganizes crypto routines so that they're grouped together by
  primitive. This again leads to a much simpler design and layout,
  making it more obvious what's happening, and making things generally
  cleaner. It is not only useful and clearer for developers, but it
  also makes contributors and auditors more easily aware of what
  implementations are available.

- Has higher standards for code and implementations that are introduced.
  Zinc prefers code that has been formally verified, that has been in
  widespread usage and has received many eyeballs and fuzzing hours,
  that has been fuzzed extensively, that is simple in design, that comes
  from well-known high-quality authors -- in roughly but not precisely
  that order of preference. That's a bit different from the current
  practices of the existing crypto API.

- Has a simpler mechanism that is just as effective for choosing
  available implementations. This is, again, more obvious and direct
  than the present crypto API's module approach, leads to smaller code
  size, and has the potential of being just as flexible with the
  inevitable desire for nobs, adjustable from userspace, from
  kernelspace, or from elsewhere.

- Is designed to promote collaboration with the larger cryptography
  community and with academia, which will yield better implementations
  and for assurance.

- Can easily be extracted to userspace libraries (perhaps a future
  libzinc could be easily based on it), which makes testing and fuzzing
  using tools like libfuzzer and afl more accessible.

- Has faster implementations than the current crypto API.

- Has, again, a very strong focus on being simple and minimal, as
  opposed to bloated and complicated, so that it's actually possible to
  understand and audit the library.

Therefore, I think this patch goes in exactly the wrong direction. I
mean, if you want to introduce dynamic patching as a means for making
the crypto API's dynamic dispatch stuff not as slow in a post-spectre
world, sure, go for it; that may very well be a good idea. But
presenting it as an alternative to Zinc very widely misses the point and
serves to prolong a series of bad design choices, which are now able to
be rectified by putting energy into Zinc instead.

Jason
Ard Biesheuvel Oct. 5, 2018, 5:15 p.m. UTC | #2
On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
...
> Therefore, I think this patch goes in exactly the wrong direction. I
> mean, if you want to introduce dynamic patching as a means for making
> the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> world, sure, go for it; that may very well be a good idea. But
> presenting it as an alternative to Zinc very widely misses the point and
> serves to prolong a series of bad design choices, which are now able to
> be rectified by putting energy into Zinc instead.
>

This series has nothing to do with dynamic dispatch: the call sites
call crypto functions using ordinary function calls (although my
example uses CRC-T10DIF), and these calls are redirected via what is
essentially a PLT entry, so that we can supsersede those routines at
runtime.
Andy Lutomirski Oct. 5, 2018, 5:26 p.m. UTC | #3
On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> ...
> > Therefore, I think this patch goes in exactly the wrong direction. I
> > mean, if you want to introduce dynamic patching as a means for making
> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> > world, sure, go for it; that may very well be a good idea. But
> > presenting it as an alternative to Zinc very widely misses the point and
> > serves to prolong a series of bad design choices, which are now able to
> > be rectified by putting energy into Zinc instead.
> >
>
> This series has nothing to do with dynamic dispatch: the call sites
> call crypto functions using ordinary function calls (although my
> example uses CRC-T10DIF), and these calls are redirected via what is
> essentially a PLT entry, so that we can supsersede those routines at
> runtime.

If you really want to do it PLT-style, then just do:

extern void whatever_func(args);

Call it like:
whatever_func(args here);

And rig up something to emit asm like:

GLOBAL(whatever_func)
  jmpq default_whatever_func
ENDPROC(whatever_func)

Architectures without support can instead do:

void whatever_func(args)
{
  READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
}

and patch the asm function for basic support.  It will be slower than
necessary, but maybe the relocation trick could be used on top of this
to redirect the call to whatever_func directly to the target for
architectures that want to squeeze out the last bit of performance.
This might actually be the best of all worlds: easy implementation on
all architectures, no inline asm, and the totally non-magical version
works with okay performance.

(Is this what your code is doing?  I admit I didn't follow all the way
through all the macros.)
Ard Biesheuvel Oct. 5, 2018, 5:28 p.m. UTC | #4
On 5 October 2018 at 19:26, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>>
>> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>> ...
>> > Therefore, I think this patch goes in exactly the wrong direction. I
>> > mean, if you want to introduce dynamic patching as a means for making
>> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
>> > world, sure, go for it; that may very well be a good idea. But
>> > presenting it as an alternative to Zinc very widely misses the point and
>> > serves to prolong a series of bad design choices, which are now able to
>> > be rectified by putting energy into Zinc instead.
>> >
>>
>> This series has nothing to do with dynamic dispatch: the call sites
>> call crypto functions using ordinary function calls (although my
>> example uses CRC-T10DIF), and these calls are redirected via what is
>> essentially a PLT entry, so that we can supsersede those routines at
>> runtime.
>
> If you really want to do it PLT-style, then just do:
>
> extern void whatever_func(args);
>
> Call it like:
> whatever_func(args here);
>
> And rig up something to emit asm like:
>
> GLOBAL(whatever_func)
>   jmpq default_whatever_func
> ENDPROC(whatever_func)
>
> Architectures without support can instead do:
>
> void whatever_func(args)
> {
>   READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
> }
>
> and patch the asm function for basic support.  It will be slower than
> necessary, but maybe the relocation trick could be used on top of this
> to redirect the call to whatever_func directly to the target for
> architectures that want to squeeze out the last bit of performance.
> This might actually be the best of all worlds: easy implementation on
> all architectures, no inline asm, and the totally non-magical version
> works with okay performance.
>
> (Is this what your code is doing?  I admit I didn't follow all the way
> through all the macros.)

Basically
Andy Lutomirski Oct. 5, 2018, 6 p.m. UTC | #5
On Fri, Oct 5, 2018 at 10:28 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 19:26, Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> >>
> >> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >> ...
> >> > Therefore, I think this patch goes in exactly the wrong direction. I
> >> > mean, if you want to introduce dynamic patching as a means for making
> >> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> >> > world, sure, go for it; that may very well be a good idea. But
> >> > presenting it as an alternative to Zinc very widely misses the point and
> >> > serves to prolong a series of bad design choices, which are now able to
> >> > be rectified by putting energy into Zinc instead.
> >> >
> >>
> >> This series has nothing to do with dynamic dispatch: the call sites
> >> call crypto functions using ordinary function calls (although my
> >> example uses CRC-T10DIF), and these calls are redirected via what is
> >> essentially a PLT entry, so that we can supsersede those routines at
> >> runtime.
> >
> > If you really want to do it PLT-style, then just do:
> >
> > extern void whatever_func(args);
> >
> > Call it like:
> > whatever_func(args here);
> >
> > And rig up something to emit asm like:
> >
> > GLOBAL(whatever_func)
> >   jmpq default_whatever_func
> > ENDPROC(whatever_func)
> >
> > Architectures without support can instead do:
> >
> > void whatever_func(args)
> > {
> >   READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
> > }
> >
> > and patch the asm function for basic support.  It will be slower than
> > necessary, but maybe the relocation trick could be used on top of this
> > to redirect the call to whatever_func directly to the target for
> > architectures that want to squeeze out the last bit of performance.
> > This might actually be the best of all worlds: easy implementation on
> > all architectures, no inline asm, and the totally non-magical version
> > works with okay performance.
> >
> > (Is this what your code is doing?  I admit I didn't follow all the way
> > through all the macros.)
>
> Basically

Adding Josh Poimboeuf.

Here's a sketch of how this could work for better performance.  For a
static call "foo" that returns void and takes no arguments, the
generic implementation does something like this:

extern void foo(void);

struct static_call {
  void (*target)(void);

  /* arch-specific part containing an array of struct static_call_site */
};

void foo(void)
{
  READ_ONCE(__static_call_foo->target)();
}

Arch code overrides it to:

GLOBAL(foo)
  jmpq *__static_call_foo(%rip)
ENDPROC(foo)

and some extra asm to emit a static_call_site object saying that the
address "foo" is a jmp/call instruction where the operand is at offset
1 into the instruction.  (Or whatever the offset is.)

The patch code is like:

void set_static_call(struct static_call *call, void *target)
{
  /* take a spinlock? */
  WRITE_ONCE(call->target, target);
  arch_set_static_call(call, target);
}

and the arch code patches the call site if needed.

On x86, an even better implementation would have objtool make a bunch
of additional static_call_site objects for each call to foo, and
arch_set_static_call() would update all of them, too.  Using
text_poke_bp() if needed, and "if needed" can maybe be clever and
check the alignment of the instruction.  I admit that I never actually
remember the full rules for atomically patching an instruction on x86
SMP.  (Hmm.  This will be really epically slow.  Maybe we don't care.
Or we could finally optimize text_poke, etc to take a list of pokes to
do and do them as a batch.  But that's not a prerequisite for the rest
of this.)

What do you all think?