diff mbox series

[RFC,v4,1/1] random: WARN on large getrandom() waits and introduce getrandom2()

Message ID 20190918211713.GA2225@darwi-home-pc
State Not Applicable
Headers show
Series random: WARN on large getrandom() waits and introduce getrandom2() | expand

Commit Message

Ahmed S. Darwish Sept. 18, 2019, 9:17 p.m. UTC
Since Linux v3.17, getrandom(2) has been created as a new and more
secure interface for pseudorandom data requests.  It attempted to
solve three problems, as compared to /dev/urandom:

  1. the need to access filesystem paths, which can fail, e.g. under a
     chroot

  2. the need to open a file descriptor, which can fail under file
     descriptor exhaustion attacks

  3. the possibility of getting not-so-random data from /dev/urandom,
     due to an incompletely initialized kernel entropy pool

To solve the third point, getrandom(2) was made to block until a
proper amount of entropy has been accumulated to initialize the
CHACHA20 cipher.  This basically made the system call have no
guaranteed upper-bound for its initial waiting time.

Thus when it was introduced at c6e9d6f38894 ("random: introduce
getrandom(2) system call"), it came with a clear warning: "Any
userspace program which uses this new functionality must take care to
assure that if it is used during the boot process, that it will not
cause the init scripts or other portions of the system startup to hang
indefinitely."

Unfortunately, due to multiple factors, including not having this
warning written in a scary-enough language in the manpages, and due to
glibc since v2.25 implementing a BSD-like getentropy(3) in terms of
getrandom(2), modern user-space is calling getrandom(2) in the boot
path everywhere.

Embedded Linux systems were first hit by this, and reports of embedded
systems "getting stuck at boot" began to be common.  Over time, the
issue began to even creep into consumer-level x86 laptops: mainstream
distributions, like Debian Buster, began to recommend installing
haveged as a duct-tape workaround... just to let the system boot. (!)

Moreover, filesystem optimizations in EXT4 and XFS, e.g. b03755ad6f33
("ext4: make __ext4_get_inode_loc plug"), which merged directory
lookup code inode table IO, and very fast systemd boots, further
exaggerated the problem by limiting interrupt-based entropy sources.
This led to large delays until the kernel's cryptographic random
number generator (CRNG) got initialized.

Mitigate the problem, as a first step, in two ways:

  1. Issue a big WARN_ON when any process gets stuck on getrandom(2)
     for more than CONFIG_GETRANDOM_WAIT_THRESHOLD_SEC seconds.

  2. Introduce the new getrandom2(2) system call, with clear semantics
     that can guide user-space in doing the right thing.

On the author's Thinkpad E480 x86 laptop and an ArchLinux user-space,
the ext4 commit earlier mentioned reliably blocked the system on GDM
gnome-session boot. Complain loudly through a WARN_ON if processes
get stuck on getrandom(2). Beside its obvious informational purposes,
the WARN_ON also reliably gets the system unstuck.

Set CONFIG_GETRANDOM_WAIT_THRESHOLD_SEC to a heuristic 30-second
default value. We __deeply encourage__ system integrators and
distribution builders not to increase it much: during system boot, you
either have entropy, or you don't. And if you didn't have entropy, it
will stay like this forever, because if you had, you wouldn't have
blocked in the first place. It's an atomic "either/or" situation, with
no middle ground. Please think twice.

For the new getrandom2(2) system call, it tries to avoid the problems
introduced by its earlier siblings. As Linus mentioned several times
in the bug report thread, Linux should have never provided the
"/dev/random" and "getrandom(GRND_RANDOM)" APIs. These interfaces are
broken by design due to their almost-permanent blockage, leading to
the current misuse of /dev/urandom and getrandom(flags=0) calls. Thus
for getrandom2, introduce the flags:

  1. GRND2_SECURE_UNBOUNDED_INITIAL_WAIT
  2. GRND2_INSECURE

where both extract randomness __exclusively__ from the urandom source.
Due to the clear nature of its new GRND2_* flags, the getrandom2()
system call will never issue any warnings on the kernel log.

OpenBSD, to its credit, got that correctly from the start by making
both of /dev/random and /dev/urandom equivalent.

Rreported-by: Ahmed S. Darwish <darwish.07@gmail.com>
Link: https://lkml.kernel.org/r/20190910042107.GA1517@darwi-home-pc
Link: https://lkml.kernel.org/r/20190912034421.GA2085@darwi-home-pc
Link: https://lkml.kernel.org/r/20190914222432.GC19710@mit.edu
Link: https://lkml.kernel.org/r/20180514003034.GI14763@thunk.org
Link: https://lkml.kernel.org/r/CAHk-=wjyH910+JRBdZf_Y9G54c1M=LBF8NKXB6vJcm9XjLnRfg@mail.gmail.com
Link: https://lkml.kernel.org/r/20190917052438.GA26923@1wt.eu
Link: https://lkml.kernel.org/r/20190917160844.GC31567@gardel-login
Link: https://lkml.kernel.org/r/CAHk-=wjABG3+daJFr4w3a+OWuraVcZpi=SMUg=pnZ+7+O0E2FA@mail.gmail.com
Link: https://lkml.kernel.org/r/CAHk-=wjQeiYu8Q_wcMgM-nAcW7KsBfG1+90DaTD5WF2cCeGCgA@mail.gmail.com
Link: https://factorable.net ("Widespread Weak Keys in Network Devices")
Link: https://man.openbsd.org/man4/random.4
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
---
 drivers/char/Kconfig        | 60 ++++++++++++++++++++++++--
 drivers/char/random.c       | 85 ++++++++++++++++++++++++++++++++-----
 include/uapi/linux/random.h | 20 +++++++--
 3 files changed, 148 insertions(+), 17 deletions(-)

Comments

Linus Torvalds Sept. 18, 2019, 11:57 p.m. UTC | #1
On Wed, Sep 18, 2019 at 2:17 PM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
>
> Since Linux v3.17, getrandom(2) has been created as a new and more
> secure interface for pseudorandom data requests.  It attempted to
> solve three problems, as compared to /dev/urandom:

I don't think your patch is really _wrong_, but I think it's silly to
introduce a new system call, when we have 30 bits left in the flags of
the old one, and the old system call checked them.

So it's much simpler and more straightforward to  just introduce a
single new bit #2 that says "I actually know what I'm doing, and I'm
explicitly asking for secure/insecure random data".

And then say that the existing bit #1 just means "I want to wait for entropy".

So then you end up with this:

    /*
     * Flags for getrandom(2)
     *
     * GRND_NONBLOCK    Don't block and return EAGAIN instead
     * GRND_WAIT_ENTROPY        Explicitly wait for entropy
     * GRND_EXPLICIT    Make it clear you know what you are doing
     */
    #define GRND_NONBLOCK               0x0001
    #define GRND_WAIT_ENTROPY   0x0002
    #define GRND_EXPLICIT               0x0004

    #define GRND_SECURE (GRND_EXPLICIT | GRND_WAIT_ENTROPY)
    #define GRND_INSECURE       (GRND_EXPLICIT | GRND_NONBLOCK)

    /* Nobody wants /dev/random behavior, nobody should use it */
    #define GRND_RANDOM 0x0002

which is actually fairly easy to understand. So now we have three
bits, and the values are:

 000  - ambiguous "secure or just lazy/ignorant"
 001 - -EAGAIN or secure
 010 - blocking /dev/random DO NOT USE
 011 - nonblocking /dev/random DO NOT USE
 100 - nonsense, returns -EINVAL
 101 - /dev/urandom without warnings
 110 - blocking secure
 111 - -EAGAIN or secure

and people would be encouraged to use one of these three:

 - GRND_INSECURE
 - GRND_SECURE
 - GRND_SECURE | GRND_NONBLOCK

all of which actually make sense, and none of which have any
ambiguity. And while "GRND_INSECURE | GRND_NONBLOCK" works, it's
exactly the same as just plain GRND_INSECURE - the point is that it
doesn't block for entropy anyway, so non-blocking makes no different.

NOTE! This patch looks bigger than it really is. I've changed the
if-statement in getrandom() to a switch-statement, and I did this:

-       if (count > INT_MAX)
-               count = INT_MAX;
+       count = min_t(size_t, count, INT_MAX >> (ENTROPY_SHIFT + 3));

to match what "urandom_read()" already did. That changes the semantics
a bit, but only for the /dev/random case, and only for insanity (the
limit we truncate to is now 32MB read, rather than 2GB - and we
already had that limit for urandom).

There is *one* other small semantic change: The old code did
urandom_read() which added warnings, but each warning also _reset_ the
crng_init_cnt. Until it decided not to warn any more, at which point
it also stops that resetting of crng_init_cnt.

And that reset of crng_init_cnt, btw, is some cray cray.

It's basically a "we used up entropy" thing, which is very
questionable to begin with as the whole discussion has shown, but
since it stops doing it after 10 cases, it's not even good security
assuming the "use up entropy" case makes sense in the first place.

So I didn't copy that insanity either. And I'm wondering if removing
it from /dev/urandom might also end up helping Ahmed's case of getting
entropy earlier, when we don't reset the counter.

But other than those two details, none of the existing semantics
changed, we just added the three actually _sane_ cases without any
ambiguity.

In particular, this still leaves the semantics of that nasty
"getrandom(0)" as the same "blocking urandom" that it currently is.
But now it's a separate case, and we can make that perhaps do the
timeout, or at least the warning.

And the new cases are defined to *not* warn. In particular,
GRND_INSECURE very much does *not* warn about early urandom access
when crng isn't ready. Because the whole point of that new mode is
that the user knows it isn't secure.

So that should make getrandom(GRND_INSECURE) palatable to the systemd
kind of use that wanted to avoid the pointless kernel warning.

And we could mark this for stable and try to get it backported so that
it will have better coverage, and encourage people to use the new sane
_explicit_ waiting (or not) for entropy.

Comments? Full patch as attachment.

                  Linus
Theodore Ts'o Sept. 19, 2019, 2:34 p.m. UTC | #2
(Adding linux-api since this patch proposes an API change; both by
changing the existing behavior, and adding new flags and possibly a
new system call.)

On Wed, Sep 18, 2019 at 04:57:58PM -0700, Linus Torvalds wrote:
> On Wed, Sep 18, 2019 at 2:17 PM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> >
> > Since Linux v3.17, getrandom(2) has been created as a new and more
> > secure interface for pseudorandom data requests.  It attempted to
> > solve three problems, as compared to /dev/urandom:
> 
> I don't think your patch is really _wrong_, but I think it's silly to
> introduce a new system call, when we have 30 bits left in the flags of
> the old one, and the old system call checked them.

The only reason to introduce a new system call is if we were going to
keep the existing behavior of getrandom.  Given that the patch changes
what getrandom(0), I agree there's no point to adding a new system
call.

> There is *one* other small semantic change: The old code did
> urandom_read() which added warnings, but each warning also _reset_ the
> crng_init_cnt. Until it decided not to warn any more, at which point
> it also stops that resetting of crng_init_cnt.
> 
> And that reset of crng_init_cnt, btw, is some cray cray.
> 
> It's basically a "we used up entropy" thing, which is very
> questionable to begin with as the whole discussion has shown, but
> since it stops doing it after 10 cases, it's not even good security
> assuming the "use up entropy" case makes sense in the first place.

It was a bug that it stopped doing it after 10 tries, and there's a
really good reason for it.  Yes, the "using up entropy" thing doesn't
make much sense in the general case.  But we still need some threshold
for deciding whether or not it's been sufficiently initialized such
that we consider the CRNG initialized.

The reason for zeroing it after we expose state is because otherwise
if the pool starts in a known state (the attacker knows the starting
configuration, knows the DMI table that we're mixing into the pool
since that's a constant, etc.), then after we've injected a small
amount of uncertainty in the pool --- say, we started with a single
known state of the pool, and after injecting some randomness, there
are 64 possible states of the pool.  If the attacker can read from
/dev/urandom, the attacker can know which of the 64 possible states of
the pool it's in.  Now suppose we inject more uncertainty, so that
there's another 64 unknown states, and the attacker is able to
constantly read from /dev/urandom in a tight loop; it'll be able to
keep up with the injection of entropy insertion, and so even though
we've injected 256 "bits" of uncertainty, the attacker will still know
the state of the pool.  That's why when we read from the pool, we need
to clear the entropy bits.

This is sometimes called a "state extension attack", and there have
been attacks that have been carried out against RNG's that's don't
protect against it.  What happened is when I added the rate-limiting
to the uninitialized /dev/urandom warning, I accidentally wiped out
the protection.  But it was there for a reason.

> And the new cases are defined to *not* warn. In particular,
> GRND_INSECURE very much does *not* warn about early urandom access
> when crng isn't ready. Because the whole point of that new mode is
> that the user knows it isn't secure.
> 
> So that should make getrandom(GRND_INSECURE) palatable to the systemd
> kind of use that wanted to avoid the pointless kernel warning.

Yes, that's clearly the right thing to do.  I do think we need to
restore the state extension attack protections, though.

> +	/*
> +	 * People are really confused about whether
> +	 * this is secure or insecure. Traditional
> +	 * behavior is secure, but there are users
> +	 * who clearly didn't want that, and just
> +	 * never thought about it.
> +	 */
> +	case 0:
>  		ret = wait_for_random_bytes();
> -		if (unlikely(ret))
> +		if (ret)
>  			return ret;
> +		break;

I'm happy this proposed is not changing the behavior of getrandom(0).
Why not just remap 0 to GRND_EXPLICIT | GRND_WAIT_ENTROPY, though?  It
will have the same effect, and it's make it clear what we're doing.

Later on, when we rip out /dev/random pool code (and make reading from
/dev/random the equivalent of getrandom(GRND_SECURE)), we'll need to
similarly map the legacy combination of flags for GRND_RANDOM and
GRND_RANDOM | GRND_NONBLOCK.

						- Ted
Linus Torvalds Sept. 19, 2019, 3:20 p.m. UTC | #3
On Thu, Sep 19, 2019 at 7:34 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> > It's basically a "we used up entropy" thing, which is very
> > questionable to begin with as the whole discussion has shown, but
> > since it stops doing it after 10 cases, it's not even good security
> > assuming the "use up entropy" case makes sense in the first place.
>
> It was a bug that it stopped doing it after 10 tries, and there's a
> really good reason for it.

I really doubt that.

> The reason for zeroing it after we expose state is because otherwise
> if the pool starts in a known state (the attacker knows the starting
> configuration, knows the DMI table that we're mixing into the pool
> since that's a constant, etc.),

That's at least partly because our pool hashing has what looks a
fairly sad property.

Yes, it hashes it using a good hash, but it does so in a way that
makes it largely possible to follow the hashing and repeat it and
analyze it.

That breaks if we have hw randomness, because it does the

        if (arch_get_random_long(&v))
                crng->state[14] ^= v;

so it always mixes in hardware randomness as part of the extraction,
but we don't mix anything else unpredictable - or even
process-specific - state in. So without hw randomness, you can try to
get a lot of data over a lot of boots - and for long times during
boots - and maybe find the pattern.

But honestly, this isn't realistic. I can point to emails where *you*
are  arguing against other hashing algorithms because the whole state
extension attack simply isn't realistic.

And I think it's also pretty questionable how we don't try to mix in
anything timing/process-specific when extracting, which is what makes
that "do lots of boots" possible.

The silly "reset crng_init_cnt" does absolutely nothing to help that,
but in fact what it does is to basically give the attacker a way to
get an infinite stream of data without any reseeding (because that
only happens after crng_read()), and able to extend that "block at
boot" time indefinitely while doing so.

Also honestly, if the attacker already has access to the system at
boot, you have some fairly big problems to begin with.

So a much bigger issue than the state extension attack (pretty much
purely theoretical, given any entropy at all, which we _will_ have
even without the crng_init_cnt clearing) is the fact that right now we
really are predictable if there are no hardware interrupts, and people
have used /dev/urandom because other sources weren't useful.

And the fact is, we *know* people use /dev/urandom exactly because
other sources haven't been useful.

And unlike your theoretical state extension attack, I can point you to
black hat presentations that literally talk about using the fact that
we delay m,ixing in the input pull hash to know what's going on:

  https://www.blackhat.com/docs/eu-14/materials/eu-14-Kedmi-Attacking-The-Linux-PRNG-On-Android-Weaknesses-In-Seeding-Of-Entropic-Pools-And-Low-Boot-Time-Entropy.pdf

That's a real attack. Based on the REAL fact that we currently have to
use the urandom logic because the entropy-waiting one is useless, and
in fact depends on the re-seeding happening too late.

Yes, yes, our urandom has changed since that attack, and we use chacha
instead of sha1 these days. We have other changes too. But I don't see
anything fundamentally different.

And all your arguments seem to make that _real_ security issue just
worse, exactly because we also avoid reseeding while crng_init is
zero.

> I'm happy this proposed is not changing the behavior of getrandom(0).
> Why not just remap 0 to GRND_EXPLICIT | GRND_WAIT_ENTROPY, though?  It
> will have the same effect, and it's make it clear what we're doing.

Have you you not followed the whole discussion? Didn't you read the comment?

People use "getrandom(0)" not because they want secure randomness, but
because that's the default.

And we *will* do something about it. This patch didn't, because I want
to be able to backport it to stable, so that everybody is happier with
saying "ok, I'll use the new getrandom(GRND_INSECURE)".

Because getrandom(0) will NOT be the the same as GRND_EXPLICIT |
GRND_WAIT_ENTROPY.

getrandom(0) is the "I don't know what I am doing" thing. It could be
somebody that wants real secure random numbers. Or it could *not* be
one of those, and need the timeout.

> Later on, when we rip out /dev/random pool code (and make reading from
> /dev/random the equivalent of getrandom(GRND_SECURE)), we'll need to
> similarly map the legacy combination of flags for GRND_RANDOM and
> GRND_RANDOM | GRND_NONBLOCK.

And that is completely immaterial, because the "I'm confused" case
isn't about GRND_RANDOM. Nobody uses that anyway, and more importantly
it's not the case that has caused bugs. That one blocks even during
normal execution, so that one - despite being completely useless -
actually has the one good thing going for it that it's testable.
People will see the "oh, that took a long time" during testing. And
then they'll stop using it.

Ted - you really don't seem to be making any distinction between
"these are real problems that should be fixed" vs "this is theory that
isn't relevant".

The "getrandom(0)" is a real problem that needs to be fixed.

The warnings from /dev/urandom are real problems that people
apparently have worked around by (incorrectly) using getrandom(0).

The "hashing the random pool still leaves identities in place" is a
real problem that had a real attack.

The state extension attack? Complete theory (again, I can point to you
saying the same thing in other threads), and the "fix" of resetting
the counter and not reseeding seems to be anything but.

            Linus
Linus Torvalds Sept. 19, 2019, 3:50 p.m. UTC | #4
On Thu, Sep 19, 2019 at 8:20 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The silly "reset crng_init_cnt" does absolutely nothing to help that,
> but in fact what it does is to basically give the attacker a way to
> get an infinite stream of data without any reseeding (because that
> only happens after crng_read()), and able to extend that "block at
> boot" time indefinitely while doing so.

.. btw, instead of bad workarounds for a theoretical attack, here's
something that should add actual *practical* real value: use the time
of day (whether from an RTC device, or from ntp) to add noise to the
random pool.

If you let attackers in before you've set the clock on the device,
you're doing something seriously wrong.

And while this doesn't add much "serious" entropy, it does mean that
the whole "let's look for identical state" which is a _real_ attack,
goes out the window.

In other words, this is about real security, not academic papers.

Of course, attackers can still see possible bad random values from
before the clock was set (possibly from things like TCP sequence
numbers etc, orfrom  that AT_RANDOM of a very early process, which was
part of the Android the attack). But doing things like delaying
reseeding sure isn't helping, which is what the crng_count reset does.

                 Linus
Linus Torvalds Sept. 19, 2019, 8:04 p.m. UTC | #5
On Thu, Sep 19, 2019 at 8:20 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, it hashes it using a good hash, but it does so in a way that
> makes it largely possible to follow the hashing and repeat it and
> analyze it.
>
> That breaks if we have hw randomness, because it does the
>
>         if (arch_get_random_long(&v))
>                 crng->state[14] ^= v;
>
> so it always mixes in hardware randomness as part of the extraction,
> but we don't mix anything else unpredictable - or even
> process-specific - state in.

So this is the other actual _serious_ patch I'd suggest: replace the

          if (arch_get_random_long(&v))
                  crng->state[14] ^= v;

with

          if (!arch_get_random_long(&v))
                  v = random_get_entropy();
          crng->state[14] += v;

instead. Yeah, it still doesn't help on machines that don't even have
a cycle counter, but it at least means that you don't have to have a
CPU rdrand (or equivalent) but you do have a cycle counter, now the
extraction of randomness from the pool doesn't just do the
(predictable) mutation for the backtracking, but actually means that
you have some very hard to predict timing effects.

Again, in this case a cycle counter really does add a small amount of
entropy (everybody agrees that modern CPU's are simply too complex to
be predictable at a cycle level), but that's not really the point. The
point is that now doing the extraction really fundamentally changes
the state in unpredictable ways, so that you don't have that "if I
recognize a value, I know what the next value will be" kind of attack.

Which, as mentioned, is actually not a purely theoretical concern.

Note small detail above: I changed the ^= to a +=. Addition tends to
be better (due to carry between bits) when there might be bit
commonalities.  Particularly with something like a cycle count where
two xors can mostly cancel out previous bits rather than move bits
around in the word.

With an actual random input from rdrand, the xor-vs-add is immaterial
and doesn't matter, of course, so the old code made sense in that
context.

In the attached patch I also moved the arch_get_random_long() and
random_get_entropy() to outside the crng spinlock. We're not talking
blocking operations, but it can easily be hundreds of cycles with
rdrand retries, or the random_get_entropy() reading an external clock
on some architectures.

                 Linus
Alexander E. Patrakov Sept. 19, 2019, 8:45 p.m. UTC | #6
20.09.2019 01:04, Linus Torvalds пишет:

> instead. Yeah, it still doesn't help on machines that don't even have
> a cycle counter, but it at least means that you don't have to have a
> CPU rdrand (or equivalent) but you do have a cycle counter, now the
> extraction of randomness from the pool doesn't just do the
> (predictable) mutation for the backtracking, but actually means that
> you have some very hard to predict timing effects.
> 
> Again, in this case a cycle counter really does add a small amount of
> entropy (everybody agrees that modern CPU's are simply too complex to
> be predictable at a cycle level), but that's not really the point. The
> point is that now doing the extraction really fundamentally changes
> the state in unpredictable ways, so that you don't have that "if I
> recognize a value, I know what the next value will be" kind of attack.

This already resembles in-kernel haveged (except that it doesn't credit 
entropy), and Willy Tarreau said "collect the small entropy where it is, 
period" today. So, too many people touched upon the topic in one day, 
and therefore I'll bite.

We already have user-space software (haveged and modern versions of 
rngd) that extract supposed entropy from clock jitter and feed it back 
to the kernel via /dev/random (crediting it). Indeed, at present, on 
some hardware this is the only way for distributions and users to 
collect enough entropy during boot and avoid stalls - all other 
suggestions are simply non-constructive. Also, Google's Fuchsia OS does 
use and credit jitter entropy.

For the record: I do not have a justifiable opinion whether haveged/rngd 
output (known as jitter entropy) actually contains any entropy. I 
understand that there are two possible viewpoints here. The rest of the 
email is written under the assumption that haveged does provide real 
entropy and not fake one.

The problem that I have with the current situation is that distributions 
and users, when they set up their systems to run haveged or rngd, often 
do it incorrectly (even, as mentioned, under the assumption that haveged 
is something valid and useful). The most common mistake is relying on 
systemd-provided default dependencies, thus not starting such software 
as early as possible. Even worse, no initramfs generator allows one to 
easily include haveged/rngd in the initramfs and run it there. And for 
me, the first urandom warning comes from the initramfs, so anything 
started from the main system is, arguably, already too late.

Therefore, I think, an in-kernel hwrng that exposes jitter entropy is 
something useful (for those who agree that jitter entropy is not fake), 
because it avoids the pitfall-ridden userspace setup. Just as an 
exercise, I have implemented a very simple driver (attached as a patch) 
that does just that. I am only half-serious here, the driver is only 
lightly tested in KVM without any devices except an unconnected virtio 
network card, not on any real hardware. Someone else can also find it 
useful as a test/fake hwrng driver.

I am aware that there was an earlier decision that jitter entropy should 
not be credited, i.e. effectively a pre-existing NAK from Theodore Ts'o. 
But, well, distributions are already overriding this decision in 
userspace, and do it badly, so in my viewpoint, the driver would be a 
net win if some mechanism is added that makes it a no-op by default even 
if the driver is built-in. E.g. an explicit "enable" parameter, but I am 
open to other suggestions, too.
Linus Torvalds Sept. 19, 2019, 9:47 p.m. UTC | #7
On Thu, Sep 19, 2019 at 1:45 PM Alexander E. Patrakov
<patrakov@gmail.com> wrote:
>
> This already resembles in-kernel haveged (except that it doesn't credit
> entropy), and Willy Tarreau said "collect the small entropy where it is,
> period" today. So, too many people touched upon the topic in one day,
> and therefore I'll bite.

I'm one of the people who aren't entirely convinced by the jitter
entropy - I definitely believe it exists, I just am not necessarily
convinced about the actual entropy calculations.

So while I do think we should take things like the cycle counter into
account just because I think it's a a useful way to force some noise,
I am *not* a huge fan of the jitter entropy driver either, because of
the whole "I'm not convinced about the amount of entropy".

The whole "third order time difference" thing would make sense if the
time difference was some kind of smooth function - which it is at a
macro level.

But at a micro level, I could easily see the time difference having
some very simple pattern - say that your cycle counter isn't really
cycle-granular, and the load takes 5.33 "cycles" and you see a time
difference pattern of (5, 5, 6, 5, 5, 6, ...). No real entropy at all
there, it is 100% reliable.

At a macro level, that's a very smooth curve, and you'd say "ok, time
difference is 5.3333 (repeating)". But that's not what the jitter
entropy code does. It just does differences of differences.

And that completely non-random pattern has a first-order difference of
0, 1, 1, 0, 1, 1.. and a second order of 1, 0, 1, 1, 0,  and so on
forever. So the "jitter entropy" logic will assign that completely
repeatable thing entropy, because the delta difference doesn't ever go
away.

Maybe I misread it.

We used to (we still do, but we used to too) do that same third-order
delta difference ourselves for the interrupt timing entropy estimation
in add_timer_randomness(). But I think it's more valid with something
that likely has more noise (interrupt timing really _should_ be
noisy). It's not clear that the jitterentropy load really has all that
much noise.

That said, I'm _also_ not a fan of the user mode models - they happen
too late anyway for some users, and as you say, it leaves us open to
random (heh) user mode distribution choices that may be more or less
broken.

I would perhaps be willing to just put my foot down, and say "ok,
we'll solve the 'getrandom(0)' issue by just saying that if that
blocks too  much, we'll do the jitter entropy thing".

Making absolutely nobody happy, but working in practice. And maybe
encouraging the people who don't like jitter entropy to use
GRND_SECURE instead.

              Linus
Alexander E. Patrakov Sept. 19, 2019, 10:23 p.m. UTC | #8
20.09.2019 02:47, Linus Torvalds пишет:
> On Thu, Sep 19, 2019 at 1:45 PM Alexander E. Patrakov
> <patrakov@gmail.com> wrote:
>>
>> This already resembles in-kernel haveged (except that it doesn't credit
>> entropy), and Willy Tarreau said "collect the small entropy where it is,
>> period" today. So, too many people touched upon the topic in one day,
>> and therefore I'll bite.
> 
> I'm one of the people who aren't entirely convinced by the jitter
> entropy - I definitely believe it exists, I just am not necessarily
> convinced about the actual entropy calculations.
> 
> So while I do think we should take things like the cycle counter into
> account just because I think it's a a useful way to force some noise,
> I am *not* a huge fan of the jitter entropy driver either, because of
> the whole "I'm not convinced about the amount of entropy".
> 
> The whole "third order time difference" thing would make sense if the
> time difference was some kind of smooth function - which it is at a
> macro level.
> 
> But at a micro level, I could easily see the time difference having
> some very simple pattern - say that your cycle counter isn't really
> cycle-granular, and the load takes 5.33 "cycles" and you see a time
> difference pattern of (5, 5, 6, 5, 5, 6, ...). No real entropy at all
> there, it is 100% reliable.
> 
> At a macro level, that's a very smooth curve, and you'd say "ok, time
> difference is 5.3333 (repeating)". But that's not what the jitter
> entropy code does. It just does differences of differences.
> 
> And that completely non-random pattern has a first-order difference of
> 0, 1, 1, 0, 1, 1.. and a second order of 1, 0, 1, 1, 0,  and so on
> forever. So the "jitter entropy" logic will assign that completely
> repeatable thing entropy, because the delta difference doesn't ever go
> away.
> 
> Maybe I misread it.

You didn't. Let me generalize and rephrase the part of the concern that 
I agree with, in my own words:

The same code is used in cryptoapi rng, and also a userspace version 
exists. These two have been tested by the author via the "dieharder" 
tool (see the message for commit d9d67c87), so we know that on his 
machine it actually produces good-quality random bits. However, the 
in-kernel self-test is much, much weaker, and would not catch the 
situation when someone's machine is deterministic in a way that you 
describe, or something similar.

OTOH, I thought that at least part of the real entropy, if it exists, 
comes from the interference of the CPU's memory accesses with the 
refresh cycles that are clocked from an independent oscillator. That's 
why (in order to catch more of them before declaring the crng 
initialized) I have set the quality to the minimum possible that is 
guaranteed to be distinct from zero according to the fixed-point math in 
hwrng_fillfn() in drivers/char/hw_random/core.c.

> 
> We used to (we still do, but we used to too) do that same third-order
> delta difference ourselves for the interrupt timing entropy estimation
> in add_timer_randomness(). But I think it's more valid with something
> that likely has more noise (interrupt timing really _should_ be
> noisy). It's not clear that the jitterentropy load really has all that
> much noise.
> 
> That said, I'm _also_ not a fan of the user mode models - they happen
> too late anyway for some users, and as you say, it leaves us open to
> random (heh) user mode distribution choices that may be more or less
> broken.
> 
> I would perhaps be willing to just put my foot down, and say "ok,
> we'll solve the 'getrandom(0)' issue by just saying that if that
> blocks too  much, we'll do the jitter entropy thing".
> 
> Making absolutely nobody happy, but working in practice. And maybe
> encouraging the people who don't like jitter entropy to use
> GRND_SECURE instead.

I think this approach makes sense. For those who don't believe in jitter 
entropy, it changes really nothing (except a one-time delay) to Ahmed's 
first patch that makes getrandom(0) equivalent to /dev/urandom, and 
nobody so far proposed anything better that doesn't break existing 
systems. And for those who do believe in jitter entropy, this makes the 
situation as good as in OpenBSD.
Alexander E. Patrakov Sept. 19, 2019, 11:44 p.m. UTC | #9
20.09.2019 03:23, Alexander E. Patrakov пишет:
> 20.09.2019 02:47, Linus Torvalds пишет:
>> On Thu, Sep 19, 2019 at 1:45 PM Alexander E. Patrakov
>> <patrakov@gmail.com> wrote:
>>>
>>> This already resembles in-kernel haveged (except that it doesn't credit
>>> entropy), and Willy Tarreau said "collect the small entropy where it is,
>>> period" today. So, too many people touched upon the topic in one day,
>>> and therefore I'll bite.
>>
>> I'm one of the people who aren't entirely convinced by the jitter
>> entropy - I definitely believe it exists, I just am not necessarily
>> convinced about the actual entropy calculations.
>>
>> So while I do think we should take things like the cycle counter into
>> account just because I think it's a a useful way to force some noise,
>> I am *not* a huge fan of the jitter entropy driver either, because of
>> the whole "I'm not convinced about the amount of entropy".
>>
>> The whole "third order time difference" thing would make sense if the
>> time difference was some kind of smooth function - which it is at a
>> macro level.
>>
>> But at a micro level, I could easily see the time difference having
>> some very simple pattern - say that your cycle counter isn't really
>> cycle-granular, and the load takes 5.33 "cycles" and you see a time
>> difference pattern of (5, 5, 6, 5, 5, 6, ...). No real entropy at all
>> there, it is 100% reliable.
>>
>> At a macro level, that's a very smooth curve, and you'd say "ok, time
>> difference is 5.3333 (repeating)". But that's not what the jitter
>> entropy code does. It just does differences of differences.
>>
>> And that completely non-random pattern has a first-order difference of
>> 0, 1, 1, 0, 1, 1.. and a second order of 1, 0, 1, 1, 0,  and so on
>> forever. So the "jitter entropy" logic will assign that completely
>> repeatable thing entropy, because the delta difference doesn't ever go
>> away.
>>
>> Maybe I misread it.
> 
> You didn't. Let me generalize and rephrase the part of the concern that 
> I agree with, in my own words:
> 
> The same code is used in cryptoapi rng, and also a userspace version 
> exists. These two have been tested by the author via the "dieharder" 
> tool (see the message for commit d9d67c87), so we know that on his 
> machine it actually produces good-quality random bits. However, the 
> in-kernel self-test is much, much weaker, and would not catch the 
> situation when someone's machine is deterministic in a way that you 
> describe, or something similar.

A constructive suggestion here would be to put the first few thousands 
(ok, a completely made up number) raw timing intervals through a "gzip 
compression test" in addition to the third derivative test, just based 
on what we already have in the kernel.
Theodore Ts'o Sept. 20, 2019, 1:08 p.m. UTC | #10
On Thu, Sep 19, 2019 at 08:20:57AM -0700, Linus Torvalds wrote:
> And unlike your theoretical state extension attack, I can point you to
> black hat presentations that literally talk about using the fact that
> we delay m,ixing in the input pull hash to know what's going on:
> 
>   https://www.blackhat.com/docs/eu-14/materials/eu-14-Kedmi-Attacking-The-Linux-PRNG-On-Android-Weaknesses-In-Seeding-Of-Entropic-Pools-And-Low-Boot-Time-Entropy.pdf
> 
> That's a real attack. Based on the REAL fact that we currently have to
> use the urandom logic because the entropy-waiting one is useless, and
> in fact depends on the re-seeding happening too late.

Actually, that particular case proves my point.

In that particular attack was against Android 4.3 (Android KitKat).
In the 3.4 kernel used by KitKat, before the urandom pool is
considered initialized, 100% of the entropy from
add_interrupt_randomness() goes to the urandom pool, NOT the input
pool.  add_device_entropy() also fed the urandom pool.  And on an
Android device, it doesn't have a keyboard, mouse, or spinning HDD, so
add_timer_randomness() and add_disk_randomness() weren't a factor.

The real problem was that the Android zygote process sampled the the
urandom pool too early, and what the attack did was essentially one
where they were trying to determine the state of the pool by looking
at that sampled output of /dev/urandom.

If we make getrandom(0) work like /dev/urandom, it doesn't solve the
problem, because if you read from the entropy pool before we can get
high quality randomness, you're screwed.  The only real answers are
(a) try to get better entropy early, or (b) get userspace to wait
until it's safe to read from /dev/urandom.

Long-term, (a) is the only real way to solve the problem, and whether
you trust the bootloader, or trust the built-in hardware random number
generator (whether it's RDRAND, or some secure element in the device,
etc), we can't control userspace.  We can try to enforce userspace to
be safe by blocking, but that makes people unhappy.  We can certainly
try to influence userspace by annoying them with WARN() stack traces
in the logs, and hope they pay attention, but that's not guaranteed.

> But honestly, this isn't realistic. I can point to emails where *you*
> are  arguing against other hashing algorithms because the whole state
> extension attack simply isn't realistic.

The blackhat presentation which you pointed at *was* actually a state
extension attack.  When I argued against state extension attacks, that
was in cases where people worried about recovery after the pool is
exposed --- and my argument was if you can read from kernel memory
enough to grab the pool state, you have other problems.  Your
observation that if you can install malware that runs at system
initscript/userspace bootup time, you probably have other problems, is
a similar argument, and it's a fair one.  But it *has* happened, as
the blackhat paper demonstrates.

My thinking at the time is that if people are reading from the CRNG
before it's initialized (which could only happen via /dev/urandom),
that was kind of a disaster anyway, so resetting the initialization
count would at least get us to the point where when the CRNG *was*
declared to be initialized, that was something could state with high
confidence that we were in a secure state.  

> > I'm happy this proposed is not changing the behavior of getrandom(0).
> > Why not just remap 0 to GRND_EXPLICIT | GRND_WAIT_ENTROPY, though?  It
> > will have the same effect, and it's make it clear what we're doing.
> 
> Have you you not followed the whole discussion? Didn't you read the comment?
> 
> People use "getrandom(0)" not because they want secure randomness, but
> because that's the default.
> 
> And we *will* do something about it. This patch didn't, because I want
> to be able to backport it to stable, so that everybody is happier with
> saying "ok, I'll use the new getrandom(GRND_INSECURE)".
> 
> Because getrandom(0) will NOT be the the same as GRND_EXPLICIT |
> GRND_WAIT_ENTROPY.

No, I did read the comment.  And I agree that at the moment, that yes,
it is ambiguous.  What I really care about though, is the HUGE
DEPLOYED BASE which is using getrandom(0) *because* they are
generating cryptographic keys, and we will be changing things out from
under them.

We agree that we don't want to change things out from under the stable
users.  I'm pleading that we not screw over existing userspace --- at
least not right away.  Give them *time* to release update their source
bases to use getrandom(GRND_SECURE).  So what if we make getrandom(0)
print a ratelimited KERN_ERR deprecation notice that program should
switch to either specify either GRND_INSECURE or GRND_SECURE, and not
change the current semantics of getrandom(0) for some period of time?
Say, a year.  Or even six months.

If that's not good enough, what if we change getrandom(0) immediately,
but only for those platforms which have a functional
arch_get_random_long() or random_get_entropy()?  That gets us the x86
platform, which is where pretty much all of the users who have
complained have been coming from.  For the IOT/embedded user cases,
blocking is actually a feature, because the problem will be caught
while the product is in development, when the userspace code can be
fixed.

						- Ted
Theodore Ts'o Sept. 20, 2019, 1:13 p.m. UTC | #11
On Thu, Sep 19, 2019 at 08:50:15AM -0700, Linus Torvalds wrote:
> .. btw, instead of bad workarounds for a theoretical attack, here's
> something that should add actual *practical* real value: use the time
> of day (whether from an RTC device, or from ntp) to add noise to the
> random pool.

Actally, we used to seed the pool from the RTC device --- that was the
case in the 3.4 kernel referenced by the Blackhat attack, and it
didn't stop the researchers.  In later kernels, we moved up when
rand_initialized() got called to before time_init(), so
init_std_data() was no longer seeding the pool from the RTC clock.

That being said, adding calls to add_device_randomness() to
do_settimeofday64() and timekeeping_inject_offset() is an obviously
good thing to do.  I'll prepare a separate patch for the random.git
tree to do that.

					- Ted
Theodore Ts'o Sept. 20, 2019, 1:16 p.m. UTC | #12
On Fri, Sep 20, 2019 at 03:23:58AM +0500, Alexander E. Patrakov wrote:
> OTOH, I thought that at least part of the real entropy, if it exists, comes
> from the interference of the CPU's memory accesses with the refresh cycles
> that are clocked from an independent oscillator.

That's not a valid assumption; on *many* systems, there is only a
single master oscillator.  It saves on power, parts cost, reduces the
amount of RF interference, etc.

						- Ted
Ahmed S. Darwish Sept. 20, 2019, 1:46 p.m. UTC | #13
Hi,

On Wed, Sep 18, 2019 at 04:57:58PM -0700, Linus Torvalds wrote:
> On Wed, Sep 18, 2019 at 2:17 PM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> >
> > Since Linux v3.17, getrandom(2) has been created as a new and more
> > secure interface for pseudorandom data requests.  It attempted to
> > solve three problems, as compared to /dev/urandom:
  > 
> I don't think your patch is really _wrong_, but I think it's silly to
> introduce a new system call, when we have 30 bits left in the flags of
> the old one, and the old system call checked them.
> 
> So it's much simpler and more straightforward to  just introduce a
> single new bit #2 that says "I actually know what I'm doing, and I'm
> explicitly asking for secure/insecure random data".
> 
> And then say that the existing bit #1 just means "I want to wait for entropy".
> 
> So then you end up with this:
> 
>     /*
>      * Flags for getrandom(2)
>      *
>      * GRND_NONBLOCK    Don't block and return EAGAIN instead
>      * GRND_WAIT_ENTROPY        Explicitly wait for entropy
>      * GRND_EXPLICIT    Make it clear you know what you are doing
>      */
>     #define GRND_NONBLOCK               0x0001
>     #define GRND_WAIT_ENTROPY   0x0002
>     #define GRND_EXPLICIT               0x0004
> 
>     #define GRND_SECURE (GRND_EXPLICIT | GRND_WAIT_ENTROPY)
>     #define GRND_INSECURE       (GRND_EXPLICIT | GRND_NONBLOCK)
> 
>     /* Nobody wants /dev/random behavior, nobody should use it */
>     #define GRND_RANDOM 0x0002
> 
> which is actually fairly easy to understand. So now we have three
> bits, and the values are:
> 
>  000  - ambiguous "secure or just lazy/ignorant"
>  001 - -EAGAIN or secure
>  010 - blocking /dev/random DO NOT USE
>  011 - nonblocking /dev/random DO NOT USE
>  100 - nonsense, returns -EINVAL
>  101 - /dev/urandom without warnings
>  110 - blocking secure
>  111 - -EAGAIN or secure
>

Hmmm, the point of the new syscall was **exactly** to avoid the 2^3
combinations above, and to provide developers only two, sane and easy,
options:

  - GRND2_INSECURE
  - GRND2_SECURE_UNBOUNDED_INITIAL_WAIT

You *must* pick one of these, and that's it. (!)

Then the proposed getrandom_wait(7) manpage, also mentioned in the V4
patch WARN message, would provide a big rationale, and encourage
everyone to use the new getrandom2(2) syscall instead.

But yeah, maybe we should add the extra flags to the old getrandom()
instead, and let glibc implement a getrandom_safe(3) wrapper only
with the sane options available.

Problem is, glibc is still *really* slow in adopting linux syscall
wrappers, so I'm not optimistic about that...

I still see the new system call as the sanest path, even provided
the cost of a new syscall number..

@Linus, @Ted:  Final thoughts?

> and people would be encouraged to use one of these three:
> 
>  - GRND_INSECURE
>  - GRND_SECURE
>  - GRND_SECURE | GRND_NONBLOCK
> 
> all of which actually make sense, and none of which have any
> ambiguity. And while "GRND_INSECURE | GRND_NONBLOCK" works, it's
> exactly the same as just plain GRND_INSECURE - the point is that it
> doesn't block for entropy anyway, so non-blocking makes no different.
>

[...]

> 
> There is *one* other small semantic change: The old code did
> urandom_read() which added warnings, but each warning also _reset_ the
> crng_init_cnt. Until it decided not to warn any more, at which point
> it also stops that resetting of crng_init_cnt.
> 
> And that reset of crng_init_cnt, btw, is some cray cray.
> 
> It's basically a "we used up entropy" thing, which is very
> questionable to begin with as the whole discussion has shown, but
> since it stops doing it after 10 cases, it's not even good security
> assuming the "use up entropy" case makes sense in the first place.
> 
> So I didn't copy that insanity either. And I'm wondering if removing
> it from /dev/urandom might also end up helping Ahmed's case of getting
> entropy earlier, when we don't reset the counter.
>

Yeah, noticed that, but I've learned not to change crypto or
speculative-execution code even if the changes "just look the same" at
first glance ;-)

(out of curiosity, I'll do a quick test with this CRNG entropy reset
part removed. Maybe it was indeed part of the problem..)

> But other than those two details, none of the existing semantics
> changed, we just added the three actually _sane_ cases without any
> ambiguity.
> 
> In particular, this still leaves the semantics of that nasty
> "getrandom(0)" as the same "blocking urandom" that it currently is.
> But now it's a separate case, and we can make that perhaps do the
> timeout, or at least the warning.
>

Yeah, I would propose to keep the V4-submitted "timeout then WARN"
logic. This alone will give user-space / distributions time to adapt.

For example, it was interesting that even the 0day bot had limited
entropy on boot (virtio-rng / TRUST_CPU not enabled):

    https://lkml.kernel.org/r/20190920005120.GP15734@shao2-debian

If user-space didn't get its act together, then the other extreme
measures can be implemented later (the getrandom() length test, using
jitter as a credited kernel entropy source, etc., etc.)

> And the new cases are defined to *not* warn. In particular,
> GRND_INSECURE very much does *not* warn about early urandom access
> when crng isn't ready. Because the whole point of that new mode is
> that the user knows it isn't secure.
>
> So that should make getrandom(GRND_INSECURE) palatable to the systemd
> kind of use that wanted to avoid the pointless kernel warning.
>

Yup, that's what was in the submitted V4 patch too. The caller
explicitly asked for "insecure", so they know what they're doing.

getrandom2(2) never prints any kernel message.

> And we could mark this for stable and try to get it backported so that
> it will have better coverage, and encourage people to use the new sane
> _explicit_ waiting (or not) for entropy.
>

ACK. I'll wait for an answer to the "Final thoughts?" question above,
send a V5 with CC:stable, then disappear from this thread ;-)

Thanks a lot everyone!

--
Ahmed Darwish
Andy Lutomirski Sept. 20, 2019, 2:33 p.m. UTC | #14
On Fri, Sep 20, 2019 at 6:46 AM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
>
> Hi,
>
> On Wed, Sep 18, 2019 at 04:57:58PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 18, 2019 at 2:17 PM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> > >
> > > Since Linux v3.17, getrandom(2) has been created as a new and more
> > > secure interface for pseudorandom data requests.  It attempted to
> > > solve three problems, as compared to /dev/urandom:
>   >
> > I don't think your patch is really _wrong_, but I think it's silly to
> > introduce a new system call, when we have 30 bits left in the flags of
> > the old one, and the old system call checked them.
> >
> > So it's much simpler and more straightforward to  just introduce a
> > single new bit #2 that says "I actually know what I'm doing, and I'm
> > explicitly asking for secure/insecure random data".
> >
> > And then say that the existing bit #1 just means "I want to wait for entropy".
> >
> > So then you end up with this:
> >
> >     /*
> >      * Flags for getrandom(2)
> >      *
> >      * GRND_NONBLOCK    Don't block and return EAGAIN instead
> >      * GRND_WAIT_ENTROPY        Explicitly wait for entropy
> >      * GRND_EXPLICIT    Make it clear you know what you are doing
> >      */
> >     #define GRND_NONBLOCK               0x0001
> >     #define GRND_WAIT_ENTROPY   0x0002
> >     #define GRND_EXPLICIT               0x0004

What is this GRND_EXPLICIT thing?

A few weeks ago, I sent a whole series to address this, and I
obviously didn't cc enough people.  I'll resend a rebased version
today.  Meanwhile, some comments on this whole mess:

As I think everyone mostly agrees in this whole thread, getrandom()
can't just magically start returning non-random results.  That would
be a big problem.

Linus, I disagree that blocking while waiting for randomness is an
error.  Sometimes you want to generate a key, you want to finish as
quickly as possible, and you don't want to be in the business of
fiddling with the setup of the kernel RNG.  I would argue that *most*
crypto applications are in this category.  I think that the kernel
should, instead, handle this mess itself.  As a first pass, it could
be as simple as noticing that someone is blocking on randomness and
kicking off a thread that does some randomish reads to the rootfs.
This would roughly simulate the old behavior in which an ext4 rootfs
did more IO than necessary.  A fancier version would, as discussed in
this thread, do more clever things.

(As an aside, I am not a fan of xoring or adding stuff to the CRNG
state.  We should just use an actual crypto primitive for this.
Accumulate the state in a buffer and SHA-512 it.  Or use something
like the Keccak duplex sponge.  But this is a discussion for another
day.)

So I'm going to resend my series.  You can all fight over whether the
patch that actually goes in should be based on my series or based on
this patch.

--Andy
Linus Torvalds Sept. 20, 2019, 4:29 p.m. UTC | #15
On Fri, Sep 20, 2019 at 7:34 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> What is this GRND_EXPLICIT thing?

Your own email gives the explanation:

> Linus, I disagree that blocking while waiting for randomness is an
> error.  Sometimes you want to generate a key

That's *exactly* why GRND_EXPLICIT needs to be done regardless.

The keyword there is "Sometimes".

But people currently use "getrandom(0)" when they DO NOT want a key,
they just want some miscellaneous random numbers for some totally
non-security-related reason.

And that will continue. Exactly because the people who do not want a
key by definition aren't thinking about it very hard.

So the interface was very much mis-designed from the get-go. It was
designed purely for key people, even though generating keys is by no
means the most common reason for wanting a block of "random" numbers.

So GRND_EXPLICIT is there very much to make sure people who want true
secure keys will say so, and five years from now we will not have the
confusion between "Oh, I wasn't thinking about bootup". Because at a
minimum, in the near future getrandom(0) will warn about the
ambiguity. Or it will use some questionable jitter entropy that some
real key users will look at sideways and go "I don't want that".

This is an ABI design issue. The old ABI was fundamentally misdesigned
and actively encouraged the current situation of mixing secure and
insecure callers for that getrandom(0).

And it's entirely orthogonal to _any_ actual technical change we will
do (like removing the old GRND_RANDOM behavior entirely, which is
insane for other reasons and nobody ever wanted or likely used).

            Linus
Willy Tarreau Sept. 20, 2019, 5:26 p.m. UTC | #16
Hi Ahmed,

On Fri, Sep 20, 2019 at 03:46:09PM +0200, Ahmed S. Darwish wrote:
> Problem is, glibc is still *really* slow in adopting linux syscall
> wrappers, so I'm not optimistic about that...
>
> I still see the new system call as the sanest path, even provided
> the cost of a new syscall number..

New syscalls are always a pain to deal with in userland, because when
they are introduced, everyone wants them long before they're available
in glibc. So userland has to define NR_xxx for each supported arch and
to perform the call itself.

With flags adoption is instantaneous. Just #ifndef/#define, check if
the flag is supported and that's done. The only valid reason for a new
syscall is when the API changes (e.g. one extra arg, a la accept4()),
which doesn't seem to be the case here. Otherwise please by all means
avoid this in general.

Thanks,
Willy
Andy Lutomirski Sept. 20, 2019, 5:52 p.m. UTC | #17
On Fri, Sep 20, 2019 at 9:30 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Sep 20, 2019 at 7:34 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > What is this GRND_EXPLICIT thing?
>
> Your own email gives the explanation:
>
> > Linus, I disagree that blocking while waiting for randomness is an
> > error.  Sometimes you want to generate a key
>
> That's *exactly* why GRND_EXPLICIT needs to be done regardless.
>
> The keyword there is "Sometimes".
>
> But people currently use "getrandom(0)" when they DO NOT want a key,
> they just want some miscellaneous random numbers for some totally
> non-security-related reason.
>
> And that will continue. Exactly because the people who do not want a
> key by definition aren't thinking about it very hard.

I fully agree that this is a problem.  It's a problem we brought on
ourselves because we screwed up the ABI from the beginning.  The
question is what to do about it that doesn't cause its own set of
nasty problems.

> So GRND_EXPLICIT is there very much to make sure people who want true
> secure keys will say so, and five years from now we will not have the
> confusion between "Oh, I wasn't thinking about bootup". Because at a
> minimum, in the near future getrandom(0) will warn about the
> ambiguity. Or it will use some questionable jitter entropy that some
> real key users will look at sideways and go "I don't want that".

There are programs that call getrandom(0) *today* that expect secure
output.  openssl does a horrible dance in which it calls getentropy()
if available and falls back to syscall(__NR_getrandom, buf, buflen, 0)
otherwise.  We can't break this use case.  Changing the semantics of
getrandom(0) out from under them seems like the worst kind of ABI
break -- existing applications will *appear* to continue working but
will, in fact, become insecure.

IMO, from the beginning, we should have done this:

GRND_INSECURE: insecure.  always works.

GRND_SECURE_BLOCKING: does exactly what it says.

0: -EINVAL.

Using it correctly would be obvious.  Something like GRND_EXPLICIT
would be a head-scratcher: people would have to look at the man page
and actually think about it, and it's still easy to get wrong:

getrandom(..., GRND_EXPLICIT): just fscking give me a number.  it
seems to work and it shuts up the warning

And we're back to square one.


I think that, given existing software, we should make two or three
changes to fix the basic problems here:

1. Add GRND_INSECURE: at least let new applications do the right thing
going forward.

2. Fix what is arguably a straight up kernel bug, not even an ABI
issue: when a user program is blocking in getrandom(..., 0), the
kernel happily sits there doing absolutely nothing and deadlocks the
system as a result.  This IMO isn't an ABI issue -- it's an
implementation problem.  How about we make getrandom() (probably
actually wait_for_random_bytes()) do something useful to try to seed
the RNG if the system is otherwise not doing IO.

3. Optionally, entirely in user code: Get glibc to add new *library*
functions: getentropy_secure_blocking() and getentropy_insecure() or
whatever they want to call them.  Deprecate getentropy().

I think #2 is critical.  Right now, suppose someone has a system that
neets to do a secure network request (a la Red Hat's Clevis).  I have
no idea what Clevis actually does, but it wouldn't be particularly
crazy to do a DH exchange or sign with an EC key to ask some network
server to help unlock a dm-crypt volume.  If the system does this at
boot, it needs to use getrandom(..., 0), GRND_EXPLICIT, or whatever,
because it NEEDS a secure random number.  No about of ABI fiddling
will change this.  The kernel should *work* in this case rather than
deadlocking.

--Andy
Ahmed S. Darwish Sept. 20, 2019, 5:56 p.m. UTC | #18
On Fri, Sep 20, 2019 at 07:26:09PM +0200, Willy Tarreau wrote:
> Hi Ahmed,
> 
> On Fri, Sep 20, 2019 at 03:46:09PM +0200, Ahmed S. Darwish wrote:
> > Problem is, glibc is still *really* slow in adopting linux syscall
> > wrappers, so I'm not optimistic about that...
> >
> > I still see the new system call as the sanest path, even provided
> > the cost of a new syscall number..
> 
> New syscalls are always a pain to deal with in userland, because when
> they are introduced, everyone wants them long before they're available
> in glibc. So userland has to define NR_xxx for each supported arch and
> to perform the call itself.
> 
> With flags adoption is instantaneous. Just #ifndef/#define, check if
> the flag is supported and that's done. The only valid reason for a new
> syscall is when the API changes (e.g. one extra arg, a la accept4()),
> which doesn't seem to be the case here. Otherwise please by all means
> avoid this in general.
> 

I see. Thanks a lot for the explanation above :)

--
Ahmed Darwish
Linus Torvalds Sept. 20, 2019, 6:09 p.m. UTC | #19
On Fri, Sep 20, 2019 at 10:52 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> IMO, from the beginning, we should have done this:
>
> GRND_INSECURE: insecure.  always works.
>
> GRND_SECURE_BLOCKING: does exactly what it says.
>
> 0: -EINVAL.

Violently agreed. And that's kind of what the GRND_EXPLICIT is really
aiming for.

However, it's worth noting that nobody should ever use GRND_EXPLICIT
directly. That's just the name for the bit. The actual users would use
GRND_INSECURE or GRND_SECURE.

And yes, maybe it's worth making the name be GRND_SECURE_BLOCKING just
to make people see what the big deal is.

In the meantime, we need that new bit just to be able to create the
new semantics eventually. With a warning to nudge people in the right
direction.

We may never be able to return -EINVAL, but we can add the pr_notice()
to discourage people from using it.

And yes, we'll have to block - at least for a time - to get some
entropy. But at some point we either start making entropy up, or we
say "0 means jitter-entropy for ten seconds".

That will _work_, but it will also make the security-people nervous,
which is just one more hint that they should move to
GRND_SECURE[_BLOCKING].

> getrandom(..., GRND_EXPLICIT): just fscking give me a number.  it
> seems to work and it shuts up the warning
>
> And we're back to square one.

Actually, you didn't read the GRND_INSECURE patch, did you.

getrandom(GRND_EXPLICIT) on its own returns -EINVAL.

Because yes, I thought about it, and yes, I agree that it's the same
as the old 0.

So GRND_EXPLICIT is a bit that basically means "I am explicit about
what behavior I want". But part of that is that you need to _state_
the behavior too.

So:

 - GRND_INSECURE is (GRND_EXPLICIT | GRND_NONBLOCK)

   As in "I explicitly ask you not to just not ever block": urandom

 - GRND_SECURE_BLOCKING is (GRND_EXPLICIT | GRND_RANDOM)

   As in "I explicitly ask you for those secure random numbers"

 - GRND_SECURE_NONBLOCKING is (GRND_EXPLICIT | GRND_RANDOM | GRND_NONBLOCK)

   As in "I want explicitly secure random numbers, but return -EAGAIN
if that would block".

Which are the three sane behaviors (that last one is useful for the "I
can try to generate entropy if you don't have any" case. I'm not sure
anybody will do it, but it definitely conceptually makes sense).

And I agree that your naming is better.

I had it as just "GRND_SECURE" for the blocking version, and
"GRND_SECURE | GRND_NONBLOCK" for the "secure but return EAGAIN if you
would need to block for entropy" version.

But explicitly stating the blockingness in the name makes it clearer
to the people who just want GRND_INSECURE, and makes them realize that
they don't want the blocking version.

             Linus
Willy Tarreau Sept. 20, 2019, 6:12 p.m. UTC | #20
Hi Andy,

On Fri, Sep 20, 2019 at 10:52:30AM -0700, Andy Lutomirski wrote:
> 2. Fix what is arguably a straight up kernel bug, not even an ABI
> issue: when a user program is blocking in getrandom(..., 0), the
> kernel happily sits there doing absolutely nothing and deadlocks the
> system as a result.  This IMO isn't an ABI issue -- it's an
> implementation problem.  How about we make getrandom() (probably
> actually wait_for_random_bytes()) do something useful to try to seed
> the RNG if the system is otherwise not doing IO.

I thought about it as well with my old MSDOS reflexes, but here I
doubt we can do a lot. It seems fishy to me to start to fiddle with
various drivers from within a getrandom() syscall, we could sometimes
even end up waiting even longer because one device is already locked,
and when we have access there there's not much we can do without
risking to cause some harm. On desktop systems you have a bit more
choice than on headless systems (blink keyboard leds and time the
interrupts, run some disk accesses when there's still a disk, get a
copy of the last buffer of the audio input and/or output, turn on
the microphone and/or webcam, and collect some data). Many of them
cannot always be used. We could do some more portable stuff like scan
and hash the totality of the RAM. But that's all quite bad and
unreliable and at this point it's better to tell userland "here's
what I could get for you, if you want better, do it yourself" and the
userland can then ask the user "dear user, I really need valid entropy
this time to generate your GPG key, please type frantically on this
keyboard". And it will be more reliable this way in my opinion.

My analysis of the problem precisely lies in the fact that we've
always considered that the kernel had to provide randoms for any
use case and had to cover the most difficult cases and imposed
their constraints on simplest ones. Better let the application
decide.

Willy
Alexander E. Patrakov Sept. 20, 2019, 6:15 p.m. UTC | #21
20.09.2019 22:52, Andy Lutomirski пишет:
> I think that, given existing software, we should make two or three
> changes to fix the basic problems here:
> 
> 1. Add GRND_INSECURE: at least let new applications do the right thing
> going forward.
> 
> 2. Fix what is arguably a straight up kernel bug, not even an ABI
> issue: when a user program is blocking in getrandom(..., 0), the
> kernel happily sits there doing absolutely nothing and deadlocks the
> system as a result.  This IMO isn't an ABI issue -- it's an
> implementation problem.  How about we make getrandom() (probably
> actually wait_for_random_bytes()) do something useful to try to seed
> the RNG if the system is otherwise not doing IO.
> 
> 3. Optionally, entirely in user code: Get glibc to add new *library*
> functions: getentropy_secure_blocking() and getentropy_insecure() or
> whatever they want to call them.  Deprecate getentropy().
> 
> I think #2 is critical.  Right now, suppose someone has a system that
> neets to do a secure network request (a la Red Hat's Clevis).  I have
> no idea what Clevis actually does, but it wouldn't be particularly
> crazy to do a DH exchange or sign with an EC key to ask some network
> server to help unlock a dm-crypt volume.  If the system does this at
> boot, it needs to use getrandom(..., 0), GRND_EXPLICIT, or whatever,
> because it NEEDS a secure random number.  No about of ABI fiddling
> will change this.  The kernel should *work* in this case rather than
> deadlocking.

Let me express a little bit of disagreement with the logic here.

I do agree that #2 is critical, and the Clevis use case is a perfect 
example why it is important. I doubt that it is solvable without 
trusting jitter entropy, or without provoking a dummy read on a random 
block device, just for timings, or maybe some other interaction with the 
external world - but Willy already said "it seems fishy". However, _if_ 
it is solved, then we don't need GRND_INSECURE, because solving #2 is 
equivalent to magically making secure random numbers always available.
Willy Tarreau Sept. 20, 2019, 6:16 p.m. UTC | #22
On Fri, Sep 20, 2019 at 11:09:53AM -0700, Linus Torvalds wrote:
(...)
> So:
> 
>  - GRND_INSECURE is (GRND_EXPLICIT | GRND_NONBLOCK)
> 
>    As in "I explicitly ask you not to just not ever block": urandom
> 
>  - GRND_SECURE_BLOCKING is (GRND_EXPLICIT | GRND_RANDOM)
> 
>    As in "I explicitly ask you for those secure random numbers"
> 
>  - GRND_SECURE_NONBLOCKING is (GRND_EXPLICIT | GRND_RANDOM | GRND_NONBLOCK)
> 
>    As in "I want explicitly secure random numbers, but return -EAGAIN
> if that would block".
> 
> Which are the three sane behaviors (that last one is useful for the "I
> can try to generate entropy if you don't have any" case. I'm not sure
> anybody will do it, but it definitely conceptually makes sense).
> 
> And I agree that your naming is better.
> 
> I had it as just "GRND_SECURE" for the blocking version, and
> "GRND_SECURE | GRND_NONBLOCK" for the "secure but return EAGAIN if you
> would need to block for entropy" version.
> 
> But explicitly stating the blockingness in the name makes it clearer
> to the people who just want GRND_INSECURE, and makes them realize that
> they don't want the blocking version.

I really like it this way. Explicit and full control for the application
plus reasonable backwards compatibility, it sounds pretty good.

Willy
Andy Lutomirski Sept. 20, 2019, 6:29 p.m. UTC | #23
> On Sep 20, 2019, at 11:15 AM, Alexander E. Patrakov <patrakov@gmail.com> wrote:
> 
> 20.09.2019 22:52, Andy Lutomirski пишет:
>> I think that, given existing software, we should make two or three
>> changes to fix the basic problems here:
>> 1. Add GRND_INSECURE: at least let new applications do the right thing
>> going forward.
>> 2. Fix what is arguably a straight up kernel bug, not even an ABI
>> issue: when a user program is blocking in getrandom(..., 0), the
>> kernel happily sits there doing absolutely nothing and deadlocks the
>> system as a result.  This IMO isn't an ABI issue -- it's an
>> implementation problem.  How about we make getrandom() (probably
>> actually wait_for_random_bytes()) do something useful to try to seed
>> the RNG if the system is otherwise not doing IO.
>> 3. Optionally, entirely in user code: Get glibc to add new *library*
>> functions: getentropy_secure_blocking() and getentropy_insecure() or
>> whatever they want to call them.  Deprecate getentropy().
>> I think #2 is critical.  Right now, suppose someone has a system that
>> neets to do a secure network request (a la Red Hat's Clevis).  I have
>> no idea what Clevis actually does, but it wouldn't be particularly
>> crazy to do a DH exchange or sign with an EC key to ask some network
>> server to help unlock a dm-crypt volume.  If the system does this at
>> boot, it needs to use getrandom(..., 0), GRND_EXPLICIT, or whatever,
>> because it NEEDS a secure random number.  No about of ABI fiddling
>> will change this.  The kernel should *work* in this case rather than
>> deadlocking.
> 
> Let me express a little bit of disagreement with the logic here.
> 
> I do agree that #2 is critical, and the Clevis use case is a perfect example why it is important. I doubt that it is solvable without trusting jitter entropy, or without provoking a dummy read on a random block device, just for timings, or maybe some other interaction with the external world - but Willy already said "it seems fishy". However, _if_ it is solved, then we don't need GRND_INSECURE, because solving #2 is equivalent to magically making secure random numbers always available.
> 
> 

I beg to differ. There is a big difference between “do your best *right now*” and “give me a real secure result in a vaguely timely manner”.

For example, the former is useful for ASLR or hash table randomization. The latter is not.
Andy Lutomirski Sept. 20, 2019, 7:12 p.m. UTC | #24
> On Sep 20, 2019, at 11:10 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Fri, Sep 20, 2019 at 10:52 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> IMO, from the beginning, we should have done this:
>>
>> GRND_INSECURE: insecure.  always works.
>>
>> GRND_SECURE_BLOCKING: does exactly what it says.
>>
>> 0: -EINVAL.
>
> Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> aiming for.
>
> However, it's worth noting that nobody should ever use GRND_EXPLICIT
> directly. That's just the name for the bit. The actual users would use
> GRND_INSECURE or GRND_SECURE.
>
> And yes, maybe it's worth making the name be GRND_SECURE_BLOCKING just
> to make people see what the big deal is.
>
> In the meantime, we need that new bit just to be able to create the
> new semantics eventually. With a warning to nudge people in the right
> direction.
>
> We may never be able to return -EINVAL, but we can add the pr_notice()
> to discourage people from using it.
>

The problem is that new programs will have to try the new flag value
and, if it returns -EINVAL, fall back to 0.  This isn't so great.

> And yes, we'll have to block - at least for a time - to get some
> entropy. But at some point we either start making entropy up, or we
> say "0 means jitter-entropy for ten seconds".
>
> That will _work_, but it will also make the security-people nervous,
> which is just one more hint that they should move to
> GRND_SECURE[_BLOCKING].

Wait, are you suggesting that 0 means invoke jitter-entropy or
whatever and GRND_SECURE_BLOCKING means not wait forever and deadlock?
 That's no good -- people will want to continue using 0 because the
behavior is better. My point here is that asking for secure random
numbers isn’t some legacy oddity — it’s genuinely necessary. The
kernel should do whatever it needs to in order to make it work.  We
really don’t want a situation where 0 means get me secure random
numbers reliably but spam the logs and GRND_SECURE_BLOCKING means
don’t spam the logs but risk deadlocking. This will encourage people
to pass 0 to get the improved behavior.

> So GRND_EXPLICIT is a bit that basically means "I am explicit about
> what behavior I want". But part of that is that you need to _state_
> the behavior too.
>
> So:
>
> - GRND_INSECURE is (GRND_EXPLICIT | GRND_NONBLOCK)
>
>   As in "I explicitly ask you not to just not ever block": urandom

IMO this is confusing.  The GRND_RANDOM flag was IMO a mistake and
should just be retired.  Let's enumerate useful cases and then give
them sane values.

>
> - GRND_SECURE_BLOCKING is (GRND_EXPLICIT | GRND_RANDOM)
>
>   As in "I explicitly ask you for those secure random numbers"
>
> - GRND_SECURE_NONBLOCKING is (GRND_EXPLICIT | GRND_RANDOM | GRND_NONBLOCK)
>
>   As in "I want explicitly secure random numbers, but return -EAGAIN
> if that would block".
>
> Which are the three sane behaviors (that last one is useful for the "I
> can try to generate entropy if you don't have any" case. I'm not sure
> anybody will do it, but it definitely conceptually makes sense).
>
> And I agree that your naming is better.

I think this is the complete list of "good" behaviors for new programs:

"insecure": always works, never warns.

"secure, blocking": always returns *eventually* with secure output,
i.e., does something to avoid deadlocks

"secure, nonblocking" returns secure output immediately or returns -EAGAIN.

And the only real question is how to map existing users to these
semantics.  I see two sensible choices:

1. 0 means "secure, blocking". I think this is not what we'd do if we
could go back in time and chage the ABI from day 1, but I think it's
actually good enough.  As long as this mode won't deadlock, it's not
*that* bad if programs are using it when they wanted "insecure".

2. 0 means "secure, blocking, but warn".  Some new value means
"secure, blocking, don't warn".  The problem is that new applications
will have to fall back to 0 to continue supporting old kernels.

I briefly thought that maybe GRND_RANDOM would be a reasonable choice
for "secure, blocking, don't warn", but the effect on new programs on
old kernels will be unfortunate.

I'm willing to go along with #2 if you like it better than #1, and
I'll update my patches accordingly, but I prefer #1.

I do think we should make all the ABI changes that we want to make all
in one release.  Let's not make programs think about their behavior on
more versions than necessary.  So I'd like to get rid of the current
/dev/random semantics, add "insecure" mode, and do whatever deadlock
avoidance scheme we settle on in a single release.

--Andy
Andy Lutomirski Sept. 20, 2019, 7:22 p.m. UTC | #25
On Fri, Sep 20, 2019 at 11:12 AM Willy Tarreau <w@1wt.eu> wrote:
>
> Hi Andy,
>
> On Fri, Sep 20, 2019 at 10:52:30AM -0700, Andy Lutomirski wrote:
> > 2. Fix what is arguably a straight up kernel bug, not even an ABI
> > issue: when a user program is blocking in getrandom(..., 0), the
> > kernel happily sits there doing absolutely nothing and deadlocks the
> > system as a result.  This IMO isn't an ABI issue -- it's an
> > implementation problem.  How about we make getrandom() (probably
> > actually wait_for_random_bytes()) do something useful to try to seed
> > the RNG if the system is otherwise not doing IO.
>
> I thought about it as well with my old MSDOS reflexes, but here I
> doubt we can do a lot. It seems fishy to me to start to fiddle with
> various drivers from within a getrandom() syscall, we could sometimes
> even end up waiting even longer because one device is already locked,
> and when we have access there there's not much we can do without
> risking to cause some harm. On desktop systems you have a bit more
> choice than on headless systems (blink keyboard leds and time the
> interrupts, run some disk accesses when there's still a disk, get a
> copy of the last buffer of the audio input and/or output, turn on
> the microphone and/or webcam, and collect some data). Many of them
> cannot always be used. We could do some more portable stuff like scan
> and hash the totality of the RAM. But that's all quite bad and
> unreliable and at this point it's better to tell userland "here's
> what I could get for you, if you want better, do it yourself" and the
> userland can then ask the user "dear user, I really need valid entropy
> this time to generate your GPG key, please type frantically on this
> keyboard". And it will be more reliable this way in my opinion.

Perhaps userland could register a helper that takes over and does
something better?  But I think the kernel really should do something
vaguely reasonable all by itself.  If nothing else, we want the ext4
patch that provoked this whole discussion to be applied, which means
that we need to unbreak userspace somehow, and returning garbage it to
is not a good choice.

Here are some possible approaches that come to mind:

int count;
while (crng isn't inited) {
  msleep(1);
}

and modify add_timer_randomness() to at least credit a tiny bit to
crng_init_cnt.

Or we do something like intentionally triggering readahead on some
offset on the root block device.  We should definitely not trigger
*blocking* IO.

Also, I wonder if the real problem preventing the RNG from staring up
is that the crng_init_cnt threshold is too high.  We have a rather
baroque accounting system, and it seems like we can accumulate and
credit entropy for a very long time indeed without actually
considering ourselves done.

--Andy
Willy Tarreau Sept. 20, 2019, 7:37 p.m. UTC | #26
On Fri, Sep 20, 2019 at 12:22:17PM -0700, Andy Lutomirski wrote:
> Perhaps userland could register a helper that takes over and does
> something better?

If userland sees the failure it can do whatever the developer/distro
packager thought suitable for the system facing this condition.

> But I think the kernel really should do something
> vaguely reasonable all by itself.

Definitely, that's what Linus' proposal was doing. Sleeping for some time
is what I call "vaguely reasonable".

> If nothing else, we want the ext4
> patch that provoked this whole discussion to be applied,

Oh absolutely!

> which means
> that we need to unbreak userspace somehow, and returning garbage it to
> is not a good choice.

It depends how it's used. I'd claim that we certainly use randoms for
other things (such as ASLR/hashtables) *before* using them to generate
long lived keys thus we can have a bit more time to get some more
entropy before reaching the point of producing these keys.

> Here are some possible approaches that come to mind:
> 
> int count;
> while (crng isn't inited) {
>   msleep(1);
> }
> 
> and modify add_timer_randomness() to at least credit a tiny bit to
> crng_init_cnt.

Without a timeout it's sure we'll still face some situations where
it blocks forever, which is the current problem.

> Or we do something like intentionally triggering readahead on some
> offset on the root block device.

You don't necessarily have such a device, especially when you're
in an initramfs. It's precisely where userland can be smarter. When
the caller is sfdisk for example, it does have more chances to try
to perform I/O than when it's a tiny http server starting to present
a configuration page.

> We should definitely not trigger *blocking* IO.

I think I agree.

> Also, I wonder if the real problem preventing the RNG from staring up
> is that the crng_init_cnt threshold is too high.  We have a rather
> baroque accounting system, and it seems like we can accumulate and
> credit entropy for a very long time indeed without actually
> considering ourselves done.

I have no opinion on this, lacking the skills to evaluate the situation.
What I can say for sure is that I've faced the non-booting issue quite a
number of times on headless systems, and conversely in the 2.4 era, my
front reverse-proxy by then had the same SSH key as 89 other machines on
the net. So there's surely a sweet spot to find between those two extremes.
I tend to think that waiting *a little bit* for the *first* random is
acceptable, even 10-15s, by the time the user starts to think about
pressing the reset button the system might finish to boot. Hashing some
RAM locations and the RTC when present can also help a little bit. If
at least my machine by then had combined the RTC's date and time with
the hash, chances for a key collision would have gone down to one over
many thousands.

Willy
Linus Torvalds Sept. 20, 2019, 7:51 p.m. UTC | #27
On Fri, Sep 20, 2019 at 12:12 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> The problem is that new programs will have to try the new flag value
> and, if it returns -EINVAL, fall back to 0.  This isn't so great.

Don't be silly.

Of course they will do that, but so what? With a new kernel, they'll
get the behavior they expect. And with an old kernel, they'll get the
behavior they expect.

They'd never fall back to to "0 means something I didn't want",
exactly because we'd make this new flag be the first change.

> Wait, are you suggesting that 0 means invoke jitter-entropy or
> whatever and GRND_SECURE_BLOCKING means not wait forever and deadlock?
>  That's no good -- people will want to continue using 0 because the
> behavior is better.

I assume that "not wait forever" was meant to be "wait forever".

So the one thing we have to do is break the "0 waits forever".  I
guarantee that will happen. I will override Ted if he just NAk's it,
because we simply _cannot_ continue with it.

So we absolutely _will_ come up with some way 0 ends the wait. Whether
it's _just_ a timeout, or whether it's jitter-entropy or whatever, it
will happen.

But we'll also make getrandom(0) do the annoying warning, because it's
just ambiguous. And I suspect you'll find that a lot of security
people don't really like jitter-entropy, at least not in whatever
cut-down format we'll likely have to use in the kernel.

And we'll also have to make getrandom(0) be really _timely_. Security
people would likely rather wait for minutes before they are happy with
it. But because it's a boot constraint as things are now, it will not
just be jitter-entropy, it will be _accelerated_ jitter-entropy in 15
seconds or whatever, and since it can't use up all of CPU time, it's
realistically more like "15 second timeout, but less of actual CPU
time for jitter".

We can try to be clever with a background thread and a lot of
yielding(), so that if the CPU is actually idle we'll get most of that
15 seconds for whatever jitter, but end result is that it's still
accelerated.

Do I believe we can do a good job in that kind of timeframe?
Absolutely. The whole point should be that it's still "good enough",
and as has been pointed out, that same jitter entropy that people are
worried about is just done in user space right now instead.

But do I believe that security people would prefer a non-accelerated
GRND_SECURE_BLOCKING? Yes I do. That doesn't mean that
GRND_SECURE_BLOCKING shouldn't use jitter entropy too, but it doesn't
need the same kind of "let's hurry this up because it might be during
early boot and block things".

That said, if we can all convince everybody (hah!) that jitter entropy
in the kernel would be sufficient, then we can make the whole point
entirely moot, and just say "we'll just change crng_wait() to do
jitter entropy instead and be done with it. Then any getrandom() user
will just basically wait for a (very limited) time and the system will
be happy.

If that is the case we wouldn't need new flags at all. But I don't
think you can make everybody agree to that, which is why I suspect
we'll need the new flag, and I'll just take the heat for saying "0 is
now off limits, because it does this thing that a lot of people
dislike".

> IMO this is confusing.  The GRND_RANDOM flag was IMO a mistake and
> should just be retired.  Let's enumerate useful cases and then give
> them sane values.


That's basically what I'm doing. I enumerate the new values.

But the enumerations have hidden meaning, because the actual bits do
matter. The GRND_EXPLICIT bit isn't supposed to be used by any user,
but it has the value it has because it makes old kernels return
-EINVAL.

But if people hate the bit names, we can just do an enum and be done with it:

   enum grnd_flags {
      GRND_NONBLOCK = 1,
      GRND_RANDOM, // Don't use!
      GRND_RANDOM_NONBLOCK, // Don't use
      GRND_UNUSED,
      GRND_INSECURE,
      GRND_SECURE_BLOCKING,
      GRND_SECURE_NONBLOCKING,
  };

but the values now have a _hidden_ pattern (because we currently have
that "| GRND_NONBLOCK" pattern that I want to make sure still
continues to work, rather than give unexpected behavior in case
somebody continues to use it).

So the _only_ difference between the above and what I suggested is
that I made the bit pattern explicit rather than hidden in the value.

> And the only real question is how to map existing users to these
> semantics.  I see two sensible choices:
>
> 1. 0 means "secure, blocking". I think this is not what we'd do if we
> could go back in time and chage the ABI from day 1, but I think it's
> actually good enough.  As long as this mode won't deadlock, it's not
> *that* bad if programs are using it when they wanted "insecure".

It's exactly that "as long as it won't deadlock" that is our current problem.

It *does* deadlock.

So it can't mean "blocking" in any long-term meaning.

It can mean "blocks for up to 15 seconds" or something like that. I'd
honestly prefer a smaller number, but I think 15 seconds is an
acceptable "your user space is buggy, but we won't make you think the
machine hung".

> 2. 0 means "secure, blocking, but warn".  Some new value means
> "secure, blocking, don't warn".  The problem is that new applications
> will have to fall back to 0 to continue supporting old kernels.

The same comment about blocking.

Maybe you came in in the middle, and didn't see the whole "reduced IO
patterns means that boot blocks forever" part of the original problem.

THAT is why 0 will absolutely change behaviour.

                Linus
Andy Lutomirski Sept. 20, 2019, 7:52 p.m. UTC | #28
> On Sep 20, 2019, at 12:37 PM, Willy Tarreau <w@1wt.eu> wrote:
> 
> On Fri, Sep 20, 2019 at 12:22:17PM -0700, Andy Lutomirski wrote:
>> Perhaps userland could register a helper that takes over and does
>> something better?
> 
> If userland sees the failure it can do whatever the developer/distro
> packager thought suitable for the system facing this condition.
> 
>> But I think the kernel really should do something
>> vaguely reasonable all by itself.
> 
> Definitely, that's what Linus' proposal was doing. Sleeping for some time
> is what I call "vaguely reasonable".

I don’t buy it. We have existing programs that can deadlock on boot. Just throwing -EAGAIN at them in a syscall that didn’t previously block does not strike me as reasonable.

> 
>> If nothing else, we want the ext4
>> patch that provoked this whole discussion to be applied,
> 
> Oh absolutely!
> 
>> which means
>> that we need to unbreak userspace somehow, and returning garbage it to
>> is not a good choice.
> 
> It depends how it's used. I'd claim that we certainly use randoms for
> other things (such as ASLR/hashtables) *before* using them to generate
> long lived keys thus we can have a bit more time to get some more
> entropy before reaching the point of producing these keys.

The problem is that we don’t know what userspace is doing with the output from getrandom(..., 0), so I think we have to be conservative. New kernels need to work with old user code. It’s okay if they’re slower to boot than they could be.

> 
>> Here are some possible approaches that come to mind:
>> 
>> int count;
>> while (crng isn't inited) {
>>  msleep(1);
>> }
>> 
>> and modify add_timer_randomness() to at least credit a tiny bit to
>> crng_init_cnt.
> 
> Without a timeout it's sure we'll still face some situations where
> it blocks forever, which is the current problem.

The point is that we keep the timer running by looping like this, which should cause add_timer_randomness() to get called continuously, which should prevent the deadlock.  I assume the deadlock is because we go into nohz-idle and we sit there with nothing happening at all.

> 
>> Or we do something like intentionally triggering readahead on some
>> offset on the root block device.
> 
> You don't necessarily have such a device, especially when you're
> in an initramfs. It's precisely where userland can be smarter. When
> the caller is sfdisk for example, it does have more chances to try
> to perform I/O than when it's a tiny http server starting to present
> a configuration page.

What I mean is: allow user code to register a usermode helper that helps get entropy. Or just convince distros to bundle some useful daemon that starts at early boot and lives in the initramfs.
Linus Torvalds Sept. 20, 2019, 8:02 p.m. UTC | #29
On Fri, Sep 20, 2019 at 12:22 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Here are some possible approaches that come to mind:
>
> int count;
> while (crng isn't inited) {
>   msleep(1);
> }
>
> and modify add_timer_randomness() to at least credit a tiny bit to
> crng_init_cnt.

I'd love that, but we don't actually call add_timer_randomness() for timers.

Yeah, the name is misleading.

What the "timer" in add_timer_randomness() means is that we look at
the timing between calls. And we may actually have (long ago) called
it for timer interrupts. But we don't any more.

The only actual users of add_timer_randomness() is
add_input_randomness() and add_disk_randomness(). And it turns out
that even disk IO doesn't really call add_disk_randomness(), so the
only _real_ user is that keyboard input thing.

Which means that unless you sit at the machine and type things in,
add_timer_randomness() _never_ gets called.

No, the real source of entropy right now is
add_interrupt_randomness(), which is called for all device interrupts.

But note the "device interrupts" part. Not the timer interrupt. That's
special, and has its own low-level architecture rules. So only the
normal IO interrupts (like disk/network/etc).

So timers right now do not add _anything_ to the randomness pool. Not
noise, not entropy.

But yes, what you can do is a jitter entropy thing, which basically
does what you suggest, except instead of "msleep(1)" it does something
like

   while (crng isn't inited) {
       sched_yield();
       do_a_round_of_memory_accesses_etc();
       add_cycle_counter_entropy();
   }

and with a lot of handwaving you'll convince a certain amount of
people that yes, the timing of the above is unpredictable enough that
the entropy you add is real.

             Linus
Alexander E. Patrakov Sept. 20, 2019, 8:11 p.m. UTC | #30
21.09.2019 00:51, Linus Torvalds пишет:

> And we'll also have to make getrandom(0) be really _timely_. Security
> people would likely rather wait for minutes before they are happy with
> it. But because it's a boot constraint as things are now, it will not
> just be jitter-entropy, it will be _accelerated_ jitter-entropy in 15
> seconds or whatever, and since it can't use up all of CPU time, it's
> realistically more like "15 second timeout, but less of actual CPU
> time for jitter".

I don't think that "accelerated jitter" makes sense. The jitterentropy 
hwrng that I sent earlier fills the entropy buffer in less than 2 
seconds, even with quality=4, so there is no need to accelerate it even 
more.

> That said, if we can all convince everybody (hah!) that jitter entropy
> in the kernel would be sufficient, then we can make the whole point
> entirely moot, and just say "we'll just change crng_wait() to do
> jitter entropy instead and be done with it. Then any getrandom() user
> will just basically wait for a (very limited) time and the system will
> be happy.
> 
> If that is the case we wouldn't need new flags at all. But I don't
> think you can make everybody agree to that, which is why I suspect
> we'll need the new flag, and I'll just take the heat for saying "0 is
> now off limits, because it does this thing that a lot of people
> dislike".

I 100% agree with that.
Matthew Garrett Sept. 20, 2019, 8:17 p.m. UTC | #31
On Fri, Sep 20, 2019 at 12:51:12PM -0700, Linus Torvalds wrote:

> So we absolutely _will_ come up with some way 0 ends the wait. Whether
> it's _just_ a timeout, or whether it's jitter-entropy or whatever, it
> will happen.

FWIW, Zircon uses the jitter entropy generator to seed the CRNG and 
documented their findings in 
https://fuchsia.dev/fuchsia-src/zircon/jitterentropy/config-basic .
Andy Lutomirski Sept. 20, 2019, 8:51 p.m. UTC | #32
On Fri, Sep 20, 2019 at 12:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > And the only real question is how to map existing users to these
> > semantics.  I see two sensible choices:
> >
> > 1. 0 means "secure, blocking". I think this is not what we'd do if we
> > could go back in time and chage the ABI from day 1, but I think it's
> > actually good enough.  As long as this mode won't deadlock, it's not
> > *that* bad if programs are using it when they wanted "insecure".
>
> It's exactly that "as long as it won't deadlock" that is our current problem.
>
> It *does* deadlock.
>
> So it can't mean "blocking" in any long-term meaning.
>
> It can mean "blocks for up to 15 seconds" or something like that. I'd
> honestly prefer a smaller number, but I think 15 seconds is an
> acceptable "your user space is buggy, but we won't make you think the
> machine hung".

To be clear, when I say "blocking", I mean "blocks until we're ready,
but we make sure we're ready in a moderately timely manner".

Rather than answering everything point by point, here's a updated
mini-proposal and some thoughts.  There are two families of security
people that I think we care about.  One is the FIPS or CC or PCI
crowd, and they might, quite reasonably, demand actual hardware RNGs.
We should make the hwrng API stop sucking and they should be happy.
(This means expose an hwrng device node per physical device, IMO.)
The other is the one who wants getrandom(), etc to be convincingly
secure and is willing to do some actual analysis.  And I think we can
make them quite happy like this:

In the kernel, we have two types of requests for random numbers: a
request for "secure" bytes and a request for "insecure" bytes.
Requests for "secure" bytes can block or return -EAGAIN.  Requests for
"insecure" bytes succeed without waiting.  In addition, we have a
jitter entropy mechanism (maybe the one mjg59 referenced, maybe
Alexander's -- doesn't really matter) and we *guarantee* that jitter
entropy, by itself, is enough to get the "secure" generator working
after, say, 5s of effort.  By this, I mean that, on an idle system, it
finishes in 5s and, on a fully loaded system, it's allowed to take a
little while longer but not too much longer.

In other words, I want GRND_SECURE_BLOCKING and /dev/random reads to
genuinely always work and to genuinely never take much longer than 5s.
I don't want a special case where they fail.

The exposed user APIs are, subject to bikeshedding that can happen
later over the actual values, etc:

GRND_SECURE_BLOCKING: returns "secure" output and blocks until it's
ready.  This never fails, but it also never blocks forever.

GRND_SECURE_NONBLOCKING: same but returns -EAGAIN instead of blocking.

GRND_INSECURE: returns "insecure" output immediately.  I think we do
need this -- the "secure" mode may take a little while at early boot,
and libraries that initialize themselves with some randomness really
do want a way to get some numbers without any delay whatsoever.

0: either the same as GRND_SECURE_BLOCKING plus a warning or the
"accelerated" version.  The "accelerated" version means wait up to 2s
for secure numbers and, if there still aren't any, fall back to
"insecure".

GRND_RANDOM: either the same as 0 or the same as GRND_SECURE_BLOCKING
but with a warning.  I don't particularly care either way.

I'm okay with a well-defined semantic like I proposed for an
accelerated mode.  I don't really want to try to define what a
secure-but-not-as-secure mode means as a separate complication that
the underlying RNG needs to support forever.  I don't think the
security folks would like that either.

How does this sound?
Linus Torvalds Sept. 20, 2019, 10:44 p.m. UTC | #33
On Fri, Sep 20, 2019 at 1:51 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> To be clear, when I say "blocking", I mean "blocks until we're ready,
> but we make sure we're ready in a moderately timely manner".

.. an I want a pony.

The problem is that you start from an assumption that we simply can't
seem to do.

> In other words, I want GRND_SECURE_BLOCKING and /dev/random reads to
> genuinely always work and to genuinely never take much longer than 5s.
> I don't want a special case where they fail.

Honestly, if that's the case and we _had_ such a methoc of
initializing the rng, then I suspect we could just ignore the flags
entirely, with the possible exception of GRND_NONBLOCK. And even that
is "possible exception", because once your worst-case is a one-time
delay of 5s at boot time thing, you might as well consider it
nonblocking in general.

Yes, there are some in-kernel users that really can't afford to do
even that 5s delay (not just may they be atomic, but more likely it's
just that we don't want to delay _everything_ by 5s), but they don't
use the getrandom() system call anyway.

> The exposed user APIs are, subject to bikeshedding that can happen
> later over the actual values, etc:

So the thing is, you start from the impossible assumption, and _if_
you hold that assumption then we might as well just keep the existing
"zero means blocking", because nobody mind.

I'd love to say "yes, we can guarantee good enough entropy for
everybody in 5s and we don't even need to warn about it, because
everybody will be comfortable with the state of our entropy at that
point".

It sounds like a _lovely_ model.

But honestly, it simply sounds unlikely.

Now, there are different kinds of unlikely.

In particular, if you actually have a CPU cycle counter that actually
runs at least on the same order of magnitude as the CPU frequency -
then I believe in the jitter entropy more than in many other cases.

Sadly, many platforms don't have that kind of cycle counter.

I've also not seen a hugely believable "yes, the jitter entropy is
real" paper. Alexander points to the existing jitterentropy crypto
code, and claims it can fill all our entropy needs in two seconds, but
there are big caveats:

 (a) that code uses get_random_entropy(), which on a PC is that nice
fast TSC that we want. On other platforms (or on really old PC's - we
technically support CPU's still that don't have rdtsc)? It might be
zero. Every time.

 (b) How was it tested? There are lots of randomness tests, but most
of them can be fooled with a simple counter through a cryptographic
hash - which you basically need to do anyway on whatever entropy
source you have in order to "whiten" it. It's simply _really_ hard to
decide on entropy.

So it's really easy to make the randomness of some input look really
good, without any real idea how good it truly is. And maybe it really
is very very good on one particular machine, and then on another one
(with either a simpler in-order core or a lower-frequency timestamp
counter) it might be horrendously bad, and you'll never know,

So I'd love to believe in your simple model. Really. I just don't see
how to get there reliably.

Matthew Garrettpointed to one analysis on jitterentropy, and that one
wasn't all that optimistic.

I do think jitterentropy would likely be good enough in practice - at
least on PC's with a TSC - for the fairly small window at boot and
getrandom(0). As I mentioned, I don't think it will make anybody
_happy_, but it might be one of those things where it's a compromise
that at least works for people, with the key generation people who are
really unhappy with it having a new option for their case.

And maybe Alexander can convince people that when you run the
jitterentropy code a hundred billion times, the end result (not the
random stream from it, but the jitter bits themselves - but I'm not
even sure how to boil it down) - really is random.

             Linus
Andy Lutomirski Sept. 20, 2019, 11:30 p.m. UTC | #34
On Fri, Sep 20, 2019 at 3:44 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Sep 20, 2019 at 1:51 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > To be clear, when I say "blocking", I mean "blocks until we're ready,
> > but we make sure we're ready in a moderately timely manner".
>
> .. an I want a pony.
>
> The problem is that you start from an assumption that we simply can't
> seem to do.

Eh, fair enough, I wasn't thinking about platforms without fast clocks.

I'm very nervous about allowing getrandom(..., 0) to fail with
-EAGAIN, though.  On a very, very brief search, I didn't find any
programs that would incorrectly assume it worked, but I can easily
imagine programs crashing, and that might be bad, too.  At the end of
the day, most user programmers who call getrandom() really did notice
that we flubbed the ABI, and either they were too lazy to fall back to
/dev/urandom, or they didn't want to for some reason, or they
genuinely want the blocking behavior.  And people who work with little
embedded systems without good clocks that basically can't generate
random numbers already know this, and they have little scripts to help
out.

So I think that just improving the
getrandom()-is-blocking-on-x86-and-arm behavior, adding GRND_INSECURE
and GRND_SECURE_BLOCKING, and adding the warning if 0 is passed is
good enough.  I suppose we could also have separate
GRND_SECURE_BLOCKING and GRND_SECURE_BLOCK_FOREVER.  We could also say
that, if you want to block forever, you should poll() on /dev/random
(with my patches applied, where this actually does what users would
want).

--Andy
Willy Tarreau Sept. 21, 2019, 3:05 a.m. UTC | #35
On Fri, Sep 20, 2019 at 04:30:20PM -0700, Andy Lutomirski wrote:
> So I think that just improving the
> getrandom()-is-blocking-on-x86-and-arm behavior, adding GRND_INSECURE
> and GRND_SECURE_BLOCKING, and adding the warning if 0 is passed is
> good enough.

I think so as well. Anyway, keep in mind that *with a sane API*,
userland can improve very quickly (faster than kernel deployments in
field). But userland developers need reliable and testable support for
features. If it's enough to do #ifndef GRND_xxx/#define GRND_xxx and
call getrandom() with these flags to detect support, it's basically 5
reliable lines of code to add to userland to make a warning disappear
and/or to allow a system that previously failed to boot to now boot. So
this gives strong incentive to userland to adopt the new API, provided
there's a way for the developer to understand what's happening (which
the warning does).

If we do it right, all we'll hear are userland developers complaining
that those stupid kernel developers have changed their API again and
really don't know what they want. That will be a good sign that the
warning flows back to them and that adoption is taking.

And if the change is small enough, maybe it could make sense to backport
it to stable versions to fix boot issues. With a testable feature it
does make sense.

Willy
Florian Weimer Sept. 21, 2019, 6:07 a.m. UTC | #36
* Linus Torvalds:

> Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> aiming for.
>
> However, it's worth noting that nobody should ever use GRND_EXPLICIT
> directly. That's just the name for the bit. The actual users would use
> GRND_INSECURE or GRND_SECURE.

Should we switch glibc's getentropy to GRND_EXPLICIT?  Or something
else?

I don't think we want to print a kernel warning for this function.

Thanks,
Florian
David Laight Sept. 23, 2019, 11:55 a.m. UTC | #37
From: Linus Torvalds
> Sent: 19 September 2019 21:04
...
> Note small detail above: I changed the ^= to a +=. Addition tends to
> be better (due to carry between bits) when there might be bit
> commonalities.  Particularly with something like a cycle count where
> two xors can mostly cancel out previous bits rather than move bits
> around in the word.

There is code in one on the kernel RNG that XORs together the output
of 3 LFSR (CRC) generators.
I think it is used for 'low quality' randomness and reseeded from the main RNG.
Using XOR makes the entire generator 'linear' and thus trivially reversible.
With a relatively small number of consecutive outputs you can determine the state
of all 3 LFSR.
Merge the results with addition and the process is immensely harder.

I've also wondered whether the RC4 generator is a useful entropy store?
It has a lot of state and you can fairly easily feed in values that might (or
might not) contain any randomness without losing any stored entropy.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Andy Lutomirski Sept. 23, 2019, 6:33 p.m. UTC | #38
On Fri, Sep 20, 2019 at 11:07 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Linus Torvalds:
>
> > Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> > aiming for.
> >
> > However, it's worth noting that nobody should ever use GRND_EXPLICIT
> > directly. That's just the name for the bit. The actual users would use
> > GRND_INSECURE or GRND_SECURE.
>
> Should we switch glibc's getentropy to GRND_EXPLICIT?  Or something
> else?
>
> I don't think we want to print a kernel warning for this function.
>

Contemplating this question, I think the answer is that we should just
not introduce GRND_EXPLICIT or anything like it.  glibc is going to
have to do *something*, and getentropy() is unlikely to just go away.
The explicitly documented semantics are that it blocks if the RNG
isn't seeded.

Similarly, FreeBSD has getrandom():

https://www.freebsd.org/cgi/man.cgi?query=getrandom&sektion=2&manpath=freebsd-release-ports

and if we make getrandom(..., 0) warn, then we have a situation where
the *correct* (if regrettable) way to use the function on FreeBSD
causes a warning on Linux.

Let's just add GRND_INSECURE, make the blocking mode work better, and,
if we're feeling a bit more adventurous, add GRND_SECURE_BLOCKING as a
better replacement for 0, convince FreeBSD to add it too, and then
worry about deprecating 0 once we at least get some agreement from the
FreeBSD camp.
Ahmed S. Darwish Sept. 26, 2019, 9:11 p.m. UTC | #39
On Mon, Sep 23, 2019 at 11:33:21AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 20, 2019 at 11:07 PM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Linus Torvalds:
> >
> > > Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> > > aiming for.
> > >
> > > However, it's worth noting that nobody should ever use GRND_EXPLICIT
> > > directly. That's just the name for the bit. The actual users would use
> > > GRND_INSECURE or GRND_SECURE.
> >
> > Should we switch glibc's getentropy to GRND_EXPLICIT?  Or something
> > else?
> >
> > I don't think we want to print a kernel warning for this function.
> >
> 
> Contemplating this question, I think the answer is that we should just
> not introduce GRND_EXPLICIT or anything like it.  glibc is going to
> have to do *something*, and getentropy() is unlikely to just go away.
> The explicitly documented semantics are that it blocks if the RNG
> isn't seeded.
> 
> Similarly, FreeBSD has getrandom():
> 
> https://www.freebsd.org/cgi/man.cgi?query=getrandom&sektion=2&manpath=freebsd-release-ports
> 
> and if we make getrandom(..., 0) warn, then we have a situation where
> the *correct* (if regrettable) way to use the function on FreeBSD
> causes a warning on Linux.
> 
> Let's just add GRND_INSECURE, make the blocking mode work better, and,
> if we're feeling a bit more adventurous, add GRND_SECURE_BLOCKING as a
> better replacement for 0, ...

This is what's now done in the just-submitted V5, except the "make the
blocking mode work better" part:

    https://lkml.kernel.org/r/20190926204217.GA1366@pc

It's a very conservative patch so far IMHO (minus the loud warning).

Thanks,
--
Ahmed Darwish
diff mbox series

Patch

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index df0fc997dc3e..772765c36fc3 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -535,8 +535,6 @@  config ADI
 	  and SSM (Silicon Secured Memory).  Intended consumers of this
 	  driver include crash and makedumpfile.
 
-endmenu
-
 config RANDOM_TRUST_CPU
 	bool "Trust the CPU manufacturer to initialize Linux's CRNG"
 	depends on X86 || S390 || PPC
@@ -559,4 +557,60 @@  config RANDOM_TRUST_BOOTLOADER
 	device randomness. Say Y here to assume the entropy provided by the
 	booloader is trustworthy so it will be added to the kernel's entropy
 	pool. Otherwise, say N here so it will be regarded as device input that
-	only mixes the entropy pool.
\ No newline at end of file
+	only mixes the entropy pool.
+
+config GETRANDOM_WAIT_THRESHOLD_SEC
+	int
+	default 30
+	help
+	  The getrandom(2) system call, when asking for entropy from the
+	  urandom source, blocks until the kernel's Cryptographic Random
+	  Number Generator (CRNG) gets initialized. This configuration
+	  option sets the maximum wait time, in seconds, for a process
+	  to get blocked on such a system call before the kernel issues
+	  a loud warning. Rationale follows:
+
+	  When the getrandom(2) system call was created, it came with
+	  the clear warning: "Any userspace program which uses this new
+	  functionality must take care to assure that if it is used
+	  during the boot process, that it will not cause the init
+	  scripts or other portions of the system startup to hang
+	  indefinitely.
+
+	  Unfortunately, due to multiple factors, including not having
+	  this warning written in a scary-enough language in the
+	  manpages, and due to glibc since v2.25 implementing a BSD-like
+	  getentropy(3) in terms of getrandom(2), modern user-space is
+	  calling getrandom(2) in the boot path everywhere.
+
+	  Embedded Linux systems were first hit by this, and reports of
+	  embedded system "getting stuck at boot" began to be
+	  common. Over time, the issue began to even creep into consumer
+	  level x86 laptops: mainstream distributions, like Debian
+	  Buster, began to recommend installing haveged as a workaround,
+	  just to let the system boot.
+
+	  Filesystem optimizations in EXT4 and XFS exagerated the
+	  problem, due to aggressive batching of IO requests, and thus
+	  minimizing sources of entropy at boot. This led to large
+	  delays until the kernel's CRNG got initialized.
+
+	  System integrators and distribution builderss are not
+	  encouraged to considerably increase this value: during system
+	  boot, you either have entropy, or you don't. And if you didn't
+	  have entropy, it will stay like this forever, because if you
+	  had, you wouldn't have blocked in the first place. It's an
+	  atomic "either/or" situation, with no middle ground. Please
+	  think twice.
+
+	  Ideally, systems would be configured with hardware random
+	  number generators, and/or configured to trust the CPU-provided
+	  RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
+	  (CONFIG_RANDOM_TRUST_BOOTLOADER).  In addition, userspace
+	  should generate cryptographic keys only as late as possible,
+	  when they are needed, instead of during early boot.  For
+	  non-cryptographic use cases, such as dictionary seeds or MIT
+	  Magic Cookies, the getrandom2(GRND2_INSECURE) system call,
+	  or even random(3), may be more appropropriate.
+
+endmenu
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 566922df4b7b..74057e496303 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -322,6 +322,7 @@ 
 #include <linux/interrupt.h>
 #include <linux/mm.h>
 #include <linux/nodemask.h>
+#include <linux/sched.h>
 #include <linux/spinlock.h>
 #include <linux/kthread.h>
 #include <linux/percpu.h>
@@ -854,12 +855,21 @@  static void invalidate_batched_entropy(void);
 static void numa_crng_init(void);
 
 static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU);
+static int getrandom_wait_threshold __ro_after_init =
+				CONFIG_GETRANDOM_WAIT_THRESHOLD_SEC;
+
 static int __init parse_trust_cpu(char *arg)
 {
 	return kstrtobool(arg, &trust_cpu);
 }
 early_param("random.trust_cpu", parse_trust_cpu);
 
+static int __init parse_getrandom_wait_threshold(char *arg)
+{
+	return kstrtoint(arg, 0, &getrandom_wait_threshold);
+}
+early_param("random.getrandom_wait_threshold", parse_getrandom_wait_threshold);
+
 static void crng_initialize(struct crng_state *crng)
 {
 	int		i;
@@ -1960,7 +1970,7 @@  random_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 }
 
 static ssize_t
-urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
+_urandom_read(char __user *buf, size_t nbytes, bool warn_on_noninited_crng)
 {
 	unsigned long flags;
 	static int maxwarn = 10;
@@ -1968,7 +1978,7 @@  urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 
 	if (!crng_ready() && maxwarn > 0) {
 		maxwarn--;
-		if (__ratelimit(&urandom_warning))
+		if (warn_on_noninited_crng && __ratelimit(&urandom_warning))
 			printk(KERN_NOTICE "random: %s: uninitialized "
 			       "urandom read (%zd bytes read)\n",
 			       current->comm, nbytes);
@@ -1982,6 +1992,12 @@  urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 	return ret;
 }
 
+static ssize_t
+urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
+{
+	return _urandom_read(buf, nbytes, true);
+}
+
 static __poll_t
 random_poll(struct file *file, poll_table * wait)
 {
@@ -2118,11 +2134,41 @@  const struct file_operations urandom_fops = {
 	.llseek = noop_llseek,
 };
 
-SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count,
-		unsigned int, flags)
+static int getrandom_wait(char __user *buf, size_t count,
+			  bool warn_on_large_wait)
 {
+	unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
 	int ret;
 
+	if (warn_on_large_wait && (getrandom_wait_threshold > 0))
+		timeout = HZ * getrandom_wait_threshold;
+
+	do {
+		ret = wait_event_interruptible_timeout(crng_init_wait,
+						       crng_ready(),
+						       timeout);
+		if (ret < 0)
+			return ret;
+
+		if (ret == 0) {
+			WARN(1, "random: %s[%d]: getrandom(%zu bytes) "
+			     "is blocked for more than %d seconds. Check "
+			     "getrandom_wait(7)\n", current->comm,
+			     task_pid_nr(current), count,
+			     getrandom_wait_threshold);
+
+			/* warn once per caller */
+			timeout = MAX_SCHEDULE_TIMEOUT;
+		}
+
+	} while (ret == 0);
+
+	return _urandom_read(buf, count, true);
+}
+
+SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count,
+		unsigned int, flags)
+{
 	if (flags & ~(GRND_NONBLOCK|GRND_RANDOM))
 		return -EINVAL;
 
@@ -2132,14 +2178,31 @@  SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count,
 	if (flags & GRND_RANDOM)
 		return _random_read(flags & GRND_NONBLOCK, buf, count);
 
-	if (!crng_ready()) {
-		if (flags & GRND_NONBLOCK)
+	if ((flags & GRND_NONBLOCK) && !crng_ready())
 			return -EAGAIN;
-		ret = wait_for_random_bytes();
-		if (unlikely(ret))
-			return ret;
-	}
-	return urandom_read(NULL, buf, count, NULL);
+
+	return getrandom_wait(buf, count, true);
+}
+
+SYSCALL_DEFINE3(getrandom2, char __user *, buf, size_t, count,
+		unsigned int, flags)
+{
+	if (flags & ~(GRND2_SECURE_UNBOUNDED_INITIAL_WAIT|GRND2_INSECURE))
+		return -EINVAL;
+
+	if (flags & (GRND2_SECURE_UNBOUNDED_INITIAL_WAIT|GRND2_INSECURE))
+		return -EINVAL;
+
+	if (count > INT_MAX)
+		count = INT_MAX;
+
+	if (flags & GRND2_SECURE_UNBOUNDED_INITIAL_WAIT)
+		return getrandom_wait(buf, count, false);
+
+	if (flags & GRND2_INSECURE)
+		return _urandom_read(buf, count, false);
+
+	unreachable();
 }
 
 /********************************************************************
diff --git a/include/uapi/linux/random.h b/include/uapi/linux/random.h
index 26ee91300e3e..3f09a8f6aff3 100644
--- a/include/uapi/linux/random.h
+++ b/include/uapi/linux/random.h
@@ -8,6 +8,7 @@ 
 #ifndef _UAPI_LINUX_RANDOM_H
 #define _UAPI_LINUX_RANDOM_H
 
+#include <linux/bits.h>
 #include <linux/types.h>
 #include <linux/ioctl.h>
 #include <linux/irqnr.h>
@@ -23,7 +24,7 @@ 
 /* Get the contents of the entropy pool.  (Superuser only.) */
 #define RNDGETPOOL	_IOR( 'R', 0x02, int [2] )
 
-/* 
+/*
  * Write bytes into the entropy pool and add to the entropy count.
  * (Superuser only.)
  */
@@ -50,7 +51,20 @@  struct rand_pool_info {
  * GRND_NONBLOCK	Don't block and return EAGAIN instead
  * GRND_RANDOM		Use the /dev/random pool instead of /dev/urandom
  */
-#define GRND_NONBLOCK	0x0001
-#define GRND_RANDOM	0x0002
+#define GRND_NONBLOCK				BIT(0)
+#define GRND_RANDOM				BIT(1)
+
+/*
+ * Flags for getrandom2(2)
+ *
+ * GRND2_SECURE		Use urandom pool, block until CRNG is inited
+ * GRND2_INSECURE	Use urandom pool, never block even if CRNG isn't inited
+ *
+ * NOTE: don't mix flag values with GRND, to protect against the
+ * security implications of users passing the invalid flag family
+ * to system calls (GRND_* vs. GRND2_*).
+ */
+#define GRND2_SECURE_UNBOUNDED_INITIAL_WAIT	BIT(7)
+#define GRND2_INSECURE				BIT(8)
 
 #endif /* _UAPI_LINUX_RANDOM_H */