diff mbox series

[16/18] net: mpls: prevent bounds-check bypass via speculative execution

Message ID 151520108080.32271.16420298348259030860.stgit@dwillia2-desk3.amr.corp.intel.com
State Not Applicable, archived
Delegated to: David Miller
Headers show
Series prevent bounds-check bypass via speculative execution | expand

Commit Message

Dan Williams Jan. 6, 2018, 1:11 a.m. UTC
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array.  In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid 'rt' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 net/mpls/af_mpls.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Sergei Shtylyov Jan. 6, 2018, 10:06 a.m. UTC | #1
On 1/6/2018 4:11 AM, Dan Williams wrote:

> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency reading 'rt' from the 'platform_label'
> array.  In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid 'rt' value.
> 
> Based on an original patch by Elena Reshetova.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>   net/mpls/af_mpls.c |   12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 8ca9915befc8..ebcf0e246cfe 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
[...]
> @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>   static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
>   {
>   	struct mpls_route *rt = NULL;
> +	struct mpls_route __rcu **platform_label =
> +		rcu_dereference(net->mpls.platform_label);
> +	struct mpls_route __rcu **rtp;
>   
> -	if (index < net->mpls.platform_labels) {
> -		struct mpls_route __rcu **platform_label =
> -			rcu_dereference(net->mpls.platform_label);
> -		rt = rcu_dereference(platform_label[index]);
> -	}
> +	if ((rtp = nospec_array_ptr(platform_label, index,


    And here...

> +					net->mpls.platform_labels)))
> +		rt = rcu_dereference(*rtp);
>   	return rt;
>   }
>   

MBR, Sergei
Eric W. Biederman Jan. 9, 2018, 3:11 a.m. UTC | #2
Dan Williams <dan.j.williams@intel.com> writes:

> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency reading 'rt' from the 'platform_label'
> array.  In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid 'rt' value.


In detail.
a) This code is fast path packet forwarding code.  Introducing an
   unconditional pipeline stall is not ok.

   AKA either there is no speculation and so this is invulnerable
   or there is speculation and you are creating an unconditional
   pipeline stall here.

   My back of the napkin caluculations say that a pipeline stall
   is about 20 cycles.  Which is about the same length of time
   as a modern cache miss.

   On a good day this code will perform with 0 cache misses. On a less
   good day 1 cache miss.  Which means you are quite possibly doubling
   the runtime of mpls_forward.

b) The array is dynamically allocated which should provide some
   protection, as it will be more difficult to predict the address
   of the array which is needed to craft an malicious userspace value.

c) The code can be trivially modified to say:

static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
{
	struct mpls_route *rt = NULL;

	if (index < net->mpls.platform_labels) {
		struct mpls_route __rcu **platform_label =
			rcu_dereference(net->mpls.platform_label);
		rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
	}
	return rt;
}

AKA a static mask will ensure that there is not a primitive that can be
used to access all of memory.  That is max a 1 cycle slowdown in the
code, which is a much better trade off.

d) If we care more it is straight forward to modify
   resize_platform_label_table() to ensure that the size of the array
   is always a power of 2.

e) The fact that a pointer is returned from the array and it is treated
   like a pointer would seem to provide a defense against the
   exfiltration technique of using the value read as an index into
   a small array, that user space code can probe aliased cached
   lines of, to see which value was dereferenced.


So to this patch in particular.
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

This code path will be difficult to exploit.  This change messes with
performance.  There are ways to make this code path useless while
preserving the performance of the code.

Eric

>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  net/mpls/af_mpls.c |   12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 8ca9915befc8..ebcf0e246cfe 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -8,6 +8,7 @@
>  #include <linux/ipv6.h>
>  #include <linux/mpls.h>
>  #include <linux/netconf.h>
> +#include <linux/compiler.h>
>  #include <linux/vmalloc.h>
>  #include <linux/percpu.h>
>  #include <net/ip.h>
> @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>  static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
>  {
>  	struct mpls_route *rt = NULL;
> +	struct mpls_route __rcu **platform_label =
> +		rcu_dereference(net->mpls.platform_label);
> +	struct mpls_route __rcu **rtp;
>  
> -	if (index < net->mpls.platform_labels) {
> -		struct mpls_route __rcu **platform_label =
> -			rcu_dereference(net->mpls.platform_label);
> -		rt = rcu_dereference(platform_label[index]);
> -	}
> +	if ((rtp = nospec_array_ptr(platform_label, index,
> +					net->mpls.platform_labels)))
> +		rt = rcu_dereference(*rtp);
>  	return rt;
>  }
>
Dan Williams Jan. 9, 2018, 3:42 a.m. UTC | #3
On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency reading 'rt' from the 'platform_label'
>> array.  In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid 'rt' value.
>
>
> In detail.
> a) This code is fast path packet forwarding code.  Introducing an
>    unconditional pipeline stall is not ok.
>
>    AKA either there is no speculation and so this is invulnerable
>    or there is speculation and you are creating an unconditional
>    pipeline stall here.
>
>    My back of the napkin caluculations say that a pipeline stall
>    is about 20 cycles.  Which is about the same length of time
>    as a modern cache miss.
>
>    On a good day this code will perform with 0 cache misses. On a less
>    good day 1 cache miss.  Which means you are quite possibly doubling
>    the runtime of mpls_forward.
>
> b) The array is dynamically allocated which should provide some
>    protection, as it will be more difficult to predict the address
>    of the array which is needed to craft an malicious userspace value.
>
> c) The code can be trivially modified to say:
>
> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
> {
>         struct mpls_route *rt = NULL;
>
>         if (index < net->mpls.platform_labels) {
>                 struct mpls_route __rcu **platform_label =
>                         rcu_dereference(net->mpls.platform_label);
>                 rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
>         }
>         return rt;
> }
>
> AKA a static mask will ensure that there is not a primitive that can be
> used to access all of memory.  That is max a 1 cycle slowdown in the
> code, which is a much better trade off.
>
> d) If we care more it is straight forward to modify
>    resize_platform_label_table() to ensure that the size of the array
>    is always a power of 2.
>
> e) The fact that a pointer is returned from the array and it is treated
>    like a pointer would seem to provide a defense against the
>    exfiltration technique of using the value read as an index into
>    a small array, that user space code can probe aliased cached
>    lines of, to see which value was dereferenced.
>
>
> So to this patch in particular.
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> This code path will be difficult to exploit.  This change messes with
> performance.  There are ways to make this code path useless while
> preserving the performance of the code.
>

Thanks, Eric understood. The discussion over the weekend  came to the
conclusion that using a mask will be the default approach. The
nospec_array_ptr() will be defined to something similar to the
following, originally from Linus and tweaked by Alexei and I:

#define __nospec_array_ptr(base, idx, sz)                               \
({                                                                      \
        union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
        unsigned long _i = (idx);                                       \
        unsigned long _m = (max);                                       \
        unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
        OPTIMIZER_HIDE_VAR(_mask);                                      \
        __u._ptr = &base[_i & _mask];                                   \
        __u._bit &= _mask;                                              \
        __u._ptr;                                                       \
})

Does that address your performance concerns?
Linus Torvalds Jan. 9, 2018, 4:13 a.m. UTC | #4
On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> originally from Linus and tweaked by Alexei and I:

Sadly, that tweak - while clever - is wrong.

>         unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\

Why?

Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
can still be positive.

Think "m = 100", "i=bignum". The subtraction will overflow and become
positive again, the shift will shift to zero, and then the mask will
become ~0.

Now, you can fix it, but you need to be a tiny bit more clever.  In
particular, just make sure that you retain the high bit of "_i",
basically making the rule be that a negative index is not ever valid.

And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
Now the sign bit is set if _i had it set, _or_ if the subtraction
turned negative, and you don't have to worry about the overflow
situation.

But it does require that extra step to be trustworthy. Still purely
cheap arithmetic operations, although there is possibly some
additional register pressure there.

Somebody might be able to come up with something even more minimal (or
find a fault in my fix of your tweak).

Obviously, with architecture-specific code, you may well be able to do
better, using the carry flag of the subtraction.

For example, on x86, I think you could do it with just two instructions:

    # carry will be clear if idx >= max
    cmpq %idx,%max

    # mask will be clear if carry was clear, ~0 otherwise
    sbbq %mask,%mask

to generate mask directly. I might have screwed that up. Worth perhaps trying?

               Linus
Linus Torvalds Jan. 9, 2018, 4:21 a.m. UTC | #5
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>     # carry will be clear if idx >= max
>     cmpq %idx,%max

Bah. Other way around.

        cmpq %max,%idx

I'm a moron.

>     # mask will be clear if carry was clear, ~0 otherwise
>     sbbq %mask,%mask
>
> to generate mask directly. I might have screwed that up. Worth perhaps trying?

More importantly, worth _testing_ and fixing my hand-waving "asm like
this" crap.

But I do think that simple two-instruction cmpq/sbbq sequence could
get it right in just two trivial ALU instructions.

          Linus
Eric W. Biederman Jan. 9, 2018, 4:17 p.m. UTC | #6
Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> Static analysis reports that 'index' may be a user controlled value that
>>> is used as a data dependency reading 'rt' from the 'platform_label'
>>> array.  In order to avoid potential leaks of kernel memory values, block
>>> speculative execution of the instruction stream that could issue further
>>> reads based on an invalid 'rt' value.
>>
>>
>> In detail.
>> a) This code is fast path packet forwarding code.  Introducing an
>>    unconditional pipeline stall is not ok.
>>
>>    AKA either there is no speculation and so this is invulnerable
>>    or there is speculation and you are creating an unconditional
>>    pipeline stall here.
>>
>>    My back of the napkin caluculations say that a pipeline stall
>>    is about 20 cycles.  Which is about the same length of time
>>    as a modern cache miss.
>>
>>    On a good day this code will perform with 0 cache misses. On a less
>>    good day 1 cache miss.  Which means you are quite possibly doubling
>>    the runtime of mpls_forward.
>>
>> b) The array is dynamically allocated which should provide some
>>    protection, as it will be more difficult to predict the address
>>    of the array which is needed to craft an malicious userspace value.
>>
>> c) The code can be trivially modified to say:
>>
>> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
>> {
>>         struct mpls_route *rt = NULL;
>>
>>         if (index < net->mpls.platform_labels) {
>>                 struct mpls_route __rcu **platform_label =
>>                         rcu_dereference(net->mpls.platform_label);
>>                 rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
>>         }
>>         return rt;
>> }
>>
>> AKA a static mask will ensure that there is not a primitive that can be
>> used to access all of memory.  That is max a 1 cycle slowdown in the
>> code, which is a much better trade off.
>>
>> d) If we care more it is straight forward to modify
>>    resize_platform_label_table() to ensure that the size of the array
>>    is always a power of 2.
>>
>> e) The fact that a pointer is returned from the array and it is treated
>>    like a pointer would seem to provide a defense against the
>>    exfiltration technique of using the value read as an index into
>>    a small array, that user space code can probe aliased cached
>>    lines of, to see which value was dereferenced.
>>
>>
>> So to this patch in particular.
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> This code path will be difficult to exploit.  This change messes with
>> performance.  There are ways to make this code path useless while
>> preserving the performance of the code.
>>
>
> Thanks, Eric understood. The discussion over the weekend  came to the
> conclusion that using a mask will be the default approach. The
> nospec_array_ptr() will be defined to something similar to the
> following, originally from Linus and tweaked by Alexei and I:
>
> #define __nospec_array_ptr(base, idx, sz)                               \
> ({                                                                      \
>         union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
>         unsigned long _i = (idx);                                       \
>         unsigned long _m = (max);                                       \
>         unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>         OPTIMIZER_HIDE_VAR(_mask);                                      \
>         __u._ptr = &base[_i & _mask];                                   \
>         __u._bit &= _mask;                                              \
>         __u._ptr;                                                       \
> })
>
> Does that address your performance concerns?

The user controlled value condition of your patchset implies that you
assume indirect branch predictor poisoning is handled in other ways.

Which means that we can assume speculation will take some variation of
the static call chain.

Further you are worrying about array accesses.  Which means you
are worried about attacks that are the equivalent of meltdown that
can give you reading of all memory available to the kernel.


The mpls code in question reads a pointer from memory.


The only thing the code does with that pointer is verify it is not NULL
and dereference it.

That does not make it possible to extricate the pointer bits via a cache
side-channel as a pointer is 64bits wide.

There might maybe be a timing attack where it is timed how long the
packet takes to deliver.  If you can find the base address of the array,
at best such a timeing attack will tell you is if some arbitrary cache
line is already cached in the kernel.  Which is not the class of attack
your patchset is worried about.  Further there are more direct ways
to probe the cache from a local process.

So I submit to you that the mpls code is not vulnerable to the class of
attack you are addressing.

Further I would argue that anything that reads a pointer from memory is
a very strong clue that it falls outside the class of code that you are
addressing.

Show me where I am wrong and I will consider patches.

Eric
Dan Williams Jan. 9, 2018, 6:01 p.m. UTC | #7
On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
[..]
> The user controlled value condition of your patchset implies that you
> assume indirect branch predictor poisoning is handled in other ways.
>
> Which means that we can assume speculation will take some variation of
> the static call chain.
>
> Further you are worrying about array accesses.  Which means you
> are worried about attacks that are the equivalent of meltdown that
> can give you reading of all memory available to the kernel.
>
>
> The mpls code in question reads a pointer from memory.
>
>
> The only thing the code does with that pointer is verify it is not NULL
> and dereference it.
>
> That does not make it possible to extricate the pointer bits via a cache
> side-channel as a pointer is 64bits wide.
>
> There might maybe be a timing attack where it is timed how long the
> packet takes to deliver.  If you can find the base address of the array,
> at best such a timeing attack will tell you is if some arbitrary cache
> line is already cached in the kernel.  Which is not the class of attack
> your patchset is worried about.  Further there are more direct ways
> to probe the cache from a local process.
>
> So I submit to you that the mpls code is not vulnerable to the class of
> attack you are addressing.
>
> Further I would argue that anything that reads a pointer from memory is
> a very strong clue that it falls outside the class of code that you are
> addressing.
>
> Show me where I am wrong and I will consider patches.

No, the concern is a second dependent read (or write) within the
speculation window after this first bounds-checked dependent read.
I.e. this mpls code path appears to have the setup condition:

    if (x < max)
        val = array1[x];

...but it's not clear that there is an exploit condition later on in
the instruction stream:

        array2[val] = y;
        /* or */
        y = array2[val];

My personal paranoia says submit the patch and not worry about finding
that later exploit condition, if DaveM wants to drop the patch that's
his prerogative. In general, with the performance conscious version of
nospec_array_ptr() being the default, why worry about what is / is not
in the speculation window?
Dan Williams Jan. 10, 2018, 12:48 a.m. UTC | #8
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > originally from Linus and tweaked by Alexei and I:
>
> Sadly, that tweak - while clever - is wrong.
>
> >         unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>
> Why?
>
> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
> can still be positive.
>
> Think "m = 100", "i=bignum". The subtraction will overflow and become
> positive again, the shift will shift to zero, and then the mask will
> become ~0.
>
> Now, you can fix it, but you need to be a tiny bit more clever.  In
> particular, just make sure that you retain the high bit of "_i",
> basically making the rule be that a negative index is not ever valid.
>
> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
> Now the sign bit is set if _i had it set, _or_ if the subtraction
> turned negative, and you don't have to worry about the overflow
> situation.
>
> But it does require that extra step to be trustworthy. Still purely
> cheap arithmetic operations, although there is possibly some
> additional register pressure there.
>
> Somebody might be able to come up with something even more minimal (or
> find a fault in my fix of your tweak).

I looks like there is another problem, or I'm misreading the
cleverness. We want the mask to be ~0 in the ok case and 0 in the
out-of-bounds case. As far as I can see we end up with ~0 in the ok
case, and ~1 in the bad case. Don't we need to do something like the
following, at which point are we getting out of the realm of "cheap
ALU instructions"?

#define __nospec_array_ptr(base, idx, sz)                               \
({                                                                      \
        union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
        unsigned long _i = (idx);                                       \
        unsigned long _s = (sz);                                        \
        unsigned long _v = (long)(_i | _s - 1 - _i)                     \
                                        >> BITS_PER_LONG - 1;           \
        unsigned long _mask = _v * ~0UL;                                 \
        OPTIMIZER_HIDE_VAR(_mask);                                      \
        __u._ptr = &base[_i & _mask];                                   \
        __u._bit &= _mask;                                              \
        __u._ptr;                                                       \
})

        elem = nospec_array_ptr(array, idx, 3);
 106:   b8 02 00 00 00          mov    $0x2,%eax
 10b:   48 63 ff                movslq %edi,%rdi
 10e:   48 29 f8                sub    %rdi,%rax
 111:   48 09 f8                or     %rdi,%rax
 114:   48 c1 e8 3f             shr    $0x3f,%rax
 118:   48 21 c7                and    %rax,%rdi
 11b:   48 8d 54 bc 04          lea    0x4(%rsp,%rdi,4),%rdx
 120:   48 21 d0                and    %rdx,%rax
Eric W. Biederman Jan. 10, 2018, 12:54 a.m. UTC | #9
Dan Williams <dan.j.williams@intel.com> writes:

> On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
> [..]
>> The user controlled value condition of your patchset implies that you
>> assume indirect branch predictor poisoning is handled in other ways.
>>
>> Which means that we can assume speculation will take some variation of
>> the static call chain.
>>
>> Further you are worrying about array accesses.  Which means you
>> are worried about attacks that are the equivalent of meltdown that
>> can give you reading of all memory available to the kernel.
>>
>>
>> The mpls code in question reads a pointer from memory.
>>
>>
>> The only thing the code does with that pointer is verify it is not NULL
>> and dereference it.
>>
>> That does not make it possible to extricate the pointer bits via a cache
>> side-channel as a pointer is 64bits wide.
>>
>> There might maybe be a timing attack where it is timed how long the
>> packet takes to deliver.  If you can find the base address of the array,
>> at best such a timeing attack will tell you is if some arbitrary cache
>> line is already cached in the kernel.  Which is not the class of attack
>> your patchset is worried about.  Further there are more direct ways
>> to probe the cache from a local process.
>>
>> So I submit to you that the mpls code is not vulnerable to the class of
>> attack you are addressing.
>>
>> Further I would argue that anything that reads a pointer from memory is
>> a very strong clue that it falls outside the class of code that you are
>> addressing.
>>
>> Show me where I am wrong and I will consider patches.
>
> No, the concern is a second dependent read (or write) within the
> speculation window after this first bounds-checked dependent read.
> I.e. this mpls code path appears to have the setup condition:
>
>     if (x < max)
>         val = array1[x];
>
> ...but it's not clear that there is an exploit condition later on in
> the instruction stream:
>
>         array2[val] = y;
>         /* or */
>         y = array2[val];
>
> My personal paranoia says submit the patch and not worry about finding
> that later exploit condition, if DaveM wants to drop the patch that's
> his prerogative. In general, with the performance conscious version of
> nospec_array_ptr() being the default, why worry about what is / is not
> in the speculation window?

Sigh.
I am not worrying about what is in the speculation window.  My
assumption is that the speculation window could be infinite, as we don't
know the limitations of all possible processors.

I am saying there is not a way to get the data out of the speculation
window.

I was saying that when you have:
	if (x < max)
        	val = array1[x];

When val is a pointer not an integer.
Then
	array2[val] = y;
        /* or */
        y = array2[va];

Won't happen.

	val->field;

Will happen.

Which looks similar.  However the address space of pointers is too
large.  Making it impossible for an attack to know where to look in the
cache to see if "val->field" happened.  At least on the assumption that
val is an arbitrary value.

Further mpls_forward is small enough the entire scope of "rt" the value
read possibly past the bound check is auditable without too much
trouble.  I have looked and I don't see anything that could possibly
allow the value of "rt" to be exfitrated.  The problem continuing to be
that it is a pointer and the only operation on the pointer besides
derferencing it is testing if it is NULL.

Other types of timing attacks are very hard if not impossible because
any packet presenting with a value outside the bounds check will be
dropped.  So it will hard if not impossible to find something to time to
see how long it took to drop the packet.

So no this code path does not present with the classic signature of
variant 1: bounds check bypass CVE-2017-5753

Show me where I am wrong and I will gladly take patches.  As it is the
mpls fast path does not appear vulnerable to the issue you are
addressing.  So the best thing we can do for performance is nothing at
all.

All I am after is a plausible scenario.  I just want to see it spelled
out which combinations of things could possibly be a problem.

Eric
Dan Williams Jan. 10, 2018, 1:31 a.m. UTC | #10
On Tue, Jan 9, 2018 at 4:54 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
[..]
> Sigh.
> I am not worrying about what is in the speculation window.  My
> assumption is that the speculation window could be infinite, as we don't
> know the limitations of all possible processors.
>
> I am saying there is not a way to get the data out of the speculation
> window.
>
> I was saying that when you have:
>         if (x < max)
>                 val = array1[x];
>
> When val is a pointer not an integer.
> Then
>         array2[val] = y;
>         /* or */
>         y = array2[va];
>
> Won't happen.
>
>         val->field;
>
> Will happen.
>
> Which looks similar.  However the address space of pointers is too
> large.  Making it impossible for an attack to know where to look in the
> cache to see if "val->field" happened.  At least on the assumption that
> val is an arbitrary value.
>
> Further mpls_forward is small enough the entire scope of "rt" the value
> read possibly past the bound check is auditable without too much
> trouble.  I have looked and I don't see anything that could possibly
> allow the value of "rt" to be exfitrated.  The problem continuing to be
> that it is a pointer and the only operation on the pointer besides
> derferencing it is testing if it is NULL.
>
> Other types of timing attacks are very hard if not impossible because
> any packet presenting with a value outside the bounds check will be
> dropped.  So it will hard if not impossible to find something to time to
> see how long it took to drop the packet.
>
> So no this code path does not present with the classic signature of
> variant 1: bounds check bypass CVE-2017-5753
>
> Show me where I am wrong and I will gladly take patches.  As it is the
> mpls fast path does not appear vulnerable to the issue you are
> addressing.  So the best thing we can do for performance is nothing at
> all.

That's completely valid. Let's drop this one.

> All I am after is a plausible scenario.  I just want to see it spelled
> out which combinations of things could possibly be a problem.

In fact, I'm not here to spell out the speculative execution problem
in this path beyond the initial user controlled speculative read. As
noted in the cover letter thread, this and the other patches are
simply reports that are fully expected to be resolved as false
positives by sub-system owners in some cases. I'm otherwise more
focused in landing common infrastructure that could be used in the
future as data-flow-analysis tooling improves to find these locations
with higher accuracy. In other words, the goal at the end of this
exercise is to identify a default nospec_array_ptr() implementation
that all sub-systems could accept for when the problem report turns
out to be real, and your pushback has already resulted in good
discussion of alternate nospec_array_ptr() implementations.

Thanks for your patience to keep the conversation going.
Dan Williams Jan. 10, 2018, 1:33 a.m. UTC | #11
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> >
>> > originally from Linus and tweaked by Alexei and I:
>>
>> Sadly, that tweak - while clever - is wrong.
>>
>> >         unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>>
>> Why?
>>
>> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
>> can still be positive.
>>
>> Think "m = 100", "i=bignum". The subtraction will overflow and become
>> positive again, the shift will shift to zero, and then the mask will
>> become ~0.
>>
>> Now, you can fix it, but you need to be a tiny bit more clever.  In
>> particular, just make sure that you retain the high bit of "_i",
>> basically making the rule be that a negative index is not ever valid.
>>
>> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
>> Now the sign bit is set if _i had it set, _or_ if the subtraction
>> turned negative, and you don't have to worry about the overflow
>> situation.
>>
>> But it does require that extra step to be trustworthy. Still purely
>> cheap arithmetic operations, although there is possibly some
>> additional register pressure there.
>>
>> Somebody might be able to come up with something even more minimal (or
>> find a fault in my fix of your tweak).
>
> I looks like there is another problem, or I'm misreading the
> cleverness. We want the mask to be ~0 in the ok case and 0 in the
> out-of-bounds case. As far as I can see we end up with ~0 in the ok
> case, and ~1 in the bad case. Don't we need to do something like the
> following, at which point are we getting out of the realm of "cheap
> ALU instructions"?
>
> #define __nospec_array_ptr(base, idx, sz)                               \
> ({                                                                      \
>         union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
>         unsigned long _i = (idx);                                       \
>         unsigned long _s = (sz);                                        \
>         unsigned long _v = (long)(_i | _s - 1 - _i)                     \
>                                         >> BITS_PER_LONG - 1;           \
>         unsigned long _mask = _v * ~0UL;                                 \
>         OPTIMIZER_HIDE_VAR(_mask);                                      \
>         __u._ptr = &base[_i & _mask];                                   \
>         __u._bit &= _mask;                                              \
>         __u._ptr;                                                       \
> })

Sorry, I'm slow of course ~(-1L) is 0.
Alexei Starovoitov Jan. 10, 2018, 1:57 a.m. UTC | #12
On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote:
> 
> #define __nospec_array_ptr(base, idx, sz)                               \
> ({                                                                      \
>         union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
>         unsigned long _i = (idx);                                       \
>         unsigned long _s = (sz);                                        \
>         unsigned long _v = (long)(_i | _s - 1 - _i)                     \
>                                         >> BITS_PER_LONG - 1;           \
>         unsigned long _mask = _v * ~0UL;                                 \
>         OPTIMIZER_HIDE_VAR(_mask);                                      \
>         __u._ptr = &base[_i & _mask];                                   \
>         __u._bit &= _mask;                                              \
>         __u._ptr;                                                       \
> })

_v * ~0UL doesn't seem right and non intuitive.
What's wrong with:
  unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1;

and why OPTIMIZER_HIDE_VAR ?
Could you remove '&' ?
since in doesn't work for:
struct {
  int fd[4];
  ...
} *fdt;
it cannot be used as array_acces(fdt->fd, ...);

Could you please drop nospec_ prefix since it is misleading ?
This macro doesn't prevent speculation.
I think array_access() was the best name so far.
Dan Williams Jan. 10, 2018, 2:22 a.m. UTC | #13
On Tue, Jan 9, 2018 at 5:57 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote:
>>
>> #define __nospec_array_ptr(base, idx, sz)                               \
>> ({                                                                      \
>>         union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;       \
>>         unsigned long _i = (idx);                                       \
>>         unsigned long _s = (sz);                                        \
>>         unsigned long _v = (long)(_i | _s - 1 - _i)                     \
>>                                         >> BITS_PER_LONG - 1;           \
>>         unsigned long _mask = _v * ~0UL;                                 \
>>         OPTIMIZER_HIDE_VAR(_mask);                                      \
>>         __u._ptr = &base[_i & _mask];                                   \
>>         __u._bit &= _mask;                                              \
>>         __u._ptr;                                                       \
>> })
>
> _v * ~0UL doesn't seem right and non intuitive.
> What's wrong with:
>   unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1;

Yeah, I noticed it was ok immediately after I sent that.

> and why OPTIMIZER_HIDE_VAR ?

It was in Linus' original. but that was when it had the ternary
conditional, I'll drop it. It does not change the generated assembly.

> Could you remove '&' ?

Yes, that should be __u.ptr = base + (i & _mask)

> since in doesn't work for:
> struct {
>   int fd[4];
>   ...
> } *fdt;
> it cannot be used as array_acces(fdt->fd, ...);
>
> Could you please drop nospec_ prefix since it is misleading ?

When you came up with that tweak you noted:

"The following:
[..]
is generic and no speculative flows."

> This macro doesn't prevent speculation.

It masks dangerous speculation. At least, I read nospec as "No
Spectre" and it is a prefix used in the Spectre-v2 patches.

I also want to include the option, with a static branch, to switch it
to the hard "no speculation" version with an ifence if worse comes to
worse and we find a compiler / cpu where it doesn't work. The default
will be the fast and practical implementation.

> I think array_access() was the best name so far.

For other usages I need the pointer to the array element, also
array_access() by itself is unsuitable for __fcheck_files because we
still need rcu_dereference_raw() on the element de-reference. So, I
think it's better to get a sanitized array element pointer which can
be used with rcu, READ_ONCE(), etc... directly rather than try to do
the access in the same macro.
Alexei Starovoitov Jan. 10, 2018, 3:07 a.m. UTC | #14
On Tue, Jan 09, 2018 at 06:22:09PM -0800, Dan Williams wrote:
> 
> When you came up with that tweak you noted:
> 
> "The following:
> [..]
> is generic and no speculative flows."

I meant 'no speculative control flow'
load speculation still happens.

> 
> > This macro doesn't prevent speculation.
> 
> It masks dangerous speculation. At least, I read nospec as "No
> Spectre" and it is a prefix used in the Spectre-v2 patches.

ahh. I thought 'nospec' means 'no speculation'.
I think it's too much of an honor to use bug name for the macro
that will be used in many places in the kernel.

> > I think array_access() was the best name so far.
> 
> For other usages I need the pointer to the array element, also
> array_access() by itself is unsuitable for __fcheck_files because we
> still need rcu_dereference_raw() on the element de-reference. So, I
> think it's better to get a sanitized array element pointer which can
> be used with rcu, READ_ONCE(), etc... directly rather than try to do
> the access in the same macro.

makes sense, then array_ptr() should fit ?
I'm hearing rumors that the first cpu with variant 2 and 3 fixed
will be appearing in early 2019. Which is amazing considering cpu
release cycles, but it also means that variant 1 will stay with us
for long time and we better pick clean interface and name for it.
Linus Torvalds Jan. 10, 2018, 3:27 a.m. UTC | #15
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> I looks like there is another problem, or I'm misreading the
> cleverness.

I think you were misreading it.

I was basically saying that this:

        unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\

doesn't work, and that the "_m -1 - _i" needs to be replaced by "_i |
_m -1 -_i".

So you have

        unsigned long _mask = ~(long)(_i (_m - 1 - _i)) >> BITS_PER_LONG - 1;\

which should give the right result. No?

But as mentioned, I think you can do it with two instructions if you
do an architecture-specific inline asm:

        unsigned long mask;

        asm("cmpq %1,%2; sbbq %0,%0"
                :"=r" (mask)
                :"g" (max),"r" (idx));

which is likely much faster, and has much better register usage ("max"
can be a constant or loaded directly from memory, and "mask" could be
assigned the same register as idx).

But once again, I didn't really test it.

Note that the "cmpq/sbbq" version works regardless of max/idx values,
since it literally does the math in BITS_ION_LONG+1 bits.

In contrast, the "binary or with idx" version only works if the high
bit set in idx cannot be valid (put another way: 'max' must not be
bigger than MAXLONG+1).

           Linus
diff mbox series

Patch

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..ebcf0e246cfe 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@ 
 #include <linux/ipv6.h>
 #include <linux/mpls.h>
 #include <linux/netconf.h>
+#include <linux/compiler.h>
 #include <linux/vmalloc.h>
 #include <linux/percpu.h>
 #include <net/ip.h>
@@ -77,12 +78,13 @@  static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
+	struct mpls_route __rcu **platform_label =
+		rcu_dereference(net->mpls.platform_label);
+	struct mpls_route __rcu **rtp;
 
-	if (index < net->mpls.platform_labels) {
-		struct mpls_route __rcu **platform_label =
-			rcu_dereference(net->mpls.platform_label);
-		rt = rcu_dereference(platform_label[index]);
-	}
+	if ((rtp = nospec_array_ptr(platform_label, index,
+					net->mpls.platform_labels)))
+		rt = rcu_dereference(*rtp);
 	return rt;
 }