Message ID | 151520108080.32271.16420298348259030860.stgit@dwillia2-desk3.amr.corp.intel.com |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Series | prevent bounds-check bypass via speculative execution | expand |
On 1/6/2018 4:11 AM, Dan Williams wrote: > Static analysis reports that 'index' may be a user controlled value that > is used as a data dependency reading 'rt' from the 'platform_label' > array. In order to avoid potential leaks of kernel memory values, block > speculative execution of the instruction stream that could issue further > reads based on an invalid 'rt' value. > > Based on an original patch by Elena Reshetova. > > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Eric W. Biederman <ebiederm@xmission.com> > Cc: netdev@vger.kernel.org > Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > net/mpls/af_mpls.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > index 8ca9915befc8..ebcf0e246cfe 100644 > --- a/net/mpls/af_mpls.c > +++ b/net/mpls/af_mpls.c [...] > @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, > static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > { > struct mpls_route *rt = NULL; > + struct mpls_route __rcu **platform_label = > + rcu_dereference(net->mpls.platform_label); > + struct mpls_route __rcu **rtp; > > - if (index < net->mpls.platform_labels) { > - struct mpls_route __rcu **platform_label = > - rcu_dereference(net->mpls.platform_label); > - rt = rcu_dereference(platform_label[index]); > - } > + if ((rtp = nospec_array_ptr(platform_label, index, And here... > + net->mpls.platform_labels))) > + rt = rcu_dereference(*rtp); > return rt; > } > MBR, Sergei
Dan Williams <dan.j.williams@intel.com> writes: > Static analysis reports that 'index' may be a user controlled value that > is used as a data dependency reading 'rt' from the 'platform_label' > array. In order to avoid potential leaks of kernel memory values, block > speculative execution of the instruction stream that could issue further > reads based on an invalid 'rt' value. In detail. a) This code is fast path packet forwarding code. Introducing an unconditional pipeline stall is not ok. AKA either there is no speculation and so this is invulnerable or there is speculation and you are creating an unconditional pipeline stall here. My back of the napkin caluculations say that a pipeline stall is about 20 cycles. Which is about the same length of time as a modern cache miss. On a good day this code will perform with 0 cache misses. On a less good day 1 cache miss. Which means you are quite possibly doubling the runtime of mpls_forward. b) The array is dynamically allocated which should provide some protection, as it will be more difficult to predict the address of the array which is needed to craft an malicious userspace value. c) The code can be trivially modified to say: static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) { struct mpls_route *rt = NULL; if (index < net->mpls.platform_labels) { struct mpls_route __rcu **platform_label = rcu_dereference(net->mpls.platform_label); rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]); } return rt; } AKA a static mask will ensure that there is not a primitive that can be used to access all of memory. That is max a 1 cycle slowdown in the code, which is a much better trade off. d) If we care more it is straight forward to modify resize_platform_label_table() to ensure that the size of the array is always a power of 2. e) The fact that a pointer is returned from the array and it is treated like a pointer would seem to provide a defense against the exfiltration technique of using the value read as an index into a small array, that user space code can probe aliased cached lines of, to see which value was dereferenced. So to this patch in particular. Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> This code path will be difficult to exploit. This change messes with performance. There are ways to make this code path useless while preserving the performance of the code. Eric > > Based on an original patch by Elena Reshetova. > > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Eric W. Biederman <ebiederm@xmission.com> > Cc: netdev@vger.kernel.org > Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > net/mpls/af_mpls.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > index 8ca9915befc8..ebcf0e246cfe 100644 > --- a/net/mpls/af_mpls.c > +++ b/net/mpls/af_mpls.c > @@ -8,6 +8,7 @@ > #include <linux/ipv6.h> > #include <linux/mpls.h> > #include <linux/netconf.h> > +#include <linux/compiler.h> > #include <linux/vmalloc.h> > #include <linux/percpu.h> > #include <net/ip.h> > @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, > static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > { > struct mpls_route *rt = NULL; > + struct mpls_route __rcu **platform_label = > + rcu_dereference(net->mpls.platform_label); > + struct mpls_route __rcu **rtp; > > - if (index < net->mpls.platform_labels) { > - struct mpls_route __rcu **platform_label = > - rcu_dereference(net->mpls.platform_label); > - rt = rcu_dereference(platform_label[index]); > - } > + if ((rtp = nospec_array_ptr(platform_label, index, > + net->mpls.platform_labels))) > + rt = rcu_dereference(*rtp); > return rt; > } >
On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Dan Williams <dan.j.williams@intel.com> writes: > >> Static analysis reports that 'index' may be a user controlled value that >> is used as a data dependency reading 'rt' from the 'platform_label' >> array. In order to avoid potential leaks of kernel memory values, block >> speculative execution of the instruction stream that could issue further >> reads based on an invalid 'rt' value. > > > In detail. > a) This code is fast path packet forwarding code. Introducing an > unconditional pipeline stall is not ok. > > AKA either there is no speculation and so this is invulnerable > or there is speculation and you are creating an unconditional > pipeline stall here. > > My back of the napkin caluculations say that a pipeline stall > is about 20 cycles. Which is about the same length of time > as a modern cache miss. > > On a good day this code will perform with 0 cache misses. On a less > good day 1 cache miss. Which means you are quite possibly doubling > the runtime of mpls_forward. > > b) The array is dynamically allocated which should provide some > protection, as it will be more difficult to predict the address > of the array which is needed to craft an malicious userspace value. > > c) The code can be trivially modified to say: > > static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > { > struct mpls_route *rt = NULL; > > if (index < net->mpls.platform_labels) { > struct mpls_route __rcu **platform_label = > rcu_dereference(net->mpls.platform_label); > rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]); > } > return rt; > } > > AKA a static mask will ensure that there is not a primitive that can be > used to access all of memory. That is max a 1 cycle slowdown in the > code, which is a much better trade off. > > d) If we care more it is straight forward to modify > resize_platform_label_table() to ensure that the size of the array > is always a power of 2. > > e) The fact that a pointer is returned from the array and it is treated > like a pointer would seem to provide a defense against the > exfiltration technique of using the value read as an index into > a small array, that user space code can probe aliased cached > lines of, to see which value was dereferenced. > > > So to this patch in particular. > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > This code path will be difficult to exploit. This change messes with > performance. There are ways to make this code path useless while > preserving the performance of the code. > Thanks, Eric understood. The discussion over the weekend came to the conclusion that using a mask will be the default approach. The nospec_array_ptr() will be defined to something similar to the following, originally from Linus and tweaked by Alexei and I: #define __nospec_array_ptr(base, idx, sz) \ ({ \ union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ unsigned long _i = (idx); \ unsigned long _m = (max); \ unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ OPTIMIZER_HIDE_VAR(_mask); \ __u._ptr = &base[_i & _mask]; \ __u._bit &= _mask; \ __u._ptr; \ }) Does that address your performance concerns?
On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > originally from Linus and tweaked by Alexei and I: Sadly, that tweak - while clever - is wrong. > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ Why? Because "(long)(_m-1-_i)" is not negative just because "i >= m". It can still be positive. Think "m = 100", "i=bignum". The subtraction will overflow and become positive again, the shift will shift to zero, and then the mask will become ~0. Now, you can fix it, but you need to be a tiny bit more clever. In particular, just make sure that you retain the high bit of "_i", basically making the rule be that a negative index is not ever valid. And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))". Now the sign bit is set if _i had it set, _or_ if the subtraction turned negative, and you don't have to worry about the overflow situation. But it does require that extra step to be trustworthy. Still purely cheap arithmetic operations, although there is possibly some additional register pressure there. Somebody might be able to come up with something even more minimal (or find a fault in my fix of your tweak). Obviously, with architecture-specific code, you may well be able to do better, using the carry flag of the subtraction. For example, on x86, I think you could do it with just two instructions: # carry will be clear if idx >= max cmpq %idx,%max # mask will be clear if carry was clear, ~0 otherwise sbbq %mask,%mask to generate mask directly. I might have screwed that up. Worth perhaps trying? Linus
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > # carry will be clear if idx >= max > cmpq %idx,%max Bah. Other way around. cmpq %max,%idx I'm a moron. > # mask will be clear if carry was clear, ~0 otherwise > sbbq %mask,%mask > > to generate mask directly. I might have screwed that up. Worth perhaps trying? More importantly, worth _testing_ and fixing my hand-waving "asm like this" crap. But I do think that simple two-instruction cmpq/sbbq sequence could get it right in just two trivial ALU instructions. Linus
Dan Williams <dan.j.williams@intel.com> writes: > On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Dan Williams <dan.j.williams@intel.com> writes: >> >>> Static analysis reports that 'index' may be a user controlled value that >>> is used as a data dependency reading 'rt' from the 'platform_label' >>> array. In order to avoid potential leaks of kernel memory values, block >>> speculative execution of the instruction stream that could issue further >>> reads based on an invalid 'rt' value. >> >> >> In detail. >> a) This code is fast path packet forwarding code. Introducing an >> unconditional pipeline stall is not ok. >> >> AKA either there is no speculation and so this is invulnerable >> or there is speculation and you are creating an unconditional >> pipeline stall here. >> >> My back of the napkin caluculations say that a pipeline stall >> is about 20 cycles. Which is about the same length of time >> as a modern cache miss. >> >> On a good day this code will perform with 0 cache misses. On a less >> good day 1 cache miss. Which means you are quite possibly doubling >> the runtime of mpls_forward. >> >> b) The array is dynamically allocated which should provide some >> protection, as it will be more difficult to predict the address >> of the array which is needed to craft an malicious userspace value. >> >> c) The code can be trivially modified to say: >> >> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) >> { >> struct mpls_route *rt = NULL; >> >> if (index < net->mpls.platform_labels) { >> struct mpls_route __rcu **platform_label = >> rcu_dereference(net->mpls.platform_label); >> rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]); >> } >> return rt; >> } >> >> AKA a static mask will ensure that there is not a primitive that can be >> used to access all of memory. That is max a 1 cycle slowdown in the >> code, which is a much better trade off. >> >> d) If we care more it is straight forward to modify >> resize_platform_label_table() to ensure that the size of the array >> is always a power of 2. >> >> e) The fact that a pointer is returned from the array and it is treated >> like a pointer would seem to provide a defense against the >> exfiltration technique of using the value read as an index into >> a small array, that user space code can probe aliased cached >> lines of, to see which value was dereferenced. >> >> >> So to this patch in particular. >> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> >> >> This code path will be difficult to exploit. This change messes with >> performance. There are ways to make this code path useless while >> preserving the performance of the code. >> > > Thanks, Eric understood. The discussion over the weekend came to the > conclusion that using a mask will be the default approach. The > nospec_array_ptr() will be defined to something similar to the > following, originally from Linus and tweaked by Alexei and I: > > #define __nospec_array_ptr(base, idx, sz) \ > ({ \ > union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ > unsigned long _i = (idx); \ > unsigned long _m = (max); \ > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ > OPTIMIZER_HIDE_VAR(_mask); \ > __u._ptr = &base[_i & _mask]; \ > __u._bit &= _mask; \ > __u._ptr; \ > }) > > Does that address your performance concerns? The user controlled value condition of your patchset implies that you assume indirect branch predictor poisoning is handled in other ways. Which means that we can assume speculation will take some variation of the static call chain. Further you are worrying about array accesses. Which means you are worried about attacks that are the equivalent of meltdown that can give you reading of all memory available to the kernel. The mpls code in question reads a pointer from memory. The only thing the code does with that pointer is verify it is not NULL and dereference it. That does not make it possible to extricate the pointer bits via a cache side-channel as a pointer is 64bits wide. There might maybe be a timing attack where it is timed how long the packet takes to deliver. If you can find the base address of the array, at best such a timeing attack will tell you is if some arbitrary cache line is already cached in the kernel. Which is not the class of attack your patchset is worried about. Further there are more direct ways to probe the cache from a local process. So I submit to you that the mpls code is not vulnerable to the class of attack you are addressing. Further I would argue that anything that reads a pointer from memory is a very strong clue that it falls outside the class of code that you are addressing. Show me where I am wrong and I will consider patches. Eric
On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Dan Williams <dan.j.williams@intel.com> writes: [..] > The user controlled value condition of your patchset implies that you > assume indirect branch predictor poisoning is handled in other ways. > > Which means that we can assume speculation will take some variation of > the static call chain. > > Further you are worrying about array accesses. Which means you > are worried about attacks that are the equivalent of meltdown that > can give you reading of all memory available to the kernel. > > > The mpls code in question reads a pointer from memory. > > > The only thing the code does with that pointer is verify it is not NULL > and dereference it. > > That does not make it possible to extricate the pointer bits via a cache > side-channel as a pointer is 64bits wide. > > There might maybe be a timing attack where it is timed how long the > packet takes to deliver. If you can find the base address of the array, > at best such a timeing attack will tell you is if some arbitrary cache > line is already cached in the kernel. Which is not the class of attack > your patchset is worried about. Further there are more direct ways > to probe the cache from a local process. > > So I submit to you that the mpls code is not vulnerable to the class of > attack you are addressing. > > Further I would argue that anything that reads a pointer from memory is > a very strong clue that it falls outside the class of code that you are > addressing. > > Show me where I am wrong and I will consider patches. No, the concern is a second dependent read (or write) within the speculation window after this first bounds-checked dependent read. I.e. this mpls code path appears to have the setup condition: if (x < max) val = array1[x]; ...but it's not clear that there is an exploit condition later on in the instruction stream: array2[val] = y; /* or */ y = array2[val]; My personal paranoia says submit the patch and not worry about finding that later exploit condition, if DaveM wants to drop the patch that's his prerogative. In general, with the performance conscious version of nospec_array_ptr() being the default, why worry about what is / is not in the speculation window?
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > > > originally from Linus and tweaked by Alexei and I: > > Sadly, that tweak - while clever - is wrong. > > > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ > > Why? > > Because "(long)(_m-1-_i)" is not negative just because "i >= m". It > can still be positive. > > Think "m = 100", "i=bignum". The subtraction will overflow and become > positive again, the shift will shift to zero, and then the mask will > become ~0. > > Now, you can fix it, but you need to be a tiny bit more clever. In > particular, just make sure that you retain the high bit of "_i", > basically making the rule be that a negative index is not ever valid. > > And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))". > Now the sign bit is set if _i had it set, _or_ if the subtraction > turned negative, and you don't have to worry about the overflow > situation. > > But it does require that extra step to be trustworthy. Still purely > cheap arithmetic operations, although there is possibly some > additional register pressure there. > > Somebody might be able to come up with something even more minimal (or > find a fault in my fix of your tweak). I looks like there is another problem, or I'm misreading the cleverness. We want the mask to be ~0 in the ok case and 0 in the out-of-bounds case. As far as I can see we end up with ~0 in the ok case, and ~1 in the bad case. Don't we need to do something like the following, at which point are we getting out of the realm of "cheap ALU instructions"? #define __nospec_array_ptr(base, idx, sz) \ ({ \ union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ unsigned long _i = (idx); \ unsigned long _s = (sz); \ unsigned long _v = (long)(_i | _s - 1 - _i) \ >> BITS_PER_LONG - 1; \ unsigned long _mask = _v * ~0UL; \ OPTIMIZER_HIDE_VAR(_mask); \ __u._ptr = &base[_i & _mask]; \ __u._bit &= _mask; \ __u._ptr; \ }) elem = nospec_array_ptr(array, idx, 3); 106: b8 02 00 00 00 mov $0x2,%eax 10b: 48 63 ff movslq %edi,%rdi 10e: 48 29 f8 sub %rdi,%rax 111: 48 09 f8 or %rdi,%rax 114: 48 c1 e8 3f shr $0x3f,%rax 118: 48 21 c7 and %rax,%rdi 11b: 48 8d 54 bc 04 lea 0x4(%rsp,%rdi,4),%rdx 120: 48 21 d0 and %rdx,%rax
Dan Williams <dan.j.williams@intel.com> writes: > On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Dan Williams <dan.j.williams@intel.com> writes: > [..] >> The user controlled value condition of your patchset implies that you >> assume indirect branch predictor poisoning is handled in other ways. >> >> Which means that we can assume speculation will take some variation of >> the static call chain. >> >> Further you are worrying about array accesses. Which means you >> are worried about attacks that are the equivalent of meltdown that >> can give you reading of all memory available to the kernel. >> >> >> The mpls code in question reads a pointer from memory. >> >> >> The only thing the code does with that pointer is verify it is not NULL >> and dereference it. >> >> That does not make it possible to extricate the pointer bits via a cache >> side-channel as a pointer is 64bits wide. >> >> There might maybe be a timing attack where it is timed how long the >> packet takes to deliver. If you can find the base address of the array, >> at best such a timeing attack will tell you is if some arbitrary cache >> line is already cached in the kernel. Which is not the class of attack >> your patchset is worried about. Further there are more direct ways >> to probe the cache from a local process. >> >> So I submit to you that the mpls code is not vulnerable to the class of >> attack you are addressing. >> >> Further I would argue that anything that reads a pointer from memory is >> a very strong clue that it falls outside the class of code that you are >> addressing. >> >> Show me where I am wrong and I will consider patches. > > No, the concern is a second dependent read (or write) within the > speculation window after this first bounds-checked dependent read. > I.e. this mpls code path appears to have the setup condition: > > if (x < max) > val = array1[x]; > > ...but it's not clear that there is an exploit condition later on in > the instruction stream: > > array2[val] = y; > /* or */ > y = array2[val]; > > My personal paranoia says submit the patch and not worry about finding > that later exploit condition, if DaveM wants to drop the patch that's > his prerogative. In general, with the performance conscious version of > nospec_array_ptr() being the default, why worry about what is / is not > in the speculation window? Sigh. I am not worrying about what is in the speculation window. My assumption is that the speculation window could be infinite, as we don't know the limitations of all possible processors. I am saying there is not a way to get the data out of the speculation window. I was saying that when you have: if (x < max) val = array1[x]; When val is a pointer not an integer. Then array2[val] = y; /* or */ y = array2[va]; Won't happen. val->field; Will happen. Which looks similar. However the address space of pointers is too large. Making it impossible for an attack to know where to look in the cache to see if "val->field" happened. At least on the assumption that val is an arbitrary value. Further mpls_forward is small enough the entire scope of "rt" the value read possibly past the bound check is auditable without too much trouble. I have looked and I don't see anything that could possibly allow the value of "rt" to be exfitrated. The problem continuing to be that it is a pointer and the only operation on the pointer besides derferencing it is testing if it is NULL. Other types of timing attacks are very hard if not impossible because any packet presenting with a value outside the bounds check will be dropped. So it will hard if not impossible to find something to time to see how long it took to drop the packet. So no this code path does not present with the classic signature of variant 1: bounds check bypass CVE-2017-5753 Show me where I am wrong and I will gladly take patches. As it is the mpls fast path does not appear vulnerable to the issue you are addressing. So the best thing we can do for performance is nothing at all. All I am after is a plausible scenario. I just want to see it spelled out which combinations of things could possibly be a problem. Eric
On Tue, Jan 9, 2018 at 4:54 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Dan Williams <dan.j.williams@intel.com> writes: [..] > Sigh. > I am not worrying about what is in the speculation window. My > assumption is that the speculation window could be infinite, as we don't > know the limitations of all possible processors. > > I am saying there is not a way to get the data out of the speculation > window. > > I was saying that when you have: > if (x < max) > val = array1[x]; > > When val is a pointer not an integer. > Then > array2[val] = y; > /* or */ > y = array2[va]; > > Won't happen. > > val->field; > > Will happen. > > Which looks similar. However the address space of pointers is too > large. Making it impossible for an attack to know where to look in the > cache to see if "val->field" happened. At least on the assumption that > val is an arbitrary value. > > Further mpls_forward is small enough the entire scope of "rt" the value > read possibly past the bound check is auditable without too much > trouble. I have looked and I don't see anything that could possibly > allow the value of "rt" to be exfitrated. The problem continuing to be > that it is a pointer and the only operation on the pointer besides > derferencing it is testing if it is NULL. > > Other types of timing attacks are very hard if not impossible because > any packet presenting with a value outside the bounds check will be > dropped. So it will hard if not impossible to find something to time to > see how long it took to drop the packet. > > So no this code path does not present with the classic signature of > variant 1: bounds check bypass CVE-2017-5753 > > Show me where I am wrong and I will gladly take patches. As it is the > mpls fast path does not appear vulnerable to the issue you are > addressing. So the best thing we can do for performance is nothing at > all. That's completely valid. Let's drop this one. > All I am after is a plausible scenario. I just want to see it spelled > out which combinations of things could possibly be a problem. In fact, I'm not here to spell out the speculative execution problem in this path beyond the initial user controlled speculative read. As noted in the cover letter thread, this and the other patches are simply reports that are fully expected to be resolved as false positives by sub-system owners in some cases. I'm otherwise more focused in landing common infrastructure that could be used in the future as data-flow-analysis tooling improves to find these locations with higher accuracy. In other words, the goal at the end of this exercise is to identify a default nospec_array_ptr() implementation that all sub-systems could accept for when the problem report turns out to be real, and your pushback has already resulted in good discussion of alternate nospec_array_ptr() implementations. Thanks for your patience to keep the conversation going.
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> > >> > originally from Linus and tweaked by Alexei and I: >> >> Sadly, that tweak - while clever - is wrong. >> >> > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ >> >> Why? >> >> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It >> can still be positive. >> >> Think "m = 100", "i=bignum". The subtraction will overflow and become >> positive again, the shift will shift to zero, and then the mask will >> become ~0. >> >> Now, you can fix it, but you need to be a tiny bit more clever. In >> particular, just make sure that you retain the high bit of "_i", >> basically making the rule be that a negative index is not ever valid. >> >> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))". >> Now the sign bit is set if _i had it set, _or_ if the subtraction >> turned negative, and you don't have to worry about the overflow >> situation. >> >> But it does require that extra step to be trustworthy. Still purely >> cheap arithmetic operations, although there is possibly some >> additional register pressure there. >> >> Somebody might be able to come up with something even more minimal (or >> find a fault in my fix of your tweak). > > I looks like there is another problem, or I'm misreading the > cleverness. We want the mask to be ~0 in the ok case and 0 in the > out-of-bounds case. As far as I can see we end up with ~0 in the ok > case, and ~1 in the bad case. Don't we need to do something like the > following, at which point are we getting out of the realm of "cheap > ALU instructions"? > > #define __nospec_array_ptr(base, idx, sz) \ > ({ \ > union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ > unsigned long _i = (idx); \ > unsigned long _s = (sz); \ > unsigned long _v = (long)(_i | _s - 1 - _i) \ > >> BITS_PER_LONG - 1; \ > unsigned long _mask = _v * ~0UL; \ > OPTIMIZER_HIDE_VAR(_mask); \ > __u._ptr = &base[_i & _mask]; \ > __u._bit &= _mask; \ > __u._ptr; \ > }) Sorry, I'm slow of course ~(-1L) is 0.
On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote: > > #define __nospec_array_ptr(base, idx, sz) \ > ({ \ > union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ > unsigned long _i = (idx); \ > unsigned long _s = (sz); \ > unsigned long _v = (long)(_i | _s - 1 - _i) \ > >> BITS_PER_LONG - 1; \ > unsigned long _mask = _v * ~0UL; \ > OPTIMIZER_HIDE_VAR(_mask); \ > __u._ptr = &base[_i & _mask]; \ > __u._bit &= _mask; \ > __u._ptr; \ > }) _v * ~0UL doesn't seem right and non intuitive. What's wrong with: unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1; and why OPTIMIZER_HIDE_VAR ? Could you remove '&' ? since in doesn't work for: struct { int fd[4]; ... } *fdt; it cannot be used as array_acces(fdt->fd, ...); Could you please drop nospec_ prefix since it is misleading ? This macro doesn't prevent speculation. I think array_access() was the best name so far.
On Tue, Jan 9, 2018 at 5:57 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote: >> >> #define __nospec_array_ptr(base, idx, sz) \ >> ({ \ >> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \ >> unsigned long _i = (idx); \ >> unsigned long _s = (sz); \ >> unsigned long _v = (long)(_i | _s - 1 - _i) \ >> >> BITS_PER_LONG - 1; \ >> unsigned long _mask = _v * ~0UL; \ >> OPTIMIZER_HIDE_VAR(_mask); \ >> __u._ptr = &base[_i & _mask]; \ >> __u._bit &= _mask; \ >> __u._ptr; \ >> }) > > _v * ~0UL doesn't seem right and non intuitive. > What's wrong with: > unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1; Yeah, I noticed it was ok immediately after I sent that. > and why OPTIMIZER_HIDE_VAR ? It was in Linus' original. but that was when it had the ternary conditional, I'll drop it. It does not change the generated assembly. > Could you remove '&' ? Yes, that should be __u.ptr = base + (i & _mask) > since in doesn't work for: > struct { > int fd[4]; > ... > } *fdt; > it cannot be used as array_acces(fdt->fd, ...); > > Could you please drop nospec_ prefix since it is misleading ? When you came up with that tweak you noted: "The following: [..] is generic and no speculative flows." > This macro doesn't prevent speculation. It masks dangerous speculation. At least, I read nospec as "No Spectre" and it is a prefix used in the Spectre-v2 patches. I also want to include the option, with a static branch, to switch it to the hard "no speculation" version with an ifence if worse comes to worse and we find a compiler / cpu where it doesn't work. The default will be the fast and practical implementation. > I think array_access() was the best name so far. For other usages I need the pointer to the array element, also array_access() by itself is unsuitable for __fcheck_files because we still need rcu_dereference_raw() on the element de-reference. So, I think it's better to get a sanitized array element pointer which can be used with rcu, READ_ONCE(), etc... directly rather than try to do the access in the same macro.
On Tue, Jan 09, 2018 at 06:22:09PM -0800, Dan Williams wrote: > > When you came up with that tweak you noted: > > "The following: > [..] > is generic and no speculative flows." I meant 'no speculative control flow' load speculation still happens. > > > This macro doesn't prevent speculation. > > It masks dangerous speculation. At least, I read nospec as "No > Spectre" and it is a prefix used in the Spectre-v2 patches. ahh. I thought 'nospec' means 'no speculation'. I think it's too much of an honor to use bug name for the macro that will be used in many places in the kernel. > > I think array_access() was the best name so far. > > For other usages I need the pointer to the array element, also > array_access() by itself is unsuitable for __fcheck_files because we > still need rcu_dereference_raw() on the element de-reference. So, I > think it's better to get a sanitized array element pointer which can > be used with rcu, READ_ONCE(), etc... directly rather than try to do > the access in the same macro. makes sense, then array_ptr() should fit ? I'm hearing rumors that the first cpu with variant 2 and 3 fixed will be appearing in early 2019. Which is amazing considering cpu release cycles, but it also means that variant 1 will stay with us for long time and we better pick clean interface and name for it.
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > I looks like there is another problem, or I'm misreading the > cleverness. I think you were misreading it. I was basically saying that this: unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ doesn't work, and that the "_m -1 - _i" needs to be replaced by "_i | _m -1 -_i". So you have unsigned long _mask = ~(long)(_i (_m - 1 - _i)) >> BITS_PER_LONG - 1;\ which should give the right result. No? But as mentioned, I think you can do it with two instructions if you do an architecture-specific inline asm: unsigned long mask; asm("cmpq %1,%2; sbbq %0,%0" :"=r" (mask) :"g" (max),"r" (idx)); which is likely much faster, and has much better register usage ("max" can be a constant or loaded directly from memory, and "mask" could be assigned the same register as idx). But once again, I didn't really test it. Note that the "cmpq/sbbq" version works regardless of max/idx values, since it literally does the math in BITS_ION_LONG+1 bits. In contrast, the "binary or with idx" version only works if the high bit set in idx cannot be valid (put another way: 'max' must not be bigger than MAXLONG+1). Linus
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 8ca9915befc8..ebcf0e246cfe 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -8,6 +8,7 @@ #include <linux/ipv6.h> #include <linux/mpls.h> #include <linux/netconf.h> +#include <linux/compiler.h> #include <linux/vmalloc.h> #include <linux/percpu.h> #include <net/ip.h> @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) { struct mpls_route *rt = NULL; + struct mpls_route __rcu **platform_label = + rcu_dereference(net->mpls.platform_label); + struct mpls_route __rcu **rtp; - if (index < net->mpls.platform_labels) { - struct mpls_route __rcu **platform_label = - rcu_dereference(net->mpls.platform_label); - rt = rcu_dereference(platform_label[index]); - } + if ((rtp = nospec_array_ptr(platform_label, index, + net->mpls.platform_labels))) + rt = rcu_dereference(*rtp); return rt; }