Message ID | 20180213040600.5821-1-sandipan@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [RFC,bpf,v2,1/2] bpf: allow 64-bit offsets for bpf function calls | expand |
On 02/13/2018 05:05 AM, Sandipan Das wrote: > The imm field of a bpf_insn is a signed 32-bit integer. For > JIT-ed bpf-to-bpf function calls, it stores the offset from > __bpf_call_base to the start of the callee function. > > For some architectures, such as powerpc64, it was found that > this offset may be as large as 64 bits because of which this > cannot be accomodated in the imm field without truncation. > > To resolve this, we additionally make aux->func within each > bpf_prog associated with the functions to point to the list > of all function addresses determined by the verifier. > > We keep the value assigned to the off field of the bpf_insn > as a way to index into aux->func and also set aux->func_cnt > so that this can be used for performing basic upper bound > checks for the off field. > > Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> > --- > v2: Make aux->func point to the list of functions determined > by the verifier rather than allocating a separate callee > list for each function. Approach looks good to me; do you know whether s390x JIT would have similar requirement? I think one limitation that would still need to be addressed later with such approach would be regarding the xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump"). Any ideas for this (potentially if we could use off + imm for calls, we'd get to 48 bits, but that seems still not be enough as you say)? Thanks, Daniel
On 02/15/2018 05:25 PM, Daniel Borkmann wrote: > On 02/13/2018 05:05 AM, Sandipan Das wrote: >> The imm field of a bpf_insn is a signed 32-bit integer. For >> JIT-ed bpf-to-bpf function calls, it stores the offset from >> __bpf_call_base to the start of the callee function. >> >> For some architectures, such as powerpc64, it was found that >> this offset may be as large as 64 bits because of which this >> cannot be accomodated in the imm field without truncation. >> >> To resolve this, we additionally make aux->func within each >> bpf_prog associated with the functions to point to the list >> of all function addresses determined by the verifier. >> >> We keep the value assigned to the off field of the bpf_insn >> as a way to index into aux->func and also set aux->func_cnt >> so that this can be used for performing basic upper bound >> checks for the off field. >> >> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> >> --- >> v2: Make aux->func point to the list of functions determined >> by the verifier rather than allocating a separate callee >> list for each function. > > Approach looks good to me; do you know whether s390x JIT would > have similar requirement? I think one limitation that would still > need to be addressed later with such approach would be regarding the > xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 > ("bpf: allow for correlation of maps and helpers in dump"). Any > ideas for this (potentially if we could use off + imm for calls, > we'd get to 48 bits, but that seems still not be enough as you say)? One other random thought, although I'm not sure how feasible this is for ppc64 JIT to realize ... but idea would be to have something like the below: diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 29ca920..daa7258 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, return ret; } +void * __weak bpf_jit_image_alloc(unsigned long size) +{ + return module_alloc(size); +} + struct bpf_binary_header * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, * random section of illegal instructions. */ size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); - hdr = module_alloc(size); + hdr = bpf_jit_image_alloc(size); if (hdr == NULL) return NULL; And ppc64 JIT could override bpf_jit_image_alloc() in a similar way like some archs would override the module_alloc() helper through a custom implementation, usually via __vmalloc_node_range(), so we could perhaps fit the range for BPF JITed images in a way that they could use the 32bit imm in the end? There are not that many progs loaded typically, so the range could be a bit narrower in such case anyway. (Not sure if this would work out though, but I thought to bring it up.) Thanks, Daniel
Daniel Borkmann wrote: > On 02/15/2018 05:25 PM, Daniel Borkmann wrote: >> On 02/13/2018 05:05 AM, Sandipan Das wrote: >>> The imm field of a bpf_insn is a signed 32-bit integer. For >>> JIT-ed bpf-to-bpf function calls, it stores the offset from >>> __bpf_call_base to the start of the callee function. >>> >>> For some architectures, such as powerpc64, it was found that >>> this offset may be as large as 64 bits because of which this >>> cannot be accomodated in the imm field without truncation. >>> >>> To resolve this, we additionally make aux->func within each >>> bpf_prog associated with the functions to point to the list >>> of all function addresses determined by the verifier. >>> >>> We keep the value assigned to the off field of the bpf_insn >>> as a way to index into aux->func and also set aux->func_cnt >>> so that this can be used for performing basic upper bound >>> checks for the off field. >>> >>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> >>> --- >>> v2: Make aux->func point to the list of functions determined >>> by the verifier rather than allocating a separate callee >>> list for each function. >> >> Approach looks good to me; do you know whether s390x JIT would >> have similar requirement? I think one limitation that would still >> need to be addressed later with such approach would be regarding the >> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 >> ("bpf: allow for correlation of maps and helpers in dump"). Any >> ideas for this (potentially if we could use off + imm for calls, >> we'd get to 48 bits, but that seems still not be enough as you say)? All good points. I'm not really sure how s390x works, so I can't comment on that, but I'm copying Michael Holzheu for his consideration. With the existing scheme, 48 bits won't be enough, so we rejected that approach. I can also see how this will be a problem with bpftool, but I haven't looked into it in detail. I wonder if we can annotate the output to indicate the function being referred to? > > One other random thought, although I'm not sure how feasible this > is for ppc64 JIT to realize ... but idea would be to have something > like the below: > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 29ca920..daa7258 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, > return ret; > } > > +void * __weak bpf_jit_image_alloc(unsigned long size) > +{ > + return module_alloc(size); > +} > + > struct bpf_binary_header * > bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > unsigned int alignment, > @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > * random section of illegal instructions. > */ > size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); > - hdr = module_alloc(size); > + hdr = bpf_jit_image_alloc(size); > if (hdr == NULL) > return NULL; > > And ppc64 JIT could override bpf_jit_image_alloc() in a similar way > like some archs would override the module_alloc() helper through a > custom implementation, usually via __vmalloc_node_range(), so we > could perhaps fit the range for BPF JITed images in a way that they > could use the 32bit imm in the end? There are not that many progs > loaded typically, so the range could be a bit narrower in such case > anyway. (Not sure if this would work out though, but I thought to > bring it up.) That'd be a good option to consider. I don't think we want to allocate anything from the linear memory range since users could load unprivileged BPF programs and consume a lot of memory that way. I doubt if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not entirely sure. Michael, Is the above possible? The question is if we can have BPF programs be allocated within 4GB of __bpf_call_base (which is a kernel symbol), so that calls to those programs can be encoded in a 32-bit immediate field in a BPF instruction. As an extension, we may be able to extend it to 48-bits by combining with another BPF instruction field (offset). In either case, the vmalloc'ed address range won't work. The alternative is to pass the full 64-bit address of the BPF program in an auxiliary field (as proposed in this patch set) but we need to fix it up for 'bpftool' as well. Thanks, Naveen
"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes: > Daniel Borkmann wrote: >> On 02/15/2018 05:25 PM, Daniel Borkmann wrote: >>> On 02/13/2018 05:05 AM, Sandipan Das wrote: >>>> The imm field of a bpf_insn is a signed 32-bit integer. For >>>> JIT-ed bpf-to-bpf function calls, it stores the offset from >>>> __bpf_call_base to the start of the callee function. >>>> >>>> For some architectures, such as powerpc64, it was found that >>>> this offset may be as large as 64 bits because of which this >>>> cannot be accomodated in the imm field without truncation. >>>> >>>> To resolve this, we additionally make aux->func within each >>>> bpf_prog associated with the functions to point to the list >>>> of all function addresses determined by the verifier. >>>> >>>> We keep the value assigned to the off field of the bpf_insn >>>> as a way to index into aux->func and also set aux->func_cnt >>>> so that this can be used for performing basic upper bound >>>> checks for the off field. >>>> >>>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> >>>> --- >>>> v2: Make aux->func point to the list of functions determined >>>> by the verifier rather than allocating a separate callee >>>> list for each function. >>> >>> Approach looks good to me; do you know whether s390x JIT would >>> have similar requirement? I think one limitation that would still >>> need to be addressed later with such approach would be regarding the >>> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 >>> ("bpf: allow for correlation of maps and helpers in dump"). Any >>> ideas for this (potentially if we could use off + imm for calls, >>> we'd get to 48 bits, but that seems still not be enough as you say)? > > All good points. I'm not really sure how s390x works, so I can't comment > on that, but I'm copying Michael Holzheu for his consideration. > > With the existing scheme, 48 bits won't be enough, so we rejected that > approach. I can also see how this will be a problem with bpftool, but I > haven't looked into it in detail. I wonder if we can annotate the output > to indicate the function being referred to? > >> >> One other random thought, although I'm not sure how feasible this >> is for ppc64 JIT to realize ... but idea would be to have something >> like the below: >> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 29ca920..daa7258 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, >> return ret; >> } >> >> +void * __weak bpf_jit_image_alloc(unsigned long size) >> +{ >> + return module_alloc(size); >> +} >> + >> struct bpf_binary_header * >> bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, >> unsigned int alignment, >> @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, >> * random section of illegal instructions. >> */ >> size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); >> - hdr = module_alloc(size); >> + hdr = bpf_jit_image_alloc(size); >> if (hdr == NULL) >> return NULL; >> >> And ppc64 JIT could override bpf_jit_image_alloc() in a similar way >> like some archs would override the module_alloc() helper through a >> custom implementation, usually via __vmalloc_node_range(), so we >> could perhaps fit the range for BPF JITed images in a way that they >> could use the 32bit imm in the end? There are not that many progs >> loaded typically, so the range could be a bit narrower in such case >> anyway. (Not sure if this would work out though, but I thought to >> bring it up.) > > That'd be a good option to consider. I don't think we want to allocate > anything from the linear memory range since users could load > unprivileged BPF programs and consume a lot of memory that way. I doubt > if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not > entirely sure. > > Michael, > Is the above possible? The question is if we can have BPF programs be > allocated within 4GB of __bpf_call_base (which is a kernel symbol), so > that calls to those programs can be encoded in a 32-bit immediate field > in a BPF instruction. Hmmm. It's not technically impossible, but I don't think it's really a good option. The 0xc range is a linear mapping of RAM, and the kernel tends to be near the start of RAM for reasons. That means there generally isn't a hole in the 0xc range within 4GB for you to map BPF programs. You could create a hole by making the 0xc mapping non linear, ie. mapping some RAM near the kernel elsewhere in the 0xc range, to make a hole that you can then remap BPF programs into. But I think that would cause a lot of bugs, it's a pretty fundamental assumption that the linear mapping is 1:1. > As an extension, we may be able to extend it to > 48-bits by combining with another BPF instruction field (offset). In > either case, the vmalloc'ed address range won't work. 48-bits could possibly work, we don't have systems with that much RAM *yet*. So you could remap the BPF programs at the end of the 0xc range, or somewhere we have a gap in RAM. That would probably still confuse some things, ie. the 0xc mapping would be a 1:1 mapping of RAM plus some other stuff, but it might be tractable to fix. We don't have page tables for the 0xc range when using the hash MMU, so we'd need to either fix that or use some bespoke data structure for keeping track of the mappings. It doesn't really appeal :) We load modules in the 0xd range, and they call functions in the kernel, we handle that with trampolines. Can you do something similar? You obviously still need a 64-bit branch somewhere, but perhaps you only need one, or one per BPF program, rather than one per function call. > The alternative is to pass the full 64-bit address of the BPF program in > an auxiliary field (as proposed in this patch set) but we need to fix it > up for 'bpftool' as well. OK. You'll have to explain to me how bpftool is related, I thought it was just for loading/examining BPF programs. cheers
Michael Ellerman wrote: > "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes: >> Daniel Borkmann wrote: >>> On 02/15/2018 05:25 PM, Daniel Borkmann wrote: >>>> On 02/13/2018 05:05 AM, Sandipan Das wrote: >>>>> The imm field of a bpf_insn is a signed 32-bit integer. For >>>>> JIT-ed bpf-to-bpf function calls, it stores the offset from >>>>> __bpf_call_base to the start of the callee function. >>>>> >>>>> For some architectures, such as powerpc64, it was found that >>>>> this offset may be as large as 64 bits because of which this >>>>> cannot be accomodated in the imm field without truncation. >>>>> >>>>> To resolve this, we additionally make aux->func within each >>>>> bpf_prog associated with the functions to point to the list >>>>> of all function addresses determined by the verifier. >>>>> >>>>> We keep the value assigned to the off field of the bpf_insn >>>>> as a way to index into aux->func and also set aux->func_cnt >>>>> so that this can be used for performing basic upper bound >>>>> checks for the off field. >>>>> >>>>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> >>>>> --- >>>>> v2: Make aux->func point to the list of functions determined >>>>> by the verifier rather than allocating a separate callee >>>>> list for each function. >>>> >>>> Approach looks good to me; do you know whether s390x JIT would >>>> have similar requirement? I think one limitation that would still >>>> need to be addressed later with such approach would be regarding the >>>> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 >>>> ("bpf: allow for correlation of maps and helpers in dump"). Any >>>> ideas for this (potentially if we could use off + imm for calls, >>>> we'd get to 48 bits, but that seems still not be enough as you say)? >> >> All good points. I'm not really sure how s390x works, so I can't comment >> on that, but I'm copying Michael Holzheu for his consideration. >> >> With the existing scheme, 48 bits won't be enough, so we rejected that >> approach. I can also see how this will be a problem with bpftool, but I >> haven't looked into it in detail. I wonder if we can annotate the output >> to indicate the function being referred to? >> >>> >>> One other random thought, although I'm not sure how feasible this >>> is for ppc64 JIT to realize ... but idea would be to have something >>> like the below: >>> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >>> index 29ca920..daa7258 100644 >>> --- a/kernel/bpf/core.c >>> +++ b/kernel/bpf/core.c >>> @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, >>> return ret; >>> } >>> >>> +void * __weak bpf_jit_image_alloc(unsigned long size) >>> +{ >>> + return module_alloc(size); >>> +} >>> + >>> struct bpf_binary_header * >>> bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, >>> unsigned int alignment, >>> @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, >>> * random section of illegal instructions. >>> */ >>> size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); >>> - hdr = module_alloc(size); >>> + hdr = bpf_jit_image_alloc(size); >>> if (hdr == NULL) >>> return NULL; >>> >>> And ppc64 JIT could override bpf_jit_image_alloc() in a similar way >>> like some archs would override the module_alloc() helper through a >>> custom implementation, usually via __vmalloc_node_range(), so we >>> could perhaps fit the range for BPF JITed images in a way that they >>> could use the 32bit imm in the end? There are not that many progs >>> loaded typically, so the range could be a bit narrower in such case >>> anyway. (Not sure if this would work out though, but I thought to >>> bring it up.) >> >> That'd be a good option to consider. I don't think we want to allocate >> anything from the linear memory range since users could load >> unprivileged BPF programs and consume a lot of memory that way. I doubt >> if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not >> entirely sure. >> >> Michael, >> Is the above possible? The question is if we can have BPF programs be >> allocated within 4GB of __bpf_call_base (which is a kernel symbol), so >> that calls to those programs can be encoded in a 32-bit immediate field >> in a BPF instruction. > > Hmmm. > > It's not technically impossible, but I don't think it's really a good > option. > > The 0xc range is a linear mapping of RAM, and the kernel tends to be > near the start of RAM for reasons. That means there generally isn't a > hole in the 0xc range within 4GB for you to map BPF programs. > > You could create a hole by making the 0xc mapping non linear, ie. > mapping some RAM near the kernel elsewhere in the 0xc range, to make a > hole that you can then remap BPF programs into. But I think that would > cause a lot of bugs, it's a pretty fundamental assumption that the > linear mapping is 1:1. > >> As an extension, we may be able to extend it to >> 48-bits by combining with another BPF instruction field (offset). In >> either case, the vmalloc'ed address range won't work. > > 48-bits could possibly work, we don't have systems with that much RAM > *yet*. So you could remap the BPF programs at the end of the 0xc range, > or somewhere we have a gap in RAM. > > That would probably still confuse some things, ie. the 0xc mapping would > be a 1:1 mapping of RAM plus some other stuff, but it might be tractable > to fix. > > We don't have page tables for the 0xc range when using the hash MMU, so > we'd need to either fix that or use some bespoke data structure for > keeping track of the mappings. > > It doesn't really appeal :) Thanks for the detailed explanation. That does look gross :) > > We load modules in the 0xd range, and they call functions in the kernel, > we handle that with trampolines. Can you do something similar? Yes, we already handle this in our JIT. The scenario being considered here is in how the BPF core informs the BPF JIT engine when it has to call out to a different BPF JIT'ed function. The existing approach encodes an offset from __bpf_call_base() in the imm32 field of the BPF instruction itself. For powerpc64 though, I think we'll have to have the address of the JIT'ed BPF functions passed in through other means. > > You obviously still need a 64-bit branch somewhere, but perhaps you only > need one, or one per BPF program, rather than one per function call. We may have more than a single branch depending on how the program is laid out. We are looking to see if it is worth detecting that a call is close-by to do a relative branch, rather than loading a full 64-bit address and branching to that. For BPF-to-BPF calls, we should also be able to skip all the r12/r2 GEP/TOC handling. We may need to emit an additional branch to skip over some instructions (or emit nops) just so we consistently emit a fixed number of instructions each pass. > >> The alternative is to pass the full 64-bit address of the BPF program in >> an auxiliary field (as proposed in this patch set) but we need to fix it >> up for 'bpftool' as well. > > OK. You'll have to explain to me how bpftool is related, I thought it > was just for loading/examining BPF programs. It is, and it can also dump loaded BPF programs. When such programs call out to other BPF programs, bpftool currently figures out the address of the target BPF function by essentially doing (__bpf_call_base + imm32) and looking up kallsyms. Since we've established that imm32 is not enough for us, this won't work on powerpc64. However, if bpf_jit_kallsyms is not enabled, it looks like bpftool will still fail to resolve the call, and only print out an address today. I'm wondering if we can instead encode the bpf prog id in imm32. That way, we should be able to indicate the BPF function being called into. Daniel, is that something we can consider? Thanks! - Naveen
Am Fri, 16 Feb 2018 21:20:09 +0530 schrieb "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>: > Daniel Borkmann wrote: > > On 02/15/2018 05:25 PM, Daniel Borkmann wrote: > >> On 02/13/2018 05:05 AM, Sandipan Das wrote: > >>> The imm field of a bpf_insn is a signed 32-bit integer. For > >>> JIT-ed bpf-to-bpf function calls, it stores the offset from > >>> __bpf_call_base to the start of the callee function. > >>> > >>> For some architectures, such as powerpc64, it was found that > >>> this offset may be as large as 64 bits because of which this > >>> cannot be accomodated in the imm field without truncation. > >>> > >>> To resolve this, we additionally make aux->func within each > >>> bpf_prog associated with the functions to point to the list > >>> of all function addresses determined by the verifier. > >>> > >>> We keep the value assigned to the off field of the bpf_insn > >>> as a way to index into aux->func and also set aux->func_cnt > >>> so that this can be used for performing basic upper bound > >>> checks for the off field. > >>> > >>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> > >>> --- > >>> v2: Make aux->func point to the list of functions determined > >>> by the verifier rather than allocating a separate callee > >>> list for each function. > >> > >> Approach looks good to me; do you know whether s390x JIT would > >> have similar requirement? I think one limitation that would still > >> need to be addressed later with such approach would be regarding the > >> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 > >> ("bpf: allow for correlation of maps and helpers in dump"). Any > >> ideas for this (potentially if we could use off + imm for calls, > >> we'd get to 48 bits, but that seems still not be enough as you say)? > > All good points. I'm not really sure how s390x works, so I can't comment > on that, but I'm copying Michael Holzheu for his consideration. > > With the existing scheme, 48 bits won't be enough, so we rejected that > approach. I can also see how this will be a problem with bpftool, but I > haven't looked into it in detail. I wonder if we can annotate the output > to indicate the function being referred to? > > > > > One other random thought, although I'm not sure how feasible this > > is for ppc64 JIT to realize ... but idea would be to have something > > like the below: > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index 29ca920..daa7258 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, > > return ret; > > } > > > > +void * __weak bpf_jit_image_alloc(unsigned long size) > > +{ > > + return module_alloc(size); > > +} > > + > > struct bpf_binary_header * > > bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > > unsigned int alignment, > > @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > > * random section of illegal instructions. > > */ > > size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); > > - hdr = module_alloc(size); > > + hdr = bpf_jit_image_alloc(size); > > if (hdr == NULL) > > return NULL; > > > > And ppc64 JIT could override bpf_jit_image_alloc() in a similar way > > like some archs would override the module_alloc() helper through a > > custom implementation, usually via __vmalloc_node_range(), so we > > could perhaps fit the range for BPF JITed images in a way that they > > could use the 32bit imm in the end? There are not that many progs > > loaded typically, so the range could be a bit narrower in such case > > anyway. (Not sure if this would work out though, but I thought to > > bring it up.) > > That'd be a good option to consider. I don't think we want to allocate > anything from the linear memory range since users could load > unprivileged BPF programs and consume a lot of memory that way. I doubt > if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not > entirely sure. > > Michael, > Is the above possible? The question is if we can have BPF programs be > allocated within 4GB of __bpf_call_base (which is a kernel symbol), so > that calls to those programs can be encoded in a 32-bit immediate field > in a BPF instruction. As an extension, we may be able to extend it to > 48-bits by combining with another BPF instruction field (offset). In > either case, the vmalloc'ed address range won't work. > > The alternative is to pass the full 64-bit address of the BPF program in > an auxiliary field (as proposed in this patch set) but we need to fix it > up for 'bpftool' as well. Hi Naveen, Our s390 kernel maintainer Martin Schwidefsky took over eBPF responsibility for s390 from me. @Martin: Can you answer Navee's question? Michael
Am Thu, 22 Feb 2018 13:06:40 +0100 schrieb Michael Holzheu <holzheu@linux.vnet.ibm.com>: > Am Fri, 16 Feb 2018 21:20:09 +0530 > schrieb "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>: > > > Daniel Borkmann wrote: > > > On 02/15/2018 05:25 PM, Daniel Borkmann wrote: > > >> On 02/13/2018 05:05 AM, Sandipan Das wrote: > > >>> The imm field of a bpf_insn is a signed 32-bit integer. For > > >>> JIT-ed bpf-to-bpf function calls, it stores the offset from > > >>> __bpf_call_base to the start of the callee function. > > >>> > > >>> For some architectures, such as powerpc64, it was found that > > >>> this offset may be as large as 64 bits because of which this > > >>> cannot be accomodated in the imm field without truncation. > > >>> > > >>> To resolve this, we additionally make aux->func within each > > >>> bpf_prog associated with the functions to point to the list > > >>> of all function addresses determined by the verifier. > > >>> > > >>> We keep the value assigned to the off field of the bpf_insn > > >>> as a way to index into aux->func and also set aux->func_cnt > > >>> so that this can be used for performing basic upper bound > > >>> checks for the off field. > > >>> > > >>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> > > >>> --- > > >>> v2: Make aux->func point to the list of functions determined > > >>> by the verifier rather than allocating a separate callee > > >>> list for each function. > > >> > > >> Approach looks good to me; do you know whether s390x JIT would > > >> have similar requirement? I think one limitation that would still > > >> need to be addressed later with such approach would be regarding the > > >> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087 > > >> ("bpf: allow for correlation of maps and helpers in dump"). Any > > >> ideas for this (potentially if we could use off + imm for calls, > > >> we'd get to 48 bits, but that seems still not be enough as you say)? > > > > All good points. I'm not really sure how s390x works, so I can't comment > > on that, but I'm copying Michael Holzheu for his consideration. > > > > With the existing scheme, 48 bits won't be enough, so we rejected that > > approach. I can also see how this will be a problem with bpftool, but I > > haven't looked into it in detail. I wonder if we can annotate the output > > to indicate the function being referred to? > > > > > > > > One other random thought, although I'm not sure how feasible this > > > is for ppc64 JIT to realize ... but idea would be to have something > > > like the below: > > > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > > index 29ca920..daa7258 100644 > > > --- a/kernel/bpf/core.c > > > +++ b/kernel/bpf/core.c > > > @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, > > > return ret; > > > } > > > > > > +void * __weak bpf_jit_image_alloc(unsigned long size) > > > +{ > > > + return module_alloc(size); > > > +} > > > + > > > struct bpf_binary_header * > > > bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > > > unsigned int alignment, > > > @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, > > > * random section of illegal instructions. > > > */ > > > size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE); > > > - hdr = module_alloc(size); > > > + hdr = bpf_jit_image_alloc(size); > > > if (hdr == NULL) > > > return NULL; > > > > > > And ppc64 JIT could override bpf_jit_image_alloc() in a similar way > > > like some archs would override the module_alloc() helper through a > > > custom implementation, usually via __vmalloc_node_range(), so we > > > could perhaps fit the range for BPF JITed images in a way that they > > > could use the 32bit imm in the end? There are not that many progs > > > loaded typically, so the range could be a bit narrower in such case > > > anyway. (Not sure if this would work out though, but I thought to > > > bring it up.) > > > > That'd be a good option to consider. I don't think we want to allocate > > anything from the linear memory range since users could load > > unprivileged BPF programs and consume a lot of memory that way. I doubt > > if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not > > entirely sure. > > > > Michael, > > Is the above possible? The question is if we can have BPF programs be > > allocated within 4GB of __bpf_call_base (which is a kernel symbol), so > > that calls to those programs can be encoded in a 32-bit immediate field > > in a BPF instruction. As an extension, we may be able to extend it to > > 48-bits by combining with another BPF instruction field (offset). In > > either case, the vmalloc'ed address range won't work. > > > > The alternative is to pass the full 64-bit address of the BPF program in > > an auxiliary field (as proposed in this patch set) but we need to fix it > > up for 'bpftool' as well. > Hi Naveen, > > Our s390 kernel maintainer Martin Schwidefsky took over > eBPF responsibility for s390 from me. > > @Martin: Can you answer Navee's question? > > Michael Damn, I forgot adding Martin to cc. Time for vacation ;-) Michael
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5fb69a85d967..1c4d9cd485ed 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5288,11 +5288,25 @@ static int jit_subprogs(struct bpf_verifier_env *env) insn->src_reg != BPF_PSEUDO_CALL) continue; subprog = insn->off; - insn->off = 0; insn->imm = (u64 (*)(u64, u64, u64, u64, u64)) func[subprog]->bpf_func - __bpf_call_base; } + + /* the offset to a callee function from __bpf_call_base + * may be larger than what the 32 bit integer imm can + * accomodate which will truncate the higher order bits + * + * to avoid this, we additionally utilize the aux data + * of each function to point to a list of all function + * addresses determined by the verifier + * + * the off field of the instruction provides the index + * in this list where the start address of a function + * is available + */ + func[i]->aux->func = func; + func[i]->aux->func_cnt = env->subprog_cnt + 1; } for (i = 0; i <= env->subprog_cnt; i++) { old_bpf_func = func[i]->bpf_func;
The imm field of a bpf_insn is a signed 32-bit integer. For JIT-ed bpf-to-bpf function calls, it stores the offset from __bpf_call_base to the start of the callee function. For some architectures, such as powerpc64, it was found that this offset may be as large as 64 bits because of which this cannot be accomodated in the imm field without truncation. To resolve this, we additionally make aux->func within each bpf_prog associated with the functions to point to the list of all function addresses determined by the verifier. We keep the value assigned to the off field of the bpf_insn as a way to index into aux->func and also set aux->func_cnt so that this can be used for performing basic upper bound checks for the off field. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> --- v2: Make aux->func point to the list of functions determined by the verifier rather than allocating a separate callee list for each function. --- kernel/bpf/verifier.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)