Message ID | 20190911181816.89874-1-christian@cbarcenas.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check | expand |
On 9/11/19 7:18 PM, Christian Barcenas wrote: > A process can lock memory addresses into physical RAM explicitly > (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO, > perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits. > > CAP_IPC_LOCK allows a process to exceed these limits, and throughout > the kernel this capability is checked before allowing/denying an attempt > to lock memory regions into RAM. > > Because bpf locks its programs and maps into RAM, it should respect > CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was > exceeded by a privileged process, which is contrary to documented > RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior. > > Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") > Signed-off-by: Christian Barcenas <christian@cbarcenas.com> Acked-by: Yonghong Song <yhs@fb.com> > --- > kernel/bpf/syscall.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 272071e9112f..e551961f364b 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr) > static int bpf_charge_memlock(struct user_struct *user, u32 pages) > { > unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > + unsigned long locked = atomic_long_add_return(pages, &user->locked_vm); > > - if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) { > + if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) { > atomic_long_sub(pages, &user->locked_vm); > return -EPERM; > } > @@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages) > > if (user) { > user_bufs = atomic_long_add_return(pages, &user->locked_vm); > - if (user_bufs > memlock_limit) { > + if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) { > atomic_long_sub(pages, &user->locked_vm); > return -EPERM; > } >
On 9/11/19 8:18 PM, Christian Barcenas wrote: > A process can lock memory addresses into physical RAM explicitly > (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO, > perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits. > > CAP_IPC_LOCK allows a process to exceed these limits, and throughout > the kernel this capability is checked before allowing/denying an attempt > to lock memory regions into RAM. > > Because bpf locks its programs and maps into RAM, it should respect > CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was > exceeded by a privileged process, which is contrary to documented > RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior. Do you have a link/pointer where this is /clearly/ documented? Uapi header is not overly clear ... include/uapi/linux/capability.h says: /* Allow locking of shared memory segments */ /* Allow mlock and mlockall (which doesn't really have anything to do with IPC) */ #define CAP_IPC_LOCK 14 [...] /* Override resource limits. Set resource limits. */ /* Override quota limits. */ /* Override reserved space on ext2 filesystem */ /* Modify data journaling mode on ext3 filesystem (uses journaling resources) */ /* NOTE: ext2 honors fsuid when checking for resource overrides, so you can override using fsuid too */ /* Override size restrictions on IPC message queues */ /* Allow more than 64hz interrupts from the real-time clock */ /* Override max number of consoles on console allocation */ /* Override max number of keymaps */ #define CAP_SYS_RESOURCE 24 ... but my best guess is you are referring to `man 2 mlock`: Limits and permissions In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock. Thanks, Daniel
> On 9/11/19 8:18 PM, Christian Barcenas wrote: >> A process can lock memory addresses into physical RAM explicitly >> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO, >> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits. >> >> CAP_IPC_LOCK allows a process to exceed these limits, and throughout >> the kernel this capability is checked before allowing/denying an attempt >> to lock memory regions into RAM. >> >> Because bpf locks its programs and maps into RAM, it should respect >> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was >> exceeded by a privileged process, which is contrary to documented >> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior. > > Do you have a link/pointer where this is /clearly/ documented? I admit that after submitting this patch, I did re-think the description and thought maybe I should have described the CAP_IPC_LOCK behavior as "expected" rather than "documented". :) > ... but my best guess is you are referring to `man 2 mlock`: > > Limits and permissions > > In Linux 2.6.8 and earlier, a process must be privileged > (CAP_IPC_LOCK) > in order to lock memory and the RLIMIT_MEMLOCK soft resource > limit defines > a limit on how much memory the process may lock. > > Since Linux 2.6.9, no limits are placed on the amount of > memory that a > privileged process can lock and the RLIMIT_MEMLOCK soft resource > limit > instead defines a limit on how much memory an unprivileged > process may lock. Yes; this is what I was referring to by "documented RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior." Unfortunately - AFAICT - this is the most explicit documentation about CAP_IPC_LOCK's permission set, but it is incomplete. I believe it can be understood from other references to RLIMIT and CAP_IPC_LOCK throughout the kernel that "locking memory" refers not only to mlock/shmctl syscalls, but also to other code sites where /physical/ memory addresses are allocated for userspace. After identifying RLIMIT_MEMLOCK checks with git grep -C3 '[^(get|set)]rlimit(RLIMIT_MEMLOCK' we find that RLIMIT_MEMLOCK is bypassed - if CAP_IPC_LOCK is held - in many locations that have nothing to do with the mlock or shm family of syscalls. From what I can tell, every time RLIMIT_MEMLOCK is referenced there is a neighboring check to CAP_IPC_LOCK that bypasses the rlimit, or in some cases memory accounting entirely! bpf() is currently the only exception to the above, ie. as far as I can tell it is the only code that enforces RLIMIT_MEMLOCK but does not honor CAP_IPC_LOCK. Selected examples follow: In net/core/skbuff.c: if (capable(CAP_IPC_LOCK) || !size) return 0; num_pg = (size >> PAGE_SHIFT) + 2; /* worst case */ max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; user = mmp->user ? : current_user(); do { old_pg = atomic_long_read(&user->locked_vm); new_pg = old_pg + num_pg; if (new_pg > max_pg) return -ENOBUFS; } while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) != old_pg); In net/xdp/xdp_umem.c: if (capable(CAP_IPC_LOCK)) return 0; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; umem->user = get_uid(current_user()); do { old_npgs = atomic_long_read(&umem->user->locked_vm); new_npgs = old_npgs + umem->npgs; if (new_npgs > lock_limit) { free_uid(umem->user); umem->user = NULL; return -ENOBUFS; } } while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs, new_npgs) != old_npgs); return 0; In arch/x86/kvm/svm.c: lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit); return NULL; } In drivers/infiniband/core/umem.c (and other sites in Infiniband code): lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; new_pinned = atomic64_add_return(npages, &mm->pinned_vm); if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) { atomic64_sub(npages, &mm->pinned_vm); ret = -ENOMEM; goto out; } In drivers/vfio/vfio_iommu_type1.c, albeit in an indirect way: struct vfio_dma { bool lock_cap; /* capable(CAP_IPC_LOCK) */ }; // ... for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { // ... if (!rsvd && !vfio_find_vpfn(dma, iova)) { if (!dma->lock_cap && current->mm->locked_vm + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); ret = -ENOMEM; goto unpin_out; } lock_acct++; } } Best, Christian
On Mon, Sep 16, 2019 at 07:09:06AM -0700, Christian Barcenas wrote: > > bpf() is currently the only exception to the above, ie. as far as I can tell > it is the only code that enforces RLIMIT_MEMLOCK but does not honor > CAP_IPC_LOCK. Yes. bpf is not honoring CAP_IPC_LOCK comparing to other places in the kernel, but we cannot change this anymore. User space already using rlimit as an enforcement. bpf_rlimit.h hack we use in selftests is not a universal way of loading bpf progs. If we make such change root user will become unlimited and rlimit enforcement will break.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 272071e9112f..e551961f364b 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr) static int bpf_charge_memlock(struct user_struct *user, u32 pages) { unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + unsigned long locked = atomic_long_add_return(pages, &user->locked_vm); - if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) { + if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) { atomic_long_sub(pages, &user->locked_vm); return -EPERM; } @@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages) if (user) { user_bufs = atomic_long_add_return(pages, &user->locked_vm); - if (user_bufs > memlock_limit) { + if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) { atomic_long_sub(pages, &user->locked_vm); return -EPERM; }
A process can lock memory addresses into physical RAM explicitly (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO, perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits. CAP_IPC_LOCK allows a process to exceed these limits, and throughout the kernel this capability is checked before allowing/denying an attempt to lock memory regions into RAM. Because bpf locks its programs and maps into RAM, it should respect CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was exceeded by a privileged process, which is contrary to documented RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior. Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Christian Barcenas <christian@cbarcenas.com> --- kernel/bpf/syscall.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)