[bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
diff mbox series

Message ID 20190911181816.89874-1-christian@cbarcenas.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series
  • [bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
Related show

Commit Message

Christian Barcenas Sept. 11, 2019, 6:18 p.m. UTC
A process can lock memory addresses into physical RAM explicitly
(via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.

CAP_IPC_LOCK allows a process to exceed these limits, and throughout
the kernel this capability is checked before allowing/denying an attempt
to lock memory regions into RAM.

Because bpf locks its programs and maps into RAM, it should respect
CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
exceeded by a privileged process, which is contrary to documented
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Christian Barcenas <christian@cbarcenas.com>
---
 kernel/bpf/syscall.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Yonghong Song Sept. 13, 2019, 6:48 p.m. UTC | #1
On 9/11/19 7:18 PM, Christian Barcenas wrote:
> A process can lock memory addresses into physical RAM explicitly
> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
> 
> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
> the kernel this capability is checked before allowing/denying an attempt
> to lock memory regions into RAM.
> 
> Because bpf locks its programs and maps into RAM, it should respect
> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
> exceeded by a privileged process, which is contrary to documented
> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
> 
> Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
> Signed-off-by: Christian Barcenas <christian@cbarcenas.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/syscall.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 272071e9112f..e551961f364b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
>   static int bpf_charge_memlock(struct user_struct *user, u32 pages)
>   {
>   	unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
>   
> -	if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
> +	if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
>   		atomic_long_sub(pages, &user->locked_vm);
>   		return -EPERM;
>   	}
> @@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages)
>   
>   	if (user) {
>   		user_bufs = atomic_long_add_return(pages, &user->locked_vm);
> -		if (user_bufs > memlock_limit) {
> +		if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
>   			atomic_long_sub(pages, &user->locked_vm);
>   			return -EPERM;
>   		}
>
Daniel Borkmann Sept. 16, 2019, 9:26 a.m. UTC | #2
On 9/11/19 8:18 PM, Christian Barcenas wrote:
> A process can lock memory addresses into physical RAM explicitly
> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
> 
> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
> the kernel this capability is checked before allowing/denying an attempt
> to lock memory regions into RAM.
> 
> Because bpf locks its programs and maps into RAM, it should respect
> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
> exceeded by a privileged process, which is contrary to documented
> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.

Do you have a link/pointer where this is /clearly/ documented?

Uapi header is not overly clear ...

include/uapi/linux/capability.h says:

   /* Allow locking of shared memory segments */
   /* Allow mlock and mlockall (which doesn't really have anything to do
      with IPC) */

   #define CAP_IPC_LOCK         14

   [...]

   /* Override resource limits. Set resource limits. */
   /* Override quota limits. */
   /* Override reserved space on ext2 filesystem */
   /* Modify data journaling mode on ext3 filesystem (uses journaling
      resources) */
   /* NOTE: ext2 honors fsuid when checking for resource overrides, so
      you can override using fsuid too */
   /* Override size restrictions on IPC message queues */
   /* Allow more than 64hz interrupts from the real-time clock */
   /* Override max number of consoles on console allocation */
   /* Override max number of keymaps */

   #define CAP_SYS_RESOURCE     24

... but my best guess is you are referring to `man 2 mlock`:

    Limits and permissions

        In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK)
        in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines
        a limit on how much memory the process may lock.

        Since  Linux  2.6.9, no limits are placed on the amount of memory that a
        privileged process can lock and the RLIMIT_MEMLOCK soft resource limit
        instead defines a limit on how much memory an unprivileged process may lock.

Thanks,
Daniel
Christian Barcenas Sept. 16, 2019, 2:09 p.m. UTC | #3
> On 9/11/19 8:18 PM, Christian Barcenas wrote:
>> A process can lock memory addresses into physical RAM explicitly
>> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
>> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
>>
>> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
>> the kernel this capability is checked before allowing/denying an attempt
>> to lock memory regions into RAM.
>>
>> Because bpf locks its programs and maps into RAM, it should respect
>> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
>> exceeded by a privileged process, which is contrary to documented
>> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
> 
> Do you have a link/pointer where this is /clearly/ documented?

I admit that after submitting this patch, I did re-think the description 
and thought maybe I should have described the CAP_IPC_LOCK behavior as 
"expected" rather than "documented". :)

> ... but my best guess is you are referring to `man 2 mlock`:
> 
>     Limits and permissions
> 
>         In Linux 2.6.8 and earlier, a process must be privileged 
> (CAP_IPC_LOCK)
>         in order to lock memory and the RLIMIT_MEMLOCK soft resource 
> limit defines
>         a limit on how much memory the process may lock.
> 
>         Since  Linux  2.6.9, no limits are placed on the amount of 
> memory that a
>         privileged process can lock and the RLIMIT_MEMLOCK soft resource 
> limit
>         instead defines a limit on how much memory an unprivileged 
> process may lock.

Yes; this is what I was referring to by "documented 
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior."

Unfortunately - AFAICT - this is the most explicit documentation about 
CAP_IPC_LOCK's permission set, but it is incomplete.

I believe it can be understood from other references to RLIMIT and 
CAP_IPC_LOCK throughout the kernel that "locking memory" refers not only 
to mlock/shmctl syscalls, but also to other code sites where /physical/ 
memory addresses are allocated for userspace.

After identifying RLIMIT_MEMLOCK checks with

     git grep -C3 '[^(get|set)]rlimit(RLIMIT_MEMLOCK'

we find that RLIMIT_MEMLOCK is bypassed - if CAP_IPC_LOCK is held - in 
many locations that have nothing to do with the mlock or shm family of 
syscalls. From what I can tell, every time RLIMIT_MEMLOCK is referenced 
there is a neighboring check to CAP_IPC_LOCK that bypasses the rlimit, 
or in some cases memory accounting entirely!

bpf() is currently the only exception to the above, ie. as far as I can 
tell it is the only code that enforces RLIMIT_MEMLOCK but does not honor 
CAP_IPC_LOCK.

Selected examples follow:

In net/core/skbuff.c:

     if (capable(CAP_IPC_LOCK) || !size)
             return 0;

     num_pg = (size >> PAGE_SHIFT) + 2;      /* worst case */
     max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
     user = mmp->user ? : current_user();

     do {
             old_pg = atomic_long_read(&user->locked_vm);
             new_pg = old_pg + num_pg;
             if (new_pg > max_pg)
                     return -ENOBUFS;
     } while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
              old_pg);

In net/xdp/xdp_umem.c:

     if (capable(CAP_IPC_LOCK))
             return 0;

     lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
     umem->user = get_uid(current_user());

     do {
             old_npgs = atomic_long_read(&umem->user->locked_vm);
             new_npgs = old_npgs + umem->npgs;
             if (new_npgs > lock_limit) {
                     free_uid(umem->user);
                     umem->user = NULL;
                     return -ENOBUFS;
             }
     } while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
                                  new_npgs) != old_npgs);
     return 0;

In arch/x86/kvm/svm.c:

     lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
     if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
             pr_err("SEV: %lu locked pages exceed the lock limit of 
%lu.\n", locked, lock_limit);
             return NULL;
     }

In drivers/infiniband/core/umem.c (and other sites in Infiniband code):

     lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

     new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
     if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
             atomic64_sub(npages, &mm->pinned_vm);
             ret = -ENOMEM;
             goto out;
     }

In drivers/vfio/vfio_iommu_type1.c, albeit in an indirect way:

     struct vfio_dma {
         bool                 lock_cap;       /* capable(CAP_IPC_LOCK) */
     };

     // ...

     for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
          pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
             // ...

             if (!rsvd && !vfio_find_vpfn(dma, iova)) {
                     if (!dma->lock_cap &&
                         current->mm->locked_vm + lock_acct + 1 > limit) {
                             put_pfn(pfn, dma->prot);
                             pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
                                     __func__, limit << PAGE_SHIFT);
                             ret = -ENOMEM;
                             goto unpin_out;
                     }
                     lock_acct++;
             }
     }

Best,
Christian
Alexei Starovoitov Sept. 16, 2019, 10:19 p.m. UTC | #4
On Mon, Sep 16, 2019 at 07:09:06AM -0700, Christian Barcenas wrote:
> 
> bpf() is currently the only exception to the above, ie. as far as I can tell
> it is the only code that enforces RLIMIT_MEMLOCK but does not honor
> CAP_IPC_LOCK.

Yes. bpf is not honoring CAP_IPC_LOCK comparing to other places in the kernel,
but we cannot change this anymore. User space already using rlimit as an enforcement.
bpf_rlimit.h hack we use in selftests is not a universal way of loading bpf progs.
If we make such change root user will become unlimited and rlimit enforcement
will break.

Patch
diff mbox series

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 272071e9112f..e551961f364b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -183,8 +183,9 @@  void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 static int bpf_charge_memlock(struct user_struct *user, u32 pages)
 {
 	unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
 
-	if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
+	if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
 		atomic_long_sub(pages, &user->locked_vm);
 		return -EPERM;
 	}
@@ -1231,7 +1232,7 @@  int __bpf_prog_charge(struct user_struct *user, u32 pages)
 
 	if (user) {
 		user_bufs = atomic_long_add_return(pages, &user->locked_vm);
-		if (user_bufs > memlock_limit) {
+		if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
 			atomic_long_sub(pages, &user->locked_vm);
 			return -EPERM;
 		}